Download & Extend

Refactor format_date() to avoid multiple calls to date_format()

Project:Drupal core
Version:7.x-dev
Component:base system
Category:task
Priority:critical
Assigned:Unassigned
Status:closed (fixed)
Issue tags:Performance

Issue Summary

I've pulled this out of #11077: Introduce Daylight Saving Time for Drupal as it is a separate task from the integration of Daylight Savings, but ideally both issues need to track as they touch the same code in several places.

The current format_date() function does several things - adjust for timezone, apply default formats, and translate the results of date formats. It is the last of these that is addressed here. The current code translates date elements by looping through every character of the date format string and applying various methods, translating the output if required by the format spec. It recurses on itself to do this. Potentially this recursion can add up to hundreds of the underlying date()/gmdate() calls per page - and these can be costly on the OS as timezone files need to be parsed on each invocation. It is also a very ugly (brute force) solution to the translation problem. We can do better.

My proposal is to escape all of the translatable strings so that date()/gmdate() need be called only once, and the cost of translation is borne in memory as string manipulations, instead of recursive OS calls.

This is the basic concept, which needs to be cleaned up and integrated:

  // Replace entries in $format string which need to be translated with escaped string '+=X=+'
  // where X is a format specifier which requires translation. The 'F' format specifier requires
  // special handling because in English (for example) May is both a short and long month name.
  // We need to specify for the 'F' format that it is the long month name we will be translating.
  // These patterns specifically exclude any pattern specifier which has been prefixed with '\\'
  // (octal 134).

  $format = preg_replace('@(?<!\134)([AaeDlMt])@','+=$1=+',$format);
  $format = preg_replace('@(?<!\134)(F)@','+=!\l\o\n\g-\m\o\n\t\h-\n\a\m\e F=+',$format);

  // Make a single call to gmdate using our specially formatted string.

  $date = gmdate($format,$timestamp);

  // Now look for anything that we escaped, and translate it.

  if(preg_match_all('@\+=([-! a-zA-Z]*)=\+@',$date,$matches)) {
    $arr = $matches[1];
    foreach($arr as $a) {
      if(substr($a,0,16) == '!long-month-name')
        $date = str_replace("+=$a=+",t($a,array('!long-month-name' => ''),$langcode),$date);
      else
        $date = str_replace("+=$a=+",t($a,array(),$langcode),$date);
   }
  }

  // $date now contains the correct translation of the date format.

It is intended to replace everything in the for loop of the current format_date() function, and needs a bit more work to get there as it is currently just proof of concept. It also may need to be integrated with and track changes in other work going on in format_date() such as #11077: Introduce Daylight Saving Time for Drupal.

Comments

#1

Status:active» needs review

Attached is a patch against current HEAD (2008-04-07). It needs regression testing, most importantly with language packs.

AttachmentSizeStatusTest resultOperations
format_date-2008-04-07.patch2.63 KBIdleFailed: Failed to apply patch.View details

#2

Added back the O and Z specifiers

AttachmentSizeStatusTest resultOperations
format_date-2008-04-07-2.patch2.78 KBIdleFailed: Failed to apply patch.View details

#3

Status:needs review» needs work

Thanks for your patch, but In fact there is no recursion here in most cases. Thus, your patch seems to solve a problem that does not exist.

The only case where there is recursion (the border case of the 'r' specifier) could be optimized separately. One way to do this is to extend $format and $max when the 'r' specifier is encountered, thus removing the need for a supplementary call to format_date().

Note that, even with the 'r' case, there can only be one recursive call, so the burden you are mentioning is very very very light.

#4

Right, recursion wasn't the right word, although there is recursion in the 'r' case. Mostly it's the time spent calling gmdate() over and over again for every character of the format. This has a cost. The time the OS spends filling in the struct tm over and over is more expensive than regex parsing - which is amusing because regex's are hardly cheap in terms of performance.

$ time php test1.php #10,000 iterations refactored code

real 0m0.556s
user 0m0.512s
sys 0m0.036s

$ time php test2.php #10,000 iterations original code

real 0m2.189s
user 0m1.972s
sys 0m0.040s

400% improvement ain't bad. But I guess if you like sucky performance ... who am I to argue?

There was a typo in the last patch. I'll try it again. It also looks like the timezone isn't right in the 'r' case - I was hoping at some point to merge this with daylight savings patches and use real time zones instead of the numeric offsets we're stuck with now. (Which would also eliminate 2-3 regex's when all is said and done).

It's part of a necessary re-working of all of the core date/time stuff where the issues are the stuff of legend. Hopefully there are one or two others out there who think it is a worthwhile endeavour to fix the bloody mess.

AttachmentSizeStatusTest resultOperations
format_date-2008-04-08.patch2.78 KBIdleFailed: Failed to apply patch.View details

#5

Status:needs work» needs review

A couple more changes to bring the patch into functional parity with current HEAD. The problem with 'r' has been fixed as well as applying a trim() to the long-month translation, and some pattern cleanup.

AttachmentSizeStatusTest resultOperations
format_date-2008-04-09.patch2.97 KBIdleFailed: Failed to apply patch.View details

#6

Thanks for your hard work on that. I'm wondering if it would not be easier just to call gmdate() unconditionally for all possible patterns (it should be very fast, because there is a huge fixed cost but a small marginal cost for any additional pattern), and just loop through each character of the string as we are doing right now.

#7

Title:Refactor format_date() to avoid recursion» Refactor format_date() to avoid multiple calls to gmdate()

Clarifying the title.

#8

Although a separate issue that is able to stand on its own, this work is intimately tied to #11077: Introduce Daylight Saving Time for Drupal. I'm thinking ahead to when we finally get daylight savings in core - which will eliminate the patterns which deal with timezone and leave three regex's total - one to catch the string translations, one for the weird case of full-month-name, and one to translate back at the end. In theory there shouldn't be any more to add. The timezone patterns are there just to keep things working until daylight savings in core moves forward.

Daylight savings additions makes this work critically important, because the functions to translate localtime for a specific timezone have significantly more overhead than gmdate(). If we continue to loop through every character of the format, it's going to be a killer when each character of the format has to open timezone db's and factor in every change to daylight savings since 1970. This can get quite expensive very rapidly and we only want to do it once for a given timestamp.

#9

Status:needs review» needs work

Back to CNW because patch no longer applies.

Maybe we should wait until #11077: Introduce Daylight Saving Time for Drupal is commit (now patch spotlight).

#10

Subscribing to this - format_date() is one of our most time-consuming functions (particularly when viewing node or comment listings where it can be called hundreds of times on a page).

#11

kcachegrind screenshot - as you can see, on a page with lots of dates (node with 90 comments), it's the slowest function in all of core.

AttachmentSizeStatusTest resultOperations
format_date.png158.62 KBIgnored: Check issue status.NoneNone

#12

Still slow. Subscribe

#13

Priority:normal» critical

#14

just updating the patch so it applies to HEAD, no other changes, date tests are failing.

AttachmentSizeStatusTest resultOperations
format_date.patch2.94 KBIdleFailed: 11295 passes, 10 fails, 0 exceptionsView details

#15

#334283: Add msgctxt-type context to t() seems relevant, we can ditch the short month prefix hack when that lands.

#16

Status:needs work» needs review

testbot?

#17

Status:needs review» needs work

The last submitted patch failed testing.

#18

subscribing

#19

Status:needs work» needs review

Should be twice as fast with this patch. I'm sure it could be even better, but I'm out of time at the moment.

AttachmentSizeStatusTest resultOperations
format_date.patch2.68 KBIdleFailed: Failed to apply patch.View details

#20

Title:Refactor format_date() to avoid multiple calls to gmdate()» Refactor format_date() to avoid multiple calls to date_format()

removed an incorrect + in regex.

AttachmentSizeStatusTest resultOperations
format_date.patch2.68 KBIdleFailed: Failed to apply patch.View details

#21

node/n, 300 comments is 4% faster with this. format_date() called 600 times.

HEAD:
0.74 [#/sec] (mean)

Patch:
0.77 [#/sec]

We could probably do with reviving #316189: Test for format_date alongside this.

#22

By the way, I'm curious why the "r" format string gets translated by format_date(). "r" is supposed to represent an RFC 2822 date, but when it gets translated it's no longer an RFC 2822 date...

#23

Status:needs review» needs work

The last submitted patch failed testing.

#24

Status:needs work» needs review

#25

Status:needs review» needs work

I'd like to commit this but I have a few questions that I'd like to see addressed in the code comments. You can answer them there.

1. Any particular reason we use '+=X=+' -- is that a special format of kinds or just something we picked? Looking at the code, it is just something we choose so we can use it in the next preg_match_all().

2. We don't explain _why_ we do this replacement. I'd better explain that in the comments.

3. I'd better explain why we have the special '!long-month-name' treatment. We lost that documentation in this patch.

#26

@dries
As the original patch creator perhaps I can answer -

1. It is entirely possible for people to pass format strings to gmdate which have nothing to do with date formatting characters. I chose this particular sequence because of the very low likelihood that it would show up in legitimate input, a challenge given that anything is legitimate. It is arbitrary but hopefully obscure enough that we won't get a bug report in the future because somebody had a justifiable need to use a 'creative' format string.

2. We aren't using any of the operating system translation facilities for dates for a variety of reasons so any result which returns a translatable 'string' (day names, month names, etc) needs to be passed through the Drupal translation function t().

3. The long_month_name hack is or at least was a part of the Drupal translation functions because translating a month name such as "May" is the same in both short and long forms for English, but may be different in other languages. Requesting t('May') gives no indication which form (long month name or short month name) is being requested - should they differ.

I have stepped aside and others have been updating this patch, so I'd like to request the most recent patch submitter (mfb?) to add the relevant comments.

#27

Seem like now we have context in t() we should be able to drop the long month name hack?

#28

Status:needs work» needs review

Rerolled to support t() context.

AttachmentSizeStatusTest resultOperations
format_date.patch2.83 KBIdleFailed: Failed to install HEAD.View details

#29

It seems like rather than working with unpredictable strings and possibly performing multiple replacements on the same date format character that the replacements should be done with a single function call and the date format information should be kept isolated. The attached patch incorporates the changes from #28, isolates the string modified by date_format, and uses a single str_replace call. This also allows F to be treated specially but without the additional exclamation point notation.

The second patch uses preg_replace throughout which may be appealing because then no arbitrary delimiters are required in the format string passed to the preg_match_all.

AttachmentSizeStatusTest resultOperations
format_date-243129-29.patch3 KBIdleFailed: Failed to install HEAD.View details
format_date-243129-29-preg.patch2.94 KBIdleFailed: 11816 passes, 10 fails, 0 exceptionsView details

#30

@jrchamp in your patches you're translating the numerical year, month, day, hour, minute etc. which is not needed. But it's good to only call str_replace once so I incorporated that into this patch.

AttachmentSizeStatusTest resultOperations
format_date.patch2.84 KBIdleFailed: 11825 passes, 1 fail, 0 exceptionsView details

#31

@mfb: Nice use of matches[0]! My initial concern is that $replace is not initialized to an empty array.

Edit: There appears to be an issue with escaped characters:

<?php
function t($input) {
  return
"z$input";
}

$format = 'F r \\F\\\\F\\\\\\F';
?>

The output is: zJune zThu, 25 zJun 2009 18:38:24 +0000 F\June\F

The output should be the following (the June with a literal backslash before it does not get translated): zJune zThu, 25 zJun 2009 18:38:24 +0000 F\zJune\F

#32

If you use the same arbitrary delimiter to swap out double backslashes, you can resolve the issue mentioned in #31.

Before the first preg_replace:

<?php
  $format
= str_replace('\\\\', '+==+', $format);
?>

Use a + instead of a * to require that something exist within the delimiters:

<?php
 
if (preg_match_all('@\+=([a-zA-Z/_!-]+)=\+@', $date, $matches)) {
?>

After the date has been prepared (this replacement could be appended to the arrays that perform the rest of the replacements to avoid an extra str_replace call):

<?php
  $date
= str_replace('+==+', '\\', $date);
?>

#33

Good god this is just horrible ('\\\\\\\\') but it seems to work?

AttachmentSizeStatusTest resultOperations
format_date.patch2.89 KBIdleUnable to apply patch format_date_11.patchView details

#34

Works for me. 8 backslashes make sense because you're trying to put those 2 back, which need to be escaped in the regex (4), which need to be escaped in the string (8). Personally, I would still like to see the $replace array initialized.

I'm not sure if I count as "community", but this seems RTBC to me.

#35

Theoretically everyone is community! but, I'd like to have additional reviews regardless.

I thought it might be faster to bundle everything into preg_replace and avoid extra calls to str_replace (where we could just use '\\\\').

#36

It makes sense to combine it into that function call. The only other option would be to add the multi-backslash replacement to the $matches[0] and $replace arrays (which would get you down to just the 4 backslashes). Though I'm not sure if that is even beneficial. Nice job though!

#37

Because the whole point of this patch is performance, changes need to be supported by benchmarks ;)

#38

I benchmarked admin/reports/dblog because node spends most of the time theming each node.

  • Short date format m/d/Y - H:i
    • Without patch
      Requests per second: 8.19 [#/sec] (mean)
      Time per request: 122.142 [ms] (mean)
    • With patch
      Requests per second: 8.40 [#/sec] (mean)
      Time per request: 119.030 [ms] (mean)
  • Short date format l, F j, Y - H:i
    • Without patch
      Requests per second: 7.79 [#/sec] (mean)
      Time per request: 128.449 [ms] (mean)
    • With patch
      Requests per second: 8.06 [#/sec] (mean)
      Time per request: 124.060 [ms] (mean)

As you can see some nice performance gains can be had simply by not using translatable date formats.

#39

I recommend a node with 300 comments for benchmarks - format_date is in the top 5 expensive functions on that page (if not number 1). Can also compare latest patch against #21

#40

I suggest micro-benchmarks for that, as they will be more accurate. We already know that this function is pretty expensive in a full page load.

#41

Status:needs review» needs work

The last submitted patch failed testing.

#42

Status:needs work» needs review

Let's retest (or does this really break install)?

Here are some micro-benchmarks* on 1000000 iterations of format_date() with date format l, F j, Y - H:i:

Without patch: 267.003068924 seconds
With patch #20: 184.690872908 seconds
With patch #33: 180.732372046 seconds

Ideally we could improve this even more :)

* as seen at http://www.garfieldtech.com/blog/magic-benchmarks except I was too impatient for 2000000 iterations

#43

Status:needs review» needs work

The last submitted patch failed testing.

#44

Status:needs work» needs review

I cannot reproduce testbot exception..

#45

@mfb: You might find a date format with "r" in it to have an even larger speedup because of the call duplication.

#46

I prefer the patch in #20. It's easier to read than the patch in #33, and it gives us the same kind of speed-up -- at least based on the benchmarks in #42.

#47

@Dries: the problem was that the patch at #20 is buggy: It doesn't support backslashes in the date format.

#48

Ok, I'll commit #33 then or does it need more work?

#49

@Dries: If any eyes are available for this, it would be nice to see if we can optimize even further and/or make it more readable.

#50

I think it's better to update #316189: Test for format_date assuming this one works - and nearly 50% speedup is very good anyway - we can always open a new issue to further optimize, but having the tests would make it easier to compare versions.

#51

OK I added some tests at #316189: Test for format_date

#52

I committed #316189: Test for format_date so asking for a re-test of this patch.

#53

A quick explanation:

<?php
 
// Steps:
  //  * Replace two backslashes with an empty escape sequence (allows us to quickly an correctly identify single backslashes).
  //  * Replace non-escaped 'r' with its literal expansion, 'D, d M Y H:i:s O'.
  //  * Replace non-escaped 'F' with a special escape sequence.
  //  * Replace non-escaped format characters which need translation with the escape sequence.
  //  * Replace the empty escape sequence with two backslashes.
?>

Unfortunately I ran into a problem with the following format string:

<?php
$format
= 'r F A a e D l M T';
?>

Here is an expanded regular expression to match the additional characters I encountered: colon and plus sign. I moved the exclamation point to the beginning to avoid the use of substr() and initialized the replace array (because I'm worried about notices).

<?php
 
// Now look for any escape sequences and translate them.
 
if (preg_match_all('@\+=(!?)([a-zA-Z0-9+/:_-]+)=\+@', $date, $matches)) {
   
$replace = array();
    foreach (
$matches[2] as $key => $match) {
      if (
$matches[1][$key] == '!') {
       
// Special treatment for long month names: May is both an abbreviation
        // and a full month name in English, but other languages have
        // different abbreviations.
       
$replace[] = t($match, array(), array('context' => 'Long month name', 'langcode' => $langcode));
      }
      else {
       
$replace[] = t($match, array(), array('langcode' => $langcode));
      }
    }
   
$date = str_replace($matches[0], $replace, $date);
  }
?>

#54

An alternative implementation. I'm not sure that multiple calls to date_format() is actually the main issue here. I guess that iterating over the string is probably part of the issue.

Could someone profile that?

AttachmentSizeStatusTest resultOperations
243129-alternative-approach.patch2.7 KBIdleFailed: 11825 passes, 1 fail, 0 exceptionsView details

#55

Status:needs review» needs work

The last submitted patch failed testing.

#56

@jrchamp: Where did you find a colon or plus sign in a date component that actually needed translation? We do not translate numeric components like times or time zone offsets. It's good practice but FYI there are no notices about uninitialized array (with E_ALL | E_STRICT).

#57

@Damien Tournoud: In my test your patch was slower than the unpatched format_date(). Here's my times on 1000000 iterations of format_date(REQUEST_TIME, 'custom', 'l, F j, Y - H:i'):
Without patch: 269.183619022 seconds
With patch #54: 280.549337864 seconds

#58

Status:needs work» needs review

The removal of two substr() calls in #53 and some further refactoring shaves almost 6 seconds off 1000000 iterations. I'm now down to 174.988138914 seconds,

AttachmentSizeStatusTest resultOperations
format_date.patch2.6 KBIdlePassed: 11567 passes, 0 fails, 0 exceptionsView details

#59

Here is a better approach.

  1. skipping most of format_date() when the language is English
  2. do not translate 'r' (that makes strictly no sense)
  3. encode date markers with invalid UTF-8 characters, not the arbitrary += =+ sequence
  4. use smart read-ahead expression to avoid multiple passes
  5. add a cache to avoid relatively costly calls to t()... after all there are only a few possible source strings

In my testing (string 'l, F j, Y - H:i'):

  • #58 is about 10% faster then unpatched
  • with (1), calls to format_date() are about 60% faster then unpatched for English
  • with (5), calls to format_date() are about 40% faster then unpatched for non-English
  • even without the advanced cache, this version of format_date() is about 16% faster the unpatched
AttachmentSizeStatusTest resultOperations
243129-alternative-approach2.patch3.56 KBIdleFailed: 11559 passes, 1 fail, 0 exceptionsView details

#60

Straight ab -c1 -n500 benchmarks, front page with 10 nodes (20 calls to format date in total).

HEAD:
7.05 reqs/sec

Patch:
7.34 reqs/sec

In webgrind.

HEAD:
1.98% of the request

Patch:
0.46% of the request

#61

1000000 iterations, using non-English language:
With patch #58: 199.153819084 seconds
With patch #59: 116.163262129 seconds
Very nice speed up :) I like removing the special treatment of 'r' as justified in #22. But I don't know if disabling translation for English is ok. What about sites that want to specify something like:

<?php
$conf
['locale_custom_strings_en'][''] = array('America/Los_Angeles' => 'Pacific', 'America/New_York' => 'Eastern');
?>

#62

The patch in #59 incorrectly evaluates an escaped format character with a literal slash before it (when the langcode is not 'en' of course).

<?php
$format
= "F \\F \\\\F \\\\\\F";
?>

Expected Result:
July F \July \F

Actual Result:
July F \July \July

This is the same issue that I brought up in #31 and proposed a solution to in #32. Perhaps a similar solution could be implemented using the invalid UTF-8 arbitrary delimiters.

#63

We should make the example in #62 a test case. We should also get into the habit of submitting tests instead of reporting bugs in the comments. ;-)

#64

Interestingly, this was not in the test case.

#65

@Dries: My apologies, I'm still getting accustomed to the workflow here.

As far as the example, the test case should be done at least in 1 non-'en' language. The patch in #59 passes that test in 'en'.

Hopefully this is the correct place to post this patch.

AttachmentSizeStatusTest resultOperations
format_date_test-243129-65.patch1013 bytesIdlePassed: 11568 passes, 0 fails, 0 exceptionsView details

#66

Oops, I meant to escape that (not that it makes a difference right now, but if they ever add a \F or \m escape sequence it would break).

AttachmentSizeStatusTest resultOperations
format_date_test-243129-66.patch1015 bytesIdlePassed: 11568 passes, 0 fails, 0 exceptionsView details

#67

@jrchamp let's keep the test improvements in #316189: Test for format_date as I already have another test patch over there. I will add this test to that patch but please review and make sure it's right.

I guess we want format_date() to work with any number of backslashes the way date() itself does.

#68

I like it, reviewed alternative approach, I dont' see any obvious problems but note that i only checked code and tried applying, didn't test live.

#69

The extra tests in #316189: Test for format_date were committed so I asked for a retest of #66.

#70

Here is #59 (the speedy UTF-8 character approach) with some simple changes to resolve the format_date test that it fails.

AttachmentSizeStatusTest resultOperations
format_date_alternative-243129-70.patch3.58 KBIdlePassed: 11567 passes, 0 fails, 0 exceptionsView details

#71

I'd still prefer not to lose translation of time zones with langcode "en". See comment at #61.

#72

@mfb: I tend to agree. Here is the reroll without treating 'en' specially.

AttachmentSizeStatusTest resultOperations
format_date_alternative-243129-72.patch3.47 KBIdleFailed: Failed to apply patch.View details

#73

Do we need to benchmark this version?

#74

<?php
// We cache translations of months (12 possible values) and week days
// (7 possible values) in order to avoid a rather costly call to t().
?>

This is technically incorrect (I know I wrote that... but still). We have two different months forms, and a few of other stuff (like am/pm, AM/PM, etc.). I think it's enough to say that there are only a few possible forms.

Otherwise, this version has my +1.

#75

Reworded comment regarding translation cache.

AttachmentSizeStatusTest resultOperations
format_date_alternative-243129-75.patch3.4 KBIdlePassed: 11567 passes, 0 fails, 0 exceptionsView details

#76

Committed to CVS HEAD. Thanks!

#77

Here's a webgrind screenshot of HEAD default front page with ten nodes, cvs updated. format_date() down from top 3 to top 25 most expensive functions. Nice work folks.

#78

Status:needs review» fixed
AttachmentSizeStatusTest resultOperations
after.png1.18 MBIgnored: Check issue status.NoneNone

#79

Category:task» bug report
Status:fixed» needs work

There are WAY too few test for this in core. This is a new PHP 5.2 feature and you expect this to work just like that? I find your faith amusing.

#80

@chx: Did you have any suggestions on what else needs to be tested? Currently we have the "Format date" unit tests in modules/simpletest/tests/common.test and some related functional tests "Date and time" in modules/system/system.test and "User time zones" in modules/user/user.test

#81

Category:bug report» task
Status:needs work» fixed

This patch hasn't introduced the date_format() call and is not using any new PHP 5.2 feature. Closing again. Let it rest.

#82

Status:fixed» closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

nobody click here