Rob Feature mentioned this in irc but I haven't seen an issue for it yet. The idea would be to run check_markup() on the body field within node_load(), so that there's always a sanitized version of it available rather than doing it later. I think the general idea is to make it more useful for themers but I reckon there's potentially a performance gain to be had there as well.

Once #111127: Cache node_load is in, we'll have the output of node_load() cached in the database. If that output includes the results of check_markup($body), then we don't need to hit the cache a second time in check_markup(). If we assume database caching, that's up to 30 queries saved on a default front page. Since we'll know that check_markup() was going to be cached in the node cache, we could possibly set those calls not to be added to cache_filter() to save duplication of the same cached data. I my nodes vs. comments benchmarks, one of the primary killers of both past a certain point is the cache_get() hit for each textarea in a list. With multiple load on nodes, the check_markup() can easily account for 50% of the database hits on a page.

If we did the same thing for fields, adding the check_markup() to the field cache, then we'd see even bigger gains across content types - for example user profiles, which can easily run to a dozen text fields. The rules for filter expiration are pretty the same for node/field expiration - since it's the same action of saving the content which should refresh the cache, so I don't think we'd add a lot of complexity there.

The disadvantages would be that if you're loading nodes but then not displaying them, you've got extra data which you don't need being pulled in. However that's a much rarer case than display, so I think it's minor if this gives the potential gains which I think it might. Obviously it doesn't make things any faster if you're running memcache (well, a few less function calls), but I don't think it'd make things any worse there either.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

catch’s picture

Assigned: Unassigned » catch
Priority: Normal » Critical
Status: Active » Postponed

Here's a patch, has to include the entire node_load() cache patch in order to work, so marking this postponed.

Looks like this gives us around 8% improvement on a completely vanilla front page over and above the node_load cache patch.

devel generated 500 article nodes, 10 on the front page, only default modules enabled. This absolutely minimises the effect of the node_load() cache since there's barely any hook_nodeapi_load() modules enabled at all.

Used devel query log, and ran ab -c1 -n500 twice for each version:

HEAD:
Executed 43 queries in 18.2 milliseconds.
9.3 requests/second
9.33 requests/second

node_load caching patch from http://drupal.org/node/111127#comment-1242672
Executed 40 queries in 15.72 milliseconds.

10.11 requests/second
10.19 requests/second

This patch:
Executed 30 queries in 11.02 milliseconds.
10.92 requests/second
10.94 requests/second

catch’s picture

FileSize
31.49 KB

and the patch.

I also compared 300 nodes vs. 300 comments. With this patch, 300 nodes on a vanilla HEAD get loaded in 30 queries, compared to over 300 queries for comments. However drupal_render()/element_sort() is still sufficiently slow that we don't yet outflank comments in overall page serving time (although it's close).

catch’s picture

With #370846: More drupal_render() optimisation and this patch both applied, nodes are consistently 10% faster than comments, whether loading 30 or 300.

moshe weitzman’s picture

I looked quickly at the patch but couldn't find the right spot to verify so I'll just state that we need to only do this for nodes/fields whose input format is marked cacheable.

catch’s picture

@moshe - chx pointed this out in irc, and it's in the patch:

+        if (filter_format_allowcache($node->format)) {
+          $node->filtered_body = check_markup($node->body, $node->format, $node->language, FALSE, TRUE);
+          $node->filtered_teaser = check_markup($node->teaser, $node->format, $node->language, FALSE, TRUE);
+        }
+        if ($n
yched’s picture

In terms of fields (and 'body as field'), this would mean moving hook_field_sanitize() invocation out of field_attach_view() into field_attach_load().
By this time, we have no 'build_mode' and we don't know what the actual formatter will be, what are its exact needs (might display 'plain text' - by far not the most current use case for formatted text fields though) or whether the field will end up 'hidden'.

Not really worse than what we do in D6 / current HEAD, where hook_field_sanitize() is called in field_attach_view(), but neglects $build_mode and formatters. Conceptually, in the 'late sanitize' approach we had so far, sanitizing would be a *formatter* operation - each formatter sanitizes for what it needs. I opened #371724: remove field data sanitization from hook_field_load about finding out the best place for sanitize,, but did not really give it much thought so far.
Becomes moot if we choose to cache sanitized data on load, of course.

catch’s picture

Status: Postponed » Needs work

So the only reason to move this earlier is to use the field/fieldable object level caching to avoid separate calls to check_markup() - but IMO it's a pretty compelling reason, since we save a query per formatted field, per object. If taxonomy terms are fieldable, and we move taxonomy descriptions to a formatted field (or equivalent), we'll save potentially dozens of queries on node listings with taxonomy terms attached. Doing it inside field_attach_load() means it'll get cached one way or the other (although we also need to handle the case of text formats which can't be cached as noted above).

I had a quick look at the hook invocations and text module's implementation, but it seems like it needs a fair bit of reworking rather than a copy / paste and formatters and build modes aren't a part of field API I've spent any time with. Either way this can be set back to active now since field_attach_load() looks like a viable place to do this.

sun’s picture

Status: Needs work » Closed (won't fix)

Sorry, this won't fix. Permissions for text formats and filters are different per user (so this patch would only work if you'd cache nodes per user). Filters are invoked upon view and modules need to be able to alter the content of a node or field.

If Drupal was using OOP all over the place, I guess one could do something more friendly like $node->output('body') or $node->output('field_foo'), but it's not.

catch’s picture

Status: Closed (won't fix) » Needs work

The filter system has a mechanism for filters to prevent themselves from being cached. Also check_markup() doesn't cache per-user, and the $access param of check_markup() has explicit documentation stating don't use it when viewing other people's content. So none of these concerns are valid. There might be other concerns (like it being a bit tweaky just to squeeze some database queries out), but that doesn't make it won't fix yet.

sun’s picture

- In correlation with #445902-4: (bool) menu_get_object() is *not* an equivalent of $page, we should not use _body and _teaser suffixes, but instead rely on node/object build modes.

- Filtered content should be stored/cached per field.

- We should bail out early if the text format uses a filter that is not cacheable.

Reviewing this patch is kinda hard - is it really dependent on 'cache node_load'?

catch’s picture

It's not dependent on node caching, but it's dependent on #333171: Optimize the field cache to have as much impact as I hope.

The motivation here is this:

For every formatted text field on a page, we have to hit {cache_filter} from check_markup() - with database caching this is an extra query for every field. On pages with lots of text fields (10 nodes with 5 text fields each or 300 comments), this can add up to hundreds of queries. On some sites I run cache_get can be quite a slow query (one query can take up to 15ms at least, several around 1-2) - so multiply that by a dozen or two queries and it's a measurable gain.

In the case where we can $something_load_multiple() the item, and also do some caching (doesn't matter if it's field or object cache here), we can save those queries by storing the formatted text with the data itself rather than separately.

So node_load_multiple() as it stands now - 5 text fields on 20 nodes = 100 queries from check_markup() (note that body as field already provides 2 of the 5 once it's in).

Put the result of check_markup() into the node or field cache - 0 additional queries.

An updated version of the patch would need to move hook_field_sanitize() from http://api.drupal.org/api/function/_field_attach_view/7 to _field_attach_load somewhere - and http://api.drupal.org/api/function/text_field_sanitize/7 is the obvious place that's going to use it. I think we'd need to change text_field_sanitize() to run through all build modes at once - that's going to be the main change.

And it'd also need to do so in such a way that uncacheable filters still get run at a later point. Haven't thought about that bit yet in terms of field API.

A bit more to it than just body and teaser - but then if we're able to make this optimization once here, then we can simply let modules convert to field storage to take advantage of it. I haven't sat down and tried to do this yet, but as yet it doesn't seem like it'd be overly complex, and I think we could get good gains out of it.

yched’s picture

Raises a similar 'clear cache' problem than the one discussed in 'cache node_load' : if we cache filtered text in field cache (and node cache if any), then it means filter.module need to be aware of field cache and node cache and clear them when a text format definition is changed.

catch’s picture

That's true. In this case filter configuration changes rarely enough that I don't think it'd hurt to clear the entire field cache. We'd not be adding additional cache_set() to any great extent because that'd already be necessary from normal check_markup() calls on fieldable objects, so the trade-off shouldn't be too bad on that side.

filter.admin.inc has a few calls to cache_clear_all('cache_filter') - I think it'd be enough to centralise those in a helper function and add a clear to the field cache in it. A lot less intrusive than the current node caching patch.

I haven't got my head around how hook_field_sanitize() should work yet. What I'm probably going to do is get a proof of concept patch worked up just for benchmarking. Then that's at least something to compare against / work off.

sun’s picture

When filters are re-configured, each and everything needs to be reset anyway. My guess and fear is that we are not resetting enough currently. For example, #446742: FAIL. might give you a clue about what would also need to be reset, but currently is not.

yched’s picture

It's not 'what exactly should be cleaned when a text format is changed' per se that worries me - resetting the whole field cache and node cache seems OK.
It's rather that this is yet another instance of 'X (filter.module) needs to know that one of its actions can cause stale cache on Y (field / node / ...), because of implementation choices internal to Y'

yched’s picture

Thinking about it a little more :
- #428296: Filter system doesn't communicate with modules about altered text formats already identified the need for a hook_filter_delete()
- a hook_filter_update() looks equally welcome
- A 'formatted text field' can be seen as an 'input_format_reference field', so text.module can use the hooks above to apply point c) of the strategy described in #111127-207: Cache node_load for field/node cache cleaning - actually the 'run check_markup() on load' part can probably be seen as point b). Other 'input format' consumers do what they have to do in their own implemntations of hook_field_(update|delete)(). Filter.module itself knows nothing about field cache, node cache, contrib_foobar cache...

catch’s picture

Heh, woke up this morning thinking about adding those hooks to filter.module, seems like a good plan that won't offend anyone too much :)

catch’s picture

FileSize
4.47 KB

Started a re-roll of this but body / field conversion has a lot of broken hunks and was planning to use that for testing. This doesn't deal with filter cache clearing, or filters which can't be cached yet, but it's a smaller patch that only includes changes to field.attach.inc and filter.module.

I had to do some (untested) borrowing from _field_invoke due to the way the field cache works - (similar issue to hook_field_load() which isn't cached at the moment). Once body and teaser as field applies again, I'll try to get this to a benchmarkable (if not properly reviewable) state.

yched’s picture

@catch : I rerolled #372743: Body and teaser as fields (gets broken about every day lately...)

catch’s picture

Title: Add check_markup() inside node/field loading. » Add check_markup() inside field loading.

Thanks yched.

Here's an updated patch which works (although not ready for proper review). I've also uploaded a composite patch with body + teaser as fields applied and a conflict resolved in case anyone wants to do independent benchmarks.

10 nodes on front page, default profile. Body and manual summary is roughly equivalent to having two text fields.

devel query log and benchmarks. These are fairly consistent with the 8% I got above with putting sanitized body and teaser in the node cache.

HEAD
Executed 51 queries in 16.73 milliseconds.
9.34 [#/sec]

Body and teaser as fields:
Executed 52 queries in 16.34 milliseconds.
7.24 [#/sec] (mean)

Body and teaser as fields with manual summaries:
Executed 61 queries in 23.15 milliseconds.
7.41 [#/sec]

This patch + body and teaser as fields:
Executed 43 queries in 15.09 milliseconds.
7.84 [#/sec] (mean)

This patch + body and teaser as fields and manual summaries:
Executed 41 queries in 15.33 milliseconds.
8.13 [#/sec]

As above with #333171: Optimize the field cache applied:
Executed 32 queries in 13.06 milliseconds.
8.31 [#/sec]

In that last one, all check_markup() is in the field cache instead of being fetched separately for each field, and the field cache itself is fetched in a single query - so we save 2 queries per node for the check_markup() on text fields, and 9 queries on the field cache itself.

catch’s picture

And the patches.

catch’s picture

Status: Needs work » Needs review
FileSize
5.53 KB

Updated patch which handles uncacheable filters now.

hook_field_sanitize() moves to field_attach_load() and is cached in the field cache. For 99% of field modules this shouldn't cause any changes internally whatsoever.

For modules which have to call check_markup() inside hook_field_sanitize() we add hook_field_sanitize_uncached() to allow them to deal with filter_format_allowcache(). I'm not overly keen on this, but don't see another way to do it, and the impact should be minimal since text module is presumably the main consumer of this hook.

The filter cache itself is skipped, so we don't duplicate storage of the cached field. This also means when loading a page without any primed caches, we have only the cache_set() for fields - no inserts from check_markup() at all - so that'd save 10 inserts on HEAD as well.

Invoking hook_field_sanitize() is awkward because of the way field_attach_load() works, #448598: Reduce PHP overhead in field_attach_load deals with that and might simplify things here a bit. If it doesn't we might need a helper function for that bit to clean it up.

If you're wondering why all the benchmarks are slower than HEAD, that's because of #438570: Field API conversion leads to lots of expensive function calls - the relative gain from the patch is about the same from #1 and #20 though.

Since field_attach_load() and the new filter hooks are both a bit further behind this issue, marking this to needs review despite there being some cleanup still to do.

Status: Needs review » Needs work

The last submitted patch failed testing.

catch’s picture

FileSize
6.52 KB

Patch failed because the field test examines all the contents of the field cache compared to the actual values. Changed this to just confirm that the cache entry exists. If hook_field_load() eventually gets cached then it'll need a similar change.

catch’s picture

Status: Needs work » Needs review
catch’s picture

Title: Add check_markup() inside field loading. » Performance: use field cache for hook_field_sanitize()
Issue tags: +Performance

More accurate title and tagging.

sun’s picture

Component: node system » field system
Issue tags: +FilterSystemRevamp

Proper component, tagging for my watchlist. This issue - more or less - starts to question the entire filter cache we have currently.

Status: Needs review » Needs work

The last submitted patch failed testing.

catch’s picture

Status: Needs work » Needs review

Test bot meltdown.

Status: Needs review » Needs work

The last submitted patch failed testing.

catch’s picture

I spoke to webchick about this in irc last week and promised a summary - here it is:

check_markup() maintains it's own cache - this means a cache_get() for every formatted text field on a page. When viewing large numbers of comments or nodes (or a node with lots of text fields), this means a lot of little queries when using database caching. For example, the body as field patch, by having two distinct textareas for summary and body, adds 10 queries to the default front page (bearing in mind we're down to 40 or so that's quite noticeable). This also means an additional cache_set() to build the cache in the first place.

The proposal here is cache the result of hook_field_sanitize() (which is currently only used by text module to run check_markup()) in the field cache itself. That way there's no additional lookup from (or save to) the filter cache. With body as field, this would save 20 queries on the front page (and 20 inserts with an empty filter cache). When I tested this combined with the body as field patch, it saved about 10% via ab benchmarks.

Where the number of queries is currently very noticeable in HEAD is viewing 300 comments (i.e. an extra query per comment) - this patch wouldn't fix that, but should comment body ever become a field it would.

There's one issue with doing this though - filter.module currently allows you to declare your filter as uncacheable - it will then disable caching for the entire text format. If we cache hook_field_sanitize() we need to deal with this.

The current patch deals with this by adding a hook_field_sanitize_uncached() - so text module can essentially do a first pass for cacheable formats and a second pass for uncacheable ones. This is a bit ugly (although there'll be a relatively small number of field modules which ever need to do it).

Talking with webchick though, we couldn't actually think of any uncacheable filters.

http://drupal.org/project/pirate seemed like an obvious example, but that one is cached - it just doesn't run unless it's talk like a pirate day. When working on the i18n project with Nedjo and Stella we loooked at multilingual support for projects like wordfilter and determined it ought to be done via hook_nodeapi_view() - both to get language context, and also to avoid disabling the filter cache for every other filter being run - which is a performance hit in itself.

yched’s picture

-I don't get why the patch needs to duplicate _field_invoke() to do the // Invoke hook_field_sanitize() on the additions. part, instead of simply calling _field_invoke() ?

- I'd personally vote to simply move the current content of text_field_sanitize() into text_field_load(), and do the 'non cache' part in the existing hook_field_sanitize(), instead of introducing hook_field_sanitize_uncached() in addition to hook_field_load(), hook_field_sanitize().

- might be a stupid question : why do we need to pass an explicit $skip_cache param to check_markup() ? Couldn't check_markup() do the filter_format_allowcache($format) check itself before trying to fetch from the cache ?

catch’s picture

#1 - the field_invoke() changes weren't in during the last revision of the patch - hook_field_sanitize() works on objects directly, field_attach_load() was acting on the $additions array. That shouldn't be necessary when this is re-rolled - just wanted something benchmarkable.

#2 - is hook_field_load() cached now? Last time I checked it wasn't (although it should be). Nice idea though!

#3 - if we're using the field cache, we don't also want to create cache records in the filter cache - extra writes, duplicate storage etc.

yched’s picture

Right, I didn't notice the date of the last patch. Yes, hook_field_load() is cached now, this whole area got streamlined with #362024: Field Attach: Make hook_field_load() be 'multiple' like field_attach_load(). So #1 and #2 should be quite natural when this gets rerolled.

Figured out check_plain($skip_cache) better now. One last point I don't get (minor, rather related to the current state of check_markup) : why doesn't it check for filter_format_allowcache($format) before trying to read from cache ? If the format is not cacheable, nothing will have been written to the cache to begin with...

catch’s picture

Title: Performance: use field cache for hook_field_sanitize() » Performance: do check_markup() in text_field_load()
Status: Needs work » Needs review
FileSize
4.53 KB

Re-roll - this incorporates all of yched's comments, which makes this patch a lot smaller and less far reaching.

text.module now adds the 'safe' version of each field in text_field_load() for cacheable text formats.

hook_field_sanitize() stays where it is, and text.module uses that for uncacheable text formats. No new hook, all the Field API changes are entirely within text module's hook implementations now.

check_markup() gets an extra $cache parameter, defaulting to TRUE, which you can set to false if you want to skip {cache_filter}.

This means we skip a cache_set() and cache_get() to the {cache_filter} table for every text field attached to an object when it's loaded since it's all stored in the field cache instead.

On filter_format_allowcache() and doing cache_get(), that just looks broken. But I didn't fix this here.

sun’s picture

Status: Needs review » Needs work

#446518: Remove $check argument from check_markup()

... means that:

+      // TODO D7 : this code is really node-related.
...
+        $check = is_null($object) || (isset($object->build_mode) && $object->build_mode == NODE_BUILD_PREVIEW);

will probably be nuked anyway (NOT in this patch). Form API validates and ensures that the user actually has access to the text format.

This leaves us with some nitpicks:

+  global $language;
+  foreach ($objects as $id => $object) {

Please squeeze a blank line in here.

+      if (isset($text)) {
+        $items[$id][$delta]['safe'] = $text;
+      }

$text is always defined here.

+      }
     $items[$delta]['safe'] = $text;
+    }

Wrong indentation.

+ * @param $cache
+ *    Whether to cache the results of this function in the {cache_filter} table.
+ *    If you are caching the results of check_markup() elsewhere you may want
+ *    to set this to FALSE to avoid storing the same data twice.

Less 'you' would be good. Also, s/twice/multiple times/

sun’s picture

oops, 2nd nitpick about isset() was wrong, sorry.

catch’s picture

Status: Needs work » Needs review
FileSize
4.54 KB

Yes that line should die soon, but not going to take that on here.

All nitpicks fixed. I'm happier with the new @param $cache comment but not 100%, so improvements welcome.

Note that the benchmarks in #1 and #20 are still valid for what this patch does, even if the implementation is different - but until we actually have a text field in core there's no concrete benchmarkable data set without installing CCK or merging this with the body as field patch.

yched’s picture

Great ! My own nitpick now : text_field_sanitize() could use a few words of comments to relate to what is done in text_field_load() (and vice versa).

Other than that, RTBC for me.

catch’s picture

FileSize
4.85 KB

Added those comments, also added @see to the phpdoc for both functions.

yched’s picture

Status: Needs review » Reviewed & tested by the community

Cool.

sun’s picture

+ * @param $cache
+ *    Whether to cache the results of this function in the {cache_filter} table.
+ *    Set this to FALSE if caching the output of check_markup() elsewhere to
+ *    avoid duplicating cached data.

I am simply overlooking the wrong indentation here, because the entire PHPDoc seems to use 3 spaces, so not invented here, should be fixed elsewhere.

However, what about:

+ * @param $cache
+ *    Boolean whether to cache the filtered output in the {cache_filter} table.
+ *    The caller may set this to FALSE when the output is already cached
+ *    elsewhere to avoid duplicate cache lookups and storage.
catch’s picture

FileSize
4.87 KB
Dries’s picture

This looks good to me.

The documentation is adequate but not extensive so it still took some time figuring out. I couldn't make sense of the TODO -- very cryptic -- but given that the TODO was already there, I'm not going to bikeshed about it.

Will commit later today.

catch’s picture

We can remove the TODO (and the lines of code it relates to) completely when either #446518: Remove $check argument from check_markup() or #409750: Overhaul and extend node build modes are committed.

bjaspan’s picture

Questions:

+      // TODO D7 : this code is really node-related.
+      if (!empty($instances[$id]['settings']['text_processing'])) {
+        $check = is_null($object) || (isset($object->build_mode) && $object->build_mode == NODE_BUILD_PREVIEW);

1. Can $object really be null here?
2. Isn't it the $check line, not the if line, that is node related?

catch’s picture

bjaspan - that's just copied/moved around, no change from HEAD, so I take no responsibility for that line. But as sun says, we'll be killing it off very soon anyway.

catch’s picture

FileSize
4.85 KB

Re-rolled for s/Implementation of/Implement

Dries’s picture

Can I request a quick re-roll. Patch fails to apply currently. Thanks!

Dries’s picture

It would be nice, but not required, to focus on #446518: Remove $check argument from check_markup() first. That seems to be the natural dependency.

catch’s picture

FileSize
4.85 KB

That patch still needs security team discussion for profile module and #after_build, although I marked back to CNR for the test bot.

Re-rolled for the whitespace changes.

sun’s picture

There is not really a dependency between this and the "remove $check argument" issue. That issue can be treated as a major API change, and we most likely need to rename check_markup() to ensure contrib developers to update their code properly.

In the meanwhile, we can commit this patch, and also fix that @todo in a separate issue. Please understand that this @todo just happens to be contained in the code that is touched here. The @todo is neither in the scope of this issue, nor really in the scope of the other issue. That is, because Field API needs to ensure that Form API validation was properly invoked in front of displaying any values in preview.

Testbot passed, so this is ready to go.

Dries’s picture

Status: Reviewed & tested by the community » Fixed

Alright, let's move forward with this then. Thanks. Committed.

yched’s picture

Status: Fixed » Needs review
FileSize
2.11 KB

Minor followup :
- moves a few lines around to make it clearer what the 'TODO, node-related code' comment applies to
- adds a comment for an important case where filtering happens in hook_field_sanitize(): node preview (only case where data go directly through 'view' before being loaded first)

catch’s picture

Status: Needs review » Reviewed & tested by the community

Looks good.

yched’s picture

Issue tags: +Quick fix

add tag

webchick’s picture

Status: Reviewed & tested by the community » Fixed

Committed! Thanks. :)

Status: Fixed » Closed (fixed)
Issue tags: -Performance, -Quick fix, -FilterSystemRevamp

Automatically closed -- issue fixed for 2 weeks with no activity.