The Crawler seems to work well (in fact I think it's a killer feature of Boost for D7) but to fine tune it (with Rules etc.) it is important for admin to see what is being queued for crawling.

Is it currently possible to view a list of URLs queued for crawling. If not, can it be added -- a simple paged View linked from or underneath the settings on the Crawler tab would be more than enough. It's important that the View shows content type, time of queueing, title, URL, etc.

Comments

Anonymous’s picture

I just need to check if you understand how the crawler works as it is inappropriately named. The crawler doesn't ever crawl the site, it only regenerates pages/ family of pages/ related pages when something has been edited/ inserted/ deleted, so as such a list wouldn't be much good as the crawler is not a spider and boost (7.x) relies on anonymous users to generate the cache.

6.x did have a spider, but the problem was that it overloads cron on sites with large amounts of pages so generation never finishes and that's the main reason it was removed.

If the data is really important then I suggest enabling the boost block found in admin/structure/block named "Boost: page cache status" that gives all sorts of information to non-anonyous users and a flush page button too.

Vacilando’s picture

Good points, @Philip_Clarke. But yes, I guess I understand the way it works in D7. The Crawler is not really doing the discover type of crawling the way e.g. a search engine spider does. Instead, it simply rushes to cache pages (on cron but in background) that have expired, e.g. due to time or as a result of the rules set in the expire module.

So yes, it is caching each URL in a defined set.

I -- and I suppose many users -- need to be able to see the list of URLs set to be "crawled" (read: cached) by the Crawler at any given point. This will allow us to fine tune the expiration rules -- e.g. if I set only a particular content type to be expired based on a particular action, I need to be able to quickly check that the Crawler does not have other URLs in the queue.

Another use case: we need to see how many items are left in the queue and by that judge whether the cron frequency and length is optimal for the site. Otherwise it can happen the crawler does not even finish re-caching pages before the given (sub)set gets expired and has to be re-cached again.

Hope this is a clearer explanation. If not, let me know, please.

(The Boost block is only useful for the page flush button. The stats is the same as what is injected at the end of the HTML, and only for the given page.)

Vacilando’s picture

https://drupal.org/project/queue_ui may be an answer to my question. We need to evaluate what it lacks in comparison to a possible tailored view that could be realized as part of the Boost Crawler submodule.