I've begun to look at implementing filters to narrow searches down by Year/Month/Day, etc., and hit a pretty serious issue in ApacheSolr.
The schema that ships with the module has the line
<field name="changed" type="integer" indexed="true" stored="true"/>
for the date field, while it really should be
<field name="changed" type="date" indexed="true" stored="true"/>
The big difference is that, when dates are defined as DateFields, Solr's built-in facets for dates become useful. As integers, it's impossible to use stock faceting facilities. Unfortunately, Solr doesn't automatically cast dates from Unix timestamps to its date format, so all of the nodes to index will need to be reformatted to the Solr date spec (which is ISO 8601).
| Comment | File | Size | Author |
|---|---|---|---|
| #55 | date-facets-293989-55.patch | 16.31 KB | pwolanin |
| #52 | apachesolr_293989.patch | 16.38 KB | drewish |
| #49 | apachesolr_date_patch_orig_rej_files.zip | 23.98 KB | xnickmx |
| #43 | apachesolr-date-facet-293989-43.patch | 21.05 KB | JacobSingh |
| #42 | apachesolr-date-facet-293989-42.patch | 16.36 KB | bjaspan |
Comments
Comment #1
robertdouglass commentedGood issue. Actions to take: update the schema, update indexing and searching code to send and receive the appropriate formats. We might choose to store two fields, changed (as it is now) and changed_date in the format you're suggesting. We'd do this if we a) didn't want to convert between date and timestamp when searching, or if the changed were interesting/useful for some other reason. Otherwise, we can just change changed to date as you suggest.
Comment #2
tmcw commentedAn update on this issue: date faceting is supposedly coming in Solr 1.3, which is in alpha, I believe.
We've worked around this by storing year/month/day information in the schema and faceting gradually on those integers. It's a midway solution, and not a great one. I can roll a patch if you're interested, though.
Comment #3
JacobSingh commentedAs an aside:
@RobertDouglass: How do you plan to deal with 1.3 vs 1.2 differences in the code? Are we going to have to add config options, or *shudder* branch?
Comment #4
robertdouglass commentedComment #5
robertdouglass commentedFrom the Solr 1.3 release notes. To be considered:
Comment #6
vladimir.dolgopolov commentedIt's interesting.
It looks like we can use here only regular intervals ("today, yesterday, the day before yesterday" or "Jan, Feb, Mar,...").
That's because of facet.date.start, facet.date.end, facet.date.gap parameters.
I wonder how to get a facet for not regular intervals like: "today, last week, last month" or "last hour, last 8 hour, last 24 hour".
Maybe facet.date.other can help, I didn't find any examples of about the one.
Comment #7
vladimir.dolgopolov commentedI've created a patch for the task.
You have to change schema.xml. You should add this line:
There is a new block called: Filter by date.
Screenshots (Don't mind it please, there are future dates):
date_facet1.gif - overview of the block.
date_facet2.gif - we click "2007"
date_facet3.gif - we click "2007 April"
date_facet4.gif - we click "2007 April 10"
NB.
This patch also introduces query range syntax.
http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Range%20Searches
You can enter in search field:
uid:[50 TO 100]
and you'll see nodes by uid from 50 to 100.
It is not useful query. :)
But for date facets and date ranges a query like:
changed_date:[2007-04-10T00:00:00Z TO 2007-04-10T23:59:59Z]
it is common.
Comment #8
robertdouglass commentedVladimir,
thank you, this looks very exciting. There is a provision in the Solr_Base_Query.php class for theming the breadcrumb: see function get_breadcrumb()
It allows you to make functions like this:
and the breadcrumb will then have a "human readable" representation. Would you be able to add such a theme function for making the dates readable?
Comment #9
JacobSingh commentedHi Vlad,
This looks great! However, you're missing a patch to schema.xml aren't you?
Best,
Jacob
Comment #10
vladimir.dolgopolov commented@Robert
I've done human readable breadcrumb.
@Jacob
Yes, I did. So now schema.xml is presented.
I also slightly rearranged functions.
Comment #11
vladimir.dolgopolov commentedHere is test for the patch.
Comment #12
pwolanin commentedIs this patch against 5.x or 6.x? It needs to be against the current 6.x.
I'm having trouble grasping what all the code does, and whether there might be easier ways of doing this. For example, looks at these links:
http://lucene.apache.org/solr/api/org/apache/solr/schema/DateField.html
http://lucene.apache.org/solr/api/org/apache/solr/util/DateMathParser.html
It seems like we might be able to do some of the date faceting using the "NOW" construct, like "NOW-1WEEK" or "NOW+1DAY"
We should start with a very minimal patch to change the schema and enable indexing of that field. Are CCK date fields already being stored correctly? We have dynamic date fields in the schema, but it's not clear we send dats in the correct format.
Any date faceting code ought to be generic enough that we can use it to facet both the node changed date and any CCK date fields.
Comment #13
vladimir.dolgopolov commentedIt's useful to set date facet begin and date facet end of a facet period.
Or we can use "NOW" for 'Range Queries' like "[NOW-1DAY TO NOW+1DAY]".
But to set a date facet "gap" we shouldn't use expression with 'NOW'.
We should use '+1DAY', '+5YEAR' and so on.
So 'faceting' here is able only for equal-sized gaps, imho.
It would be nice to use some kind of a "multi-gap" like '+1DAY, +1WEEK, +1MONTH' but it's impossible now.
Thus, use 'NOW' for data *faceting* is useless; use 'NOW' for date *limits* is good.
Comment #14
pwolanin commentedyes, I meant for setting beginning and end of the period, not for setting the facet gap.
Comment #15
vladimir.dolgopolov commentedI've created a patch that only handle $node->changed date as ISO8601 date.
There are no facets, etc. It's just a first move.
Comment #16
vladimir.dolgopolov commentedchange status
Comment #17
vladimir.dolgopolov commentedRerolled the patch. Leave only a date reated stuff.
Comment #18
JacobSingh commentedAwesome. Nice work Vlad.
I tested it w/ sorts and biasing.
Committed to 6.x
Comment #19
pwolanin commentedok, remaining date facet stuff still needs to be worked on.
Comment #20
vladimir.dolgopolov commentedThe next attempt to get the 'changed' date narrow facet block working.
There is a lot of stuff concerning ISO 8601 date manipulation.
It should be handled more elegancy to get it work with another apachesolr-and-date-field stuff.
Comment #21
vladimir.dolgopolov commentedIt uses the path for date range query #230376: Add range query feature.
Comment #22
bjaspan commentedI spent some time trying to create generalized support for faceted search on date fields today. As a disclaimer, I have NO IDEA WHAT I AM DOING. I've never looked at any of this code before today. God help us all.
I changed Solr_Base_Query.php mostly by adding a lot of phpdoc so I could figure out what the heck was going on. I also "enhanced" (or maybe "damaged") query_extract() to support range queries in a way that returns the range values in a useful format and changed make_field() to known that it should not double-quote range queries. Finally, I modified parse_query() to preserve the extra range info from query_extract() in $this->fields.
The real fun is in apachesolr_search.module. hook_apachesolr_facets() can now return facets with a 'facet_date' property. When building up the Solr query params, facets with a facet date property are treated specially. They have a callback to determine the start, end, and gap parameters to add to the query.
I honestly have no idea if this is right. However, in apachesolr_search_block(), I always just dump all date facet counts as a drupal message, and I am getting results for the 'changed' facet. So presumably I am not 100% on the wrong track.
FYI, I changed the title of this issue since the previous title has already been accomplished.
Commentary welcome. :-)
Comment #23
pwolanin commentedsince we don't need to support URL hacking, this can probably be simplified to be stricter:
something like:
Comment #24
pwolanin commentedI'm going to merge in this range query matching and fix some of the code comments for Solr_Base_Query w/ my other patch: http://drupal.org/node/369944
Comment #25
pwolanin commentedcommitting the attached as the basis for further work
Comment #26
pwolanin commentedwhoops - not the one I meant to mark fixed.
Comment #27
bjaspan commentedNew patch for date faceting. This is basically the same as yesterday's except it works with the Solr_Base_Query.php currently in CVS. To explain what I am doing:
* Suppose there is a date facet, e.g. 'changed'. On an unfiltered search, we want date facets for all documents, with a gap determined by the min and max value for all documents. e.g. for a site more than a year old, start the date of the oldest document, end is the date of the youngest document, and gap is +1YEAR because those dates are more than a year apart. So, we'll get back date facets like "2008 (6), 2009 (22), ...".
* A facet block will turn those date facets into links containing a filter: the 2008 (6) link will include changed:[2008-01-01 TO 2008-12-31] (date format simplified for this post). Generating these links will be relatively easy because Solr computes and returns all the intermediate dates for us.
* For a search with one or more filters on a date facet, we submit the search requesting a date facet for the smallest from/to range specified by any of the filters. So, if the filters are changed:[2008-01-01 TO 2008-12-31] changed:[2008-12-01 TO 2008-12-31], we request from 2008-12-01 to 2008-12-31 with a gap of +1DAY.
One thing I'm not doing yet is normalizing the default min/max from and to dates to the beginning/end of their gap range. If documents go from 2008-11-15 to 2009-01-07, that's more than a month so the gap will be +1MONTH. We should submit the search from 2008-11-01 to 2009-01-31 so that the intermediate dates Solr generates are 2008-11-01, 2008-12-01, 2009-01-01, so that our easy-to-generate links correspond to meaningful date periods.
All this assumes no support for relative date facets, e.g. "last week." Start small.
Comment #28
bjaspan commentedNote that my patch anticipates some changes to Solr_Base_Query.php that I discussed with Peter; it is using a kludgey workaround until then.
Comment #29
bjaspan commentedNew patch. Exactly like the old one except it now normalizes the start/end dates to the beginning of the year, month, day, etc. so that generating the facet links is easy.
Comment #30
bjaspan commentedNew patch. This one adds a basic, unformatted date facet block. It shows the gap being used for each drill-down link so you can see as you go from year to month to day, etc. The gap ought to be used to format the links more nicely, of course.
Comment #31
bjaspan commentedNew patch. Slight tweak to the gap computation so clicking on a year in the facet block results in a month gap, and clicking on a month results in a day gap.
Comment #32
pwolanin commentedwould it be possible to do 2 day gaps? or 1-week gaps when on month? Just thinking a ~30 item facet block is a little unweildy
Comment #33
janusman commentedAgree with peter on month -> week -> day, however it's not really critical =)
Now, wishing doesn't cost anything, so:::
How about applying this to events with radically different timeframes? E.g. centuries (events like "Birth of rome" vs. "Berlin wall falls"), decades (imagine your family photo collection) and even minutes/seconds (e.g. filtering a webserver event log).
So either:
a) we let the granularity be configurable by the admin somehow
b) place generic functionality for this granularity: century -> decade -> year -> month -> week -> day -> hour -> minute (Or pick any X from the list)
Wishing mode off. =)
Comment #34
bjaspan commented#32: Yes, I think we can do n-gaps (2 days, etc). We just need to figure out the rules for when to do what. I suggest, however, that we get the most obvious 1-gap code cleaned up and finished first.
#33: The code is currently using a simple rule for gaps from year to second. I do not know what other gaps Solr supports. But wishing-mode isn't useful until there is something simple to start from.
Comment #35
pwolanin commentedI think it's a bit silly to support down to 1 second gaps - can you see a 95% use case for anything less than 1 hour?
Comment #36
JacobSingh commentedYeah, I agree. I think by the hour is fine. If we go higher res it would be by quarter hour I assume..
Best,
Jacob
Comment #37
bjaspan commentedI think at each level we should offer drill-down to the next level if and only if there is at least group at the current level that has more than N items in it. I don't know if N should be 1, or the number of items shown per page on the response, or something else. But this way, on a site like jaspan.com with very infrequent posts, the drill-down will only go to MONTH and then stop, whereas on a busier site like Drupal it will go to some lower level. I do agree that HOUR is probably the smallest reasonable gap.
The code to generate the date facet has the counts for each gap so it is easy to know when to allow further drill-down. I guess what it does not have is the gaps and counts for the *previous* search in which the user drilled down (i.e. when looking at Month and you drill, you get back Days and see there is never more than 1 per day, but you no longer have the list of Month gaps and counts). So maybe this is not trivial to implement.
Comment #38
pwolanin commented@bjaspan - I'm not sure we need to worry about cutting off the drill-down for now. If the user is not clever enough to see that there is only 1 current search result and keeps clicking the next narrower range, then so be it.
Comment #39
robertdouglass commentedBeware endless link loops that Google might follow.
Comment #40
pwolanin commented@robertDouglas - we already suggested that hour might be a minimum time unit - I'm certainly not suggesting that we infinitely sub-divide the time interval.
Comment #41
bjaspan commentedNew patch. This one displays pretty labels in the drilldown block, Current Search block, and breadcrumbs. It's quite useable as is, though only node modification time is implemented as a date facet. CCK fields are not yet supported but adding them should be not too hard at this point. This code is not done, but setting to CNR because it needs to be reviewed. :-)
Comment #42
bjaspan commentedNew patch. This one adds a created drilldown block in addition to the changed drilldown block. Having two date blocks makes it obvious that the breadcrumbs need to identify which date field they are for. ie: When you search for "post created in 2009" and "CCK artwork painted field in 1638", the breadcrumbs should look something like
Search > Created 2009 > Painted 1638
and not
Search > 2009 > 1638
This is not implemented in this patch, though.
I'm done for right now.
Comment #43
JacobSingh commentedWorks for me. Awseome contribution.
I spent a couple hours I probably shouldn't have today and hacked in CCK support. There are a couple issues with this, especially the following:
really, even the existing code to figure out the max range is bad because I imagine in a large node table this query is expensive for every run. I asked on solr-user about how to get the min and max values from solr. Probably the better thing to do is to cache this every cron run so we've just got it.
The only other changes I made were to fix a Foreach (foreach) and a comment about a variable which seemed useless ($active).
I'm leaving this for review, but if one doesn't come in the next day or two, I'll just commit, because it looks pretty good for now IMO.
Comment #44
pwolanin commentedI'm not totally happy with the code. Looks like we need a more consistent way to map deltas to actual facet field names and also some better methods in the query class.
Comment #45
JacobSingh commentedyeah, probably true. Also, I left a could queryd calls in there :(
I can post a new patch, but my wc is a little borked at the moment.
I'm happy enough with the approach to go forward and look at it later, but if you want to hold it off until we archetype it better, I defer :)
Comment #46
bjaspan commentedI think it will be helpful if I review and explain my own patch:
The date processing logic is a little subtle. There are two different ways and places dates are used: breadcrumbs and drilldown. And there are two different date searching scenarios: when the current query already includes a date filter, and when it does not. All four of these cases are per-facet, so having a filter in the current query for "changed" still leaves the "created" logic in the scenario of having no existing filter.
When there is no existing filter for a date facet, we need to supply an initial start and end datestamp and gap as part of the solr query. The code does this by looking in the database for min and max values *for that specific facet* (e.g. the node.changed column, or the CCK data storage table column (not implemented), etc.), and then dynamically choosing an initial gap based on those dates. The initial gap is the first match of: YEAR if min/max are more than a year apart, MONTH if they occur in separate calendar months, DAY if they are more than 86,400 seconds apart, HOUR if more than 3600 seconds apart, etc. The YEAR/MONTH rules have these consequences:
* 1/1/2007 to 1/2/2008 are YEAR. Facets: 2007, 2008.
* 1/1/2007 to 12/31/2007 are MONTH. Facets: Jan 07, Feb 07, ... Dec 07.
* 11/15/2007 to 2/1/2008 are MONTH (even though they occur in different years). Facets: Nov 07, Dec 07, Jan 08, Feb 08.
* 1/1/2007 to 1/31/2007 are DAY (because they < YEAR and are in the same month). Facets: Jan 1 07, Jan 2 07, ..., Jan 31, 07.
* 2/1/2007 to 3/1/2007 are MONTH (different months) even though that's actually fewer days than 1/1/2007 to 1/31/2007 (DAY).
I think these are intuitively the correct rules to use but are open to debate.
Whenever we submit a start/end/gap filter to solr, we always round the start datestamp to the beginning of the period and the end date to the beginning of the NEXT period (not the end of the current one). For example, if we have 1/5/2007 9:10:11 to 2/27/2007 14:15:16, that's a MONTH gap, so what we round to month boundaries are send 1/1/2007 00:00:00 to 3/1/2007 00:00:00. This is actually done with Solr's date math capability. We *really* send "1-5-2007T9:10:11/MONTH TO 2-27-2007T14:15:16+1MONTH/MONTH".
So, all date facet filters are ALWAYS from the beginning of a period to the beginning of the next period. This has two important consequences. (1) The date facet groups that Solr generates will always fall on natural period boundaries, either years, months, days, etc. We won't ever get a date facet from 3pm one day to 3pm the next day, because DAY gap facets always start at T00:00:00. (2) Given only a start/end filter pair in ISO format (which is what Solr gives us), we can determine the gap between them very easily: if the seconds are different, it must be a SECOND gap; else if the minutes are different, a MINUTE gap; and so on.
Okay. So we have no date filter yet, and we submit a search. We determine the min/max range and an initial gap using the rules above, so we get back date facet data, and we also have info on the range we searched for, which is always an period-to-period filter as described above. We need to determine what to put into the breadcrumb and what to put in the drilldown block.
For the breadcrumb, we look at the start/end range for the filter used in the search, one at a time. We can determine the gap for that range (see above), so we just format it appropriately. For YEAR, we show 2008. For MONTH, January 2008. etc. The fact that we do not need anything except the start/end date is good, because that's all we get.
For the drilldown, we also look at the start/end range, and determine the gap. In this case, the start/end range does not come from looking in the database but by looking at the Solr query results, which has all of the date facets nicely laid out for us, on period boundaries (because of how we specified the query). Each entry in the drilldown block is a link back to the search with a new filter added, from one facet value to the next, using the next smaller gap. Unfortunately, we cannot (or maybe just do not) currently encode the gap to use along with a date facet filter built up in the drilldown block. Therefore, in the date facet callback (see below), when actually building the Solr query, we iterate through all the filters for the current date facet and find the smallest, and drill down from that filter's gap for the query. Now that I think about it, we could probably just encode the gap to use in the URL along with the start/end time and eliminate the "find the smallest range" logic in the date facet callbacks (again, see below).
Whew! Okay, now to the code.
The big change that allows handling multiple date facets cleanly (and what I think was not clear to Jacob in his patch to #43) is that hook_apachesolr_facets() can now return date facets *with a callback to generate the default start/end/gap tuple*. In my patch, only one callback is implemented (apachesolr_search_date_range()) and it supports the created and changed deltas. We can implement a DIFFERENT callback, e.g. apachesolr_search_cck_date()) for handling CCK date fields. If we do not/cannot embed the gap to use in the URL, then the "find the smallest range" logic in apachesolr_search_date_range() should be factored out so it can be re-used by multiple date facet callbacks.
In apache.module, apachesolr_date_facet_block() is a slightly modified copy of apachesolr_facet_block(). Clearly it should be refactored, but I figured someone more familiar with the code could do it better.
Obviously, _hack_parse_date_range() is a hack indicating that Solr_Base_Query.php needs a new method. The comment and foreach loop in apachesolr_search_date_range() also indicate the need for a new method.
In order to make a single date breadcrumb theme function handle multiple CCK date fields, we'll have to pass it arguments telling it which date field it is being invoked for. I haven't figured out the D6 theme API yet but I doubt this is hard.
I'm going on vacation for the rest of this week and will be unreachable. I can help with this more next week.
Comment #47
David_Rothstein commentedSubscribe.
I was thinking I might try to review this if I had time, but based on the latest comments I'm not quite sure if the code is at the right state for that at the moment...
Comment #48
xnickmx commentedsubscribing
Comment #49
xnickmx commentedI've been trying to get the patch from message 42 installed.
Here is what I tried, and the output from patch:
Maybe I'm just having a bad day? I had trouble applying another apachesolr patch today.
I am working against the HEAD branch, using Windows and the UnixUtils patch tool.
I'm not sure if they're helpful, but I've also attached the orig and rej files that the patch tool generated.
Any helpful hints will be very appreciated. I am really excited about the apachesolr module and I want to contribute. Unfortunately, I keep getting stuck on the seemingly easy parts.
Comment #50
pwolanin commentedDon't use HEAD, we are working off the DRUPAL-6--1 branch.
Comment #51
xnickmx commented@pwoalanin - Thanks for the suggestion of trying the DRUPAL-6--1 branch. Unfortunately that branch fails in the exact same way as HEAD failed for me yesterday. Do you have any other suggestions? Is there any thing else simple that I may be missing?
Comment #52
drewish commentedhere's a re-roll of pwolanin's patch from #42, hopefully xnickmx will have a little more luck with it.
Comment #53
smoothify commentedsubscribing
Comment #54
xnickmx commentedI figured out my problem - it was line endings and not knowing which files to change. After I ran unix2dos on drewish's patch (and only the patch, not the files that were being patched), the patching worked just fine. I will be reviewing this patch and hopefully offering up some feedback in the next couple of weeks.
Comment #55
pwolanin commentedre-roll of #42 - only changed "Foreach" to "foreach"
Committing this - we need to revisit a bunch of refactoring and CCK support in a follow-up issue.