Rebuilding the sparql endpoint is limited to the first 500 entries. Attached patch works around that limitation by using the batch api for indexing - that way I've successfully imported more than 1mio. rdf triples.
Also, the patched makes the code entity-generic using efq. So it could it be easily improved to cover more or all entity types. I guess we should do a confirm_form() before clearing the whole index though?
Also, indexing upon node_insert/update could be improved to work generally upon entity_insert/update + misses the delete step.
Note: The batch lacks a proper way to report progress, the output messages though can be read during processing by visiting another page of the same site within the same browser. This helps as importing ~1mio triples took quite a while (several hours).
Note2: I've been using this for the example "linked open data" portal of Austria: http://austria.drupaldata.com/sparql-queries
| Comment | File | Size | Author |
|---|---|---|---|
| #9 | 1389718_batch-index-rebuild_9.patch | 7.57 KB | scor |
| #8 | 1389718_batch-index-rebuild_8.patch | 4.26 KB | scor |
| #7 | 1389718_batch-index-rebuild_7.patch | 4.13 KB | scor |
| #5 | 1389718_batch-index-rebuild_2.patch | 4.61 KB | jessebeach |
| sparql_rebuild.patch | 4.05 KB | fago |
Comments
Comment #1
Anonymous (not verified) commentedThanks, Fago :) The work on SPARQL endpoint has been handled mostly by scor, so it would be great if he could take a look at this.
Unfortunately, the performance of ARC2 is expected to be very bad at that scale, ARC2 is no longer maintained, and there don't seem to be any efforts to create a scalable triple store that only use PHP and MySQL (at least, not that I'm aware of). If you're working at that scale, you might want to connect to an external triple store. I know that scor had mocked up some code for what such a connection might look like on github, but it never made it to Drupal.org. IIRC, it's called RDF DB.
Comment #2
scor commentedThanks for your patch @fago! The efq approach makes more sense. In the long run I think integrating with search_api is the way to go as it offers lots of good features like index on the fly (like it is the case now in sparql_endpoint) or index later, choose what entity types and bundles to index etc. This would also take care of your node_insert/update remark.
Comment #3
fagothanks for the suggestions - an external triple store would be fine too, but as it's just for demo purposes I guess the arc2 store will do it. Though, kasabi looks really interesting.
@score: I'm not so sure about going with search-api as I don't think the search back-end receives the full entity, but just the data items one selected for indexing. Thus, I think a dedicated solution working directly with the entities should be fine as implementing the index-logic is rather simple.
Comment #4
scor commentedtagging
Comment #5
jessebeach commentedI updated the drupal_set_messages with a little more detail.
I can't tell from the comments if this patch meets the needs of the issue, but I can say that it applies cleanly and runs the indexing in batches without issue.
Comment #6
scor commentedThere is something funky going on in the batch integration.
I initially tested with a few nodes and got this:
I expected to see the number of nodes I had on my site in the list. (the reindexing did happen though).
Then set SPARQL_ENDPOINT_BUILD_AT_ONCE to 30 just to see, and tested with 100+ nodes:
Comment #7
scor commentedadd support for all entity types.
Comment #8
scor commentedadd minimal support for setting which entity types need to be indexed via an array in a Drupal variable.
Comment #9
scor commentedthis patch updates the various hook_entity_insert-update-delete() to take into account the sparql_endpoint_entity_types variable