rebuild endpoint does not cover all entities [#1389718]

Rebuilding the sparql endpoint is limited to the first 500 entries. Attached patch works around that limitation by using the batch api for indexing - that way I've successfully imported more than 1mio. rdf triples.

Also, the patched makes the code entity-generic using efq. So it could it be easily improved to cover more or all entity types. I guess we should do a confirm_form() before clearing the whole index though?

Also, indexing upon node_insert/update could be improved to work generally upon entity_insert/update + misses the delete step.

Note: The batch lacks a proper way to report progress, the output messages though can be read during processing by visiting another page of the same site within the same browser. This helps as importing ~1mio triples took quite a while (several hours).

Note2: I've been using this for the example "linked open data" portal of Austria: http://austria.drupaldata.com/sparql-queries

Comment	File	Size	Author
#9	1389718_batch-index-rebuild_9.patch	7.57 KB	scor
#8	1389718_batch-index-rebuild_8.patch	4.26 KB	scor
#7	1389718_batch-index-rebuild_7.patch	4.13 KB	scor
#5	1389718_batch-index-rebuild_2.patch	4.61 KB	jessebeach
	sparql_rebuild.patch	4.05 KB	fago

Comments

Comment #1

Anonymous (not verified) commented 1 January 2012 at 16:36

Component:	Code	» SPARQL Endpoint
Assigned:	Unassigned	» scor

Thanks, Fago :) The work on SPARQL endpoint has been handled mostly by scor, so it would be great if he could take a look at this.

Unfortunately, the performance of ARC2 is expected to be very bad at that scale, ARC2 is no longer maintained, and there don't seem to be any efforts to create a scalable triple store that only use PHP and MySQL (at least, not that I'm aware of). If you're working at that scale, you might want to connect to an external triple store. I know that scor had mocked up some code for what such a connection might look like on github, but it never made it to Drupal.org. IIRC, it's called RDF DB.

Comment #2

scor commented 2 January 2012 at 18:07

Thanks for your patch @fago! The efq approach makes more sense. In the long run I think integrating with search_api is the way to go as it offers lots of good features like index on the fly (like it is the case now in sparql_endpoint) or index later, choose what entity types and bundles to index etc. This would also take care of your node_insert/update remark.

Comment #3

fago

German

Vienna

commented 2 January 2012 at 21:00

thanks for the suggestions - an external triple store would be fine too, but as it's just for demo purposes I guess the arc2 store will do it. Though, kasabi looks really interesting.

@score: I'm not so sure about going with search-api as I don't think the search back-end receives the full entity, but just the data items one selected for indexing. Thus, I think a dedicated solution working directly with the entities should be fine as implementing the index-logic is rather simple.

Comment #4

scor commented 25 February 2012 at 14:00

Issue tags:

+RDF, +sprint

tagging

Comment #5

jessebeach commented 25 February 2012 at 20:06

Status	File	Size
new	1389718_batch-index-rebuild_2.patch	4.61 KB

I updated the drupal_set_messages with a little more detail.

I can't tell from the comments if this patch meets the needs of the issue, but I can say that it applies cleanly and runs the indexing in batches without issue.

Comment #6

scor commented 25 February 2012 at 23:10

Status:

Needs review

» Needs work

There is something funky going on in the batch integration.
I initially tested with a few nodes and got this:

Processed 0 node entities.
Processed 0 user entities.
Processed 0 taxonomy_term entities.

I expected to see the number of nodes I had on my site in the list. (the reindexing did happen though).

Then set SPARQL_ENDPOINT_BUILD_AT_ONCE to 30 just to see, and tested with 100+ nodes:

Processed 0 node entities.
Processed 30 node entities.
Processed 60 node entities.
Processed 90 node entities.
Processed 120 node entities.
Processed 0 user entities.
Processed 0 taxonomy_term entities.

Comment #7

scor commented 7 May 2012 at 19:08

Status	File	Size
new	1389718_batch-index-rebuild_7.patch	4.13 KB

add support for all entity types.

Comment #8

scor commented 24 July 2012 at 20:25

Status	File	Size
new	1389718_batch-index-rebuild_8.patch	4.26 KB

add minimal support for setting which entity types need to be indexed via an array in a Drupal variable.

Comment #9

scor commented 24 July 2012 at 20:43

Status	File	Size
new	1389718_batch-index-rebuild_9.patch	7.57 KB

this patch updates the various hook_entity_insert-update-delete() to take into account the sparql_endpoint_entity_types variable

rebuild endpoint does not cover all entities

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

News items

Our community

Documentation

Drupal code base

Governance of community