Rebuilding the sparql endpoint is limited to the first 500 entries. Attached patch works around that limitation by using the batch api for indexing - that way I've successfully imported more than 1mio. rdf triples.

Also, the patched makes the code entity-generic using efq. So it could it be easily improved to cover more or all entity types. I guess we should do a confirm_form() before clearing the whole index though?

Also, indexing upon node_insert/update could be improved to work generally upon entity_insert/update + misses the delete step.

Note: The batch lacks a proper way to report progress, the output messages though can be read during processing by visiting another page of the same site within the same browser. This helps as importing ~1mio triples took quite a while (several hours).

Note2: I've been using this for the example "linked open data" portal of Austria: http://austria.drupaldata.com/sparql-queries

Comments

Anonymous’s picture

Component: Code » SPARQL Endpoint
Assigned: Unassigned » scor

Thanks, Fago :) The work on SPARQL endpoint has been handled mostly by scor, so it would be great if he could take a look at this.

Unfortunately, the performance of ARC2 is expected to be very bad at that scale, ARC2 is no longer maintained, and there don't seem to be any efforts to create a scalable triple store that only use PHP and MySQL (at least, not that I'm aware of). If you're working at that scale, you might want to connect to an external triple store. I know that scor had mocked up some code for what such a connection might look like on github, but it never made it to Drupal.org. IIRC, it's called RDF DB.

scor’s picture

Thanks for your patch @fago! The efq approach makes more sense. In the long run I think integrating with search_api is the way to go as it offers lots of good features like index on the fly (like it is the case now in sparql_endpoint) or index later, choose what entity types and bundles to index etc. This would also take care of your node_insert/update remark.

fago’s picture

thanks for the suggestions - an external triple store would be fine too, but as it's just for demo purposes I guess the arc2 store will do it. Though, kasabi looks really interesting.

@score: I'm not so sure about going with search-api as I don't think the search back-end receives the full entity, but just the data items one selected for indexing. Thus, I think a dedicated solution working directly with the entities should be fine as implementing the index-logic is rather simple.

scor’s picture

Issue tags: +RDF, +sprint

tagging

jessebeach’s picture

StatusFileSize
new4.61 KB

I updated the drupal_set_messages with a little more detail.

I can't tell from the comments if this patch meets the needs of the issue, but I can say that it applies cleanly and runs the indexing in batches without issue.

scor’s picture

Status: Needs review » Needs work

There is something funky going on in the batch integration.
I initially tested with a few nodes and got this:

Processed 0 node entities.
Processed 0 user entities.
Processed 0 taxonomy_term entities.

I expected to see the number of nodes I had on my site in the list. (the reindexing did happen though).

Then set SPARQL_ENDPOINT_BUILD_AT_ONCE to 30 just to see, and tested with 100+ nodes:

Processed 0 node entities.
Processed 30 node entities.
Processed 60 node entities.
Processed 90 node entities.
Processed 120 node entities.
Processed 0 user entities.
Processed 0 taxonomy_term entities.
scor’s picture

StatusFileSize
new4.13 KB

add support for all entity types.

scor’s picture

StatusFileSize
new4.26 KB

add minimal support for setting which entity types need to be indexed via an array in a Drupal variable.

scor’s picture

StatusFileSize
new7.57 KB

this patch updates the various hook_entity_insert-update-delete() to take into account the sparql_endpoint_entity_types variable