IR / Non-IR for RDF output (RDFa and/or restws) [#1293926]

Consider this page: http://gallows.inf.ed.ac.uk/dataset/medline

This page is about a dataset. The content-types involved were made entirely with the web interface by pointyclicking and assigning the correct RDF types and predicates and suchlike. It worked surprisingly well. With a bit more work one could imagine creating a "native RDF / DCat / voiD" data catalogue in such a way.

However, if you look at the generated RDF with either of these commands:

    rapper -i rdfa -o turtle http://gallows.inf.ed.ac.uk/dataset/medline
    rapper -i rdfxml -o turtle http://gallows.inf.ed.ac.uk/dataset/medline

there are a few things to note.

First there is a classic range-14 problem, where the web page about the dataset and the dataset itself have the same URI.

Second the information given by restws and that in the rdfa is different. Because of the way the web page is laid out, the rdfa contains information about more nodes - this is not a big deal, in fact it might be better to make the web page with views anyways and not bother with the rdfa. More problematic, restws doesn't seem to honour the instructions to not put creation, etc dates on the resource. Because the timestance pertain to the document about the dataset and not the dataset itself the range-14 problem actually bites hard here and causes incorrect information to be emitted.

Comments

Comment #1

wwaites commented 28 September 2011 at 19:29

A further note, the extraneous dates in the restws output do not appear in the SPARQL endpoint. Whatever might be done to correct the restws output should also make sure that the corresponding information is stored in the SPARQL endpoint so we have the same information everywhere.

Comment #2

scor commented 28 September 2011 at 19:48

First there is a classic range-14 problem, where the web page about the dataset and the dataset itself have the same URI.

yes, that's the default behavior. I guess we could add a option in the content type to separate Information Resource from the page (like user/uid does, though there is no option for that).

Because of the way the web page is laid out, the rdfa contains information about more nodes

agree that's needed: #1237296: .rdf support for Views.

More problematic, restws doesn't seem to honour the instructions to not put creation, etc dates on the resource.

yes... well, after all, the options is "Display author and date information."... so one could argue that this is presentation only, while you could still export it in the RDF, but you don't have to it if you don't care about it. I understand however that in your case, they might get in the way if they are associated with the resource. So here we could either follow what the content type setting "Display author and date information.", or provide a separate option for the RDF serializations. Another work around is to alter the core RDF mappings and remove the predicates for created and changed, so that they don't end up in the RDF model.

A further note, the extraneous dates in the restws output do not appear in the SPARQL endpoint. Whatever might be done to correct the restws output should also make sure that the corresponding information is stored in the SPARQL endpoint so we have the same information everywhere.

hum, that should not happen... I wonder if you are experiencing #1237090: SPARQL Endpoint index not immediately sync'ed on node update. you might want to wait and try with the upcoming alpha4 version of the modules...

thanks for your feedback William!

Comment #3

Anonymous (not verified) commented 28 September 2011 at 19:59

range-14

Regarding range-14: please review #1004338: Add support for information resources in RDF mappings API. In it, I explain that we can't support Non-IR uris in contrib fields.

Having spent time thinking about issues of distributed, volunteer based development, I just don't think that supporting range-14 is possible... it's not a technical issue, it's a social one. We would have to train all of the field formatter developers to think in IR/Non-IR terms and then they would have to create user interfaces for the end user to be able to think in IR/Non-IR terms. I talk a little bit about this (or rather, a related issue) on my blog.

It's just too large a social undertaking, especially with the extremely limited resources we have... there unfortunately aren't too many people who both understand SemWeb technologies and actually enjoy getting their hands dirty hacking. In the end, even some of the most visible supporters of range-14 have been critical of it lately, so I'm not sure what traction it will get with the mainstream (which is where we all hope these techs to go).

If you need to have predicates about a document and predicates about a related resource, you can use Field Collection to create your Non-IR. It will have a distinct URI from the IR. Then you have to make sure that you don't have any document predicates on the Field Collection.

dc:created

I haven't looked into the progress on RestWS, but AFAIK, the only reason why it would output a creation data would be that there is an RDF mapping for the node that maps the creation date to dc:created. This is part of the default mapping for nodes. You have two options:

Use Entity API to create a non-node entity for datasets. Then you won't have the default node mapping
Override the predicate mapping for date in the node mapping. Unfortunately, due to #1228872: RDF default mappings override empty values, you can't just leave it empty.

Comment #4

wwaites commented 28 September 2011 at 20:07

yes... well, after all, the options is "Display author and date information."... so one could argue that this is presentation only, while you could still export it in the RDF, but you don't have to it if you don't care about it. I understand however that in your case, they might get in the way if they are associated with the resource. So here we could either follow what the content type setting "Display author and date information.", or provide a separate option for the RDF serializations. Another work around is to alter the core RDF mappings and remove the predicates for created and changed, so that they don't end up in the RDF model.

I guess all of this is moot if the range-14 business is sorted. Because these dates relate to the node/document it doesn't hurt to keep them around if they don't risk being confused with the NIR.

That said, I could see the benefit of having some sort of option to turn these dates off completely in the case that some more sophisticated kind of provenance metadata is desired. For example you could expose some linkage back to previous versions, with information about who made what change when. Maybe a bit of an "advanced" usage but this could be important in some environemnts (e.g. government data catalogues).

IR / Non-IR for RDF output (RDFa and/or restws)

Comments

Comment #1

Comment #2

Comment #3

range-14

dc:created

Comment #4

News items

Our community

Documentation

Drupal code base

Governance of community