When exporting a vocabulary with GUIDs set, the RDF export feature produces nodes which contain both "rdf:ID" and "rdf:about" attributes. This results in invalid SKOS. I have tested this in the W3C validator, the Pool Party validator, and with a failed import into Protege.

The attached patch ensures that, when present, the GUIDs (or machine names) are set as the "rdf:about" attribute and the "rdf:ID" attribute is ignored.

It does raise an issue about URIs as GUIDs for Vocabularies (not possible currently), which I hope to address in a separate patch.

CommentFileSizeAuthor
#1 1393316-guids.patch4.9 KBxtfer
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

xtfer’s picture

Status: Active » Needs review
FileSize
4.9 KB

Patch attached...

dman’s picture

Yeah, I found this issue also months ago during a bout of extending/rewriting the dev code to use pure SKOS (D7-read/write-compatible) instead of the earlier W3C-thesaurus RDF examples the earlier code was modelled on.
Anyway, I got caught up in a few days of research into the state of play with other example/samples and how they managed (or failed to manage) the distinction between XML-base + IDs vs full-GUID-about attributes. And then other weirdo shortcuts and inconsistencies. And then tried to think about how *sometimes* my GUIDs are local and sometimes they are imported, but most of the ones I am importing start off being local to somewhere, and maybe I should stake a namespace like PURL and .. ah, I can't remember much after that, but that's the basic reason I never resolved it!

Yeah, and then I had to look very hard at D7 vocabs and GUIDs and the GUID module (I prefer machine_name) and ... that too, I didn't see a clear path in. I've been making it up a little as I go along, and only nominally writing this extra (probably useful, but not definative) extra info when I output. In my experiments I;ve found readers that accept rdf:about, and others that only assume xml:id, and only more recently has my habit of writing Both started hitting validators that actually care.
If I was writing a validator, and both methods were given, I should only complain if both were given and differed. As it is, I was (attempting to) say the same thing twice in to equally valid ways. That is, if the xml:base directive is still working in the current code. I can't recall.

I'm on holiday, and not in a position to test ...
But I see where you are going with this patch, it's pretty readable.
I though that any call to taxonomy_xml_get_term_guid() would return a valid value, aven if it had to make it up on the spot, so I would have expected

+    // If a GUID is available, it should be used instead of the ID to avoid validation errors
+    $guid = taxonomy_xml_get_term_guid($term);    
+    if (isset($guid) && !empty($guid)) {

to never trigger.
I may be thinking of a different version of the code though.

I think I'd still like to get my original big conceptual problem sorted out though - should we use IDs *or* rdf:about - as a straight rule, rather than gues on a term by term basis.
I really would prefer IDs (plus an xml:base) because that's much more succinct, easier to read and transfer, and portable. Also MUCH tidier for internal references. But xml:base support was flaky in the various parsers I was using (several years ago) and rdf:about has the advantage of being really portable even when things are broken up into small fragments.

This calls for a hand-wavy discussion over some strong drink.
OR a pointer at several consistent examples that say "this is the best way to do this" from several different respectable sources - that actually agree.

xtfer’s picture

When you get back from your holiday then...

The standard, as I understand it, the examples I've read about (mostly on listservers), and the validators seem to agree, that (while the two are largely interchangeable) you can't have both an rdf:ID and an rdf:about attribute in a skos: statement. Either way, the W3C Validator rejects it when it occurs, so it is technically invalid. I believe this comes back to the problem of constructing the graph. Even if an ID and an about agree at both ends, there is no way for a consumer to know that they are the same thing, so must assume they are different.

The choice comes down to practicalities, then...

I think I'd still like to get my original big conceptual problem sorted out though - should we use IDs *or* rdf:about - as a straight rule, rather than gues on a term by term basis.

I would assume you must choose rdf:about if you had to pick one, as while they are interchangeable, the ID is appended to the in-scope base URL (+ '#'), so will only work when ALL your terms are in that scope. As the output defines a skos:conceptScheme, its individual nodes could have ANY URI.

On top of that, the skosCollection URI cannot currently be defined adequately, and because we are exporting a file, the base URI would be have to be set explicitly, either from that skosCollection, or another point.

An additional problem, from the modules point of view, is that the GUID concept allows any string, including URIs, so if I use an actual GUID in that field, it breaks anyway.

I really would prefer IDs (plus an xml:base) because that's much more succinct, easier to read and transfer, and portable. Also MUCH tidier for internal references.

rdf:about can be used with a base (see ref 1), if it starts with '#', but also return a full URI, and is no less portable or tidier.

I though that any call to taxonomy_xml_get_term_guid() would return a valid value, even if it had to make it up on the spot, so I would have expected

It seems to be returning a value only when one is set and is not causing problems.

Another option for you to consider...

I have been considering a way to provide configurable persistent identifiers for drupal objects, through an API module providing ctools plugin types for different identifiers, that could replace this functionality wherever a module has implemented it in a custom way (with the exception of UUIDs, which it would defer to when required). This would allow you to pull POI for a term or vocabulary from the module, then based on its type (e.g. URI, UUID or machine_name) determine how to format the ID or attribute.

Refs

  1. http://www.ibm.com/developerworks/xml/library/x-tiprdfai/index.html
  2. http://answers.semanticweb.com/questions/2189/should-i-use-rdfabout-or-r...