RDF in Drupal 7 code sprint

scor - April 24, 2009 - 11:30

There are only a few months left before the code freeze on September 1st. Now that Fields API has settled in core, it's time to extend it with some RDF semantics. DERI Galway is hosting an RDF in Drupal code sprint from May 11th until May 14th.

This sprint builds on Dries' ideas expressed in his recent posts Drupal, the semantic web and search and RDFa and Drupal. With RDF in the core of Drupal and RDFa output by default, it's dozens of thousands of websites which will all of a sudden start publishing their data as RDF.

So far 8 people have signed up. How about you?

Some others are willing to come but cannot afford the trip until some funding is secured. To help us fund the sprint and bring more Drupal rockstars on board, please consider making a donation using the ChipIn widget on this page. The money will be used to cover flight, food and hotel costs for the sprinters. All sprinters are generously donating their time to make this happen. It would also be great to fly in a few additional people with extensive testing and Fields experience. Any excess money will be used to add more people, or will be donated to the Drupal Association.

Goals of the code sprint

The RDF code sprint will focus on Drupal core and aim at integrating RDF semantics in it.

  1. Extend Fields API to integrate RDF mappings for each field instance. The semantics of a field can differ from a bundle to another. This can be stored either in the existing settings property or by adding a rdf_mappings property to the Field Instance objects.
  2. Modify the Fields UI (contrib) to allow RDF mappings editing.
  3. Define the appropriate mappings for the core modules, based on the RDF core mapping proposal.
  4. Patch core modules with the mappings defined above.
  5. Export these mappings in RDFa via the theme layer and keep it as generic as possible in order to ease the work of the themers.
  6. Write tests for RDF in core.
  7. Identify other non-fieldable entities in core which could benefit from being RDF-ized, and see how to annotate them. Comment is one example. Terms also, though they might become fieldable.
  8. RSS 1 (RDF) in core. Arto volunteered to get started with that.

See a list of current open RDF issues in RDF issues in core.
See also the RDF code sprint wiki page where we will keep an up to date list of goals.

What is RDF?

AlanT - April 26, 2009 - 14:33

I'm probably just showing how little I know, but what is RDF? And what does having it in core mean to me as a user?

Intro to the Semantic Web

scor - April 26, 2009 - 20:15

RDF is a W3C standard to add semantics to the data of your site and enable interoperability on the Web. Think of it as RSS on steroids. Watch this great video Intro to the Semantic Web

Some of the ways it helps end-users...

webchick - April 26, 2009 - 20:31

1. Better SEO; RDF allows Google and other search engines to have context about your site's content. They'll understand that "Frank Jones" is the name of a person, not just some random text. They'll understand that a random node on your site is a review for a book with a rating of 2/5 stars. Think search engines on steroids.

2. Better opportunities for interoperability. Data on your site can be "mashed up" with data from other peoples' sites in all sorts of interesting ways.

3. Once you explain what the content is of your pages, it makes it really easy to pull in related content from elsewhere on your site (or elsewhere on the web) to help improve the ability of your visitors to find things they're looking for quickly and easily.

(scor, feel free to correct me if I'm wrong in any of this; this is just what I learned from researching OpenCalais the other week.)

right!

scor - April 26, 2009 - 20:44

you're perfectly right webchick!

Thank you, Webchick. I have

AlanT - April 27, 2009 - 12:39

Thank you, Webchick.

I have to say that this sounds like an interesting theory, and hopefully it will turn out to have practical uses as well. Has anyone split-tested this to see if it really does produce better SEO? Are there any examples of live sites using it to improve site usability?

Well note that this exists right now...

webchick - April 27, 2009 - 17:06

It's not like this is science-fiction stuff that "could" someday appear. :) Google, for example, is parsing this stuff as we speak, and directing priority traffic to sites that implement RDF and Microformats.

For example, try searching Google for "name of movie movie" and you'll see something like this:

Ratings

That aggregated rating is parsed from sites that implement microformats to explain that the "5" that the search engine finds in that page is actually a "5 stars out of 5" rating on a movie review. If you click into that link, you'll see a variety of sites. One of them that usually comes up is http://www.commonsensemedia.org/ which also happens to be a Drupal site that implements the hreview microformat.

Drupal makes a particularly interesting/powerful platform to put RDF into because there is literally no limit to the type of content Drupal can manage, so we have a real opportunity to be leaders in this area, and move this power into the hands of people who are not comfortable hand-editing HTML.

Yahoo! SearchMonkey

scor - April 27, 2009 - 19:22

AlanT, make sure you also watch the video about SearchMonkey. It features enhanced result you can see already on Yahoo! search results, searching for art of pizza chicago for example.

Does this mean that Drupal will be documented?

pschopf - April 27, 2009 - 23:42

It looks like if you have to ask, you don't belong. Continuing with the strong drupal tradition of writing new code without backward compatibility, we can now release Drupal 7 years ahead of documentation for Drupal 6. In fact, you can forget about 6 documentation entirely - read the code, that should be enough.

Excellent!

webchick - April 28, 2009 - 00:05

I'm always happy to meet someone passionate about seeing Drupal's documentation improved. :)

It's important to note that anyone can click the "edit" tab on any handbook page and fix it if they notice something inaccurate. Or, if you come across something that's not documented yet, write down as much as you've managed to figure out, and then file an issue in the queue, either against a particular module if it's for that, or against the "Documentation" project if it's for something more general such as a page in the handbook. The documentation team is a really great bunch of volunteers who love to help those who want to help Drupal, and would be more than happy to proof-read your work, collaborate with you on something, or direct you to the proper channels. http://drupal.org/contribute/documentation has more information on getting involved.

Looking forward to your contributions! :D

TOTALLY off-topic for this thread, but...

dman - April 28, 2009 - 10:49

I was bored, so did some quick calculations.

A quick and very unscientific grep of the drupal core modules says that from:
24699 lines (just the core /modules directory, not API, excluding the html templates)

22477 are /not/ blank.
4721 /look/ like inline function docs ( with *)- the core phpdoc documentation as seen on api.drupal.org
1586 are inline explanatory docs ( with //) - available on api.d.o and useful to any developer.

Taking a look at the code,
2601 lines are calls to t() - which contain more text than code and are just ui messages, it's not like there are per-line docs needed there.
2931 lines contain nothing but "}" on its own - not exactly confusing to anyone reading docs.
(2601 + 2931) = 5532 non-documentable lines

sooo .. the way I look at it, there are
(4721+1586)= 6307 lines of doc to (22477 -6307 -5532)= 10638 lines of code.

a little over 1 line of documentation per 1.7 actual code that may need explaining. +/- 5%

So that's (a little) like the developers spending 22 minutes of every hour explaining what they are doing in the remaining 38 minutes.

Line-count-based metrics are extremely flawed way of measuring code quality, BUT I still don't understand why these results (2 docs every 3 lines) could be held up to call Drupal6 'undocumented'.
Do we need the "talking about things" to outweigh the "actually doing things" portion of the code before it can be called "adequately documented"?

FTR, to expose how bad my maths/cli skills are:

cat /var/www/drupal6/modules/*/*.module > drupal-cat.txt
export total_lines=`wc  -l drupal-cat.txt`
export line_count=`grep -cve '^\s*$' drupal-cat.txt `
export phpdoc_count=`grep -c '*' drupal-cat.txt  `
export inlinedoc_count=`grep -c '//' drupal-cat.txt  `
export translate_func_count=`grep -ce '^ *}\s*$' drupal-cat.txt `
export lone_brace_count=`grep -ce '^ *}\s*$' drupal-cat.txt `

.. there are many tweaks that could be made to this algorithm, have fun.

Of course, I may have totally missed the point as I'm only talking abut docs intended for people who read documentation. I'm not sure what the wordcount on the Drupal handbooks vs contrib code would come in at.

.dan.

I'm afraid RDF is an abstract

momendo - May 2, 2009 - 04:23

I'm afraid RDF is an abstract and hard to understand topic. Even amongst seasoned web developers, you'll be hard pressed to find much excitement in RDF. I think that enthusiasm is reflected in the chipin widget. Can you provide more information about how that would to translate into real world applications and use cases?

If it's worth anything to

Steve Dondley - May 2, 2009 - 14:31

If it's worth anything to you, Dries has given a presentation about RDF and explained his reasoning why he supports it. Tim Berners Lee supports it. Here's what he said:

I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A ‘Semantic Web’, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The ‘intelligent agents’ people have touted for ages will finally materialize.

RDF _is_ cool, really!

graybeal - May 2, 2009 - 16:28

I totally understand this reaction. As technologies go, RDF is dry as dust. It's one step removed from all the cool stuff that it enables, so people don't get excited about it.

As someone who is trying to marry science and semantics -- that is, change current scientific methods so that automated processes can _understand_ them, in a computational sense at least -- I am wildly enthusiastic about this work. I am convinced it will be the most important contribution of Drupal to its users, the users of Drupal sites, and to information technology in general, in this decade at least.

Many of the other links provide the additional information that you are asking for, and I probably can't improve upon them in a post. But I will give a use case from our community (since I wasn't sure where to leave my use case anyway). Those of you who like text more than video may find it helpful. This use case relates the more semantic-oriented technologies of this change to the practical Drupal web site technologies.

Right now we[1] collect references to other documents on the web about science data management. We are starting to categorize and rate them, using custom fields we created for each kind of reference. Anyone who wants to find and use these ratings has to go to the Drupal site, look up each page, and grab that data. The ratings themselves are terms that come from vocabularies we maintain on another 'vocabulary repository' system [2]. We will probably add the ratings data to a custom-built table, but that will cost development time and still require people navigate to the table and view it, or copy/paste it, to use it for their own purposes. They can only use what we can provide by developing custom software. They can't automate it because if we change our format, or the name we used for the title of a category, their automated scraping of our page will break. They can't tell what the rating words mean unless we also specifically add taxonomies to Drupal that match our rating taxonomies on our own vocabulary system (or build additional tools to do that automatically, which we might have to do). They can't relate the ratings on our site easily to the ratings on another site, or know that our John Doe that rated this Content Standard is the same John Doe that rated it differently on another site.

In the brave new world of RDF in Drupal, here's how I hope it will work:

  1. We create the page for rating, say, Content Standards. Our page will define for every category (Drupal field): a title; the vocabulary (from our vocabulary repository, but ideally this information is automatically aligned with Drupal taxonomies, ahem :-)) to use in filling out the field; and the RDF-encoded concept that corresponds to the title. In practice, this last means that we use terms from a vocabulary of concepts, like 'overall rating' and 'version reviewed' that we and Google and everyone else understand. Creating this page was easy and straightforward because it's all integrated in Drupal core.
  2. Members of our team make a page by filling out every field of information for a given Content Standard. They are prompted for the allowed terms to fill out each field. If they don't know what the field title means, they click on a help link that references the corresponding RDF-encoded concept ("version reviewed: the version string assigned by the information provider to the particular release of the information", and possibly much more). Note: Much of this is possible with Drupal taxonomies today, if you use them, but they are embedded inside your Drupal server, not widely accessible and exchangeable.
  3. In each presented Content Standard page, Drupal automagically provides the RDF metadata associated with that information.
  4. Anyone who wants to write an application that makes use of our data -- say, tracks the change in overall rating against the different versions reviewed -- can use the RDF information embedded in our web page to do so. Even if we change the format of the web page, and the title of the fields that the user sees, their application will still work. *And*, it will be able to explain to the user exactly what all these ratings mean that we're using, because we have defined those terms, and Drupal and the application understand how to use RDF to find the definitions. (Note to semantic web folks: Glossing over some URI dereferencing issues there.)
  5. Someone who isn't on our development team may not be inspired to write an application that lets a scientist find the perfect content standard for their needs, by using our ratings and information to automatically select content standards based on the scientist's input. This multiplies the value of our work, potentially a lot, without requiring any additional labor or agreement on our part. (Because the interface to the data is exposed automatically through RDF.)
  6. All the big players (Google, Yahoo, countless semantic tool developers) that write applications that crawl the web looking for information they understand ("look! there's an 'overall rating' for a 'web resource' that goes by the 'resource name' of -FGDC Content Standard- and it has an 'update date' that's less then a month old!) will now understand much, much more about what's on our web site, and can represent that information in their own contexts.
  7. If we ever decide to add a concept to one of our vocabularies -- say we add 'paradigm-changing' to our overall ratings vocabulary -- this 5-minute change automatically ripples through ALL these applications. Everyone can use it on our Drupal site to rate a Content Standard, every application will understand it is a term in the 'overall rating' scheme defined by RatingsRUs, and everyone who sees 'paradigm-changing' as a rating for a Content Standard, even if they have no idea where our site is or who created that Content Standard page, can immediately find out what what term means.
  8. Because ALL these terms and vocabularies are controlled and connected in well defined ways (thanks to RDF), we can understand that our 'paradigm-changing' rating actually means exactly the same thing as Google's '*****' and Consumer Reports' '10'. Similarly for names (thanks to an RDF vocabulary called Friend-of-a-Friend, or FOAF).
  9. Someday, automated systems can use all this knowledge to perform human-like reasoning automatically across all the data, concluding for example that the MMI rating system gives web content lower but more consistent ratings, while the Consumer Reports system gives higher ratings with more variation.

    I will donate some of my own money to make this happen, although the benefit will go to my work life. If someone from the project finds this use case interesting they are welcome to contact me about it.

    [1] Marine Metadata Interoperability project, http://marinemetadata.org
    [2] MMI Ontology Registry and Repository, http://mmisw.org/or

Structural hints open interesting doors

yelvington - May 2, 2009 - 18:03

While you're thinking about how RDF might empower what EPIC called "fact-stripping robots," give some attention to YQL Execute as well. The potential interaction of the two is mind-boggling.

RDF is out there right now and being used

PhillG - May 21, 2009 - 13:17

I think some seasoned web developers might not be excited about RDF because they may not all come from a data or information architecture background.

Have a look at http://www.london-gazette.co.uk/

All the Corporate Insolvency notices (and may others) contain large amounts of RDF triples encoded as RDFa. The documents are self describing, in combination with the ontologies pointed to by the CURIEs, a machine can infer all sorts of information such as comany name, number, nature of business, directors, the court hearing date, place, which administrators were appointed, which company they worked for and at which office, and so on.

Phil

RDF Primer

giorgio79 - May 4, 2009 - 04:23

My Drupal sites:
Review Critical
ClipGlobe - World Travel
I created these 100% from concept, to design and build on Drupal.

My thoughts

Dries - May 6, 2009 - 16:13

I will take part as much as

Cloud - May 8, 2009 - 13:04

I will take part as much as possible (a few meetings on Monday but will be around for most of the week).

Memory usage

jaharmi - May 8, 2009 - 17:59

I hope that those developing this can take into account memory footprint. I don’t believe I’m alone, based on the RDF module’s issue queue, in running into memory problems with Drupal 6 + RDF on a shared hosting account. Enabling that module seems to take up another ~4 MB.

If this is going into core, I hope that it won’t have that kind of impact on every single Drupal site upgraded to v7.

I’m not complaining or railing against anyone’s hard work on this effort — heck, I want to be able to run the RDF module now — but I think that memory usage is an important consideration.

--
Jeremy
www.jaharmi.com

reduced memory footprint

scor - May 10, 2009 - 14:30

The RDF API module is not going into core and there won't be similar memory issues in core. We are working to make the RDF in core as lightweight as possible.

RDF in a pharmacist view

pharma - May 12, 2009 - 14:25

What i understand is (non-technical guy), it will help index old search engines (Google & Yahoo) to display results like the new search engine from ex-googlers http://www.cuil.com (Pronounced "cool"...)

If i am not asking too much, is it possible to display standard search results of websites using drupal like Cuil search engine results...

If you check for "Drupal" in google and Cuil ...you know what i mean

With RDF in Drupal 6 how much more SE traffic do you get?

giorgio79 - May 14, 2009 - 05:23

If you implemented RDF for Drupal 6 would you share your SEO results, such as the % of increase from organic search? I am really curious how much can I benefit from its implementation? Some case studies would be great.

I just tried the Calais analyzer, but in some of my posts with 300 words, it could only identify maybe 2 words as the name of the person...

My Drupal sites:
Review Critical
ClipGlobe - World Travel
I created these 100% from concept, to design and build on Drupal.

RDF / Microformats will be mainstream soon

rupl - May 14, 2009 - 14:49

Right now I wouldn't expect RDF integration to have a very large impact on traffic to your site. However, just two days ago Google announced support for RDF and Microformats in regular search. Yahoo already has this capability featured in Search Monkey.

http://googlewebmastercentral.blogspot.com/2009/05/introducing-rich-snip...
http://developer.yahoo.com/searchmonkey/

So expect this to be a hot topic in the coming months/years. It's awesome that we're identifying this tremendous opportunity and making progress toward supporting it! (I donated, you should too)

 
 

Drupal is a registered trademark of Dries Buytaert.