Distributed Drupal Search and Retrieval. Brainstorming, dreaming and rambling.

By Edward C. Zimmermann@drupal.org on 18 May 2005 at 08:24 UTC

Some ideas about turning Drupal into a distributed information system and to fade the borders between individual sites (going beyond RSS Syndication or Open Archives)

[Warning: The following links are just experiemental and not intended to ever provide services outside of the scope of current development. They are highly instable. Please also keep in mind that development has only just started.]

RSS Search:
http://drupalix.ibu.de/?q=aaa/id/rss
Sample term" "Film"
Right now we are only collecting from a few Drupal test sites but we have the capacity to effectively--- if we wanted and it made any sense--- to include in an interface (that is a module so services are available to nearly any Drupal) each and every Drupal site on the planet.

Web Search:
http://drupalix.ibu.de/?q=aaa/ib/dash
Sample term: "Nazi"
(This is intentended for D-A-S-H---- which we have many millions of anti-racist pages.)

There are actually lot of other kinds of searches we could add into Drupal, such as Mailing lists etc. Any ideas what makes sense?

We have implemented these features as a module.

For the discussion right now lets focus on Drupal specific (Web != Drupal).

The idea is to break-down the borders in large communities between individual Drupals to create communities of synthetic Drupal sites (really what Syndication could have been about). Would it make sense (as I'm leaning) to allow for remote rendering of context and perhaps even store-and-forwarding. This would offer the advantages of distributed designs but also the advantages of centralised services (connectivity, resources, etc.).

Right now in the above RSS search--- which is, should one be wondering, a fullblown XML fulltext engine allowing one to seach structure to individual siblings etc that also has objects like date/time, numbers, geospatial boxes etc--- we have glued a simple protocol into a module that creates virtual nodes--- data is NOT sucked into the local database!--- as the result of a search that are then rendered by the Drupal machine. The links go to the "remote" Drupal site where they are rendered by whatever Theme design one happens to get. Especially on the issues of personalized user spaces and interfaces (a significant issue among the disabled) would it not make sense to allow "nodes" to be exportable and rendered at other sites?

By remote rendering and having "canned searches" to create synthetic Drupals one can create a world of content navigation in Drupal that spans an entire locus of communities, creating small villages and townships in the incorporation.

Comments

This is very much what my

iandickson commented 18 May 2005 at 11:11

This is very much what my previous business was about.

See CommKit and look at the papers.

We put four years into it - but illness by my tech partner has forced closure.

For the record, as a coder and project manager with 30 years experience, he considered it the most conceptually complex project he'd ever got involved in.

I'd like to see the lessons taken wider, and Drupal would be a good basis on which to work.

The core of any such solution is threefold

1) Unified Taxonomy structure. Tag meaning to numbers, structure of words determines numbers, and so becomes language independent.

Provide a core of pre built taxonomy (we had 100,000 entries) that most people will adopt.

(Each community only chooses what it needs, and if something is missing, it's considered and added by the Gods - normally in a generalised fashion).

2) Identity is situational, and thus transferrable only where utilised taxonomies are similar.

E.g - a vicars identity is very different in Vicars Forum to that in UK BDSM, where no one knows he's a vicar, and his parishioners don't know he's there :-)

3) It's more about psychology than code.

If you wanted to really get to grips with this, the best initial option would be to put a small team together, and meet up for a weekend. Experience has taught me that it takes a while for programmers to really understand what they need to achieve, and a face to face initial workshop will save a lot of deadends and hours wasted.

I'm in the UK.

Ian,Sorry to hear about

Kobus commented 18 May 2005 at 11:23

Ian,

Sorry to hear about your tech partner's health, but at least your experience is still with you, and I am sure your inputs would be valuable to this idea of Edward.

Meeting face to face would be difficult for me, as I am in South Africa, but I won't be that much involved in the coding, and I believe my involvement can be done electronically, if I am accepted to join this venture at that level.

This is a very exciting new development for me, and I want to help as much as I can!

E.g - a vicars identity is very different in Vicars Forum to that in UK BDSM, where no one knows he's a vicar, and his parishioners don't know he's there :-)

I find this bit very funny. Very well put!

-- Kobus

Community

Edward C. Zimmermann@drupal.org commented 18 May 2005 at 16:00

"1) Unified Taxonomy structure. Tag meaning to numbers, structure of words determines numbers, and so becomes language independent."

I think we need to distinguish between a unified subject taxonomy and a unified sematic structure.

We have, of course, within the engine a model for attribute unification and mapping--- and when used in Z39.50/ISO23950 it does need an Object ID (OID) to an attribute set (these are really just numbers that map to objects, structures etc.) as that's part of the design of the procotol--- but it is not really needed in this application.

As an aside:
[One could have exported meta-records to a controlled vocabulary or if you want numbers (Z-Tokens http://www.gils.net/z-tokens.rdfs ) or BSR (ISO Basic Sematics Registry see http://www.ubsr.org/bsreCadre.htm ) or a map into UDDI (See http://www.uddi.org/specification.html ). This has over the years played a significant role in our work in information locators (see GILS http://www.gils.net and also some of the ideas explored in the Advanced Search Facillity http://asf.gils.net ). ]

I think this is all irrevalent as we have a (reasonably) well defined NODE structure (which, at least in the current development needed to be slightly extended) in Drupal and the aim is not to integrate a universal of hetrogeneous community systems but Drupals. We have RSS/RDF.

Why no need for a taxonomy? Because how each Drupal organizes its content is up to each. The magic is Search-and-retrieval, language. A controlled vocabulary for subject classification and some other bits are nice and very usefull but we can live without them.

The path from data to rendered page demands some of this but it need never go beyond the private sphere since the creation of a node from content is in my model always handled by the owner of the data, viz. the native Drupal. It just just as a Node exported as structure and content and is rendered by the local theme variants.

"Provide a core of pre built taxonomy (we had 100,000 entries) that most people will adopt."

"2) Identity is situational, and thus transferrable only where utilised taxonomies are similar.

E.g - a vicars identity is very different in Vicars Forum to that in UK BDSM, where no one knows he's a vicar, and his parishioners don't know he's there :-)"

I think the "roles" don't matter. Since the views to roles are always contextual. In an engineering department a specific person is wellknown and one can distinguish between one person and the next but from another group they are all just "engineers". The refinement of identity using your example of someone named "Smith", a vicar at the local Anglican church, within a chruch's Drupal is different from within a Vicar's forum to within a religious community to within an interest community. Its like birds in a swarm or ants in a hill.

The associations with identify within this model are derived from the name of the site from which the content is originally based more than the person (who is generally unknown outside his/her community) identity. We would thus announce the source site of the content but not the name of the person owning the content within the "front page"-- its also not exported by Drupal in its RSS. Its only then relevant within the rendering of the page/document itself. There its then no longer just "Smith" but , for example, "Smith@drupal.vicarsforum.org".

QuoteWhy no need for a

iandickson commented 19 May 2005 at 21:23

Quote

Ends

This might be us agreeing, but speaking different langagues.

With a generic taxonomy each site STILL does what it wants, but, instead of building from scratch, the generic taxonomy resource will be used, at least for large elements of taxonomies. It's faster, and easier.

Example - I decide I want to track any posts about a particular drug.

Call it Chemical Name X.

Some drugs have over 100 different brand names (different makes, different markets). Locally, Acol.

Chances are that , at best, I'd run searches on Acol and Chemical Name X. I'd miss all refernces based around other brand names in other big markets.

With a generic taxonomy any Drupal site which considered that drug an important aspect of its content would have Drug X in it's taxonomy, as it could give a text lable (or synomyms) to the Chemical Name X.

Thus when I as a patient in a Uk Drupal community look for (say) Acol, what I get back is references to Chemical Name X and lots of other brand by name around the world because they all reference an underlying generic. (The MEANING resides in the number tag, not the text).

Also the use of such a model allows Drupal Site Owners to locate other sites with significant overlap re taxonomies.

For example I could set up a Drupal Site for people in Gloucester with Autistic Children. Later someone in Bristol sets up one for Bristol, quite possibly unknown to me. But our taxonomies will have big overlap.

For rare conditions think about my site finding a French one, in French, which I don't speak, because of the taxonomy overlap. Suddenly two small, isolated groups, become a bigger one.

Cheers

Project for the disabled

Kobus commented 18 May 2005 at 11:17

Hi, Edward,

Your idea sounds very good, and I'd love to get involved in testing and perhaps documentation and translation. My development skills are ok, but I don't believe up to the standard of this project, so I won't offer assistance there.

I believe I would be able to use your module very well in my current project I am starting to get off the ground, which is a project for disabled people. See this node: http://drupal.org/node/22997 for more information about this.

The benefit I see for your module for my project is that it, if I understand your post correctly, allow me to search (and maybe import?) information from other Drupal sites and have them formatted by *my* site theme, which will be all about making information accessible for disabled people.

Before I even completely read your text, I realized the potential for disabled people, and wrote my comment, and then saw that you specifically mention disabled people in your text. This is absolutely wonderful!

Would it be possible for you to give me a bit more information about this, please?

-- Kobus

"The benefit I see for your

Edward C. Zimmermann@drupal.org commented 19 May 2005 at 05:20

"The benefit I see for your module for my project is that it, if I understand your post correctly, allow me to search (and maybe import?) information from other Drupal sites and have them formatted by *my* site theme, which will be all about making information accessible for disabled people."

That's one of the (intentional) side-effects. Its more than that but also creates a virtual one-stop to a larger community of Drupal based communities and blogs.

Imagine I'm interested in contributions/stories/articles to the flurry of right-wing SPAM hitting the German networks then I'd maybe search for "Nazi SPAM" (or maybe ask for something like "Nazi" and "SPAM" in the same sentence) and get back a front page with the articles as-if they were on my local system. Now selecting one it would (and not as now where we just provide a link to the remote system page) allow me to also view the article as-if it too was local. From the vantage point of the user, beyond the names (roles), the difference between local and remote fades and content/interest etc. defines one's own virtual front page, rendered in one's own personal way.

Some of this can be

jvandyk commented 18 May 2005 at 23:49

Some of this can be accomplished with the publish and subscribe modules, that I am working on. Screenshots. Taxonomies are mappable between Drupal sites, but there is no "master vocabulary" at this point.

Publish Subscribe

Edward C. Zimmermann@drupal.org commented 19 May 2005 at 05:09

Are you not importing content from remote sites into your local database? The nodes for rendering are created locally from local replicas--- node created locally and rendered locally? Or are you storing a linkage to a remote method to have the content exported on demand--- node created remotely but rendered locally?
Tell me more!

[my suggestion was the later and the information discovery, filtering, search etc. are via remote services so there are no additional local storage demands to participate in a large community]

Correct, our approach is to

jvandyk commented 19 May 2005 at 11:14

Correct, our approach is to import the content, but for the purpose of efficient searching. and synchronization of content. For example, Drupal module documentation could be published on many developers' sites but searchable as one body of information on drupal.org. When results are displayed, we intend to have the URLs point to the original (non-local) node.

"Export Quality"

Edward C. Zimmermann@drupal.org commented 19 May 2005 at 18:48

"Correct, our approach is to import the content,"

This demands resources (storage, internet transfer etc.) that is hardly efficient.

" but for the purpose of efficient searching. and synchronization of content. For example, Drupal module documentation could be published on many developers' sites but searchable as one body of information on drupal.org."

And my model is that the user-interface to search is via one's own Drupal! The toy module we demonstrate above is talking via a simple protocol to a remote machine to handle the search. In the case of the D-A-S-H project ( http://www.d-a-s-h.org ) we will provide a module that will allow people to integrate within their own Drupal sites a search of the collection of millions of web pages that were considered subject appropriate (in the case of the D-A-S-H project to the issues of racism, anti-semetism and zenophobia). The result of the search is to build a synthetic node and this is then handled by the Drupal rendering apparatus (Themes etc.). The design and UI of results is whatever theme has been chosen.

"When results are displayed, we intend to have the URLs point to the original (non-local) node."

In the case of our "network of Drupals" its what we are doing now but I'd like to have a module fetch a node remotely to be locally rendered--- perhaps we might need to extend property rights to allow for content owners to control their rendering rights (as in "I don't want my content rendered remotely"), similar perhaps to the concept of cache control and robot exclusion in standard Web.

Efficiency

jvandyk commented 19 May 2005 at 20:25

There will be tradeoffs in efficiency, bandwidth, and time-to-result just like in any software project. In our case, it's much faster to search in one place rather than doing distributed search, and since the sites are hosted on the same machine bandwidth is a nonissue. Your experience will, of course, differ.

A request for pull-publishing can be thought of as a search request if it comes with conditions (e.g., author = fred, created > 5/19/2005).

Because the publishing sites will also be sending their taxonomy terms, one of our design goals is to offer taxonomy-based browsing on the subscribing site.

I am in no way denigrating your model. It's open source -- build what you dream of.

Networked Metropolis

Edward C. Zimmermann@drupal.org commented 20 May 2005 at 07:17

" it's much faster to search in one place rather than doing distributed search, "

I've not talked about distributed search but, instead, from the pointof-view of services, centralized back-end search with multiple front-ends. In the context, especially of some of the youth communities that have been addressed in D-A-S-H, not all have or will have adaquately high grades of connectivity to sufficiently offer good grades of search services. Distributed search among equally well connected peers with sufficient computing resources makes a lot of sense but distributed search to everyone and everything does not. This has been the lession we've all learned over the years and the ideas behind not just Advanced Search Facillity, the "Open Archive Initiative" and other initiatives and projects to information distribution and interoperability but also really RSS.

"and since the sites are hosted on the same machine bandwidth is a nonissue."

Its a complete non-issue since one can be assured that all modules and services are available to each and every virtual Durpal and one can create access rights to the the information in each and everyone's RDBMS tables-- the issue is, at most, property and rights management AND, of course, search (of which these RDMSs are ill-suited, demanding thus, anyway, another layer of applications).

Most of the kinds of networked communities I'm thinking about are not all housed at the same location, rack, machine but the computer is literally the network. Think about global companies, organizations or interest groups where there are pockets of teams across the planet. Each working group might have their own "community" defined on their own installation of Drupal on one of their machines but how do they all connect up with each other. How does, say, a salesman in Italy get to know that the marketing team in Brazil has a project already running on the national office of a key account? Sucking all the "data" in from all the computer nodes does not make sense since it creates unnecessary demands on resources. Syndication is, instead, a good model for this. Now, one could demand that people go to a one-stop (like FirstGov, FindIt or Yeehaw) or one can provide a layer to allow for participating nodes (Drupals) to integrate the search-and-retrieval functionality of the one-stop within their own community. Remove now the issue of "goto" as in external linkage--- which has been long an issue that has plagued hypermedia, jokingly called back in the days of even richer hypermedia and better defined linkage models "Lost in Hyperspace"--- and it quickly becomes concept driven personalized communities not unlike how each of us lives in our cities, each our own world.

I'm really interested in this topic

dikini commented 19 May 2005 at 07:48

I'm really interested in this, or at least the larger picture. I want to work on distrubted communities, or communities of interest, if you like, where the location of content does not really matter. Something similar to what you are speaking about.

I have an experimental code, in the pan, to do p2p like neighbourhoods between websrevers. I am going to port it eventually to Drupal. Until then I'm very much interested in helping out in such development.

...

adrian commented 19 May 2005 at 14:24

You have experimental code to do just about anything, don't you =)

--
The future is so Bryght, I have to wear shades.

:))

dikini commented 20 May 2005 at 09:00

not just about everything, but loosing my sleep over the missing bits

The p2p part, is something I want to use for distributed identification, and something like the search you are describing. Why? Apart from being a sysadmin, webmaster, webdesigner (wannabe at least), video-conferencing victim, I am doing a part-time PhD and I am big time into chaotic and weird behaviours - so you can see where I am coming from. I need to prepare a paper or some other kind of publication on these to make a claim to fame, and the code and results will end up here, well, maybe. It's not only my call.

--
"Confusing the world for the sake of humanity" just sounds right.

Peer-to-Peer, Drop-to-Drop

Edward C. Zimmermann@drupal.org commented 20 May 2005 at 07:19

"I have an experimental code, in the pan, to do p2p like neighbourhoods between websrevers. "

Part of what I've suggested is that Drupal itself is (or can be) the peer-to-peer!

yes, it is

dikini commented 20 May 2005 at 09:13

drop-to-drop, raining could or I would say should be implemented as a p2p kind of network. The biggest challenge is how to make as simple as possible, yet as flexible as possible.

Why?

It's not only the elegance of implementation. For example:

How do you filter and route content items to the most appropriate site feed?
How do you maintain a interest negbourhood?
How do you discover new sites or new feeds related to yours?
How do you tackle the lack of control over the remote content, while maintaining your own feed cred?

These are just issues off the top of my head. A p2p network can easily fall prey to spam, disinformation, identity fraud for the same reasons that make it useful. It is quite a nice challenge I must admit.

I can't post my code into the open at the moment, but will try to very soon. A promise. Sometimes I'm a bit slow, but usually everything ends up just fine.

--
"Confusing the world for the sake of humanity" - why not

Its raining content

Edward C. Zimmermann@drupal.org commented 20 May 2005 at 09:58

How do you filter and route content items to the most appropriate site feed?

Not need to route or even filter--- wrong paradigm. Its about search. One gathers all the RSS feeds and indexes them. There is some structure in RSS. The rest is "search magic" with queries like:
- What "articles" have mentioned "SPAM" and "Nazi" in the same sentence? Or in the title? Sorted maybe by a weighted combination?
- Or what other articles are relevant to this one?
- How has a linkage or cross-reference to such-and-such a resources?
- What articles published before 11 Sept 2001 mention "Bin Laden" and "World Trade Center" in the same paragraph?
(I would suggest that search services should not be restricted to just current frontpage/RSS content but maintain a "way-back" archive. In contrast to the Web the general usefullness of a specific syndication snapshot I think is higher and the demands on resources are for the whole much lower.)

Or any of a number of queries.

How do you maintain a interest negbourhood?

Canned queries and the search equivalent to a bookmark.

How do you discover new sites or new feeds related to yours?

In my model its not how "I discover new feeds" but about how "feeds announce themselves". I'd argue, as City Hall, that its probably pragmatic and appropriate to have feeds register with a federated directory. In the context of the D-A-S-H project, for instance, new sites built using Drupal and relevant to the subject would register their site with the EU D-A-S-H registry (moderated, in this case, to establish barriers to sabotage and SPAM, especially relevant in this highly charged and highly visible political arena). Sites can, of course, be proposed as in the many Open Directories.

In a corporate intranet its irrelevant since only well-defined sites would be "collected".

How do you tackle the lack of control over the remote content, while maintaining your own feed cred?

Its not my content to control. The role is, at most, one of censor--- really a nasty word for the concept but the most appropriate one--- or how to try to defend the open metropolis from its enemies. Since these cosmopolitcal communities have some (or should have some) common ground and we'd rank things by relevance (and not links or popularity) there is less a chance too of letting things get co-opted by information oligarchies (as the case with the popular Web search guides and services).

Not need to route or even

dikini commented 20 May 2005 at 11:51

Not need to route or even filter--- wrong paradigm. Its about search. One gathers all the RSS feeds and indexes them. There is some structure in RSS.

Canned queries and the search equivalent to a bookmark.

Yes, RSS has structure and you can and should use it to your advantage. It will work, but it has limits. How do you update the 'bookmarks'? Probably by user recommendation or editor intervention. I personally dislike the subscription model. How do I or Why should I maintain central taxonomies. There is too much centralised control in it. I prefer lazy gathering of information. That is relevant information comes to you as opposed to you hunt for it.

I could probably (not sure about it, it's more of a gut feeling) prove, that a distributed system, with local, per server, content filters, over time can perform either equally or better than a centrally administered, or a collective of independent centrally administered search engines. It should have a behaviour of a geographically dispersed neural network, not 100% correct but describes the feeling of what I mean.

My talk at the moment is quite cheap, no code. I will try to put what I can in vi or something, and describe what I exactly mean by content routing and filtering and how do I see it working. Actually I prefer all of that to be public, there is more benefit of having peer scrutiny or even destruction of my ideas, rather than being ashamed afterwards.

Central Committee

Edward C. Zimmermann@drupal.org commented 22 May 2005 at 09:49

I personally dislike the subscription model.

Its hardly a subscription model. You don't here generally subscribe to a feed from a specific source (although this is not excluded) but to, if anything (and there is no need really to subcribe to anything) to a subject theme defined by a query. Its up to the federated search services to track syndicated content.

How do I or Why should I maintain central taxonomies.

You don't. They don't work well anyway.

There is too much centralised control in it.

Not really. These federated services don't need to be "central", are really only centralized from the viewpoint of user but not architecture. A search, in fact, could well be distributed over a virtual target that not only spans multiple data collections but also multiple servers and services. If you want you can think of it as name services like bind.

Instead of sending users to 100s of servers using 100s perhaps of different user interfaces its a common interface to information within one's own Drupal-- thus using even the look and feel ("theme" or "chrome") of one's own space.

Yes there is peer review on these federated services and that might not include services from some sites that one might personally want.. Should that site, however, deploy an interoperable search service (such as SRW: http://www.loc.gov/z3950/agency/zing/srw/ ) then it too could be, in this design, transparently added.

Right now I'm just addressing the simple subcase. Drupal servers and RSS.

I prefer lazy gathering of information. That is relevant information comes to you as opposed to you hunt for it.

I don't see the difference.