e! Science News is a site dedicated to provide the very latest science news, but with a special twist – it is entirely automated! There is no human editor behind it - it finds relationships between news stories from all major science sites and regroups, categorizes, ranks, tags, finds related press releases and publishes them directly on the site. The result is an efficient overview of everything happening in science, right when it happens.
On my previous site, Biology News Net, I hit the limitations of Movable Type really fast; adding functionality was complicated, performance was not great (even on a dedicated server), customization was not really doable. Just as an example, the forum is actually a phpBB installation that has its sessions tied to the Movable Type sessions - it's clunky even if it works, and upgrading is a nightmare. As I could not really expect more from a blogging engine – Movable Type served me well - I searched for something better - this is when I found Drupal and fell in love with it! It has a significant learning curve, but it is so powerful that the time invested to learn it is easily worth it in the long run.
I wanted to build a site that would report science news as it happens. I felt the need to automate the process while keeping the quality at a high level. So much science news is posted every day, but frankly, not all of it interesting. I found the inspiration in initiatives like Techmeme and Google News.
Building the e! Engine
First, I identified the principal components of an intelligent news aggregator:
- A source of news, such as an RSS aggregator
- A clustering engine, to group news together
- A classification engine, to categorize the news (Is this Biology, Physics, Medicine or Astronomy?)
- A way to assign scores to clusters, to determine in which order the news should be displayed
First, I needed a good RSS aggregator; the default one provided by Drupal was inadequate, as it did not create Drupal nodes out of RSS items. The other aggregators available at the time were Leech and Feedparser, but they were implementing lots of functionality I did not really need and were not yet mature. Fortunately, Ted Serbinski (m3avrck) came to the rescue, releasing Simplefeed just when I needed it. Simple, fast, efficient: I could not ask for more!
Next, I built three custom modules to implement the remaining functionalities. The process was easy thanks to the infinite extensibility of Drupal – the hooks system is just wonderful! The result is what I call the Eureka! Engine. Here are some details about these three modules:
The first step is to regroup similar items in clusters. Two things are needed to cluster items together: a similarity metric and a clustering algorithm. In the case of text, similarity metrics can be based on the occurrence of words in two texts – a document can be represented as a vector (see vector space model).
Fortunately, MySQL Fulltext engine does exactly this – by using a whole article as a fulltext query against all other articles, it is possible to calculate a similarity score between every pair of articles. I use a sliding window technique to limit the number of items clusterer.module needs to look at (highly related news items are often published in a relatively small timeframe so looking at a ‘window’ of a few days worth of news at a time is adequate). Hierarchical clustering algorithms have a complexity of O(n^2), sometimes worse, so it gets exponentially more expensive to compute clusters as you add more items. To regroup items, I used a Perl API to a C library implementing hierarchical clustering; it is very fast, clustering a thousand items in a few seconds. I initially tried to implement my own clustering algorithm in PHP but it was slow and memory inefficient. Do not reinvent the wheel if you do not have to! Lots of tweaking was necessary to find the ideal clustering parameters that would allow great precision and accuracy. Be too permissive and you end up with mega clusters of unrelated stories - be too restrictive and related stories do not cluster together anymore. In the end, I got something very satisfying.
There are many classification algorithms out there; I needed an accurate one, but most importantly, a very fast one (many items to classify from RSS feeds). I chose a naïve Bayesian filter – your email software probably uses a similar approach to determine whether incoming mail is spam or not. In the case of Eureka Science News, the algorithm needs to classify incoming news items in eight categories – Astronomy, Biology, Climate, Health, Math, Palaeontology, Physics and Psychology. For this purpose, I ported / reworked a naïve Bayesian php library to Drupal. It is reasonably fast and categorizes a new item in about a tenth of a second. The system is surprisingly accurate once trained properly - of course, it makes errors every now and then (especially when it encounters a post made up of words it did not see before) but I am pleased with its performance so far. In the future, the system could be improved by using latent semantic analysis.
Finally, I built a module to rank clusters of news and to find and parse related press releases. The module ranks clusters based on the number of items they contain, the timeliness of each of those items and a few other factors such as popularity – the score is time-decayed using a formula based on radioactive half-life decay, keeping only fresh or popular news at the top of the front page.
The system as a whole outperformed my expectations; it even outsmarted me on a few occasions where it found links between stories I did not think were related (why did it regroup those four articles, they are not highly related! Oh, wait… they are slightly related but were presented at the same international conference! Cool!).
You should be able to do the same on your sites pretty soon - thanks to the memetracker.module from Kyle Mathews, a Google Summer of Code 2008 student! He plans to build a generic framework which will allow the automatic clustering and classification of Drupal nodes. Cannot wait to see what kind of implementation he will come up with – a generic framework is more complicated to build than a site-specific one such as Eureka Science News.
Search: Meet the Sphinx
At first I used Google Site Search as the search solution, as I felt that Drupal search.module could not keep up with a very large number of nodes (the site produced about 50 000 nodes in a month and a half during testing – what about in 5 years, with more sources?). With Google Site search, the updates were not instant, (there is a significant delay between when an item is published and when Google will crawl it), the interface was also very restrictive - even if I could integrate the results directly in the site, I could not alter the layout or results in any way. I discovered Sphinx Search thanks to a post made by chx on his blog. It is easy to configure and extremely fast - it indexes hundreds of thousands of nodes in mere seconds and searches are all returned in one second or less. Using a “main + delta” scheme of indexing, I can index news as soon as they are published.
In addition to the search form, I used one of sphinx built-in function to generate our stopwords list (for clustering / classification).
I used a CSS framework: Blueprint CSS. It is ideal for a grid design (which in turn is great for news sites). The results obtained with Blueprint are cross-browser compliant which is a huge timesaver. The framework comes bundled with a CSS reset (so that the site will look the same in every browser) and a nice basic typography which keeps a vertical rhythm of 18px (this means that every line of text will fall on the same virtual ‘line’, giving a virtual rhythm which is nice to the eye).
Coupled with Panels.module, it allowed us to create the design of most major pages on the site in record time – it is also very simple to test different layouts since it is so easy to use. Even when using a framework, CSS still has pitfalls – I was hit by somewhat obscure bugs in different browsers (guillotine bug, some menu completely disappearing in IE6, just to name a few). I used Browsershots.org to a great extent – screenshots of your site in just about every browser, free!
Performance is always the #1 concern when building a site with a potentially huge number of nodes. Eureka Science News is quite complex, so I generated custom SQL for almost everything instead of using the Views module - just about every query is optimized and thus very fast. Views is used exclusively for the Archive.
MySQL optimization is important and the Drupal page caching mechanism and the MySQL query cache are not an excuse to have badly optimized queries. What if I want users to register in the future? What about the one user hitting an uncached page that is forced to wait seconds while the page loads?
Servint – I installed APC cache, tweaked Apache and Mysql but did not yet install more advanced performance improvements like Memcache. Right now AB (Apache benchmark) gives us upward of 500 requests per second on a cached page, which is very satisfying.
I applied the YSlow principles thoroughly – a special thanks to Wimleers for his detailed post about YSlow vs Drupal – while I left things out like using a Content delivery network for now as I feel it is overkill, I keep a very interested eye on Wimleers CDN.module. Some of the things which made a small but noticeable difference on the page loading times:
- Minified CSS and JS using Yahoo YUI compressor
- Aggregated JS and CSS
- Used CSS sprites for small icons
- Used PNG crush and PNGslim to reduce the size of all the images on the site
I also used Drupal cache_set and cache_get functions for SQL intensive blocks that are on every page - like the ‘popular’ block.
The optimization paid off as the site was hit by Reddit (two different stories), YC News and Stumbleupon the day of the launch and the following days: the relatively modest VPS did not even break a sweat; page loads were still instant, even with about 3000 page views within a few hours, about 15 000 in a few days. Since then we have been hit by Slashdot multiple times and were featured on the front page of Mashable with no performance problems whatsoever. I feel confident that a single VPS will be able to handle the growth of the site for quite some time.
Developing on Windows
Doing web dev work on Windows is possible! I built everything using Windows exclusively – using Wamp and Notepad++ mainly. I also could not live without the Firebug and the ‘It’s all text!’ extensions for Firefox.