Eureka! Science News just launched!
Eureka! Science News just launched – it is a site dedicated to provide the very latest science news, but with a special twist – it is entirely automated! There is no human editor behind it - it finds relationships between news stories from all major science sites and regroups, categorizes, ranks, tags, finds related press releases and publishes them directly on the site. The result is an efficient overview of everything happening in science, right when it happens. The following details how we built the site.
History
First, a little bit of history about how I discovered Drupal; I launched Biology News Net 4 years ago using Movable Type – biology is the #1 science and I found it weird that no site was dedicated to biology news. The site quickly became popular (#1 on Google for ‘biology news’) – it was unexpected, as the site was started as a hobby project / blog and thus I hit the limitations of Movable Type really fast; adding functionality was complicated, performance was not great (even on a dedicated server), customization was not really doable. Just as an example, the forum is actually a phpBB installation that has its sessions tied to the Movable Type sessions - it's clunky even if it works, and upgrading is a nightmare. As I could not really expect more from a blogging engine – Movable Type served me well - I searched for something better - this is when I found Drupal (about 2 years ago) and fell in love with it! It has a significant learning curve, but it is so powerful that the time invested to learn it is easily worth it in the long run. While I do not have time to contribute much to the actual development of Drupal, I help when I can and maintain one module (quickstats.module, coded by chx with small improvements from me).
Goals
I wanted to build something bigger than Biology News Net, but also something different – a site that would report news as it happens, but for science as a whole. I update Biology News Net manually and being a busy guy, that means usually only once per day. I felt the need to automate the process while keeping the quality at a high level – there is lots of science news reported daily and frankly, not all of it is interesting! I found the inspiration in initiatives like Techmeme and Google News.
Building the Eureka Engine
First, I identified the principal components of an intelligent news aggregator:
- A source of news, such as an RSS aggregator
- A clustering engine, to group news together
- A classification engine, to categorize the news (Is this Biology, Physics, Medicine or Astronomy?)
- A way to assign scores to clusters, to determine in which order the news should be displayed
First, I needed a good RSS aggregator; the default one provided by Drupal was inadequate, as it did not create Drupal nodes out of RSS items. The other aggregators available at the time were Leech and Feedparser, but they were implementing lots of functionality I did not really need and were not yet mature. Fortunately, Ted Serbinski (m3avrck) came to the rescue, releasing Simplefeed just when I needed it. Simple, fast, efficient: I could not ask for more!
Next, I built three custom modules to implement the remaining functionalities. The process was easy thanks to the infinite extensibility of Drupal – the hooks system is just wonderful! The result is what I call the Eureka! Engine. Here are some details about these three modules:
Clusterer.module
The first step is to regroup similar items in clusters. Two things are needed to cluster items together: a similarity metric and a clustering algorithm. In the case of text, similarity metrics can be based on the occurrence of words in two texts – a document can be represented as a vector (see vector space model).
Fortunately, MySQL Fulltext engine does exactly this – by using a whole article as a fulltext query against all other articles, it is possible to calculate a similarity score between every pair of articles. I use a sliding window technique to limit the number of items clusterer.module needs to look at (highly related news items are often published in a relatively small timeframe so looking at a ‘window’ of a few days worth of news at a time is adequate). Hierarchical clustering algorithms have a complexity of O(n^2), sometimes worse, so it gets exponentially more expensive to compute clusters as you add more items. To regroup items, I used a Perl API to a C library implementing hierarchical clustering; it is very fast, clustering a thousand items in a few seconds. I initially tried to implement my own clustering algorithm in PHP but it was slow and memory inefficient. Do not reinvent the wheel if you do not have to! Lots of tweaking was necessary to find the ideal clustering parameters that would allow great precision and accuracy. Be too permissive and you end up with mega clusters of unrelated stories - be too restrictive and related stories do not cluster together anymore. In the end, I got something very satisfying.
Categorizer.module
There are many classification algorithms out there; I needed an accurate one, but most importantly, a very fast one (many
items to classify from RSS feeds). I chose a naïve Bayesian filter – your email software probably uses a similar approach to determine whether incoming mail is spam or not. In the case of Eureka Science News, the algorithm needs to classify incoming news items in eight categories – Astronomy, Biology, Climate, Health, Math, Palaeontology, Physics and Psychology. For this purpose, I ported / reworked a naïve Bayesian php library to Drupal. It is reasonably fast and categorizes a new item in about a tenth of a second. The system is surprisingly accurate once trained properly - of course, it makes errors every now and then (especially when it encounters a post made up of words it did not see before) but I am pleased with its performance so far. In the future, the system could be improved by using latent semantic analysis.
Publisher.module
Finally, I built a module to rank clusters of news and to find and parse related press releases. The module ranks clusters based on the number of items they contain, the timeliness of each of those items and a few other factors such as popularity – the score is time-decayed using a formula based on radioactive half-life decay, keeping only fresh or popular news at the top of the front page.
The system as a whole outperformed my expectations; it even outsmarted me on a few occasions where it found links between stories I did not think were related (why did it regroup those four articles, they are not highly related! Oh, wait… they are slightly related but were presented at the same international conference! Cool!).
You should be able to do the same on your sites pretty soon - thanks to the memetracker.module from Kyle Mathews, a Google Summer of Code 2008 student! He plans to build a generic framework which will allow the automatic clustering and classification of Drupal nodes. Cannot wait to see what kind of implementation he will come up with – a generic framework is more complicated to build than a site-specific one such as Eureka Science News. If you read this Kyle, just send me a message should you want to talk about it!
Design

Having more time than money, I decided to design the site myself - professional designers are expensive! I used a CSS framework: Blueprint CSS. It is ideal for a grid design (which in turn is great for news sites). It’s really easy to build any grid design you might think of using Blueprint – but it only allow for fixed width designs (at least for now). In my opinion its better to use a framework than to fiddle with a custom made semi-liquid CSS design that will break in just about every browser, but in different, obscure ways. The results obtained with Blueprint are cross-browser compliant which is a huge timesaver. The framework comes bundled with a CSS reset (so that the site will look the same in every browser) and a nice basic typography which keeps a vertical rhythm of 18px (this means that every line of text will fall on the same virtual ‘line’, giving a virtual rhythm which is nice to the eye).
Coupled with Panels.module, it allowed us to create the design of most major pages on the site in record time – it is also very simple to test different layouts since it is so easy to use. Even when using a framework, CSS still has pitfalls – I was hit by somewhat obscure bugs in different browsers (guillotine bug, some menu completely disappearing in IE6, just to name a few). I used Browsershots.org to a great extent – screenshots of your site in just about every browser, free!
Drupal - Module list
The site is built using Drupal 5.x as some critical components / modules were not yet available for 6.x. It is sometimes frustrating o see the shiny new Drupal version that is much better than what you’re developing on, but that’s the way it is – I plan to upgrade soon, but in the meantime, I patched 5.x with a few features that were more interesting for this project (such as JS aggregation).
The power of Drupal relies in the strength of the community; someone somewhere has probably contributed something you need for a particular feature! Here is a short list of the contributed modules I used in no particular order:
- Simplefeed
- Views
- CCK
- Pathauto
- Global Redirect
- Imagecache
- Forward
- Panels (1.x as 2.x came a bit late for us to use, sadly)
- jLightbox
- Taxonomy access control
- Quickstats
- CAPTCHA
- Service Links
Notice that the list is very short. Drupal offers easy access to contributed modules, which is great but can be the source of a disease that I call 'modulitis' - too many modules - which leads to bad performance. Most contributed modules use a broad ‘catch every scenario’ approach (‘kitchen sink approach’), which leads to feature bloat and sometimes bad performance; don’t forget that a module has to be loaded completely in memory, so every feature you don’t need is an extra cost on every page load. For example, the Adsense.module, while very useful in some cases, can most of the time be replace by simply pasting the Adsense code in node.tpl.php (or where appropriate). Same thing with Google Analytics; there is a module for that, but just incorporating the code in the footer of every page does the job most of the time. I would also like to see more ‘modular’ modules in the future; it is often hard for maintainers to resist feature creep. Module ratings on D.O. would also help choose between the five image module solutions, the three different lightbox implementations, etc.
As a site developer, it is also easy to fall in the same kind of trap; adding tons of features users do not want / need and that lead to bad performance in general. I could have added voting, comments, a forum, a blog for every user, instant messaging, etc. I started with what users want: the news. I can add other things later on if there is significant demand for it.
Search: Meet the Sphinx
At first I used Google Site Search as the search solution, as I felt that Drupal search.module could not keep up with a very large number of nodes (the site produced about 50 000 nodes in a month and a half during testing – what about in 5 years, with more sources?). With Google Site search, the updates were not instant, (there is a significant delay between when an item is published and when Google will crawl it), the interface was also very restrictive - even if I could integrate the results directly in the site, I could not alter the layout or results in any way. I discovered Sphinx Search thanks to a post made by chx on his blog. It is easy to configure and extremely fast - it indexes hundreds of thousands of nodes in mere seconds and searches are all returned in one second or less. Using a “main + delta” scheme of indexing, I can index news as soon as they are published.
In addition to the search form, I used one of sphinx built-in function to generate our stopwords list (for clustering / classification).
Performance
Performance is always the #1 concern when building a site with a potentially huge number of nodes. Eureka Science News is quite complex, so I generated custom SQL for almost everything instead of using the very convenient Views.module - just about every query is optimized and thus very fast. Views is used exclusively for the Archive.
MySQL optimization is important and the Drupal page caching mechanism and the MySQL query cache are not an excuse to have badly optimized queries. What if I want users to register in the future? What about the one user hitting an uncached page that is forced to wait seconds while the page loads?
Servint – I installed APC cache, tweaked Apache and Mysql but did not yet install more advanced performance improvements like Memcache. Right now AB (Apache benchmark) gives us upward of 500 requests per second on a cached page, which is very satisfying.
I applied the YSlow principles thoroughly – a special thanks to Wimleers for his detailed post about YSlow vs Drupal – while I left things out like using a Content delivery network for now as I feel it is overkill, I keep a very interested eye on Wimleers CDN.module. Some of the things which made a small but noticeable difference on the page loading times:
- Minified CSS and JS using Yahoo YUI compressor
- Aggregated JS and CSS
- Used CSS sprites for small icons
- Used PNG crush and PNGslim to reduce the size of all the images on the site
I also used Drupal cache_set and cache_get functions for SQL intensive blocks that are on every page - like the ‘popular’ block.
Be wary of implementing tons of javascript plugins, as that can affect load time performance. At one point the ‘recent images’ block on the front page were embedded into a javascript carousel, but it impacted the page load times significantly (users had to download additional images that they wouldn’t see most of the time, plus the actual javascript parsing and execution time was significantly high - 500 milliseconds for that carousel alone). Sure it looked cool, but cool is not our #1 priority, performance is!
Drupal page cache is great for anonymous users, but can also create problems; I wanted to display the amount of time elapsed since the publication of news items in some place (‘published 2 hours ago’) but with minute precision. I did not want users to see for example ‘published 1 minute ago’ for 10 minutes. I resolved the problem by porting Drupal format_interval function to JavaScript.
The optimization paid off as the site was hit by Reddit (two different stories), YC News and Stumbleupon the day of the launch and the following days: the relatively modest VPS did not even break a sweat; page loads were still instant, even with about 3000 page views within a few hours, about 15 000 in a few days. Since then we have been hit by Slashdot multiple times and were featured on the front page of Mashable with no performance problems whatsoever. I feel confident that a single VPS will be able to handle the growth of the site for quite some time.
Miscellaneous
Developing on Windows
Doing web dev work on Windows is possible! I built everything using Windows exclusively – using Wamp and Notepad++ mainly. I also could not live without the Firebug and the ‘It’s all text!’ extensions for Firefox.
Of course developing on Windows have disadvantages; I was bitten by a strange bug where my text editor would sometimes insert an invisible character or line at the beginning or end of a module file, which would cause errors (white screens of death mainly); once I figured things out, I had to reedit those within Linux to get rid of the problem. Speaking of Linux, it is actually much easier that I previously thought; I learned using it a month ago to tweak the server; it is easy if you have DOS experience (but I hate VI – anything better to edit files?).
Moreover, my dev box got somehow infected by a nasty piece of adware that made IE and Firefox crash and opened tons of pop-ups. Fortunately, it happened post-launch and I had plenty of backups (on the server, on an external HDD and on a laptop). It still took me 3 days straight to clean completely.
Lessons learned / random thoughts
Here are a few lessons I learned in no particular order
- Finding a good domain name is hard and takes time (and / or money) – start early and never stop searching, even if you got one or 2 good ones! You might find something even better.
- Backup everything often! Especially on Windows.
- Think simple; more is often less – do only what your users actually need / want as a starting point
- Don’t be afraid to redo something from scratch if it’s not working right the first time around - I rebuilt critical components of the clustering system days before launch
- Drupal is a market disrupting tool – it allows a single guy part time to build something great while learning it; imagine what a whole team of professionals can do!
- I wish I knew about simpletest a year ago; I chased bugs for a long time (and sometimes the same bug that was reoccurring). Things like clustering and regex-based parsing could have been a whole lot easier with appropriate tests.
- Release early; don’t be afraid to put off minor features to later
- Keep a todo list through the process, and try to remove items from it as fast as they appear! (this is not as easy as it sounds!)
- You know you built something great when you visit your own site and find it interesting :)
The future
On a closing note, the platform I built is highly adaptable to other news sites (sports, tech, general news source). The only problem is that press releases are not available in a centralized location as is the case for science; I might seek venture capital to license Associated Press news to build a news site similar to Newsvine (any VC reading this?). I am also toying with the idea of implementing anonymous comments on the site using Mollom. But of course, I have new unrelated projects to start, and a PhD to finish! I hope you enjoyed the write-up and that you will enjoy Eureka Science News! If you have any questions feel free to ask ;) As a closing note, special thanks to funkyhat who helped me tremendously with this write-up.

That is incredible. I'd say
That is incredible. I'd say I'm speechless, but I have enough in me to say, "Awesome job." : )
(Quick question if you have a moment... any reason behind the . separator in URLs or did it just suit you?)
----------------------
Drupal by Wombats | Current Drupal project: http://www.ubercart.org
No real reason - they're all
No real reason - they're all equivalent (dot hyphen and underscores) for search engines (well there's some controversy but it's all unfounded opinions). I'm already getting good traffic from search engines so I really don't think this kind of small details matter to Google and the like. Personally I think dots look better on long urls, as they're smaller.
---
Biology Articles
Works for me. : )
Works for me. : ) Thanks.
(I asked b/c my use of underscores in Ubercart URLs has been pointed out as bad for search engines, but as you mentioned... there hasn't been any lack of organic referrals from Google et al.)
----------------------
Drupal by Wombats | Current Drupal project: http://www.ubercart.org
It's all baseless
It's all baseless speculation (there's lots of that in the SEO world); Matt Cutts from Google specifically said that underscores = dashes.
I love your site, but I'm a
I love your site, but I'm a bit confused. In this post on "Dashes vs. Underscores", Matt Cutts says :
http://www.mattcutts.com/blog/dashes-vs-underscores/
Can you direct me to where Matt said "underscores = dashes."?
You are right, there is a lot of SEO snake oil floating around.
Thanks,
Shane
BWCA
You're right
You're right, this was true in 2005. In the meantime, Google changed their algo to recognize underscore as word separators, too : http://www.webpronews.com/topnews/2007/07/24/cutts-google-honors-the-und...
"One key development that Matt shared with the audience was that underscores in URLs are now (or at least very soon to be) treated as word separators by Google. That's great news, because it historically hasn't been that way. Back in 2005, Matt stated that Google did not view underscores in URLs as word separators."
Anyway I think underscores look terrible, I wouldn't use them :) Dashes is still the 'standard' but it doesn't matter anymore (or if it does, its a very very very small factor).
Well put
"Drupal is a market disrupting tool – it allows a single guy part time to build something great while learning it"
Indeed, this is very true. Great write up :)
--
John Forsythe
Need reliable Drupal hosting?
Amazing Site
Thanks for sharing the details on creating your site via Drupal. Have been trying to learn how to make sites automated like http://ectio.us/ . That site does not use drupal but it's a simple site to read almost anything up to date by just clicking on a term in the tag cloud. And it's not updated by the webmaster.
But now with your site. It's my fav. Now I'm working on the idea again.
Best wishes on your PhD work.
But ectio.us was indeed
But ectio.us was indeed *born* a drupal site... using one of those feed-to-node modules. Feel free to ask me specific questions about it. It's one of my favorite sites to fiddle with.
hi - Is it still a Drupal
hi - Is it still a Drupal site? If so, what modules are you using? Many thanks!!
Ectio.us
No, not anymore, I had to rewrite it. Drupal was too intense for the shared server it was living on.
If I were to do it again with Drupal, there would be a lot more options for feed -> node modules. FeedAPI seems good but I haven't played with it too much.
Great Work!
Great work and beautiful explanation. Congratulations !
Phenomenal work and great
Phenomenal work and great looking site! Reading your write-up was encouraging as I now feel like I'm on the right track with some the approaches I've taken thus far with my set-up. Many of the items you described resonated with me and I could relate to your lesson learned. I'm also a member of the 'have more time than money' club, ha!
CLAP! CLAP! CLAP! CLAP! CLAP!
WOW!
Really, really in awe here.
So, I guess another round of applause is in order.
CLAP! CLAP! CLAP! CLAP! CLAP!
Warm regards from sunny México!
Pepe
:-)
nice 想出国留学吗?
nice
想出国留学吗? 想出国旅游吗? 想移民吗? 她在国外, 问她哦!
You might find pico a lot
You might find pico a lot easier to deal with than vi.
________________________
dave hansen-lange
Developer
Advomatic.com
East Asia office
Hong Kong
Great Work
FiReaNG3L
Drupal does allow one man to build wonders !!!
------
GiorgosK
Geoland Web development / web marketing
Grid
Good job with BluePrint! I started building a BluePrint base theme a few months ago, but found it difficult to keep the vertical rythm in Drupal, especially on form elements. That was mainly because of some drupal.css defaults that were not reset by BluePrint. I should give it another look with Drupal6 as drupal.css is now optional.
elv, q about blueprint...
first, amazing site and incredible write-up - i would love to see this for other areas that don't get great coverage...
elv, regarding blueprint, any interest in porting/adapting for basic d6 use? i'd be open to sponsoring the work if it's reasonable...just ping me via contact if you're interested...
........................................................................
i love to waste time: http://twitter.com/passingnotes
There's not much to 'port'
There's not much to 'port' really, just add the Blueprint css to page.tpl.php and build your layout from there.
Awesome job --- All
Awesome job
---
All Tutorials
http://www.alltutorials.org
Unbelievable work, very
Unbelievable work, very useful information. Thanks so much!
Great!
Great site and great description!
---
Drupal Theme Garden
amazing work
stellar effort.
A really great case study
Thanks for the attention to detail in your case study.
- Robert Douglass
-----
my Drupal book
Getting content question
Amazing what a good website single person can do while studying Drupal :)
You wrote that you get content by RSS aggregation, but RSS feeds provide only small bit of content, and you have full articles on your website. Do you somehow parse those websites?
---
A2 design
No; in the world of science,
No; in the world of science, for just about every news item, you can find a press release on universities / research centers websites or in specialized repositories (such as Eurekalert). In fact, major science news sites such as Science Daily and PhysOrg only cut and paste these press releases; I just automated the process using a couple of regexes. Getting it to work right was hard as the format of press releases is not always the same!
---
Biology Articles
Thanks for the insight
Great work followed by an excellent write up.
I hope you built some fail-safe in there in the case where one of the PR sites become unreachable or change their URL structure?
---
Dee
iScene Interactive :: iScene.eu
Of course; the site can be
Of course; the site can be switched to semi-manual mode (the press releases part). It's a weakness of relying on other websites. It's the same for the tagging system, which relies on Yahoo Term Extractor; should they discontinue their API, I'd have to find an alternative.
SimpleFeed
Could you please go into a little more detail on how you were able to make SimpleFeed parse the entire press release into a node on your site? When I try it, I only get the heading and a link to the original article. For instance, what do you mean by using regexes?
Also, what was the process like in getting the all the different blocks on the homepage. Was that done through a module or themeing?
This insight would be really helpful to me, and I am sure a lot of others would greatly benefit from this.
Thank You
Hmmm..
It would be kinda cool to be able to use the tools created here to build other news sites... anyone interested in Formula 1?
Heh
Eric
__________
Eric Aitala - f1m@f1m.com
The Formula 1 Modeling Website
www.f1m.com
Great work Michael!!
Great work Michael!!
(The link to simplefeed project needs a fix.)
Nice catch, it should be
Nice catch, it should be http://drupal.org/project/simplefeed
Fix in original post? :D
Fix in original post? :D
I would but I cant edit it
I would but I cant edit it myself :)
Fixed
I hopped in and edited the link.
Lullabot loves you | Be a Ninja, join the Drupal Dojo
Brilliant site...
Great site, great write up. Bookmarked!. We need more news sites like this one. Getting everything on one site is the way I like it.
How does one get your images on the nodes like that though ... cck imagefield and imagecache I suppose? WIth the title at the top of the image and the images below each other that is... I'm fairly new to Drupal so still learning ...
Thanks for a brilliantly written article.
Interesting stuff! the
Interesting stuff! the design is seriously lacking, but could certainly be alot worse!
____________________________________________________
Tj Holowaychuk
Vision Media - Victoria Web Design
I certainly am reading this
Awesome work! I love what you've done. I certainly will want to get in contact with you and learn from your experience. I think my work will be much simpler now that you've "broken ground" in the memetracker in Drupal arena. Hopefully with yours and other's help, many new memetracker news sites powered by Drupal will be running later this year.
I've just started working this week on GSoC but I'll be in contact with you soon.
Kyle Mathews
I'm glad you saw this Kyle,
I'm glad you saw this Kyle, you should come on IRC from time to time ;) Feel free to email me or whatever if you want; lots of people showed interest in a memetracker (I received numerous emails after the publication of this write-up) and my implementation is a bit too specific for my site; I'm sure your module will be a better solution for those people :)
Yes, Spport you! I am
Yes, Spport you!
I am waiting for your memetracker module to create a news site in other field.
Judging from the number of
Judging from the number of emails I received from people who would like to use my clustering engine, I think your module will generate a lot of interest! Unfortunately my implementation would need lots of tweaking / polish before it could be usable by the general public; I think your module will be a great solution for everyone who wants to build a memetracker. Did you setup a group on groups.drupal.org?
Ha! No, I tried but the
Ha! No, I tried but the content moderator said that because I had no code or issues yet there is nothing to discuss. That seems a bit short-sighted but whatever. I'll be writing up a report tomorrow on my research so far. I'll post in the SOC group for now. I'll ping you when it's up.
I'm excited to start coding up the memetracker and get it into people's hands. Are you willing to share code you've already written? Some of it might not fit the architecture I'm planning but polishing rough code is easier than writing new code.
Kyle Mathews
FYI -- Memetracker group created
For everyone interested in helping build the memetracker (or just want to keep track on its progress), I've created a memetracker group over at groups.drupal.org. Come over and join.
Kyle Mathews
Great write-up, thanks!
Of all of the website Drupal website write-ups I've seen in the past year, your's is the only one that I read all the way through. I found it very interesting and informative.
I have a question regarding your use of blueprint. Did you use zen to integrate your blueprint design into a Drupal theme or another method?
The Cosmic Gift | Complete Computer Care | Team Hope
Excellent job! The part
Excellent job!
The part about the Eureka Engine is very interesting, in fact so much that I want to start experimenting with that myself! It also made me wonder if this is strongly related to your studies?
I do DNA microarray analysis
I do DNA microarray analysis (in the context of HIV-1), so I applied some of the techniques I use such as hierarchical clustering, so its sorta related ;)
Fascinating Blog
Very cool description of the mechanics behind an automated site with good content. That is brilliant.
Amazing!
Wonderful concept, great execution and amazing site. Though some of the things you mentioned were a few levels above me. Thanks for the writeup.
==================================
test, retest and backup! no guarantees in life!
VPS Specs
?
____________________________________________________
Tj Holowaychuk
Vision Media - Victoria Web Design
VPS
Tell us more about the VPS - where is it hosted? The site sure is blazing fast! Nice writeup too!
Selwyn
Servint
It's on a signature VPS at servint, and I have still plenty of resources left on it.
http://www.servint.net/vps/index.php
2 GB Burst RAM
768 MB Guaranteed RAM
30 GB Storage
800 GB Monthly Transfer
---
Biology Articles
Haha damn! I pay the same
Haha damn! I pay the same with only 512MiB of RAM. Site does run well though, Im assuming the majority of these pages are served static?
____________________________________________________
Tj Holowaychuk
Vision Media - Victoria Web Design
They upgraded all their
They upgraded all their plans very recently (for free!), it used to be 512 megs.
All pages are served cached since there's no user login yet (debating whether or not registration is useful on such a site).
Firstly: great write-up.
Firstly: great write-up. Loved it.
Secondly: just wanted to note that I moved our site from a dual-core dedicated server with 512 MB of RAM (where a lack of #148849: Merge {node_comment_statistics} and {node_counter} into {node} and usage of private forums were killing the server) to a VPS with roughly the same specs. The site has gone from taking 5-15 seconds to load a page, to ~1 second.
Even if you're sharing resources with other users, a VPS on a screaming-fast machine is often better than a mediocre dedicated server.
Thirdly:
There is nothing better than Vi, ever, for anything. You'd better hide when the revolution comes. There'll be mandatory Vi/m lessons for everyone, and anyone discovered using EMacs will win be thrown in the Pits of Ice! *insert evil laugh here*
But seriously: you can use
nano. Vim is vastly superior to Nano, if you learn it, though. I use Vim for all my PHP development. :)----
Web Design, GNU/Linux and Drupal.
I host on ServInt too on a
I host on ServInt too on a Signature plan and have been forever planning the kind of news site you've built. Nice to know that my VPS will hold up if I ever get around to actually doing it:-)
What an amazing effort you've put together! More important for the community is that you shared in such detail. Thanks a bunch.
On a different note, how close to or different is your system from Development Seed's Managing News. The aggregation and tagging part sound pretty similar. I have been considering it for a while.
----
Previously user venkat-rk.
Great writeup !
Great writeup with attention to details.
Any plans of releasing any module (or atleast the modules based on other open source projects like Naive Bayesian Filter) to community?
Great Design, but any RSS Copyright issue?
Really amazing work and engine, i am also developing a site with RSS from the Web, I would like to know is there any copyright issue about getting RSS feed from all the sources?
Popular Block
Thanks for the detailed write up - you've built a great site!
I wondered how you did the Popular Block to allow it to show popular today, this week and this month. You mentioned above you used the cache, so I imagine you created a custom query, i wonder if you would mind shedding some light on this?
This is something i've been looking to add to my spiritlibrary site for a while now!
Thanks
Ben
popular module
I've now managed to create a similar block on the home page of Spirit Library.
I created a module which after a bit of tidying out I should be able to release on drupal.org. At each cron run it calculates the views for each node within different time periods - 24hours, 7 days and 30days. These counts are then made available for templates and also in views so you can sort them.
Help us improve forums
@FiReaNG3L: One of the things that comes to mind is that your application of statistical analysis to find groups in articles could be used to group forum topics on Drupal.org. It would be much easier to provide support for forum topics if like issues could be automatically grouped. Please put your sharp mind and existing code to work on this problem and produce a suggestion for how we can build in machine intelligence into the Drupal.org redesign.
- Robert Douglass
-----
my Drupal book
I'll see what we can do
I'll see what we can do about that; I think that clustering similar to what I did for Eureka Science News is out of the question if you don't have access to a dedicated server just to process it; you have to calculate and keep relationships between every nodes (n X n complexity). It's exponential, so a thousand nodes might take 1 second to cluster, but 2000 might take 10 :)
I think a mysql fulltext `related` block would be much more efficient - I have those on my articles pages and it works very well to find highly similar posts. These can be cached and refreshed every few days on topics that receive hits.
I'll give you a server to play with
And a sample Drupal db (sanitized, of course) that I use for Solr research. If it's a matter of more machines, that's an easy problem to solve. Let me know if you want to pursue.
- Robert Douglass
-----
my Drupal book
Wow
This is really amazing. Web 3.0 will one day become what you have already implemented. Thanks for the detailed write up, you have singlehandedly caused many of us to rethink our baseline. It's really amazing work. -NP
Great Job
FiReaNG3L (or Michael),
You did a fantastic job on the site. I am really impressed by your end results.
Well done. Thanks for the explanation.
I am looking forward to your sharing of your "secret" sauce in more detail :-).
wouldn't it be possile to
wouldn't it be possile to use the index gnerated by search.module to implement a similar clusterer module?
----
http://www.manalaa.net
Hey, great site, technically
Hey, great site, technically amazing, and great science behind it, too!
I don't mean to troll, but two things bother me about this kind of website:
a\ What about the copyright issues? If you are collecting RSS-s from other news sites and re-publishing them as a whole, and on an automated basis, isn't that a big copyright infringement? You could be facing issues similar to Google News. You don't even seem to be quoting your sources, which is even worse.
b\ This is a little over the top and more on a philosophical side... but I feel kinda strange to see another job (editor) go... Researching HIV sounds like a noble thing to do... But don't you guys feel that there should be some limits to the use of A.I.? We can make ourselves unnecessary altogether so easily. Right now, the world market is driven by the masses, so whatever is good for the masses wins. But soon the masses will have no value to contribute, as the machines will do every single job better, including intellectual jobs (like selecting articles for a news site about science). It's such a small step to imagine that soon the research published on this site will be done by A.I. as well... and soon after that, A.I. will also be the main consumer of these news, putting us humans out of the loop entirely... Yeah like I said, this is a bit over the top, but you are a scientist so I'd love to hear your thoughts about this! Am I paranoid or rightfully worried?
Cheers and congratulations on your site
I'm not republishing content
I'm not republishing content from RSS feeds. The full articles come from copyright-free press releases related to the content found in RSS feeds of other sites. Sometimes these sites (such as science daily and physorg) copy-paste the same press releases as the content - this might have led you to think that we were 'stealing' their content.
understood!
i see. thats a whole different story then! :)
Nice site, very
Nice site, very impressive.
However....
You shouldn't be republishing the RSS feeds in their entirety. It's fine to aggregate content from other sites but you should be aggregating links back to the original source, not republishing the content entirely on your own site.
At best this could be looked on as bad etiquette, at worst it's just plain stealing. I wrote a blog post about another site that was doing the same thing a while ago so I'm just going to link to that instead of going into it in detail here.
I would encourage you to change this detail of your site.
You misunderstand how the
You misunderstand how the site works. I'm not republishing full content coming from RSS feeds of any of the source on the site. The articles come from copyright-free press releases related to the news published by other sites (sometimes, their content is the copy-pasted press release, too, so that might have confused you).
As long as the content is
As long as the content is released in such a way that it can legally be re-copied there is nothing wrong with it. This, of course, wouldn't work for a lot of sites since it relies on the availability of essentially "open source" articles
Exactly; for another site
Exactly; for another site I'm planning to build, I'd have to license Associated Press content (and this is very pricey! so I can't do it yet)
The purpose of RSS feeds is republishing
I don't get this at all - at least not as a blanket rule. The purpose of RSS (and other syndication mechanisms) is to make information available for easy re-use. I think that at least in the absence of a clear and prominent notice to the contrary, the very act of providing an RSS feed implies permission for the redistribution of it's contents.
If you don't want people aggregating the entire content of your articles, don't put it all in your feed. I got a threatening email from a pretty prominent blogger about a year ago, insisting I was violating his copyright for republishing the full text of his articles, when it was really his decision as to whether I had the full text or just a teaser available to me. He accused me of benefiting financially in some way from aggregating his feed, which was just utter madness. I felt (until that point) that he was pretty smart, and wrote good stuff which would be of interest to the people who were likely to visit my (advertisement-free, by the way) site, and I wanted to point them in his direction. If anything, I expected him to benefit from my aggregation of his feed, not me.
So I pulled his feed from my site, and now people who happen upon my site will likely remain ignorant of this fellow's existence. I hope this business strategy of keeping his brilliance a closely-guarded secret is working out for him. (I also notice, by the way, that his RSS feed now contains truncated teasers of his articles as per my advice to him.)
I agree that you could republish content from RSS feeds in ways that are unquestionably beyond the pale - no links back to the source, no attribution in order to imply that the aggregator is the original source, etc.. - but I think outside of these specific cases of outright fraud, anybody who feels aggrieved about how others are using their feeds should reconsider how much they want to put in their feeds before sending out threatening letters about their "intellectual property".
I think that "You shouldn't be republishing the RSS feeds in their entirety," is just way too broad a rule. The legal question of copyright violation and the ethical question of dishonesty must be kept separate from the technical means employed in one or another case. Virtually every interaction with the web is a technical copyright violation, so it's only the ethical question here that's worth seriously considering. Similarly questionable ethical practice on the web predate the existence of RSS by some time. And remember, Google (among others) is as we speak publicly reproducing your entire site, without your permission - implied or otherwise, via the cached copies in their index. If your concern is technical copyright violation - and Google is certainly deriving a direct financial benefit from screen-scraping your site; it's what their business is, after all - then there are bigger offenders out there then some individual who likes your work and wants to let other people know about it.
Considering every possible use of your full RSS feed a misuse that will either somehow harm you, or benefit someone else without delivering proper compensation to you is untenable.
Amazing Work
Congratulations, and thank you for the detailed information. Hey, source code would be nice, but I think you should hold onto it. The hints you have given are a great starting point for anyone interested in creating a similar site.
I'm curious about SEO duplicate content issues using this system. I'd like to hear other people's thoughts on this.
Regarding visual design, I think it is awesome. It is clean, looks professional, and the important stuff - the content - does not play 2nd fiddle to eye candy.
Again, well done!
Amazing
I'll neve believe that it could be.
Bu here it is - amazing!
The SEO duplicate content is
The SEO duplicate content is an unclear one; some people think that it applies to 'same content, different sites', some people think its just if you duplicate the same pages on the same site. One thing is for sure, lots of very popular sites are replicating the same content (either AP news or press releases, etc) and seem to have no problem getting hits from search engines. So I would tend to think that only the second case applies.
editing files on linux
To me, the advantage of editing files on linux is: since remote file systems are so transparent, I can just use my favorite text editor.
If you use kde, try fish://login@server/home/yoursite
bravo
oh i say.. bravo!
and as an Egyptian i can only be proud of your use of the sphinx search ;)
__________
http://namima.in-egypt.net اجدد اخبار ممثلين
I'm impressed!
Great job and even better sharing attitude!
here, here... excellent site
here, here... excellent site and job!
It's really great work and
It's really great work and it may be the step to the future!
praise & question
Hi Michael,
Great job! Nice how you really worked out everything so well and shared it with us. thanks.
I especially like your use of blocks/panels, it makes the site very dynamic and informative
Could you explain one thing in more detail?
You say that you "(...) built a module to rank clusters of news and to find and parse related press releases"
How did you create that parser? What kind of approach did you use? Is it all in php? Did you make a parser per site?
Thanks for any details.
Robert.
Awesome work...
Great resource, to share your methods: Hope some VCs are taking note.
This is my new favourite portal to science news, as it often finds little gems that other news sites don’t think are interesting.
Can I make a couple of observations?
visited links are darker – makes them stand-out rather than fade into the background – slightly counter-intuitive.
some titles are wrapped onto several lines and ‘spaced out’. This makes the links non-contiguous, and I frequently ‘fall between the cracks’ when clicking.
Thanks for the site.
___________________
It’s in the detaιls…
Opera issue
Your site is not rendered correctly by (my version of?) Opera (9.27 on ubuntu 8.04): the 'Psychology' header is shown under the red bar above 'Popular Science news articles', in stead of in the top bar next to 'Physics'.
Not sure about the cause. And maybe it is a very specific thing, the more there you used http://browsershots.org. But as that shows that browser compatibility matters to you, I thought I'd let you know...
And congrats with the site. Very impressive, certainly seeing that it's only a 'hoby-project'.
Cheers
edit: I just noticed there's an issue with the 'automatically updated x ago' as well. Firefox says 10min, Opera says 14u10min. I'm inclined to believe FF. JS issue, or is it a local thing?
thanks
thanks for the information
otelara.org
Wow, this is just great
Wow, this is just great work. The automation part is priceless. May I ask where you are located? Also, what module is being used to create the Tabbed block on the bottom right that shows most read Today || 7 days || 30 days?
Congrats!
Are you for hire?
Impressive. I have a project I'd like you to develop if you're interested please let me know and I can provide details for a development quote.
I am blown away. Awesome
I am blown away. Awesome site and awesome case study. Thanks a lot for sharing this here!
--
Websites: SEO-Expert-Blog.com | Torlaune.de
thanks for the write up.
thanks for the write up. these case-studies are always very interesting. good luck with the phd :)
I too am using Blueprint CSS - it's easy to use and gives great results. did you adapt at all to make the markup more semantic?
Full Article from Feed, beyond Read More...
Has anyone figured out how Eureka pulls the full article from the feed into a node? When I try to do it, I only get a summary. I tried SimpleFeed, but maybe there is a setting I missed. Any insight into this would be greatly appreciated. ThankYou !
Whether you get a full
Whether you get a full article or just a summary depends on the feed. The creator of the feed (the blog author or news site) chooses on their end whether to put the whole article in their feed or only a part.
Or this might be an issue with SimpleFeed. I've only used FeedAPI which returns full articles with aplomb.
Kyle Mathews
ah ha
Thank you for shedding light on that; full article or just summary. You just undid a big knot for me.
I'm trying to use FeedAPI, because it seems a little more backed by the Drupal community, but when you say aplomb, are you referring to another program, or just general "composure" ?
I don't pull full articles
I don't pull full articles from RSS feeds, as that would be a Bad Thing (stealing content from other sites). The full articles come from press releases on other sites. Simplefeed is only used as an aggregator.
how do you get?
So you select those sites manually and add them to your engine list to get full articles from those press releases ....Could you explain bit more....I feel some link is missing here...
a little clearer :)
Ok, that helps tremendously in my understanding of how your site is functioning and how you are using Simplefeed; but are you using SimpleFeed to pull the full press releases from the universities, or are you manually importing the full press releases onto your site? That is really what I am trying to do, is pull a press releases from a company onto a node on my site. For instance, Caterpillar Corp. has a press release section with latest PRs about their company. I would like to pull those into nodes. Thank you much for helping me understand this a little better, I've almost fully got it.
For this I built a very
For this I built a very basic crawler; just use wget from PHP to get the page(s) you want to, then use preg_match with a regular expression to get the title, the text, the source, links, images, etc from the page. Of course, if the site you're crawling change format, you want to build failure mechanisms so your site won't be littered with nonsense :)
Very interesting
That is very interesting. Have you seen any good tutorials that you would recommend for building such a crawler. This sounds like something I would be very interested in doing. Not just for the way you used it, but also as a possible means for importing articles from my old mediawiki site. It seems I should be able to crawl my old mediawiki install, that I still have up, and bring the articles into my new drupal site using your technique. Do you think it would work for something like that?
French version available
This is just to let french readers know that this case study has been selected by the french community as an example, and is available in french. It is currently only available as a download, but should soon be made available on the Drupal France community site.
Wow thats neat (btw: I'm
Wow thats neat (btw: I'm french canadian)
it's my sample,thanks
it's my sample,thanks
I am very impressed myself
I am very impressed myself as well, great job on this!
HostV - VPS Hosting - http://www.hostv.com
Awesome job btw, i notice
Awesome job btw, i notice the space between the left and the right is not proportion. The space on the right side is almost touches the border, why not adjust the space on the left ?
----------
My Drupal site: Philippines Travel - Come and Visit the beautiful Philippine Paradise!
by Busby SEO Challenge
Thank you
Thank you for the detailed information you've provided.
You've done a great work and deserve a congratulations.
Ghana Real Estate | Voacanga Africana Seeds | Shea Butter
Hi Well done. Great
Hi
Well done. Great site.
But have you talked to the news source to publish full stories on your website as they might be copyrighted.
Web hosting sites