Paid affiliate advertisement
Looking for a crawler script
beponto - September 1, 2008 - 05:29
I am looking for a drupal developer that can develop a crawler script that will read and download html / pdf content into drupal nodes. The script will have to visit about 50 fixed urls, and then do the read/download process.
Please contact me if interested.
Thanks

bump
bump for no interest.
Which crawler you are
Which crawler you are looking to integrate with? Do you want it similar to the "Feed Aggregator Node" module of the FeedParser project? I suppose some scraping will need to be done or html before it is added as node. Is this one time extraction or on-going cron job you want?
- Rajeev Karajgikar, Drupaler in South Bay Area
not too familiar with this module
thanks for your response. the source site does not have a rss feed. its pure html. i am not sure the feed aggregator would work. one module that apparently does that is the "import html" but seems to complicated and requires modifications on the server side that i am not too fond of...
It is an non-too-critical ongoing cron jobm(will run once a month) here is a sample page with the links to the html i wanted to extract. http://moscow.usembassy.gov/job_opportunities.html
I don't think a general
I don't think a general module can be done for this task.
It would depend on html of each site and it would be wrong if any site changes its main html.
With RSS it's possible since it has a standard which html hasn't.
Even with the most general module i can think of, it will need mateinance for its configuration to be sure that the module it's capable of reading the specific html you want to.
Something like that would be hard to implement and maybe won't even gonna do the job the way you want
Thanks
for your assistance. I have tried a few crawler tools as well with no sucess. I may have to give up on this one.
Control over the HTML?
beponto,
Do you have any control over the contents (markup) of the HTML you're looking to download/import? If not, I agree that this would have to be maintained on an ongoing basis. We used to do this with a Perl script to grab news items from sites before the days of RSS. We would use regular expressions to tell the script what parts of the page are what, but whenever the pages changed (typically formatting changes that would happen every few months), we'd have to update our regular expressions. That's the beauty of RSS, but if it's not available, you could approximate it by adding in HTML comments or div tags to let your Drupal module know what to grab.
But if you don't have access to the HTML you're grabbing, then it would certainly need to be something that a maintainer would need to keep an eye on.
-Rob
custom module using wget
This can be done many ways within Drupal - but it would need some degree of code maintenance.
One simple way would be to develop a custom module that runs in cron that
- use wget's recursive functionality to crawl & download targeted web pages
- save downloaded pages in a table
- extract the meat you want from the pages in table - anchoring from known ids/tags/patterns that exist in the html. This part of module code is the only one that may need maintenance if your target page's meat anchor identifiers change
- add new nodes into drupal from the data thus scraped
I think the regular code maintenance would depend on what you want to extract from the html pages. If your target sites are stable, more or less static in their backend programs generating your meat, this code maintenance may not be often. I guess there is no way around regular maintenance if you want to scrape data from other website's pages.
I hope this helps. Thanks
- Rajeev Karajgikar, Drupaler in South Bay Area