web crawler

staminna - August 19, 2008 - 14:40
Project:Sphinx search
Version:5.x-1.x-dev
Component:Miscellaneous
Category:feature request
Priority:normal
Assigned:Unassigned
Status:by design
Description

Hello mate!

Will you be releasing for Drupal6?
Is there a way to use the spiders to index outside the project? I mean to index the whole network beyond the realms of the local search?

Regards,
/Dex

#1

markus_petrux - August 19, 2008 - 19:15

>> Will you be releasing for Drupal6?

Yes, this is planned some time in the near future.

>> Is there a way to use the spiders to index outside the project? I mean to index the whole network beyond the realms of the local search?

Ugh! :-)

Sphinx is a search engine, though you have to give it the data somehow. It currently supports mysql/postgresql and xmlpipe source types. Former types allow you to set Sphinx to build indexes grabbing data directly from SQL query. The later allows you to set Sphinx to build indexes by opening a pipe and reading data in particular XML format.

There might be more alternatives, though you may wish to look at here for a possible way to use Sphinx with OpenWebSpider:

http://www.sphinxsearch.com/forum/view.html?id=746

OpenWebSpider is open source written in C# and works with MS.NET/Mono (multiplatform). It's multithreaded engine that is able to parse web pages and uses MySQL storage, then you could set up Sphinx to build its own indexes around OWS tables. Finally, you could probably use Drupal as the front-end site for users to search your data.

However, I'm afraid this is far beyond the scope of this sphinxsearch project. Webcrawling is so complex that I belive it is not possible to achieve using PHP.

#2

markus_petrux - August 19, 2008 - 19:16
Component:User interface» Miscellaneous

#3

markus_petrux - August 19, 2008 - 19:19
Status:active» by design

hmm... it looks like issue statuses are somehow limited to deal with feature requests... I would say here "by design" since second question is about something not really planned for this project.

 
 

Drupal is a registered trademark of Dries Buytaert.