Need some help with my project planning
Hi Guys,
I am about to embark on my first Drupal project.
I come from a Joomla background, and am fairly clever with php/mysql scripts though am not a coder by any stretch.
The project I have in mind needs more than Joomla can offer, and looking at drupal, most of what I need is available as contribitions.
The project is basically a comparison search engine for specific vertical markets.
I need to scrape/import data from various sites and then serve them up as comparison search results. Similar to a shopping.com.
I am using a third party webscaping script for now, I understand there is a dataminer API, but I dont have the skills to write a front end for it.
To store the data, I am planning on using CCK as there is many custom bits of data associated with each "product".
To serve the data I am planning on using sphinx.
My reasons for this, over apache solr, is that it can run of the one web server, and I am trying to keep costs down. I realise apache might be a better solution for my needs (faceted search etc).
I have a couple of questions.
Eventually this project is expected to get large, many gigabytes of databases etc and I will run multiple servers but for now, as a concept, I am trying to keep it to one server.
As there will be a few installations of drupal, a seperate one for each vertical market, I am thinking of using the drupal multiple sites from one codebase module.
As I am using sphinx, I want to be able to have a sphinx install/index for each vertical market drupal installation.
Does anyone see any major difficulties with this at the moment?
Secondly, does anyone see any major difficulties scaling this to go across multiple servers? Moving the database's to a different server and the code to another server or several, when time and money permit?
Should I be looking at seperate insalls of drupal now in preperation of the scaling issue?
Also.
I want to be able to hide the fact I am using Drupal. This is for 2 reasons, I dont want potential hackers getting an idea on what script I am using, and I also dont want my competitors figuring out that I am using Drupal. I understand that I can clean the urls and make them seo friendly, but what about image directory structures for templates for instance? What other ways do people figure out a site is made with Drupal?
I realise that last issue can be seen to be anti community....no offence is intended, its a commercial issue, probably not an uncommon one.
Thanks for any pointers that can be provided.

Filepaths/imagepaths
There is a module called http://drupal.org/project/uploadedfilesmover which at least will let you move your files (uploaded/user contributed files) out of the telltale sites/default/files (or sites/yoursite.com/files or similar) if you are using filefield in cck.
That said, a lot of the paths of drupal are configurable from the admin pages, so you will be able to move them out.
However, a lot of files will be served from their default locations, and I'm not sure how you would go around that; short of stuffing your .htaccess file with some kind of rewrite rules to fool around with peoples heads. Others will be able to answer your question more fully, I guess.
Paul K Egell-Johnsen
Thanks
Thats an awesome start, points me in the right direction. It helps me heaps.
Thanks again
I also think
the .htaccess of drupal restricts access to some directories with code which is included and doesn't need to be served directly. In many php environments one is told to move the whole installation out of htdocs and into a directory which isn't served by the webserver. I tried to google this, but trying a search like "moving drupal out of htdocs" yields no results, etc.
As for the files directory it is configured from admin/settings/file-system and most modules respect that root as where to place their uploaded/downloaded/generated content.
.js and .css files will live with their modules, and have to be available from the outside world, and thus inside the sites/all/modules or similar place.
However, there is some cache modules which helps with obfuscation, for example the boost module which will try to put all html into a cache folder; and the performance module can join all css files into one file and all js into one file, and serving them out without letting the client know where the original files came from.
So all in all it seems that you can make Drupal pretty unrecognizable from the outside; but there is no guarantee, and there are probably a lot of telltale signs for sniffers with some experience; but that is probably true for a lot of systems.
BTW. Look into feedapi, for example, if your data sources uses XML, feed api has an additional module called extensive parser which will take care of all the data in an xml. You will probably have to look into cck and views for what you will do.
Paul K Egell-Johnsen
Boost
Boost module does nothing to help with obfuscation; htaccess rewrite rules make boost a non-obstructive caching system that kicks some serious butt. Beautify can remove all extra white space characters from your html; Example. Change the expires headers, turn off javascript if you can; many more options with system url aliases.
My mistake
You are right of course.
Paul K Egell-Johnsen
I also think
the .htaccess of drupal restricts access to some directories with code which is included and doesn't need to be served directly. In many php environments one is told to move the whole installation out of htdocs and into a directory which isn't served by the webserver. I tried to google this, but trying a search like "moving drupal out of htdocs" yields no results, etc.
As for the files directory it is configured from admin/settings/file-system and most modules respect that root as where to place their uploaded/downloaded/generated content.
.js and .css files will live with their modules, and have to be available from the outside world, and thus inside the sites/all/modules or similar place.
However, there is some cache modules which helps with obfuscation, for example the boost module which will try to put all html into a cache folder; and the performance module can join all css files into one file and all js into one file, and serving them out without letting the client know where the original files came from.
So all in all it seems that you can make Drupal pretty unrecognizable from the outside; but there is no guarantee, and there are probably a lot of telltale signs for sniffers with some experience; but that is probably true for a lot of systems.
BTW. Look into feedapi, for example, if your data sources uses XML, feed api has an additional module called extensive parser which will take care of all the data in an xml. You will probably have to look into cck and views for what you will do.
Paul K Egell-Johnsen