Hi all
background info..
I am performing a drupal_http_request() on a location URL supplied by authenticated users, i would prefer to discover if the html data returned is generated by Drupal, Wordpress or "unknown" without prompting the user to supply that information (using radios on the form). I know i can obtain the Content-Type from the header and i intend to offer the option to import using rss feeds (.xml) which is much easier for me to parse before storing in the database, but i also want to include functionality to import data directly from the html page (essentially i am doing scrapes to extract individual posts e.g. blog nodes).
my question is this..
Is there a definitive signature (or fingerprint) contained in the html markup which identifies the application that generated the page as either Drupal or Wordpress ?
Initial research suggests..
- Drupal: <div> containers with class "nodetitle" or "node" encapsulate another <div> container with class "content".
- Wordpress: <div> containers with css id like "post-1234" and class "post-1234".
My concern is that theme developers may be able to (or choose to) override these system defaults which would break my extraction code, also i do not have the time (or resources) to investigate 1000s of blog pages to discover which are the most common structures used to format blog nodes for both Drupal and Wordpress pages.
I request the community to comment and provide tips and guidance where possible.
Many thanks
Comments
QueryPath
Looks like the QueryPath library and API is perfect for pulling out the parts of the document i am most interested in, but i still need to know which specific tags, identities and class attributes to use to locate those sections.