JSON interface to project search [#363466]

Comment	File	Size	Author
#20	363466-plugin-manager-repository.patch	9.81 KB	Anonymous (not verified)
#19	363466-plugin-manager-repository.patch	19.31 KB	Anonymous (not verified)
#18	363466-plugin-manager-repository.patch	9.82 KB	Anonymous (not verified)
#9	repo.patch	11.72 KB	Anonymous (not verified)
#5	repo.tar_.gz	3.79 KB	Anonymous (not verified)
	dorg-repository.patch	10.14 KB	Anonymous (not verified)

Comment #1

we/he/they

commented 24 January 2009 at 09:12

Status:

Needs review

» Postponed

I don't want to add this code, and have you write your own code to use it, since it perpetuates the evil assumption that there's a 1:1 relationship between release nodes and files.

See #11416: Please provide *.zip downloads. and #179471: release file attachments should use drupal upload functionality.

Those issues are about to change as part of the D6 port (one way or another), so I can't commit/deploy code that only makes fixing that problem harder.

Log in or register to post comments

Comment #2

Anonymous (not verified) commented 24 January 2009 at 10:38

Status:

Postponed

» Postponed (maintainer needs more info)

It's possible that I'm missing the part that assumes there is a 1:1 relationship between release nodes and files. It's also possible that I just did a bad job of phrasing the intent of the patch.

The first (and most important) file, build_repository.php, uses project_release_nodes tell if a release is present for a specific api version. The outputted XML file should be the same with no regard to the number of files (or their types) for a release. Only whether or not there has been a release for a particular api version will affect that.

The second file, serve_md5sum.php, takes a file path and name as an argument and returns the appropriate md5sum for this file. Different files will have different md5sums, but since this is the way that md5sums work, I believe that would be the expected behaviour. The release doesn't matter, only the filename used to access the file.

The last file, .htaccess, just makes it easier to use serve_md5sum.php.

After reading the above linked issues, I just don't see how either of them relate to the issue at hand. If there is still a problem that I am missing, please try explaining it again so that I can rewrite the patch. I do not believe that any more development can continue on the plugin manager until I have more sufficient feeds available.

J Rogers

Log in or register to post comments

Comment #3

dww

we/he/they

commented 24 January 2009 at 11:02

Status:

Postponed (maintainer needs more info)

» Postponed

+// Find the appropriate md5sum.
+$row = db_fetch_array(db_query(
+  "SELECT file_hash
+   FROM {project_release_nodes}
+   WHERE file_path = '%s'", FILE_PATH_PREFIX . $_GET['q']));
+
+// Display the md5sum js.
+if (!$row) {
+  print drupal_to_js(array('error' => TRUE));
+}
+else {
+  print drupal_to_js(array(check_plain($_GET['q']) => check_plain($row['file_hash'])));
+}

That assumes that {project_release_nodes} contains the file_hash and file_path data. That's not what's going to happen by the end of next week. There's going to have to be a separate {project_release_files} table with a different schema, etc, etc.

As I said, I haven't looked too closely at every line of your patch, I just know that it's not going to land until the dust settles on the project_release schema in the D6 port, which (with all due respect to the plugin manager) is higher priority.

If you want to help with the discussions/code around these file schema changes, and the UI implications of having multiple files per release, that'd be the best way to make progress on this issue.

Cheers,
-Derek

Log in or register to post comments

Comment #4

Anonymous (not verified) commented 22 February 2009 at 21:31

Since d.o has moved over to D6 now (along with a very functional project module) would it be safe to rewrite this bit of code or are there other planned db changes in the near future?

Thanks,
JR

Log in or register to post comments

Comment #5

Anonymous (not verified) commented 5 March 2009 at 16:20

Status	File	Size
new	repo.tar_.gz	3.79 KB

I have rewritten the code to take advantage of the new project schema for D6. Also, the purpose of the serve_md5sum (now serve_release_info) has changed. Given a project name and api version it returns every md5sum, filepath and version that goes with it. This supports multiple files per release.

If this meets approval then I should be able to finish plugin_manager 2. If not, then I'm certainly willing to redo it as neccessary.

Thanks,
JR

Log in or register to post comments

Comment #6

Anonymous (not verified) commented 5 March 2009 at 16:21

Status:

Postponed

» Active

Log in or register to post comments

Comment #7

greggles

he/him

English

Denver, Colorado, USA

commented 10 March 2009 at 15:28

Status:

Active

» Needs review

More appropriate status given that is has code.

Log in or register to post comments

Comment #8

dww

we/he/they

commented 10 March 2009 at 16:46

Status:

Needs review

» Needs work

Eeek, any reason not to just have a patch file? Why the .tar.gz? Thanks.

Log in or register to post comments

Comment #9

Anonymous (not verified) commented 19 March 2009 at 14:01

Status:

Needs work

» Needs review

Status	File	Size
new	repo.patch	11.72 KB

Sorry about taking so long. I just got back from Tennessee. (Spring Break.) Also, sorry about the .tar.gz. Here it is in the form of a patch.

Log in or register to post comments

Comment #10

Anonymous (not verified) commented 31 March 2009 at 23:28

Would it help any if I were to give a more detailed explanation of what this should help accomplish that cannot be accomplished now?

Log in or register to post comments

Comment #11

dww

we/he/they

commented 1 April 2009 at 00:49

Yes, that'd be helpful. I have severe doubts about where this is heading. See the following comments:

#395472-19: Plugin Manager in Core: Part 1 (backend)
#395472-25: Plugin Manager in Core: Part 1 (backend)
#395472-29: Plugin Manager in Core: Part 1 (backend)

Thanks,
-Derek

Log in or register to post comments

Comment #12

greggles

he/him

English

Denver, Colorado, USA

commented 1 April 2009 at 01:59

@dww - you've certainly made your points there about what does/doesn't belong in core. However...if we want the idea of browsing modules inside of the site itself to be able to get any real world testing then it will have to happen first in contrib. In order for that to be possible...we need something like these files.

Otherwise, Joshua would have to build a d.o scraper to get the data and his own server to host that information and point plugin_manager at that.

So, understood that this may not be the best way but let's at least give it a chance so that we can see if people like it.

Log in or register to post comments

Comment #13

Anonymous (not verified) commented 6 April 2009 at 16:03

Check out the prototype: http://joshuarogers.net/admin/build/plugin_manager. The data was scraped off the drupal website. Sometimes the scraper grabbed the wrong data. Sorry. It was just meant to show what could be done.

This is most definitely unfinished, but I would still suggest trying it. Search for "facebook" or "plugin manager" or "chatroom" or any other term you want to. The goal is to eventually give the user a simple (or advanced) way to search for different modules and themes that they can install.

Having the ability to search for new stuff without having to leave your own site seems to give a feeling of safety to those who aren't very familiar with the inner workings of Drupal. For those like me that can do it, we should be able to have more advanced searches eventually. Giving a full text search just gives more potential to give better results.

I know there are several people who are really wanting to get this in to core. That would be really exciting. Honestly though, what I want more than that is to get this module polished. That led to this conversation in the usability group: http://groups.drupal.org/node/17765. I personally feel that this would be one giant step toward usability.

Log in or register to post comments

Comment #14

dww

we/he/they

commented 6 April 2009 at 17:04

This is why I'm hesitant to spend any time on this issue. You keep talking about all the cool features without ever mentioning the underlying reality that this system faces: WAY TOO MUCH DATA. I'm sure the usability group is happy to bikeshed about the cosmetic details and gush about how wonderful this would be for UX without a drop of concern for how the code would actually work and what the implications would be.

"Giving a full text search just gives more potential to give better results."

And it gives users the chance to wait while you do expensive string operations iterating through every byte of a 5 meg (and growing) array of module/theme data... assuming you haven't already run out of RAM just parsing the XML in the first place. ;)

The only reason your prototype is at all functional (haven't tried it yet) is because you have a subset of the current data and/or your PHP configuration is intentionally very forgiving for RAM and CPU usage. If you had the full data as of today, you might already be in trouble, certainly given the RAM limits on most shared hosting accounts (which, clearly are the target audience for this feature). Plus, as I pointed out in the issue where people are talking about trying to do this in core, the dataset is growing a lot, perhaps exponentially. There's "only" [sic] 4560 published projects hosted on d.o right now. There have been 50 new projects added in the first week of April. I'd be interested to see a graph of the # of new projects a week -- wonder how typical that number is. Point being, by the time of the D7 code freeze (Sept 1st), there will likely be another ~1000 projects (20 weeks * 50 projects/week). By the time D7 is released, there might be 7K-8K projects, and by the time its EOL there might be 20K or more. Multiple those figures by the amount of meta data you want to include in your internal browser "for the most awesome functionality" (e.g. full text of the body, taxonomy terms, project title, perhaps maintainer information, etc), and we're looking at gargantuan amounts of RAM that we'd need to have available for this plugin browser module to work at all.

"But it'd be such a cool feature, and all the usability folks think so, too" won't convince me otherwise. I agree it *would* be cool, but it just reminds me of all the code in project.module itself that worked fine when d.o had O(100) contributed projects which completely melted and fell apart we hit another order of magnitude in the number of projects. I've spent way too much time the last 2 years re-writing this code I inherited and fixing problems resulting from this growth to let anyone else naively walk down the same path again.

I'm *not* being an obstructionist, and I'm not resistant to change. I'm genuinely concerned that you're walking into a minefield with your eyes closed, and I don't want to see your legs blown off. And, I don't want me (or someone else on the infra team) to be tending the lawn in the minefield for the indefinite future...

Log in or register to post comments

Comment #15

Anonymous (not verified) commented 6 April 2009 at 18:48

I'm *not* being an obstructionist, and I'm not resistant to change. I'm genuinely concerned that you're walking into a minefield with your eyes closed, and I don't want to see your legs blown off.

The thought never crossed my mind. I honestly believe that we are both wanting to do whatever needs to be done for Drupal's future. Though I definitely disagree with the outcome, I respect and appreciate your concern. :) Back to the fun stuff though...

And it gives users the chance to wait while you do expensive string operations iterating through every byte of a 5 meg (and growing) array of module/theme data... assuming you haven't already run out of RAM just parsing the XML in the first place. ;)

I took this into consideration during the design phase. XML files are being processed with XMLReader. It only loads into memory the section of XML that it needs to parse at any given instant. It is extremely light on resources. I didn't want to use too much memory.

"The only reason your prototype is at all functional (haven't tried it yet) is because you have a subset of the current data and/or your PHP configuration is intentionally very forgiving for RAM and CPU usage."

I'm pretty sure that I'm using most if not all of the data (at least as of a week or two ago.) Also, mysql is doing the text searching. Since it supports full text indices, I believe that could help out greatly in terms of resource cost.

I do understand that there are likely to be more and more people using Drupal. Quite honestly, I'm thrilled by that prospect. I also realize that many people have limited resources. That's great, because I do too. I'm not trying to get this added to core though. Right now I'm just wanting to make a completely optional contrib module.

"But it'd be such a cool feature, and all the usability folks think so, too" won't convince me otherwise. I agree it *would* be cool, but it just reminds me of all the code in project.module itself that worked fine when d.o had O(100) contributed projects which completely melted and fell apart we hit another order of magnitude in the number of projects.

I'm not worried as much about the cool features as the useful features. I also can't disprove that the same problem won't happen here. I can give two options though:

1) We actually run the script to see how large it would be. I'll take the gzipped file, double it and see if the system can still stand.

OR

2) We let d.o handle all of the searching itself. It has already been pointed out that Drupal's infrastructure can handle a heavy load. I can't say that this would be the way I would want to go though. (Or that it would have any shorter wait times.)

J Rogers

Log in or register to post comments

Comment #16

Anonymous (not verified) commented 6 April 2009 at 18:54

Actually, the more I think about it, the more okay I would be with putting the actual search portion on d.o and just returning XML or JSON to the browser or server that requests it. Would that be a more acceptable avenue to take?

Log in or register to post comments

Comment #17

Anonymous (not verified) commented 17 April 2009 at 20:10

Title:	Add Full XML Repository for use by plugin manager	» JSON/XML interface to project search
Status:	Needs review	» Active

Would it be possible to create an interface that can return JSON/XML results to searches through the various projects hosted on Drupal.org? This method provides several advantages:

This would completely eliminate the need for a local repository.
This would lower the memory footprint of the plugin manager module to just a few kilobytes.
Since the alternative would be to use d.o's search, this should be no more stressful than someone searching normally.
It would be able to scale as far as d.o can.

I believe that would negate most of the earlier issues that were raised without causing any lack of functionality.

Log in or register to post comments

Comment #18

Anonymous (not verified) commented 4 May 2009 at 00:42

Component:	Miscellaneous	» Code
Status:	Active	» Needs review

Status	File	Size
new	363466-plugin-manager-repository.patch	9.82 KB

@dww: You were right. The other method was a bad idea. I'm hoping that this corrects that.

The attached patch creates two files: .htaccess and plugin-manager-repo.php. It allows three functions: get a list of project categories, get a list of releases with md5sums (and sorted by recommended version) and search for projects in a particular category with a particular API version with a certain term. (This uses the same Solr search that powers the searching on d.o)

All queries return either XML or JSON data. This allows the plugin manager to work without a copy of the repository. (Thus with much lower overhead.)

If this is inadequate, I am more than happy to work on it. I do need this functionality for my project.

Thank you.
J Rogers

Log in or register to post comments

Comment #19

Anonymous (not verified) commented 4 May 2009 at 04:32

Status	File	Size
new	363466-plugin-manager-repository.patch	19.31 KB

Corrected some typos that got by me the first time.

Log in or register to post comments

Comment #20

Anonymous (not verified) commented 4 May 2009 at 04:34

Status	File	Size
new	363466-plugin-manager-repository.patch	9.81 KB

And finally curse the fact that kwrite automatically creates files ending with ~. Yeah. I forgot it was going to create one of those before I submitted the patch.

Log in or register to post comments

Comment #21

jpetso commented 4 May 2009 at 05:40

You know you can disable that "feature" in KWrite :P

Log in or register to post comments

Comment #22

Anonymous (not verified) commented 4 May 2009 at 15:11

To be honest, I had never really stopped to check. :P Anyway, I've disabled it now.

With this patch in place, this is what a plugin manager install would look like:
1) Upon loading the plugin manager, the browser would load whatever.drupal.org/json/categories/6.x to get a list of all categories that are available.
2) Once the user is ready to search, the browser would load whatever.drupal.org/json/search/6.x?text=my search string&tid=15&page=1. This would return the results to the browser for display.
3) Once the user is ready to install, the browser would load whatever.drupal.org/json/release/6.x/coder for example. It would get a list of all releases with md5sums and filepaths. The releases are sorted by recommended version, then major, then minor, then patch. Thus, the one at the top of the list should be the newest recommended.
4) The browser would post the md5sum, version of the one to install to the local server.
5) The server would download whatever.drupal.org/xml/release/6.x/coder. It would read the version and the download path.
6) The server would download the file for the appropriate version then compare the generated md5sum to the posted md5sum.
7) The rest of the magic happens here.

Log in or register to post comments

Comment #23

dww

we/he/they

commented 4 May 2009 at 17:44

Status:

Needs review

» Needs work

Sure, this seems better than fetching all the data in d.o and locally searching it. I haven't tested anything, but a fairly quick skim of patch #20 revealed the following areas that need work:

A) Code style: if () and foreach() always need {} even for 1 liners.

B) Fragile design: There's no vocabulary on d.o called "Project release API compatibility". Vocabularies can be renamed, too. Instead of the name, you should probably just use the vocab id (vid) directly, and either use _project_release_get_api_vid() or go with a constant at the top of the script.

C) Duplicate code: the whole code block to deal with this vocab is cut + paste in two places. Use functions and have a helper that finds the tid based on the arguments to the script.

D) Expensive design: The queries for the list of releases are pretty bad (nothing you can do about that, it's just an expensive thing to compute without much more denormalization). We already do all this for the release history XML. Why do you need to duplicate all of this via this script? Can't you use this script for the list of categories and searching projects, but re-use the XML release history files that update.module uses for the actual list of releases?

E) project_solr is under fairly active development (as is solr itself). I'm worried about the big cut + paste here since it means that someone has to remember to fix both places as we make changes to project_solr + solr on d.o. We should consider if any of the logic in project_solr can be refactored into helper functions that this script could reuse, instead of duplicating the code.

Also, you should check out what Adrian is doing with his proposal for a Drupal 'ports' collection. In fact, I'd like him to comment here with his thoughts on this thread and the proposed solutions, since he's thought a lot about the client-side implications of all this data...

Log in or register to post comments

Comment #24

adrian commented 4 May 2009 at 18:03

Status:

Needs work

» Needs review

I've been working on this stuff recently, but from the perspective of an outside system.
http://groups.drupal.org/node/21295

This is probably outside of the update module usecase, but I require all the information up front, to be able to make decisions about what packages to go fetch in the first place.

So I am building a directory structure of easily mirror-able meta-information, that can be parsed on the client side.
This is specifically with the goal of writing something like apt-get for Drupal.

By keeping it outside of the project module, I also open the possibility for developers to generate package indexes of their own repositories, without needing to have drupal and project running. By keeping the client side implementation out of Drupal itself (in Drush), it allows me to manage major version upgrades too.

Log in or register to post comments

Comment #25

adrian commented 4 May 2009 at 18:14

Status:

Needs review

» Needs work

fixing status change i didn't intend

Log in or register to post comments

Comment #26

dww

we/he/they

commented 4 May 2009 at 18:24

@adrian: re: "By keeping it outside of the project module, I also open the possibility for developers to generate package indexes of their own repositories, without needing to have drupal and project running."

If the primary data source is Drupal project + release data, and that data lives in project*, you need to be able to query project* to get it. If you don't have project* and manage your own releases of Drupal code via other means, you can certainly provide data in the same format (just like you can for update_status right now). But, the bulk of the data is going to come from the projects hosted on d.o, and that's living in project*, so something(s) need to understand project's schema and generate the files.

Maybe this isn't the best thread to discuss it in... we probably shouldn't hijack JoshuaRogers's request with a debate about the details of your needs for yaml files and the philosophy behind who/what/where they're generated. ;)

Log in or register to post comments

Comment #27

Anonymous (not verified) commented 4 May 2009 at 18:58

@dww: Okay, 'A', 'B' and 'C' were all me being an idiot. ;)

D: That sounds fair enough. I'll replace this section with a section to give the md5sum of a file in json. Using update sounds good enough.

E: Most of this came directly from project_solr_browse_page. At the end of the function it uses a loop to turn $response into themed html. If it had just returned $response then I would not have needed to copy any of the code. I suppose I could make a patch that moves the majority of project_solr_browse_page into a helper function. Would you suggest this?

Log in or register to post comments

Comment #28

Anonymous (not verified) commented 4 May 2009 at 18:58

Title:

JSON/XML interface to project search

» JSON interface to project search

@dww: Okay, 'A', 'B' and 'C' were all me being an idiot. ;)

D: That sounds fair enough. I'll replace this section with a section to give the md5sum of a file in json. Using update sounds good enough.

E: Most of this came directly from project_solr_browse_page. At the end of the function it uses a loop to turn $response into themed html. If it had just returned $response then I would not have needed to copy any of the code. I suppose I could make a patch that moves the majority of project_solr_browse_page into a helper function. Would you suggest this?

Log in or register to post comments

Comment #29

dww

we/he/they

commented 4 May 2009 at 20:24

A-C: I wouldn't say "idiot". That's the point of peer-review.

D: Good. However, the release history XML includes the md5sum -- why do you need json for that at all?

E: Not just "would" suggest, that's exactly what I did suggest... ;)

We should consider if any of the logic in project_solr can be refactored into helper functions that this script could reuse, instead of duplicating the code.

Log in or register to post comments

Comment #30

Anonymous (not verified) commented 4 May 2009 at 22:03

D: The way the system current works (if you can call it that,) the plugin manager displays an iframe of the release page before a module can be installed. The user has to copy the md5sum from the release page to the plugin manager form. This is done for security sake.

If the server downloads the file and md5sum and the server had been affected by DNS poisoning then it would be it would be simple to make malicious packages appear to be legit. Thus, the user needs to post the md5sum to the server. One possibility would be to use ajax to fetch it from the XML release history. Unfortunately security settings prevent most browsers from loading XML from remote sites, so this isn't an option... JSON data can be loaded though. Serving md5sums in a json wrapper would allow the browser to automatically get the md5sum and then post it to the local server. This would successfully hide that nasty step from the user without endangering them.

E: My mistake... again... Dang it. Sorry... again. I'll get to work on that patch now. :)

Log in or register to post comments

Comment #31

dww

we/he/they

commented 4 May 2009 at 23:11

Re: D) If updates.drupal.org has been effected by DNS poisoning, I don't see how JSON is going to help you any. Please explain.

Log in or register to post comments

Comment #32

Anonymous (not verified) commented 5 May 2009 at 00:13

I think I might have worded it badly. The worry here isn't updates.drupal.org falling prey to DNS poisoning. The worry is that a server hosting a Drupal install might become poisoned. If that were to happen then all attempts to connect to updates.drupal.org and ftp.drupal.org could be redirected to sinister.example.com.

If the local server is poisoned and it grabs the md5sum then the both could also be pulled from sinister.example.com. If the user enters the md5sum and the server downloads the package, however, then both the server and the client PC would have to be subject to DNS poisoning for this attack to be successful. Otherwise the fetched md5sum and the calculated md5sum would most likely be different.

Log in or register to post comments

Comment #33

dww

we/he/they

commented 5 May 2009 at 00:22

"server" is overloaded here. Let's say there are three machines involved:

1) *.d.o (let's pretend it's one box for the sake of argument).

2) the server where the website is hosted that someone is trying to update.

3) the machine where the browser is running where the admin is trying to initiate the update.

right?

So, you're saying we want JSON between #1 and #3 in case #2 is poisoned, right?

I'm not fundamentally opposed to that, but #3 could also fetch the XML release history for a given project and find the right md5sums, no?

Also, if we go the JSON route, it's complicated because there could be multiple files per release node (e.g. .tar.gz and .zip versions) and you need to be able to specify which one you actually care about. In the XML version, we can just list all the md5sums "next to" the links to the files themselves, and you know via parsing which file each md5sum corresponds to...

Log in or register to post comments

Comment #34

Anonymous (not verified) commented 5 May 2009 at 01:02

That is correct.

The reason we haven't attempted letting the browser parse the XML is because of the "Same-Origin" policy (or at least what we understand it to mean.) Simply, it will not allow XML from one domain to be read by a script on another. (Once again, that is my personal understanding of it.) I do know that json data can get around that limitation, however.

As far as knowing which file we are refering to, appending ?filename=path/to/the/file.tgz to the end of the url would precisely identify which md5sum should be returned.

Log in or register to post comments

Comment #35

Anonymous (not verified) commented 16 May 2009 at 23:54

29e is beginning to look unlikely. Issue #453718: Provide easier access to raw Solr results is meant to address is. Unfortunately, it has received no love. :(