in working on http://drupal.org/node/128827 and thinking about http://drupal.org/node/125742 it's now clear that the existing XML-RPC protocol between update_status.module and project_release.module isn't good enough. :(
once upon a time, i thought update_status sent an array of (module name, major version) pairs in the request, along with a single field to indicate the core version (since you can't possibly have multiple versions of core on the same site). however, really it just sends an array of names, and a single core *and* major revision number, or no revision number at all if it wants to find out about the latest "default" version (as defined by the project admins) for each module. :( the reason for this is that it was easier to write a single (and therefore, faster) DB query in the project server code to answer a request if all the major revisions are the same. :( in fact, update_status.module itself currently gets this wrong for sites that have different major revisions installed -- it's just hard-coded to always ask about major revision 1. :(
so, either:
- we have to change the protocol between update_status and project, so that we send (name, major) pairs, which would be better for update_status, and better for project_usage, but worse for project's XML-RPC server (more expensive to answer the requests b/c we'd need a more complicated query, or perhaps more queries.
- forget about tracking major revision in all this, which is pretty much out of the question -- update_status can't actually provide the right answers if major revision isn't involved.
- have update_status send multiple queries, 1 for each set of modules at a given major revision. this seems inefficient and wonky. it makes it harder to store the queries for recording usage info, and wouldn't necessarily be any faster for project to answer the XML-RPC requests (since there are more of them), than putting all the data in a single request.
- work on the project* DB schema to consider ways to cache the info and make these XML-RPC requests less expensive to answer, so there's no problem with more complicated queries. for example, maybe project_release should just maintain a {project_release_latest} table with project_nid, core_version, major_revision, and release_nid as the fields (more or less).
a sort-of-related complication is how this protocol can be used to correctly flag updates as including a security fix. it'd be easy enough to have a new "release type" taxonomy, so you can flag individual release nodes as providing a security fix. however, say 5.x-1.2 is a security fix, a site currently has 5.x-1.1, but the most recent release is now 5.x-1.3, which fixes bugs, but not security holes. update_status should still tell the admin they need to upgrade to 5.x-1.3 because of a security problem, even though that was from 5.x-1.2, not 5.x-1.3. :( so, it seems like part of the answer that project_release's XML-RPC server sends back has to be something like "last security update on this branch", and let update_status do the work to decide if it needs to flag a given out-of-date module using that info. i only raise this here while we're considering how to fix this protocol.
the goal is to get update_status into core for 6.x, so time's really running out if we're going to get this right before the code freeze. hence, the critical priority for this issue.
| Comment | File | Size | Author |
|---|---|---|---|
| #31 | project-release-history.php_5.txt | 11.18 KB | dww |
| #26 | signup-5.x.xml_2.txt | 1.52 KB | dww |
| #25 | project_release_history_callback.patch_1.txt | 5.25 KB | dww |
| #24 | project-release-history.php_4.txt | 11.07 KB | dww |
| #22 | project_release_history_callback.patch.txt | 5.02 KB | dww |
Comments
Comment #1
drewish commentedthe last bit about security status ties in with one feature that chx was asking for in my SoC project. he wanted to make sure that security got factored in. having the ability to mark the cvs tag releases (I'm not sure what terminology you use to distinguish the HEAD -dev release from the "stable" tag releases) as containing a security issue would be very critical to evaluating a module's quality.
Comment #2
drewish commentedCouple more thoughts:
* As I sort of said in the last post, I think the security issue flag should be per-release, but its not really applicable to -dev/HEAD releases so I don't know if a taxonomy is the right way to do this. It might be better off as a release node field.
* I don't know if I'm seeing the the or part of #1 and #4 they both sound like the right idea to me.
Comment #3
dwwthe basic terminology we use is:
see http://drupal.org/handbook/cvs/releases/types for more.
anyway, yeah, flagging official releases as including a security fix has been part of the design goals since day 0 of the new release system. the taxonomy i propose above is most of what we need. however, it just gets tricky with the update_status.module for the reasons i outlined above, which is why i'm bringing it up here.
Comment #4
dwwoh yeah, and to clarify, i only envision this taxonomy applying to official releases, not dev snapshots. however, i've run into this same problem of wanting a taxonomy to only apply to a subset of nodes of a given type (e.g. the database compatibility term i'd also love to see on projects only makes sense for modules and installation profiles, not translations or themes) but sadly, there doesn't seem to be a good, general solution to this right now. so, meanwhile, we might have to play some tricks with form_alter() to just hide this taxonomy selector when creating release nodes that are development snapshots (we already play some tricks like that to alter the taxonomy UI on project nodes for the "Project type" radio buttons vs. the "Project categories" multi-selects to classify modules)...
that said, i still think a taxonomy is the right solution for this, instead of something hard-coded into project_release.module as another field directly in the {project_release_nodes} table. here's why:
however, it does get a little messy once we get into the XML-RPC protocol, and this whole business of reporting security updates. then, the general, flexible niceness of a taxonomy can't remain clean and pure. someone has to get dirty and have specific knowledge of the security update term, and give it some special case treatment, because of the problem in the original post about a site running 5.x-1.1, which missed the 5.x-1.2 security update, and now 5.x-1.3 (which itself is not a security update) is already released. :(
Comment #5
dries commentedComment #6
dries commentedArgh, I typed a long reply here but a 500 ate it. I'll re-post it later on, have to run now.
Comment #7
dwwattempting to summarize a few long discussions with merlinofchaos, webchick, and dries...
one possible downside of #2 is that not only would the client need to parse these XML files (good god, let's not turn this into another theme .info file thread!), more importantly, the client would have to have all the brains required to figure out stuff like "what's the latest release?", "is there a security update in between?", etc, etc. if we ever change anything about how project_release works, if most of these brains live in project_release, we can change everything in one place. once it's in the client (and especially once that client is in core), we're going to be pretty locked into how everything works.
Dries asked me to post info on the "brains" of the code i'm talking about. basically, see
function project_release_data()from http://drupal.org/files/issues/xmlrpc_22.patch (care of http://drupal.org/node/48580#comment-191462). or, just look in the current source:http://cvs.drupal.org/viewcvs/drupal/contributions/modules/project/relea...
basically, all the smarts are currently handled by building the right SQL query, since the db schema related to release nodes was designed to answer these kinds of queries. so, we'd have to basically do the same kind of logic in php on the array of parsed info we got from these .xml files.
i'm still torn, since the idea of putting all the brains in the client is great from a performance standpoint, but a little scary for other reasons (especially if/when the client moves into system.module in core)...
more brainstorming required, clearly. ;)
Comment #8
nedjoConsidering that we have XML-RPC parsing in core, we could use that as the format for the (static) XML files. Then the Drupal client would just have an array to deal with. So basically we'd be generating an XML-RPC file and caching it rather than responding on demand.
Comment #9
merlinofchaos commentedI don't know enough about the guts of our xmlrpc library to really comment; but if we can basically format the messages ahead of time and store flat-files it seems like a fantastic idea. Server load would be well reduced.
Comment #10
dwwsummary from another IRC session, this time with Dries and merlin at the same time. ;)
ideas about storing usage data:
Dries still didn't seem too thrilled with the hostkey thing, but merlin and I think that's the approach with the best "price/performance" ratio (i.e. most accurate stats for the least trouble).
everyone agrees this effort is important, so more thinking is required. however, there's a sense of urgency about getting this figured out, a new server side implemented, a new version of update_status.module as a 5.x client that supports it, and a patch against system.module for D6 that moves the client code into core, all before the june 1st code freeze. yikes... ;)
Comment #11
dwweek, sorry, i realized i was a little bit sloppy with terminology. to clarify:
and, to be clear, no one's actually in favor of the application key approach, since it's way too much work, and no one would bother to use it.
Comment #12
Paul Natsuo Kishimoto commentedThis is exciting - subscribing!
Am I correct in thinking that the XML-RPC requests are actually serving two purposes? Namely:
Food for thought: these could be separated. From my (limited) experience with apt-based package managers, they download lists containing information (latest versions, dependencies, descriptions, details of security updates) on all packages. With 24000 packages in the Ubuntu repositories, the lists are less than 5 MB. There is a popularity contest analogous to Drupal's usage stats, but it functions independently from the package manager. Users can turn it off.
There are several repositories; analogues in Drupal:
If the processing onus is truly on the client (as in package managers), d.o would generate the very few module list files (gzip'ed) and serve them (straight from cache or via public download) as requested. If a client wasn't running any modules under development, it wouldn't download the snapshots list. Complete lists also enable the client to inform users "If you want Module X, version 5.x-1.2, you must also install these versions of Modules J, K and L. Here are the download links for all four."
In short, thought lessons from package managers (which seems to be, in the very very long run, where this is going):
I hope that isn't too far off-topic. My point is that the above would obviate the need for an advanced protocol for popularity/usage tracking; the client would send d.o only the list of the modules it was running. It would obviate the need for *any* protocol at all for update status checking, in favour of an XML package list format. And, of course, that everyone else is doing it isn't sufficient reason to follow suit (unless it really works!)
Comment #13
dwwFYI: see http://drupal.org/node/142120 about setting up a "Release type" taxonomy on d.o as per my comments in #4 above. that's basically in the critical path for all of this security update flagging stuff to really work properly...
Comment #14
dwwafter considerable further thought and more discussions in IRC, here are my current ideas about this:
files/release-history/[project_name]/[core_version].xml. i suppose we could just do everything in 1 giant directory, and just name the files[project_name]-[core_version].xmlbut i'm slightly worried about performance, and i think subdirs might be better. this point doesn't really matter, and we could change it later, since i'm not envisioning that clients pull the file directly from apache, but instead go through a menu call-backthe .xml files for release histories would be something like this (this is cut and paste from a local test site, the URLs, nids, etc will all be different for real):
my concrete action plan:
as soon as i get the initial .xml files up there, merlin can start working on the new brains in the client for update_status 5.x-2.0, so we can get a working prototype out in the wild to make sure it's all happy while we work on a) storing usage data and b) a patch for D6 core.
any objections to anything above? speak now or forever hold your peace. ;)
thanks,
-derek
Comment #15
dwwdries and i chatted via IM about this. mostly he loves it. he suggested a few simplifying assumptions:
i'm all in favor of #1, that sounds great. not sure about #2 yet, honestly, i think files on disk will be easier to get working quickly, but i could be wrong.
Comment #16
dwwhere's a mostly working prototype to generate the xml files. currently, it only logs to the watchdog, instead of spitting out files, but i have to run right now and can't spend the extra 15 minutes to get this fully working. ;) however, i wanted to share what i've got as a backup and so others can review and poke holes in the mean time...
Comment #17
dwwcode-complete and tested locally. writes everything (carefully) to the file system. the live .xml files are only touched via an atomic call to rename(), so there's no fear of clients accidentally grabbing a partially-written version if they happen to ask in the middle of while this script is generating files. i guess the next step is to let this rip on d.o. however, i wouldn't mind a 2nd pair of eyes before i did that, just to make sure everything is cool. of course, we're not writing to the DB at all, and only writing to a very specific, well-known directory tree, so there's really no harm that could be done. but, better safe than sorry. ;)
Comment #18
dwwa few extra check_plains(), just to be safe... thanks to webchick for having the extra paranoia filter on while reviewing. ;)
while i was at it, i added a big comment to where we create the filenames we use to explain what parts are safe and what parts aren't. and, for pure paranoia, i added checks to remove '/' from the Drupal core compatibility vocab terms when generating the filename... tried it on a test site and it worked nicely to avoid "../../../" style attacks using these vocab terms.
Comment #19
drewish commentedlooked over #18 and had a few comments.
i wouldn't mind seeing $fatal_err = FALSE before it's first used. it'd be E_ALL compatible and make it's lifetime a bit clearer.
not to be too picky but i though the Drupal coding standard was to capitalize TRUE and FALSE
you've got what looks like some dead code:
should 'Published' and 'Unpublished' really be translated? It seems like it's being used as a constant more than a label.
$xml .= ' <status>'. ($release->status ? t('Published') : t('Unpublished')) ."</status>\n";would it make sense for $drupal_root, $site_name, $dest_root, $dest_rel and $dest_full to be defined as constants rather than global variables? they don't change and it'd help them stand out...
also would you mind posting one of the output files?
Comment #20
dwwthanks for the careful review, drewish. actually, last night before i went to sleep, i started working on the project_release.module side of things, and decided that it'd be best (and more consistent with the rest of drupal) to have a setting for the directory to use for this, so a bunch of the $dest_* stuff is gone in my workspace now. just putting the finishing touches on that. i'll fold in your other suggestions, for sure.
re: TRUE vs. true, wasn't there recently a thread on devel@ that said 'true' was a few % faster? i don't really care, i'll stick with TRUE for consistency, until core switches...
and yeah, i guess there's no good reason to t() published vs. unpublished in this case. ;)
attached is a sample output file from a test site. it's full of bogus data, and not all of the releases even have files attached to them, so some of the attrs are missing, but it should give you an idea. (speaking of which, just noticed another bug: we shouldn't include the
<download_link>tag if there's no file attached to the release, yet.) looks a lot like what i posted above in #14.Comment #21
dwwnew improved version of the generation script, including all of drewish's suggestions. also relies on a variable_get() for the relative directory stuff, instead of having that configured in here, since project_release.module needs the same info. initial patch against project_release.module coming next.
Comment #22
dwwinitial patch against project_release.module. doesn't actually serve up the .xml files yet, but it does everything else (all the validation, handling error cases, adds the necessary settings, etc).
Comment #23
drewish commentedLooking at the output..
It seems like these should be made consistent. Perhaps without the a href bit? less parsing and all...
It's probably been mentioned before but why are we including unpublished releases?
Should we be including the version in the name? It seems redundant. Is that already baked in to the node title?
And now looking at the code...
DIR_SEP isn't really needed... PHP's pretty good about treating \ and / the same on windows... You could use the built in DIRECTORY_SEPARATOR http://us.php.net/manual/en/ref.dir.php
it may be over thinking it but depending on which version of PHP is being used and how the script is called, $argv[0] may have '/usr/bin/php' as a value.
drupal_chdir() isn't being called from anywhere.
Comment #24
dwwre: including info about unpublished release nodes: if you happen to be running a version that's since been unpublished, that's *really* bad news, and the client should be able to tell you that. however, i agree it's dumb to include links to the release node and the download link in this case -- we just want to include enough info for the client to identify it's the same release, and then to be able to freak out appropriately. ;)
otherwise, i incorporated all of drewish's suggestions in #23. thanks!
so, here's the new generator-script...
Comment #25
dwwfinished the project_release.module side of things to actually send back XML. ;) for consistency, all errors at this menu callback are reported via XML, too, so once you land on that page, you always see XML (e.g. if the requested project is not found, has no releases for the requested version, etc).
IMHO, both parts of this are now RTBC, but additional reviews wouldn't hurt. i'll probably deploy on scratch.d.o this afternoon and let it rip, to see how things are looking.
Comment #26
dwwoh, and re: "Should we be including the version in the name? It seems redundant. Is that already baked in to the node title?" -- yeah, the version is just part of the node title for the release nodes.
at request, a new copy of the generated xml file (still with bogus data from the test site... soon y'all will just be able to see this directly on s.d.o...)
Comment #27
dwwran the generator script on s.d.o, and it took less than a minute (and the initial run is the one that has to create the most directories). a subsequent run took only 12 seconds to re-generate all the files. ;) granted, that's on s.d.o, which is using the new beefy DB server, but still...
i also applied my patch against project_release.module on s.d.o, so the menu callback is now working. for example:
http://scratch.drupal.org/project/release-history/views/5.x
http://scratch.drupal.org/project/release-history/project_issue/5.x
...
obviously, your browser is going to do funny things to display the XML, but if you view source, you'll see it's all happy. the project_issue example includes some taxonomy terms, too, since i had already classified some of the security releases i've done with the new Release type terms.
any final tweaks to any of the behavior or comments on the code? i'd like to commit all of this to CVS and install it for real on d.o.
Comment #28
drewish commentedsweet! it looks good to me. when you think about it most of the execution time is probably spent running you error-triple-checking code ;)
Comment #29
drewish commentedsweet! it looks good to me. when you think about it most of the execution time is probably spent running you error-triple-checking code ;)
Comment #30
dwwre: when you think about it most of the execution time is probably spent running you error-triple-checking code ;)
hah, you laugh, but i've been developing project* code on d.o for over a year now, and it *never* ceases to amaze me how often the impossible happens once i move to the live environment. ;) sure enough, the script generated 25 errors on those "impossible" cases that should never happen, due to the weirdness that is the live d.o data...
about 20 of those appear to be from behavior i didn't think about before, which i now can't decide if it's a bug or not (probably it is). apparently, it's not uncommon for entire project nodes to be deleted (i thought we usually unpublished them, but i guess not). anyway, there's nothing in project* that notices when you delete a project node that goes off to delete all the release nodes for that project. so, we've got a bunch of entries in {project_release_nodes} pointing to projects that no longer exist. who knew? ;) submitted http://drupal.org/node/142957 about this, so if you're interested in discussing further, go there.
the other ~5 are from cases where the "HEAD" release node has been assigned a core compatibility term (e.g. 5.x or 6.x) without having a reasonable version string. i really need to finish up http://drupal.org/node/89699.
Comment #31
dwwin IRC, merlin and i decided we might want the project short name returned in the XML, for example, if you wanted to fetch a bunch, smoosh them together, and parse all at once. furthermore, "name" is rather ambiguous for the full project name (aka "title"). so, here's a new version that calls the full name "title" and adds the short name as "short_name".
Comment #32
dwwcommitted to HEAD and installed on d.o.
the initial run took a *lot* longer on d.o than on s.d.o, i guess the old DB server really is orders of magnitude worse than the new one ... took 7 minutes and 5 seconds. instead of ~45 seconds. second run took 5 minutes and 21 seconds, as opposed to ~15 seconds on s.d.o. yikes.
added a cron job as drupal-cron, too, to re-generate every 6 hours at 25 minutes after the hour. we can tweak that as needed.
Comment #33
(not verified) commented