The d.o upgrade changes the project table structures. So we cannot run the same direct queries anymore. #669910: Expose list of projects to external services (via JSON, XML, etc) introduced a stop-gap solution for some of the data we gather. For now we need to disable the old queries and then work to figure out how to get all the data we need. Comitting a first interim patch to disable stuff for now.

Comments

SebCorbin’s picture

It seems that we lost some taxonomy terms regarding translations (e.g. https://drupal.org/project/fr) in the d.o upgrade, the term id was 29 and is now gone https://drupal.org/taxonomy/term/29
This is holding the two first steps listed here for release fetching.

The only other way I see now is to filter out projects both marked as "Obsolete" and "Unsupported".

Is that ok?

gábor hojtsy’s picture

I think its fine to filter those out.

tvn’s picture

gábor hojtsy’s picture

Closed #2139775: Commerce economic not listed on l.d.o as duplicate of this one... Now with so much time passed, this is getting to be a problem.

cheatlex’s picture

How can one help?

gábor hojtsy’s picture

@SebCorbin: I looked at the query before/after your patch (http://drupalcode.org/project/l10n_server.git/commitdiff/583b6eae7c8109f...) and looks like our db query user does not have access to the taxonomy_index table. There is no term_node table anymore (I assume due to D7 upgrade), so no way to get data from there either. Trying to get access to that table now :)

gábor hojtsy’s picture

Status: Needs work » Postponed
gábor hojtsy’s picture

Priority: Normal » Critical
gábor hojtsy’s picture

gábor hojtsy’s picture

Status: Postponed » Needs work
StatusFileSize
new5.12 KB

Ok, the permission is now granted on taxonomy_index. I did some more digging and improved sebcorbin's patch. The project usage sync will not work as the timestamp max query runs into a filesort and mysql resets the connection on us for that, heh :D So I commented that out. That is a sacrifice if at least the release synching would work but that does not work either :/ Looks like the release files are not in the files table anymore or some other table is not tracking them anymore... That needs more looking into.

I also provisionally put in the Oct 31 last release timestamp to the last sync so it will pick up all releases even though our last erroneous attempt knocked that number over on the live site :)

Unfortunately I am out of steam for tonight but will look into this on Monday.

gábor hojtsy’s picture

Just got an IRC report from @csakiistvan that Drupal 7.24 with l10n_update will not download proper translations either since 7.24 is not on l.d.o yet :/

gábor hojtsy’s picture

gábor hojtsy’s picture

Did more debugging. Figured out we cannot gain results from our join on [files] since that is a D6 leftover table. D7 has file_managed. Doh. Updated my issue to ask for access to that too: https://drupal.org/comment/8293461#comment-8293461

gábor hojtsy’s picture

hass’s picture

For about 3 months we have no updated po files on ldo. Isn't there nobody who can fix this issues, PLEASE? :-(((

gisle’s picture

I also find the present situation problematic. Since there has been no progress for more than a month, I suggest that, as a stopgap solution, project maintainers are allowed manually upload po files for specific releases.

dydave’s picture

I've also been following this issue with a lot of interest and getting anxious to see some progress to unblock this annoying situation.

It seems there has been some developments lately with #2148907-9: Give read access to localize_ro user to file_managed and Gábor should now have proper access rights which should perhaps allow this issue to keep moving again.

@Gábor Hojtsy
With the recent update of the access to file_managed, would you be able to get this issue to move forward again?
Would there be anything we could potentially work on to assist you, to allow you to do the work we can't?

Thanks very much in advance.

gábor hojtsy’s picture

I share your frustrations. I spent several hours yesterday trying to untangle this only to be more depressed about the situation :( The extent to which we need changes is way bigger than I thought. I thought I fixed several queries and only needed the file_managed table to be accessed, and then figured out yesterday that the tables we have access stopped collecting data on October 31st and now data like project shortnames or dev/stable designations are at entirely different places. Eg data we used to get from the project_release_nodes and project_projects tables are now in field_data_field_project_machine_name and field_data_field_release_build_type. :/

Earlier I also found out that the data for project usage reached a treshold where our queries would resort to filesort, which would bring down the database.

So I think if we want to get this fixed sooner than later, we need to give up most syncing functionality that localize.drupal.org used to have (eg. ordering projects by usage or having real project titles). Which is pretty sad. I mean that you would find some (old) projects under their human titles and new ones under machine names only is quite odd. But the data provided publicly by drupal.org is not sufficient to get this type of information with use of any reasonable amount of resources.

The offered solution from the drupal.org upgrade team was/is a single TSV file at https://drupal.org/files/releases.tsv which provides all releases ever made. This can be used to create the release info on l.d.o, with some assumptions about where the files are put by the packager (there are no filenames in this dump). It cannot be used to collect real project names (project node titles), project usage data, and it cannot be used to tell if a project / release was unpublished (eg. security unsupported). These are things the current code does.

At least from a stop-gap perspective the https://drupal.org/files/releases.tsv data could be used to save raw project info (machine name basically) to l.d.o's db, so people could search for their projects with that and it can be used to set up minimal info about releases as well. We can keep using old project usage data for a while until people start complaining about that too.... Not having that is probably not that big of a problem ATM than not having any new projects or releases available.

Since we will not get to where we wanted to be with asking for db table access one by one, I think everybody can help move this forward who wants to dedicate some time and know PHP :) The task would be to update the logic in http://drupalcode.org/project/l10n_server.git/blob/refs/heads/6.x-3.x:/c... based on data from https://drupal.org/files/releases.tsv and gut out all the stuff that cannot be done with this data (ie. project status tracking, real project titles, tracking of project usage, tracking of file hashes).

Who wants to help?

SebCorbin’s picture

Assigned: Unassigned » SebCorbin
Related issues: +#2100597: Add a connector for www.drupal.org’s REST API
StatusFileSize
new9.41 KB

I've taken the skeleton from #2100597: Add a connector for www.drupal.org’s REST API and adapted it as per #18

Note that for now, I put WHERE p.connector_module IN ('l10n_project_drupalorg', 'l10n_drupal_rest_restapi') in the parsing process, but this may be useless, thoughts?

SebCorbin’s picture

Patch applied on http://syncing-localize.redesign.devdrupal.org/, accessible with drush uli and devwww.drupal.org access

Installed l10n_drupal_rest module and enabled connector (the other one is not visible), ran cron (this took a while)

Took this example #2176591: Project jQuery Nicescroll not listed on l.d.o which was not listed before and then http://syncing-localize.redesign.devdrupal.org/admin/l10n_server/project...

Cron after cron, you can see parsed releases, and they become available in the translate interface http://syncing-localize.redesign.devdrupal.org/admin/reports/dblog

barraponto’s picture

I'd love to chime in, but how to start working on that?
Should I set up a l10n_server instance and try the patch above to see if it works?
(would it require an awful lot of CPU/RAM to replicate l.d.o locally for development?)

gábor hojtsy’s picture

Thanks sebcorbin for jumping on this. At least this will get us a minimum level to move forward :) Some things to fix before we deploy:

  1. +++ b/connectors/l10n_drupal_rest/l10n_drupal_rest.rest.inc
    @@ -0,0 +1,106 @@
    +  $local_projects = l10n_server_get_projects(array('all' => TRUE));
    ...
    +    unset($local_projects[$project]);
    ...
    +  if (count($local_projects)) {
    +    // If we still have local projects lingering, those are not anymore
    +    // available with non-dev releases on drupal.org, so we should turn off
    +    // their listing in our database.
    +    $disabled_projects = array_keys($local_projects);
    +    db_query('UPDATE {l10n_server_project} SET status = 0 WHERE uri IN (' . db_placeholders($disabled_projects, 'varchar') . ')', $disabled_projects);
    +  }
    

    I think this is the kind of stuff that we cannot do anymore, at least I would not parse this multi-megabyte file in its entirety, sounds like a recipe for problems... :/ Also if we don't remove/disable anything in this code, since we don't really know the project status, the rollout of this would be more safe :D

  2. +++ b/connectors/l10n_drupal_rest/l10n_drupal_rest.rest.inc
    @@ -0,0 +1,106 @@
    +  $destination_path = file_directory_path() . '/releases.tsv';
    +  $url = variable_get('l10n_drupal_rest_refresh_url', 'https://drupal.org/files/releases.tsv');
    +  // This will take some time, so we need to increase timeout.
    +  $response = drupal_http_request($url, array(), 'GET', NULL, 3, 300);
    +  if ($response->code == 200) {
    +    // Save as temporary file
    +    file_save_data($response->data, $destination_path, FILE_EXISTS_REPLACE);
    +    _l10n_drupal_rest_read_tsv($destination_path, $before, $projects,
    +      $releases);
    +  }
    

    I think downloading the whole file and parsing the whole thing would be problematic. IMHO we should read the tsv line by line and stop after the last sync timestamp minus a day. That should mean we only read a few hundred lines at most (except the first run now that we need to pick up all our missing stuff).

  3. +++ b/connectors/l10n_drupal_rest/l10n_drupal_rest.rest.inc
    @@ -0,0 +1,106 @@
    +      // New project, not recorded before.
    +      db_query("INSERT INTO {l10n_server_project} (uri, title, last_parsed, home_link, connector_module, status) VALUES ('%s', '%s', %d, '%s', '%s', %d)", $project, $project, time(), 'http://drupal.org/project/' . $project, $connector_name, 1);
    

    Add a @todo that titles need to be grabbed from somewhere *later*. Not blocking this patch at all.

  4. +++ b/connectors/l10n_drupal_rest/l10n_drupal_rest.rest.inc
    @@ -0,0 +1,106 @@
    +  // @TODO Filter them so that they do not include the term "Translations" (tid 29).
    

    I don't think this is relevant if we only consider new releases, since there are no new translation releases allowed.

baluertl’s picture

I also want to help your heroic efforts guys. Currently I'm blessed with dozens hour of freetime, but with limited technical possibilities. Please count with me for any browser-based (e.g. clicking through l.d.o for testing) or textfile-editing (eg. processing .csv/.tsv dumps) tasks.

SebCorbin’s picture

Status: Needs work » Needs review
StatusFileSize
new4.63 KB
new8.61 KB

Here's the updated as per #22

gábor hojtsy’s picture

Looks good. If you find it works well on staging, it looks good to me to deploy. Thanks for jumping on this so fast.

SebCorbin’s picture

"So fast" => 3 months late ;)

Unfortunately, I get this error on the server

Warning: fopen(): https:// wrapper is disabled in the server configuration by allow_url_fopen=0 in _l10n_drupal_rest_read_tsv() (line 67 of /var/www/dev/syncing-localize.redesign.devdrupal.org/htdocs/sites/all/modules/l10n_server/b/connectors/l10n_drupal_rest/l10n_drupal_rest.rest.inc).

Thank my company for working on this, this would not have been possible without them (btw, if you have spare clients, we are open :p)

gábor hojtsy’s picture

Oh, well, sorry for the detour then. Let's get back to the drupal http request code BUT don't disable projects and only look at the first needed part of the file (to be quicker and use less memory). Hopefully the file will be OK size for a while.

gábor hojtsy’s picture

In the meantime the tsv changed to include the project full name, so we can create it with that. See http://drupalcode.org/project/infrastructure.git/commitdiff/311211d. Also with this the size of the tsv changed from 2.9MB to 4.4MB, so a pretty huge increase :/ Hope this will not mean problems for downloads for a while...

SebCorbin’s picture

StatusFileSize
new4.21 KB

I have not the courage to update existing project titles so they will be updated as soon as they have new releases.

Also, I've switching to using column headers for data since the patch to release-list.sh in infra changed the order of the columns.

gábor hojtsy’s picture

Status: Needs review » Fixed

I think we can call this fixed. @Sebcorbin amazingly rolled this out and thousands of releases are now in the queue to parse. Will take some time to catch up with parsing, but now its running. Still 751 releases in the queue but yesterday it was above 2000, so its running well seems like :)

tvn’s picture

SebCorbin++ !

baluertl’s picture

@SebCorbin, I wish once we could meet on a Drupal-event, to say Thank You personally :)

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.