I am seeing entries like this since the update to {twitter}.twitter_id -
mysql> select distinct twitter_id from twitter ;
+------------------+
| twitter_id |
+------------------+
| 29623769287 |
| 29631773515 |
| 29691431908 |
| 368407153222000 |
| 369535781704000 |
| 374879970927000 |
| 375296213656000 |
| 376538260308000 |
| 386220945314000 |
| 387828525564000 |
| 689419451367000 |
| 1024631595080000 |
| 1026387909550000 |
| 1097669099390000 |
I believe that (at least in MySQL, but possibly on only my installation) the use of decimal(20,0) is losing data for twitter IDs beyond the first 12 digits, so Twitter IDs end up the final several digits be '0'.
When doing some work on #584514: Add search support to module I noticed that imported statuses were triggering #966748: Duplicate key error on normal import (my comment). In MySQL Drupal has used DECIMAL(20,0) to represent array( 'type' => 'numeric', 'precision' => 20 ), and I think both of these are capable of storing the full sequence of digits we need. But for some reason we end up with zeroes.
Are we maybe missing something in the schema definition?
Can anybody confirm they see similar entries in {twitter}.twitter_id? (Just so I know it's not a local issue!)
http://drupalcode.org/viewvc/drupal/contributions/modules/twitter/twitte...
| Comment | File | Size | Author |
|---|---|---|---|
| #62 | twitter-985544-62-reroll.patch | 2.2 KB | xurizaemon |
| #56 | twitter-985544-checks_and_kludges-56.patch | 2.64 KB | xurizaemon |
| #41 | twitter-985544-41.patch | 2.18 KB | gapple |
| #40 | twitter-985544-40.patch | 1.6 KB | gapple |
| #38 | twitter-985544-38.patch | 1.54 KB | gapple |
Comments
Comment #1
protoplasm commentedI can confirm that all my entries have between 2-7 trailing zeros (mostly 2-4). All numbers are 16 digits.
Comment #2
xurizaemonThanks protoplasm, appreciate your double-checking. Can you confirm what DB engine you're using (just in case it differs)?
Comment #3
protoplasm commentedmysql/MyISAM
Comment #4
kjholla commentedCould this be something to do with the function json_decode, which is used in Line# 261 in twitter.lib.php.
I got thinking that this may not be anything to do with the database or how the insert into the database works. It may be something to do with the way the twitter status id is returned/obtained from the twitter api itself.
Hence, I started following the twitter_fetch_user_timeline function in the twitter_cron() hook. I found that it finally obtains the twitter timeline and then uses json_decode to parse the results.
I looked up the PHP manual for json_decode ( http://php.net/manual/en/function.json-decode.php ) and found that there is one example (Example# 5 in the PHP Manual) which talks about how long integers will be handled by this function.
Maybe using
instead of
will solve the problem?
But, there may be downstream impact since the Database has twitter_id as a Float and not a string.
Your comments?
Regards,
KH
Comment #5
kjholla commentedI was probably right. It appears to me that the JSON format is what is causing the issue with the last few digits of the twitter_id being corrupted.
However, the earlier approach of replacing the json_decode line will not work since json_decode will support JSON_BIGINT_AS_STRING only from PHP version 5.3 onwards. Most of us are still on PHP 5.2.x
I knew that this problem started manifesting itself only after I upgraded to version 6.x-3.0-beta2. Hence, I compared the code in version 6.x-3.0-beta3 with that in version 2.6 and found that version 3.0 uses the JSON format whereas version 2.6 used XML.
Since the twitter api still supports the XML format, I borrowed the code from version 2.6 and added it to the version 3.0-beta3 code files.
The resultant file is attached. I can confirm that using this updated code solved this problem for me. However, I am not too sure if there are other problems potentially waiting in the wings, which I haven't encountered yet.
You could take a look if you choose.
Regards,
KH
Comment #6
kjholla commentedComment #7
kjholla commentedComment #8
kjholla commentedHere is the file
Comment #9
xurizaemonThanks kjholla, that is a good find. And yes your updated twitter.lib.php does seem to resolve the issue on import.
Had a quick look thru CVS but no indication of why the switch from XML to JSON. Likewise no obvious issues leap forward ... it would be good to review why this change was made in case there's an issue for that which this fix would revert.
I re-rolled your changes as a .patch, attached. Probably needs a small cleanup (remove comments, check indentation & coding standards etc) but good enough for testing. I'm running this now and it appears to work as advertised.
Comment #10
xurizaemonAttached is a patch with a bit more cleanup.
PHP manual on json_encode has a few comments on this, but yes this issue will appear on sites running PHP < 5.3 (excepting 64bit presumably?)
Comment #11
Gilneas commentedI applied your patch from comment #10.
Thank you very much it seems to work. There are now errors any longer in the log after cron runs.
Comment #12
xurizaemonThanks Gilneas. kjholla's method of restoring the XML code does work, but I'm wondering if there's a simpler fix too.
The code in #8/#10 reintroduces several methods which aren't necessary for JSON parsing - Twitter::_twitter_convert_xml_to_array(), Twitter::_twitter_convert_status(), Twitter::_twitter_convert_user(). If we can avoid having to special case each XML result then that's good IMO.
Twitter's XML format isn't universal either (search results are available only in Atom or JSON) so reverting to XML affects the approach needed for #584514: Add search support to module (adding support for Atom as a new format isn't a huge task, but using JSON avoids that need).
So what about a workaround for the JSON bug? What if we just do something like this instead?
Needs testing to see if the stringified bigint will be stored correctly tho.
It's another special case approach still, but this seems like a lot less code so might be worth considering as an alternative.
Setting back to needs review so we can discuss a bit further - hope that's OK.
Comment #13
lanouettea commentedThanks a lot!
I had the exact same issue and the patch at #10 fixed it for me.
Comment #14
nosweat commentedStoring big id's in the mySQL DB no longer seems to be a problem. However:
(in twitter.inc -> function twitter_status_save)
This error check, to see if tweets already exist in the db, seems unreliable to me.
This looks better. But is it?
%d - the argument is treated as an integer, and presented as a (signed) decimal number.
%b - the argument is treated as an integer, and presented as a binary number.
Huh? what? Ok im not sure myself, lets test:
My findings:
1845501952 is actually a very strange number, as max_int on 32 bit = 2147483647
Output from %b is expected, i guess, but i find it kind of strange that it works quite well when using it in queries
Result from %s is unexpected, i guess i would expect a string as output, but it seems to work aswell
Still looking for reason to choose either %b (binary) or %s (string) representation of $status->id. Or maybe a better solution?
Comment #15
xurizaemonJeff's comments re twitter_update_6005() and twitter_update_6005() in twitter.install discuss this as well.
See also http://drupal.org/node/333788
We don't really think this is ready yet IMO so let's go back to needs work for now.
Comment #16
xurizaemonThere's a patch in #10 which restores the old XML behaviour.
This patch is an alternative approach which I suggested in #12 - "forget XML, let's regex". I think it's worth contemplating the regex approach being applied to the JSON before we hit json_decode().
So - thoughts on this patch vs the one in #10?
Comment #17
pcoucke commented@grobot: your patch from #16 works for me.
Comment #18
hedac commentedtested #16 with no issues. Thanks!
Comment #19
hedac commentedsorry.. I spoke too fast.. #16 didn't work for me..
I had error: json_decode() expects at most 3 parameters. 4 given.
Comment #20
Anonymous (not verified) commented@headac : Me too with #16.
touch more detail : error reports JSON_BIGINT_AS_STRING is an undefined constant and json_decode() expects 3 parameters, not 4.
Just a thought - if I'm reading the PHP docco linked correctly, the 4th param (JSON decode option) isn't contingent on PHP > 5.3.0, it's only in the development ver?
Comment #21
xurizaemonSomething like that ... The PHP docs seem to be out of date; PHP 5.3.0 was released in mid-2009.
But, if you are hitting error: json_decode() expects at most 3 parameters, 4 given. then apparently there are PHP builds which pass
version_compare(PHP_VERSION, '5.3.0', '<')but don't like JSON_BIGINT_AS_STRING. Feh.I noticed tonight that Twitter API now gives a couple of (undocumented?) _str entries as well in its JSON output. I say "undocumented" - can't see them in API docs @ http://dev.twitter.com/doc/get/statuses but they are detailed at http://groups.google.com/group/twitter-api-announce/browse_thread/thread... (google groups login required, appears to be the canonical post, discussion)).
So - Twitter API offers eg
{"id": 10765432100123456789, "id_str": "10765432100123456789"}... should we just use the id_str values? I'm a bit suspicious of hooking onto API data which is not blessed with docs. And it seems that StatusNet doesn't have those entries, so we'd lose compatability there. (Related issue on StatusNet.)Tempting but I am still in favour of preg, ugly as it may be. My thoughts on using XML are outlined in #12 above.
And if we can't test for PHP5.3.0 to decide if json_decode will like BIGINT or not, let's go all the way and just preg_ the lot.
Comment #22
Greg Varga commentedsubscribing
Comment #23
yark commentedPlease don't try to reinvent the wheel.
After json_decode() that BIGINT exists in memory as correct integer. To output it you just need to format it like this:
sprintf('Here your BIGINT: %.0f', $decoded_json['some_bigint_eg_id']);
voila
Comment #24
gapplewhy not use
defined()instead of the version check? If the constant exists, then decode must support the fourth parameter.Comment #25
joachim commented> Just a thought - if I'm reading the PHP docco linked correctly, the 4th param (JSON decode option) isn't contingent on PHP > 5.3.0, it's only in the development ver?
Yup. I'm on PHP 5.3 and getting an error message about that 4th param. I don't think the JSON_BIGINT_AS_STRING is something we can count on.
>sprintf('Here your BIGINT: %.0f', $decoded_json['some_bigint_eg_id']);
@yark, I'm not sure entirely what you mean by the above, but looking in the entire array of stuff returned by json, there is a string version of the id. Also, I've no idea what you're doing with sprintf does -- best way to really explain what you mean is a patch :)
So here is my patch, which fixes the problem for me.
It could do with more work, probably giving the in_reply_to_status_id key the same treatment, and the sprintf thing if that needs doing too.
Comment #26
Anonymous (not verified) commentedJoachim, The patch #25 didn't work for me.
Comment #27
michaek commentedHere's an updated patch that takes into account that the return from the API may be a JSON array (in my case it seems to always be one). @bumathan, please let me know if that takes care of the issue for you.
Comment #28
joachim commentedI'm not entirely sure I get what this does...
I'm not sure what this is for as it's not going to change what you return...
And just some coding standards stuff...
Comments should start with a capital and end with a full stop.
Coding standards: don't coddle elses.
Powered by Dreditor.
Comment #29
raintonr commentedAlso MySQL/PHP related?
I have a custom module that converts tweets to nodes. As part of this one has to read from the
twittertable of course and interestingly this fails:But this works (note the treatment of the ID as a string):
In both cases
$twitter_idwas read from the database usingdb_fetch_object(and passed to a function from$row->twitter_id).Hopefully that is of some help here.
Basically - as one never really has to mathematically manipulate these IDs I would suggest just treating them as strings in the twitter module.
Comment #30
michaek commented#28: The problem is that the integer is too big for PHP to represent, leading to corruption of the id with trailing zeros. For older tweets, there's no difference, but newer tweets have much higher id values. If you're running on a 64-bit system, you might not see the problem at all.
Sorry about the coding standards. I will work on getting that back into compliance, but it's better to have a fix than nothing at all!
#29: Yep, same underlying issue, see #966748: Duplicate key error on normal import.
Comment #31
gapple@28
The assignment inside the loop wouldn't normally do anything, since foreach copies array elements, but in this instance the reference operator (
as &$item) allows the foreach loop to operate directly on the original array's elements.----
Here's a patch that builds upon #27, making PHP do the work if it can so we don't have to.
First it checks if running on a 64bit version on PHP, so that any special handling can be avoided.
Next, if JSON_BIGINT_AS_STRING is defined it will use that as a parameter to
json_decode().Finally, it will overwrite values of ['id'] with the string values in ['id_str']
----
As in #21, I'm also reluctant to rely on id_str if it's not in the docs. Using preg_replace also allows the opportunity to replace every large integer, not just those keyed with 'id', and at any level of depth in the JSON data structure without extra iterations or recursion.
Comment #32
gappleHere's my attempt at using a general regular expression pattern to convert large integers, however there are a few issues.
Anything like a JSON string inside the JSON being parsed could make things blow up spectacularly.
While matching scalar values (e.g.
"id":12345) should be pretty safe, the match for values in an array (e.g.[123,456,789]) has to be pretty lenient and could cause problems. The issue is that if the first element in an array is matched, the next element cannot use the separating comma as part if its match; trying to match both will result in only the odd-positioned elements being converted (e.g.[123,456,789,123]changes to["123",456,"789",123]).The pattern for the second replacement call in the patch will match all the values, but also any number (of suitable length) followed by a comma or closing bracket will match. (e.g.
"stringvalue":"this is some numbers: 123, 456"will get mangled to"stringvalue":"this is some numbers: "123", 456")----
If it's safe to ignore arrays, the second replacement could be omitted.
To be extra safe, if all numbers to be converted have identifiers ending with 'id', the first match could be made more restrictive:
preg_replace('/"(\\w*id)":\\s*(\\d{' . $length . ',})\\s*([},])/','"\\1":"\\2"\\3'Comment #33
michaek commentedHi, @gapple.
I like your approach in #31. Somehow I'd missed that id_str wasn't documented. I feel really wary of using preg_replace on JSON, as in #32, if we can avoid it - string manipulation always seems simpler than it turns out being.
Comment #34
michaek commentedAlso, @gapple, could you reroll your patch against the current dev version? That seems like it's against a slightly older version.
Comment #35
joachim commentedJust so we're fixing the right problem... when does the long integer actually break? is it broken by json_decode, or once Drupal tries to do something with the result of that?
Comment #36
michaek commentedI can't test at the moment, but I'm pretty sure it's inherently broken, because of the limitations of PHP's integer. json_decode attempts (correctly) to store the value in an integer, which doesn't work (details repeated, repeatedly, elsewhere). It's not a problem with Drupal or json_decode, but a limitation of the language.
Comment #37
michaek commentedAnd the reason that the preg_replace approach is a solution (albeit a sort of troublesome one) is that it makes the change to the JSON string _before_ it gets decoded into an integer.
Comment #38
gappleI didn't know you had pushed your patch to the repository before working on my own.
Here is an updated patch for the integer size and JSON_BIGINT_AS_STRING checks, based on current 3.x-dev.
----
I am also wary of the preg_replace solution due to the potential for it to mangle data, just worried that handling each individual element will get out of hand (and that it's not supported by StatusNet yet). In #21 @grobot noted that "in_reply_to_status_id" may also need to be handled as a bigint.
Comment #39
michaek commentedThanks! I've come around to the preg_replace approach since my first post on it - you're right that it's the way to go. I think we should probably get this patch into the module!
Comment #40
gappleHere's a patch with the preg_replace, and the checks for 64bit and JSON_BIGINT_AS_STRING.
I've used
"((?:\w+_)?id)"\s*:\s*(\d{' . $length . ',})\s*(,|})as the matching expression, as I think it's the simplest, safest match we can start with. It will match elements with the key "id" or ending in "_id" that has an integer value long enough to potentially not fit within the integer value limit.It will not match any integer values within an array, but I'm not sure if these are an issue or priority.
Comment #41
gappleHere is an improved patch for converting all large integers using a regular expression.
This patch uses a look-ahead subexpression in the array regex to check that the proper delimiters are both before and after the integer, fixing some issues I ran into in #32.
The only remaining problem is JSON-like string values. The most likely is a list of large numbers separated by commas (e.g.
"string": "1234,5678,9101,2345"will convert to"string": "1234,"5678","9101",2345","string": "list: 1234, 5678, 9101"to"string": "list: "1234", "5678", 9101").Comment #42
gappleWith some more experimenting, I think the issue of matching within a string can be avoided by adding the following look-ahead match to the end of each expression:
(?=(?:[^"]*(?<!\\\)"[^"]*(?<!\\\)")*[^"]*$)This will check that the remaining string to be matched only contains pairs of unescaped quote characters, so that numbers inside quotes will not match.
Comment #43
michaek commentedAll this regex stuff really does make me nervous. I think I'm going to check with Twitter to see 1) why they're providing id_str and 2) if they can get it into the documentation so we can rely on it. Until then, I would rather just use id_str - replacing all long integers with strings seems like it might be a solution in search of a problem, as the only thing we've actually got a problem with is the id.
Comment #44
xurizaemonI don't like the regex, it feels brittle.
I don't like reintroducing XML because it's lots more code to maintain, and I didn't want to be reversing eaton's removal of the XML code. (but now maybe we can consider that if it's the robust thing to do.)
I don't like id_str because it ties us to an undocumented part of Twitter API (alarm bells) and that it breaks support for status.net (unless they copy or have already copied id_str). May not be a big issue right now but I really want this module to work with twitter-compatible services too (ideally per-user).
:)
@michaek are you interested in another co-maintainer? I'd be in if you can use it.
Comment #45
xurizaemonI'm pretty sure Twitter offer id_str for this exact reason: Snowflake announcement, Oct 2010
May already be linked above but a related issue in status.net is #3124 provide id_str / in_reply_to_str in JSON API results.
I'm not against any of the above approaches btw, just flagging that none seems like a perfect solution at the moment. Beginning to think myself that bringing back XML is the sensible (if ugly) option.
In the Twitter announcement linked above, they state that they offer _str versions of: Status, User, Direct Message and Saved Search IDs, so those are the large-int versions we need to handle.
Comment #46
joachim commentedThat seems pretty well-documented to me.
Comment #47
xurizaemon#21 -
Yes. "documented" in the hitchhikers guide sense of "on display", I guess. Beware, Leopard. ;)
Comment #48
joachim commentedSo it's a bug :) Maybe someone can contact twitter and ask them to update their documentation?
Comment #49
michaek commentedHi, folks. I assumed that this was the reason they were providing id_str, because they were aware their ids were getting too big to reliably represent as integers - that's what I wanted to ask them about, but the explanation @joachim points to makes that unnecessary. I believe they'll update the documentation if we bring it to their attention.
I think we can use id_str if it's there, use id if it's not, and be pretty confident that we're not going to break another implementation of Twitter's API.
Comment #50
michaek commentedI've created a ticket with Twitter to get that added. We'll see where it goes, but I feel pretty comfortable taking a chance on this. Working with Twitter's API always involves a leap of faith. :)
Comment #51
michaek commentedHere's the response from Twitter:
We should be good to go on using id_str.
Comment #52
michaek commentedI've just checked the code, and that's exactly what we're doing.
Comment #53
gappleLet's not forget the patch back in #31 that doesn't have any funky regex stuff, and just tries to use PHP's capabilities if available first (64bit or JSON_BIGINT_AS_STRING). I think it should still apply to the latest 3.x-dev, but let me know if you would like a re-roll.
I think it was also noted that there are other id values that may need to be handled as well, and if id values are nested at any deeper levels some extra handling will be required.
Comment #54
michaek commentedI'm ok with the approach from #31, but I don't see the downside of just using the string Twitter provides. If other services get into tweet ids that are high enough to cause problems, they'll need to implement something similar.
I'd say let's worry about any other id values as we come to them. If Twitter is concerned enough about this id to provide it as a string, they would probably do so anywhere else an integer might break.
Comment #55
gappleBecause the string value is being used to overwrite the (mangled) integer value, there is overhead to looping through each item and performing the assignment.
This overhead should be unnecessary if the system uses 64bit integers, or supports JSON_BIGINT_AS_STRING; both of which should become more common in the future.
Comment #56
xurizaemonThis expands on #31 for reduced overhead on sane systems, and packs the _str stuff away in a separate (badly named) function.
Comment #57
dela.deyoungster commentedHi folks.
I've gone through the entire thread and the analysis made is very reasonable. I applied the patch (manually, just to make sure) from #56 while going through each line to understand what it does. Unfortunately I still get the 'Duplicate key' error on cron run (though the tweets are rightly posted to twitter).
Funnily enough, I'm running PHP 5.3.2 (as shown below from version request in terminal) but a PHP request for the defined constant
JSON_BIGINT_AS_STRINGreturns nothing.With the explanations given above everything should be working, but for some reason it doesn't on my system.
Can you help me out @grobot.
Thanks in advance.
Comment #58
gapple@dela.deyoungster
the $options parameter to json_decode (where JSON_BIGINT_AS_STRING is used) is only available in PHP 5.4+ (it was previously available in some development versions of 5.3)
What is the value of your PHP_INT_SIZE constant? (4 means 32bit, 8 means 64bit)
Comment #59
juampynr commentedHow could I reproduce this bug? Can anyone provide steps or scenarios? I have not faced it while fixing bugs over the last three weeks.
Cheers
Comment #60
xurizaemonReproducing this issue requires PHP which lacks support for JSON_BIGINT_AS_STRING (versions before 5.3, but no documented specific version) and PHP on a 32-bit system.
Comment #61
kjholla commentedLooks like the same issue occurs with the fields in_reply_to_status_id as well.
We will need to include a line of code
to handle this error.
More details of the change I have made is available in the attachments in #1520008: Extract actual URLs from t.co wrappers using Tweet Entities
Regards,
KH
Comment #62
xurizaemonWe committed a workaround in db2a4a, but this issue may still affect people on 32-bit systems? JSON_BIGINT_AS_STRING got pushed to 5.4 in the end.
Here's a reroll of #56 for 7.x-3.x. Should cover @kjholla's above?
Comment #63
13rac1 commentedIs this still an issue? If so, it needs updating for 7.x-5.x.
Comment #64
xurizaemonI think this is resolved, a later issue to check is #2055951: Interesting MySQL issue appears to cause randomish "twitter_views_handler_field_formatted_tweet->render()" errors. If there are module users with 32-bit / older PHP hosting setups, they aren't making a lot of noise in the issue queue.