Problem / Motivation
Currently, the author column in {aggregator_items} is a 255-character varchar. However, there are definitely feed items out there where the author entry is longer than 255 characters. An example is an article authored by multiple people, as per the example below:
PDOException: SQLSTATE[22001]: String data, right truncated: 1406 Data too long for column 'author' at row 1: INSERT INTO {aggregator_item} (title, link, author, description, guid, timestamp, fid) VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3, :db_insert_placeholder_4, :db_insert_placeholder_5, :db_insert_placeholder_6); Array ( [:db_insert_placeholder_0] => Biodiversity loss and its impact on humanity [:db_insert_placeholder_1] => http://feeds.nature.com/~r/nature/rss/current/~3/R0f1-VRuwlw/nature11148 [:db_insert_placeholder_2] => Bradley J. Cardinale J. Emmett Duffy Andrew Gonzalez David U. Hooper Charles Perrings Patrick Venail Anita Narwani Georgina M. Mace David Tilman David A. Wardle Ann P. Kinzig Gretchen C. Daily Michel Loreau James B. Grace Anne Larigauderie Diane S. Srivastava Shahid Naeem [:db_insert_placeholder_3] => Biodiversity loss and its impact on humanity Nature 486, 7401 (2012). doi:10.1038/nature11148 Authors: Bradley J. Cardinale, J. Emmett Duffy, Andrew Gonzalez, David U. Hooper, Charles Perrings, Patrick Venail, Anita Narwani, Georgina M. Mace, David Tilman, David A. Wardle, Ann P. Kinzig, Gretchen C. Daily, Michel Loreau, James B. Grace, Anne Larigauderie, Diane S. Srivastava & Shahid Naeem. The most unique feature of Earth is the existence of life, and the most extraordinary feature of life is its diversity. Approximately 9 million types of plants, animals, protists and fungi inhabit the Earth. So, too, do 7 billion people. Two decades ago, at the first Earth Summit, the vast majority of the world’s nations declared that human actions were dismantling the Earth’s ecosystems, eliminating genes, species and biological traits at an alarming rate. This observation led to the question of how such loss of biological diversity will alter the functioning of ecosystems and their ability to provide society with the goods and services needed to prosper. Only local images are allowed. [:db_insert_placeholder_4] => [:db_insert_placeholder_5] => 1338933600 [:db_insert_placeholder_6] => 33 ) in aggregator_save_item() (line 156 of .../modules/aggregator/aggregator.processor.inc).
Proposed resolutions
Truncate the author list to 255 characters. This is the approach recommended as a general rule in #1284658: MySQL strict mode is causing PDO Exceptions when lengths of varchar are exceeded. We took this approach with the aggregator_item title field in #1244116: Feed items with title longer than 255 characters fail to insert into aggregator_item as titles were being encountered with more than 255 characters.
Remaining tasks
#1811458: User should be able to set the length of the aggregator field e.g. the author field is a followup issue to discuss whether we should increase the size of the database field to allow longer author texts.
Comment | File | Size | Author |
---|---|---|---|
#19 | aggregator-test-only-1622974-19.patch | 3.57 KB | iflista |
#18 | aggregator-1622974-18.patch | 4.55 KB | iflista |
#18 | interdiff.txt | 4.7 KB | iflista |
#15 | interdiff-14.txt | 1.34 KB | xjm |
#14 | aggregator-1622974-14.patch | 5.03 KB | xjm |
Comments
Comment #1
DocuAnt CreditAttribution: DocuAnt commentedTo truncate the author field seems to be the right way for me because of the similarity to the title field.
Thanks in advance for the patch. :-)
Comment #2
marcingy CreditAttribution: marcingy commentedTitle was truncated as the spec states the maximum length is 100 characters. I suppose the key question is the likelihood of someone querying against author but in the case of the example above it is 'authors'. I am incline to say make it a text field because in reality the above example shows that in truncating would invalidate a search anyway.
Comment #3
Pomliane CreditAttribution: Pomliane commentedChanging priority and tags with reference to #218004: Allow aggregator feed URLs longer than 255 characters
Comment #4
marcingy CreditAttribution: marcingy commentedThis is not a major bug.
Comment #5
marcingy CreditAttribution: marcingy commentedNote to self read linked issue first ;)
Comment #6
cweagansFixing tags per http://drupal.org/node/1517250
Comment #7
xjmIn this case I think we should truncate it. The URL is a special case.
Comment #8
joevansteen CreditAttribution: joevansteen commentedTruncation is the approach I used to get a quick and dirty fix for myself under D7. I ran into this same scenario with a different "Nature" article. Scientific journals would seem more the culprits here. The only thing that might be a little cleaner is to try to avoid chopping a fraction of a name, and maybe inserting a "[more]" notice at the end to replace whatever is at the back of the field.
Comment #9
xjmI'll write a test.
Comment #10
xjm@joevansteen, the truncate_utf8() function takes care of that for us (adding an ellipsis).
Tests and fix.
Comment #11
xjmOopsie, broken tests patch above. Fixed here.
Comment #13
xjmAdjusting another test that relies on the number of items in the sample feed.
Comment #14
xjmOops, bad copypasta. Fixed in attached.
Comment #15
xjmUpload failed apparently.
Comment #16
BerdirSimple fix, has tests. Let's get this in.
o.0
Comment #17
webchickAgreed with the fix, and thanks for the literature lesson in the process of the tests. ;)
Committed and pushed to 8.x. Moving to 7.x.
Comment #18
iflista CreditAttribution: iflista commentedBackported to D7. Needs review.
Comment #19
iflista CreditAttribution: iflista commentedAdded test only version of patch.
Comment #21
xjm#18: aggregator-1622974-18.patch queued for re-testing.
Comment #22
BerdirBackport looks good too, RTBC (Patch in #18, #19 is test only)
Comment #23
Anonymous (not verified) CreditAttribution: Anonymous commentedTruncating the author names is sorta not done in the publishing world AFAIK. It could lead to people not getting enough credit for their work. For example with research publications it is not uncommon that there are more than 3 authors.
It should really accommodate for all authors of a (research) publication. If there are multiple author names added to a publication then they should all be listed. How would you feel if your name got lobbed of a project because of an IT 'feature'? Not to happy, I imagine. ;-)
Could use a text field for example, and possibly limit this to for example 1500 characters. If people still run into trouble after that spec could always be upped. Having some kind of cap might be good though, for example standard allow for 30 really long names (50 names of 30 characters is 1500 characters). Perhaps make the number of characters user configurable? This could still leave the issue that names will get cut off.
Also if there is not already there should be some text in for example a readme.txt or at entry level that specifies how many characters are allowed per field. This way if the text is being lobbed off then it is more easy to discover why.
Edit from what I read here:
http://stackoverflow.com/questions/1303476/is-a-varchar20000-valid-in-mysql
In MySQL 5.0.3 and later versions varchar is allowed up to 65,535 characters. If that is the case could just set the varchar() to 1500 characters for example. This may be the least amount of work, while providing the solution. Could also do the same for the URL field. This needs further testing though.
Drupal 7 has as minimum MySQL spec 5.0.15.
What will the minimal MySQL spec for Drupal 8 be? It could be standardly implemented from 8. This would then still need a fix for 7.
Comment #24
xjmThanks @design_dolphin. See #218004: Allow aggregator feed URLs longer than 255 characters for extensive discussion of why we are limiting the varchar, and why we used a text there. Also keep in mind that not all sites use MySQL for their storage.
My reasoning for truncation was that in each case, the author data is just a summary of the RSS item. The aggregator item still has a link to the actual item, so the full title, author, etc. information will be included there.
The current behavior when you have an author field over 255 characters isn't that the author is partially imported--it's that there's a fatal error that prevents any feed data from being imported at all. That's why this is a major bug. So let's commit #18 as a stopgap to resolve this major bug on production sites impacted by it, and then open a followup issue to discuss increasing the allowed length?
Comment #25
Anonymous (not verified) CreditAttribution: Anonymous commentedo.k., I can see the reasoning behind that, and the need for a solution.
However, AFAIK there is no way to tell how the names are coming through the author field. (I find RSS feed a pain at times because they get improperly filled at the source.)
So how will varchar know if it is cutting of the persons first name or last name?
That might lead to strange displays of data. For example:
'John Smith, Jane Smith, Dr. Joe'
Would adding a text that states something like "Please see article for the full list of authors." if the author text exceeds the varchar limit be a cleaner solution?
For example something like this alfa pseudocode:
This won't solve the problem, however, in situations where the publisher insists on having the author names there.
How sure is it that the string coming through is UTF8? I've seen RSS feeds with all sorts of encoding. Won't this mess special characters up then? (I couldn't find any definitive info on aggregator encoding at the moment.)
If we're changing the field in the database to text, then a better solution may be to at the same time increase the string length entered for the field. Showing all the author names would solve the problem as well.
Edit:
Changed some text.
Comment #26
xjmPlease open a followup issue?
Comment #27
xjmReiterating, this issue is not changing any feature of Drupal 7. It's only stopping the whole feed import from breaking. Whether to lengthen the field is worth discussing, but it should not block this.
Comment #28
webchickWhile I can definitely appreciate design_dolphin's point, xjm's correct that in this particular instance, I think truncating is the right thing to do, because:
- Aggregator items are ephemeral; reloading http://drupal.org/planet twice a week is going to come up with a completely different list of stuff. These are not intended to be reference pieces that stick around for a long time. AFAIK they don't even have their own URLs.
- Aggregator items are specifically intended to be a jumping off point to the "real" work that exists somewhere else, and will contain the full article, along with its full title and list of authors.
- Changing the data type of a field is not as easy as it sounds, due to cross-database concerns as well as query performance concerns.
If we were talking about something like Feeds module, whose primary purpose is to import this external content as actual site data, then I think the truncation argument makes a lot less sense. But that's not what we're talking abut here.
So, in the interest of closing down a major bug whose consequence is a database failure, Committed and pushed to 7.x. Thanks!
Now, to 6.x. :)
Comment #29
Anonymous (not verified) CreditAttribution: Anonymous commentedI can understand the necessity for the current implementation, but I cannot help feel it is broken. Author credit can be a touchy subject in publishing. This is certainly the case when publishing content through a feed.
I don't have a whole lot of experience with the issue cue, so learning as I'm going (after having read the manual).
From what I understand the general consensus is that a follow up issue is necessary.
The follow up issue can be found here:
http://drupal.org/node/1811458
Please feel free to improve this issue where needed.
Have a nice day. :-)
Comment #30
xjmThanks @design_dolphin, that followup looks good. I'll link it in this issue's summary so folks investigating this issue can find it easily.
Comment #31
dcam CreditAttribution: dcam commentedShould this issue be backported to 6.x? The related 255-character title field did not get ported to 6.x.
To be honest, I want to make certain because the 6.x code is quite a bit different than it is in 7.x.
Comment #32
webchickActually, that's a good point. Can you test it and see if a feed with a super long author field causes the site to die with a horrible database failure like in the original post? I suspect not, since D6 just uses db_query() wrappers and not PDO. If not, I think we can move back to 7.x and fixed. Sorry, I was just following the tags here. :)
Comment #33
BerdirYes, I believe this error only happens in 7.x because we enable the mysql strict mode, which isn't on in 6.x so MySQL should actually just anything that's too long. On the other side, it might break PostgreSQL, which 6.x kinda supports.
Comment #34
David_Rothstein CreditAttribution: David_Rothstein commentedAdding this one to the release notes and CHANGELOG.txt.
Comment #34.0
David_Rothstein CreditAttribution: David_Rothstein commentedUpdated issue summary.
Comment #35
xjm