Allow aggregator feed URLs longer than 255 characters
pcambra - February 5, 2008 - 10:21
| Project: | Drupal |
| Version: | 7.x-dev |
| Component: | aggregator.module |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Dave Reid |
| Status: | needs review |
| Issue tags: | needs backport to D6, Novice |
Description
Hi
Recently I've experienced problems dealing with some long feeds which exceeded the VARCHAR 255 limitation of the field 'link' of aggregation_item table.
I've changed this datatype for TEXT and i have it working.
Another solution would be changing to VARCHAR(65535), but this is only available in MySQL 5.0.3 and older. (http://dev.mysql.com/doc/refman/5.0/en/char.html)
I suggest to change this data type for avoiding limitation issues
Regards
Pedro

#1
Moved to 7.x.
Seriously though, are insane URLs worth changing the data structure like that? This sets a precedent - we have URLs stored elsewhere, and email addresses can also get pretty long. Turning a VARCHAR into a TEXT could be a big deal for high-traffic sites.
#2
Well, then i propose 2 alternatives
-> Detect "insane" URLs (I think this is a must, to avoid getting mad looking for the error), and correct them, maybe with a tinyurl function, or letting the user decide if he wants to work with the field in long or short format through the settings (warning of the lost of data and the performance impact, of course)
-> If Drupal 7 won't work over MySQL 4.X anymore (i am not sure), letting the user choose the max length of the field, keeping it as a VARCHAR
I found myself in this error and I don't decide the length of the external URL, even they have an absurd length!
Regards and Thanks
Pedro
#3
Oh. That's interesting, MySQL 5.x allows VARCHAR up to 65,536. I must confess that the VARCHAR(255) concept is still deeply ingrained in me. I don't know how long we have to support MySQL 4. Drupal 7 will enforce PHP5, but the database?
What's the limit on Postgre? Perhaps a combination of VARCHAR(512) and intelligent filtering could do the trick. I admit that silently truncating on database insert is not the best way to handle such things.
If we have to keep MySQL 4 for the forseeable future, then perhaps only the warning or a validation function could be sufficient. Refusing to handle longer URLs is still better than silently failing.
#4
This is a novice task.
#5
#3: we use MySQL TEXT with a 255 character index in feedapi for sites with hundreds of thousands of feed items. This is doable (see http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/feedapi/fee...)
There are a bunch of URLs out there that are longer than 255 characters (and some that are longer than 512 for that matter). There is no specification for the length of a URL.
I'd say let's use TEXT with an index limited to the first 255 characters (I'm thinking in MySQL world, I hope this is supported by Drupal's DB layer.)
#6
Begin to work on this.
I am comparing if is better use TEXT (max 65535 characters) or a 65535 length VARCHAR (new in MySQL5), but I've seen that VARCHAR(65535) its only available from 5.0.3. Not sure if postgre supports 'huge' VARCHAR so I think that link and url fields will be modified to get a TEXT field.
aggregator_feed contains these fields: url (indexed) & link (not indexed) for storing urls
aggregator_item contains link field (not indexed)
The three store varchar type with 255 length, I think that all of them should be converted to TEXT type.
link field from aggregator_feed table is not invoked in any sql condition, so I don't think it should be indexed.
url field from aggregator_feed table is currently indexed but it is only invoked as a sql condition in one admin process (aggregator.admin.inc) so I am not sure if this index provides benefits.
link field from aggregator_item is not indexed but it is used in a WHERE clause by hook_aggregator_process, maybe this should be indexed.
I'll take a look to indexes for text columns, as far as I know, you must specify a index lenght when indexing MySQL text columns, I'll take a look to drupal db layer & postgre specs.
#7
Let's also add indexes to URL/guid - currently aggregator_item does not have any.
#8
Subscribe. We should really fix this and backport. technically URLS have no length limit. Practically I have heard limits of 2,000 characters in some layers of internet technology. We should certainly be closer to 2,000 than 255.
Regarding data types - that is handled somewhat automatically by the schema API (see the data types). I think this should use the "text" type which stores 16KB worth of characters. Should be plenty.
#9
Subscribing.
#10
Drupal 7 patch ready for review.
#11
The last submitted patch failed testing.
#12
It seems the various CRUD functions won't work now.
#13
Aha! I can see the problem.
'not null' => TRUE,- 'default' => '',
'description' => 'URL to the feed.',
),
'refresh' => array(
@@ -143,10 +141,8 @@ function aggregator_schema() {
'description' => 'Last time feed was checked for new items, as Unix timestamp.',
),
'link' => array(
- 'type' => 'varchar',
- 'length' => 255,
+ 'type' => 'text',
'not null' => TRUE,
- 'default' => '',
You removed the empty string default value, but you left in the NOT NULL constraint.
The link is an optional field, and need not be passed. But with the updated schema, omitting this value will violate the not-null constraint.
If you're going to remove the empty default, you should probably remove the not-null as well.
#14
Ooops, I was working on the issue but I have patched aggregator.install aggregator.module and aggregator.admin.inc files.
I don't know how to mix both patches, I am attaching mine.
#15
So, the patches are similar, but not the same.
pcambras misses a space in the url,255 near line 191, adds an index to link (also missing a space after comma), and as stated modifies aggregator.module or aggregator.admin.inc.
Dave Reid's handles updating the tables by adding the indexes that are in the schema and adds an index on guid.
Aside from the whitespace fixes, I think they can just be merged and we'll be set, right?
#16
I've compared Dave Reid's patch with mine and I think I've merged our changes into this new patch.
I've included the guid index and the update in the install file and the space missing that greggles pointed out is fixed now.
#17
Reviewed and tested on PHP 5.2.6 and MySQL 5.0.77
This is coming together now. Nice work. Minor fixes:
- minor code style fixes (mostly missing spaces)
- adjusted comment to latest update to reflect that there's also an index on aggregator_item.guid being created
- aggregator_remove() should not reset link nor url, it needs to use db_update() instead of db_merge() though - otherwise this patch breaks remove functionality.
- description is not necessary in db_change_field() - removed
pcambra: the patch in #16 was created on modules/aggregator - please create future patches from Drupal's root.
There are two outstanding issues from my point of view:
1) Looking at this again, we should convert the guid field to text as well. Many times feed producers use the feed item's URL for the guid and there is no specified character limit to it: http://cyber.law.harvard.edu/rss/rss.html
2) Needs testing on pgsql.
#18
Re-rolled patch for head containing improvements from the previous few versions.
In summary:
- Converts {aggregator}_feed.url and .link to text fields since they are URLs and shouldn't be limited to 255 characters.
- Converts {aggregator}_item.link and guid to fext fields for the same reason.
- Drops the {aggregator}_item.fid index and uses fid and guid instead, since there are a couple queries that use
WHERE fid = :fid AND guid = :guid. See http://api.drupal.org/api/function/aggregator_aggregator_process/7.- Removes the #maxlength limitation on the feed URL textfield (tested) instead of using a hardcoded limit of 65535
- Also ran a coder check to be sure there weren't any major things wrong.
I tested the upgrade path and it worked just fine (also found a bug in the previous aggregator update function not returning an array). I'm currently being bitten by the 'cache_update table does not exist' testing error so I'll let the bot test this one.
#19
After review of http://api.drupal.org/api/function/aggregator_aggregator_process/7 we do not need to have an index on guid or link. That is the only place where we get those fields in a WHERE condition and it's simply used to find duplicate feeds. If we really wanted to, we'd have to add an index on title as well since that is used in a third query, but I don't think that we should add those indexes here. If it's necessary we should have a separate issue for revising aggregator_aggregator_process().
EDIT: Summary (again)
- Converts {aggregator}_feed.url and .link to text fields since they are URLs and shouldn't be limited to 255 characters.
- Converts {aggregator}_item.link and guid to fext fields for the same reason.
- Removes the #maxlength limitation on the feed URL textfield (tested) instead of using a hardcoded limit of 65535 or 255.
- Changes aggregator_remove() to use db_update instead of db_merge because the feed item should be in the database or there are problems. Remove the description and image from the update since those things aren't going to be changing.
- Also ran a coder check to be sure there weren't any major things wrong.
#20
Tested this patch after running into the issue on a d7 sandbox site and it works well. Coding style also looks good. There is just one problem. When calling db_change_field() you must include the field description in the schema definition. Otherwise, the column comment will be lost.
#21
Ran into this testing the opml import function on aggregator. Tried http://opml.tagjag.com/all and was mystified that it thought that url was longer than 255 characters. Of course it isn't and it actually triggering this error on the feeds contained in that opml feed but wow...
So yeah, I'd say this is actually a bug more than a feature request. Its starting to cause weird problems. See #70201: Aggregator: Feed URL Length Is A Limitation where it actually is a closed(?) bug.
Reroll sync with core and update with field descriptions requested by mfb. Any chance of still getting this in to 7?
#22
The last submitted patch failed testing.
#23
Apparently needed another reroll.
One question I have about this patch: we now allow feed URLs > 255 characters. However, there is a unique key on the first 255 chars of the URL column. What if you want two distinct URLs that have identical first 255 chars?
#24
I don't have a good answer for that.
#25
The unique key length is about the index length on the column. If you have X URLs that are identical in their first 255 characters, comparisons between them will get slow. This is an unlikely case.
1)
#maxlength => NULL - I think this does not break anything but doesn't have a precedence in Drupal either. I'd just kill this line.
2)
There are changes in aggregator.install that don't seem related.
#26
If you don't manually specify
#maxlength => NULL, the textfield will get a limit of 64 characters. If we're changing that field to a text, we shouldn't limit the input.#27
"If you don't manually specify #maxlength => NULL, the textfield will get a limit of 64 characters."
- wasn't aware of that. Thanks for the clarification.
#28
Yeah, I think the only really odd change is the $ret change in update_7001. If that where removed I think we can agree to RTBC this.
#29
Definitely needs a reroll, $ret is gone now.
#30
Easy enough.
#31
#32
Crash on Uncaught PDO exception. Very close, maybe test a bit in the UI before submitting next patch. See below:
Example very long URL (476 chars)
http://www.google.com/search?hl=en&client=firefox-a&rls=com.ubuntu%3Aen-...
How I tested:
- mysql
- apply patch
- enable module
- clicked help
- started to test all links on that page
- CRUD a long link.
- add through UI succeeded - http://d7p1.dev/admin/config/services/aggregator/add/feed
- add duplicate - appropriate error
- add similar long (append "test2" to above URL to test 255 index) - CRASH
PDOException: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry 'http://www.google.com/search?hl=en&client=firefox-a&rls=com.ubun' for key 3: INSERT INTO {aggregator_feed} (title, url, refresh, block, description, image, link) VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3, :db_insert_placeholder_4, :db_insert_placeholder_5, :db_insert_placeholder_6); Array ( [:db_insert_placeholder_0] => test2 [:db_insert_placeholder_1] => http://www.google.com/search?hl=en&client=firefox-a&rls=com.ubuntu%3Aen-... [:db_insert_placeholder_2] => 3600 [:db_insert_placeholder_3] => 5 [:db_insert_placeholder_4] => [:db_insert_placeholder_5] => [:db_insert_placeholder_6] => ) in aggregator_save_feed() (line 522 of /home/drupaldev/websites/d7p1.dev/modules/aggregator/aggregator.module).
NOT TESTED:
- Postgres
- Update script
NOTES:
Postgres 8.3 can have long varchar: "In any case, the longest possible character string that can be stored is about 1 GB. "
- http://www.postgresql.org/docs/8.3/static/datatype-character.html
#33
#34
The problem is the unique flag on url. This patch replaces it with an index.