When img source filter was deployed a lot of images are now broken.

Attached are two text files with lists of nodes and comments that probably need to be updated.

I just ran this analysis:

select count(1), type from node n inner join node_revisions nr on n.vid = nr.vid where body like '%src="http://%' group by type;

+----------+-----------------+
| count(1) | type            |
+----------+-----------------+
|      452 | book            |
|     1251 | forum           |
|        3 | page            |
|     1663 | project_issue   |
|      190 | project_project |
|        6 | project_release |
|        6 | story           |
+----------+-----------------+

There are ~4,257 comments which need to be updated based on a similar query (note: the query has false positives like http://drupal.org/node/5890#comment-8871 which has an iframe src on http but its inside a code tag so not actually a problem).

I suggest anyone actually doing this work download the files and turn the cid and nid values into links with something like awk in their terminal:

awk '{print "http://drupal.org/comment/edit/" $2}' cids_to_update.txt

CommentFileSizeAuthor
cids_to_update.txt.bz226.15 KBgreggles
nids_to_update.txt.bz276.28 KBgreggles

Comments

dww’s picture

I thought none of those would be broken until their input format was changed from Documentation to Filtered HTML. See #1275424: Deal with documentation role and documentation input format for more...

greggles’s picture

http://drupal.org/node/7774#comment-12824

The scenario is comments/nodes set to filtered html where the img tag was being stripped out before and is now being replaced with the red X.

dww’s picture

Gotcha. Although your LIKE is going to have a lot of false positives with people who just cut+paste'd the file attachment URL (which gives you absolute URLs to http://drupal.org/files/...) into an img tag. So I think the problem is less severe than it seems from the stats in the summary.

jhodgdon’s picture

Issue tags: +docs infrastructure

Does the filter allow those kind of URLs (#3)?

arianek’s picture

sub

vegantriathlete’s picture

So I started to take a look at this and see that we will need a much better way of coordinating. Here are the results for the first 15 items. My comments are at the end of the line using the ->

223	Powered By Link	325	story -> Access Denied
1901	Import your LJ through an IFRAME held in a book page or similar	11	book -> Legitimately blocked
2344	image filter order problem	3056	project_issue -> false positive: it's inside <code>
3281	Project	46549	project_project -> already fixed?
3331	Adc	4179	project_project -> Can't access page. Besides this is from 4.7 and it never got out of dev
3351	Sunflower	4179	project_project -> Can't access page. Besides this is from 4.7 and it never got out of dev
3826	Taxonomy Browser	101412	project_project -> This is truly an external image, which happens to be broken. And, interestingly enough it is not being blocked.
5026	New Module: Wordfilter for filtering posted content	5378	forum -> false positive: This is an <a href src= . . .> not an image tag.
5841	e-Commerce	959	project_project -> Can't edit page
6754	Node section	4179	book -> false positive: it's inside <code>
7233	Comment	4179	book -> false positive: it's inside <code>
8121	Taxonomy Image	101412	project_project -> This is truly an external image, which happens to be broken. And, interestingly enough it is not being blocked.
8334	YA image.module problem	9604	forum -> false positive: it's inside <code>
9070	Buttons	35733	book -> already fixed?

The point is that I went 0 / 15. How do we coordinate so that somebody else doesn't do the exact same thing? How can we let others know which items have already been addressed?

Maybe as we work on the items, we could edit the original issue, remove the old version and attach a new version of the file [with the completed items removed]. Edit: NO GO, with the idea of changing the attached files for the original issue. We'd have to attach the new version to our reply.

I suppose it would also be helpful if we agree on some type of format for including notes about things like nids 223, 3331 and 3351 which need further follow up by someone with the appropriate permission. I would also recommend keeping this a simple text file instead of compressing it with any utility to avoid the potential that somebody wouldn't be able to open it [I understand the desire to save space].

greggles’s picture

I thought of querying the cache_filter table to find problems but that doesn't work b/c drupal.org uses memcache.

Thanks for your work, vegantriathlete.

vegantriathlete’s picture

@greggles: I'm happy to go through more. I just want to make sure that we don't duplicate effort. Any thoughts on implementing a protocol?

jhodgdon’s picture

Can I suggest to avoid duplication of effort that anyone who wants to take on a chunk of the file just say something like: "I'm taking nodes 123 - 4567" in a comment? That is probably good enough to avoid duplication of effort?

vegantriathlete’s picture

Sounds good! Leave it to a doc person to come up with a simple solution. We code monkeys have a way of over-complicating things: if it doesn't involve programming, then it isn't any fun ;-)

Would it work well if we leave another comment with a status update when we've finished? Probably it would be best just to comment on the "exceptions" rather than on things we were able to resolve.

Will you take a look at what I've done above and give your thoughts about how I would have reported back on those 15?

Then I'll do another go 'round with nodes 10235 - 14886.

jhodgdon’s picture

I put myself in the category of "code monkey" by the way. But I also am a project manager, freelance site builder, doc writer, and any number of other categories. And I'm fond of non-tech simple solutions where possible, and have managed a number of "divide and conquer meta issues". :)

jhodgdon’s picture

Bump. Do we need to turn this into a meta-issue and organize a sprint on it? Are we agreed that this has to be a manual fix-it process, or is there something we can automate?

jhodgdon’s picture

Looking at this again... it's been a while.

So it looks like we are just talking about pages with Filtered HTML format that contain IMG tags.

Previously the images were completely filtered out. Now they are showing up as big red X's.

I guess I am not sure why this is a huge problem. If people see the page with the problem, they should have an Edit button and can fix it. Can't we just let that happen organically? We don't have plans to do any wholesale "change everything that was docs format to filtered HTML" or anything...

I'm inclined to say this is "won't fix".

webchick’s picture

Created #1335904: Proxy external images which could help solve this issue.

killes@www.drop.org’s picture

Status: Active » Closed (won't fix)

Yeah, I am very much with jhodgdon. People should file a webmaster issue if they can't fix it.