Drupal generates a page at /filter/tips
, and that page is indexed by search engines, offering it up to the public as a destination on your website.
Currently, robots.txt is configured only to block sub-pages of /filter/tips/
, but in order to block the tips page itself from indexing engines, the trailing slash must be removed. Patch #28, posted by Tor Arne Thune, addresses this issue by removing the trailing slash.
Consideration should be given to re-scoping and/or rolling this issue in with similar issues related to robots.txt.
Committed and pushed 4300e616cc to 8.6.x and 17c9a8a27a to 8.5.x. Thanks!
Comments
Comment #1
jensimmons CreditAttribution: jensimmons commentedAdd this:
Disallow: /filter/tips
to line 32 of robots.txtLike this:
Does someone want to make a patch?
Comment #2
BrockBoland CreditAttribution: BrockBoland commentedSure!
Comment #3
BrockBoland CreditAttribution: BrockBoland commentedAw, fer cryin - fixed attached.
Comment #4
Dave ReidWe need to add the un-clean URL version of it as well. Note, that page is not a file, so it should go one section lower.
Comment #5
tim.plunkettLooks good.
Comment #6
Dave ReidAnd as always, fix in 8.x first, then backport easily.
Comment #7
tim.plunkettWhoops.
Comment #8
tim.plunkettUpdated. Crazy cross-posts. It's what happens when Jen asks for Drupal things on twitter.
Comment #9
BrockBoland CreditAttribution: BrockBoland commentedI've read and understand the backport policy (http://drupal.org/node/767608), but what's the actual process for an issue like this? For a simple item like this, it makes sense that a single patch can be applied to D7 and D8, but in more complex cases where the patches differ, should a separate issue be spun off for the D7 version?
Apologies for being a newb - I haven't done any core patches before.
Comment #10
ksenzeeNo, normally it all stays in the same issue. It works fine for simple stuff like this but it's kind of a messy process for complicated issues.
Also, subscribing. I saw drupal.org/filter/tips in some Google results the other day and said huh what?
Comment #11
jensimmons CreditAttribution: jensimmons commentedTwitter FTW!!
Yeah, the CVS-centric workflow's been to update things in the dev version of Drupal (now D8), then backport to the current version (D7), and then the one-older version (D6). IMO, this workflow could/should/might change now that we have Git, and we can work with branches instead of patches.... but that's not happened yet. So meanwhile, we are following the same rules that were used two years ago when D6 was brand-new and D7 development had just opened. (Or was that three years ago?)
Issues like this one are the test. Super easy to understand. Super easy to write the code. Not much to debate.... now let's see how long it takes to get this into D7, with the crazy D8-first-rule. Especially since we don't have a D8 co-maintainer, and Angie (webchick) doesn't have commit access to D8. Will this be no-biggy? Or will it take months to fix? Our process post-switch-to-git is still evolving.
Meanwhile, welcome BrockBoland to core development! You've been awarded the "My First Drupal Core Patch" badge. :D YAY!
Comment #12
ksenzeeI don't think the rule about committing to the newest version first is likely to change just because of git. The process is being discussed over at #1050616: Figure out backport workflow from Drupal 8 to Drupal 7.
Comment #13
ksenzeeOh, and this passed tests since I was last here, so RTBC.
Comment #14
Dave Reid+1 from me as well, although now I will no longer be able to google for sites that have their full html input filter on...which is a good thing!
Comment #15
webchickMakes sense to me.
Committed to 8.x and 7.x. Thanks!
Comment #16
pillarsdotnet CreditAttribution: pillarsdotnet commentedRequested d6 backport:
Comment #17
Dave ReidDon't forget the Disallow: /?q=filter/tips
Comment #18
pillarsdotnet CreditAttribution: pillarsdotnet commentedOops.
Comment #19
pillarsdotnet CreditAttribution: pillarsdotnet commented(sigh) Probably better as one patch. Sorry for the noise.
Comment #20
Damien Tournoud CreditAttribution: Damien Tournoud commentedComment #21
juliangb CreditAttribution: juliangb commentedThis has been RTBC for 3 months.
I'm using this patch on my live sites and would greatly like for new D6 releases to include this as standard.
Is there anything stopping this from being committed?
Comment #22
Gábor HojtsyCommitted to 6.x too, thanks!
Comment #24
juliangb CreditAttribution: juliangb commentedI'm now finding that Google is not blocking filter/tips because the line in robots.txt has a trailing slash.
We need to remove the slash to ensure that Google always knows to block this page.
Comment #25
pillarsdotnet CreditAttribution: pillarsdotnet commentedThe Redirect module has an option to remove trailing slashes.
Comment #26
juliangb CreditAttribution: juliangb commentedActually the redirect module doesn't help in this instance.
The issue is that in the robots.txt the paths all have trailing slashes, which means that Google does not block any paths without the trailing slashes.
To ensure that it catches everything, we should include a version without the trailing slash in robots.txt.
Comment #27
pillarsdotnet CreditAttribution: pillarsdotnet commentedAh. That explains the module which *adds* trailing slashes to everything.
Write a patch, please?
Comment #28
Tor Arne Thune CreditAttribution: Tor Arne Thune commentedjuliangb is right. It should not have a trailing slash. Attaching a patch that corrects it. As for the suggestion to add a non-trailing-slash-version of paths with a trailing slash, I feel that it deserves its own issue.
Comment #29
Tor Arne Thune CreditAttribution: Tor Arne Thune commentedUploading the D7 backport.
Comment #30
juliangb CreditAttribution: juliangb commentedThanks for posting the patch, Tor Arne - a good reminder for me seeing this pop up in my issues tracker.
I disagree with fixing the other links in a separate issue though, hence the "needs work" for now. This would leave a slightly "hacked" state until the other issue was fixed.
Comment #31
GaëlGI'm on it.
Comment #32
GaëlGHere's a new patch. I checked in the router table to see if the path can have subpaths. If so, we need to list both formats (end slashes and no end slashes).
/search/
needs indeed to be listed to avoid search results indexing, but it seems not bad to me that the search landing page can be indexed. That's why I did not add/search
.Comment #33
oenie CreditAttribution: oenie commentedfixing the amsterdam sprint tag to amsterdam2014
Comment #34
ronaldmulero CreditAttribution: ronaldmulero commentedRelated: #180379 - Fixing Robots.txt
Comment #35
cilefen CreditAttribution: cilefen commentedThe scope of this issue is /filter/tips only and that is all that should be fixed here, considering #180379: Fix path matching in robots.txt exists. So, proceed from #28.
Comment #36
ericjenkins CreditAttribution: ericjenkins at Digital Bridge Solutions commentedI'm at a sprint in Los Angeles. I'm going to check that the patch in #28 still applies to D8 core.
Comment #37
ericjenkins CreditAttribution: ericjenkins at Digital Bridge Solutions commentedPatch #28 still applies successfully into robots.txt. I will seek a way to test it against an indexing validator.
Comment #38
ericjenkins CreditAttribution: ericjenkins at Digital Bridge Solutions commentedI'm hiding Patch #32 because it was beyond the scope of this ticket.
Comment #39
ericjenkins CreditAttribution: ericjenkins at Digital Bridge Solutions commentedI've tested the indexing of
filter/tips
using a personal development machine with Google Webmaster Tools robots.txt tester. I confirmed that, prior to applying Patch #28, the tips page was indexed by Google. After applying Patch #28, the tips page is no longer indexed by Google. This validates the removal of the trailing slash onfilter/tips
Comment #40
YesCT CreditAttribution: YesCT commentedSeems like the problem is that we have some listings where we intended to disallow, but they are not being disallowed because of an erroneous trailing slash.
Do that same test in google webmaster tools on node/add to see. (for example)
If so, I would suggest retitling and rescoping this issue to just address that problem. #180379: Fix path matching in robots.txt might be about a variety of problems.
(also, an issue summary update would be nice, explaining the back and forth of the direction of the issue)
Depending on the result, maybe add the novice tag back, with explicit next steps.
Comment #41
ericjenkins CreditAttribution: ericjenkins at Digital Bridge Solutions commentedI adjusted Add Node permissions on my Drupal 8 test site to allow anonymous browsing to
node/add
and also tonode/add/article
. Here are the results of my findings from Webmaster Tools index testing ofnode/add/
, with and without the trailing slash in robots.txt:Trailing slash:
Disallow: /node/add/
Disallow: /index.php/node/add/
The
node/add
page is indexed, but sub-URLS ofnode/add
are blocked.No trailing slash:
Disallow: /node/add
Disallow: /index.php/node/add
The
node/add
page is blocked, and sub-URLS ofnode/add
are blocked.Comment #42
ericjenkins CreditAttribution: ericjenkins at Digital Bridge Solutions commentedComment #43
ericjenkins CreditAttribution: ericjenkins at Digital Bridge Solutions commentedComment #44
mgiffordNeeds re-roll.
Comment #45
opdavies@mgifford: Which patch and branch are you testing with? The patch in #28 applies cleanly to both 8.0.x and 8.1.x.
Comment #46
mgiffordI was applying it to SimplyTest.me : https://simplytest.me/project/drupal/8.0.x?patch[]=https://www.drupal.or...
Comment #47
opdaviesIt looks like it's trying to apply a Drupal 7 patch to 8.0.x.
Comment #48
mgiffordMy bad... I see what I did wrong. My main goal was looking at the bots not being able to test the patches.
I'm just going to re-upload the patch from #28.
Comment #58
FiNeX CreditAttribution: FiNeX as a volunteer commentedHi, will this patch be included on the next Drupal release? Thanks!
Comment #59
FiNeX CreditAttribution: FiNeX as a volunteer commentedIn multilanguage environment the path could be in the following form:
/LANGCODE/filter/tips
. Do we need to manually patch robots.txt in order to Disallow all those pages?Comment #60
PanchoThe followup still isn’t committed.
Very straightforward, #48 does the job: the trailing slash must go.
Comment #62
alexpottRe #59 this is true for all of the things listed in robots.txt and as such we need a general solution. This patch does not make the situation worse. There are other issues around this topic. For example doing something like #1032234: Use Robots Meta Tag rather than robots.txt when possible would be a better solution.
Let's proceed with this small improvement.
Credit is a bit of mess for this issue therefore just going with everyone who added a file since the last commit.
Committed and pushed 4300e616cc to 8.6.x and 17c9a8a27a to 8.5.x. Thanks!
Drupal 7 backports are now filed as separate issues linked to this one.
Comment #65
PanchoAnother possible followup: #2581637: robots.txt paths incorrect