Defect #6734
closedrobots.txt: disallow crawling issues list with a query string
Added by Ве Fio about 15 years ago. Updated over 5 years ago.
0%
Description
When robots visit robots.txt, it tells them to disallow /projects/project/issues, but nowhere does it tell it to disallow /issues
From looking at access logs, Googlebot (but all other bots do it to) was indexing /issues, and was indexing many useless pages, mainly like this:
66.249.68.115 - - [24/Oct/2010:07:05:00 -0700] "GET /issues?sort=assigned_to%2Cupdated_on%2Cstatus%3Adesc HTTP/1.1" 200 6254 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
There are about a few hundred of those entries. I disallowed the sort parameter with Google Webmaster Tools, but that's just working around the issue for now.
Files
| 6734.patch (1.14 KB) 6734.patch | Go MAEDA, 2020-06-28 09:38 | 
Related issues
       Updated by Ве Fio about 15 years ago
      Updated by Ве Fio about 15 years ago
      
    
    Aww, excuse me for putting this in "search engine", just realized the category doesn't actually fit this but report. >.<
       Updated by Ве Fio about 15 years ago
      Updated by Ве Fio about 15 years ago
      
    
    From searching the Google index, it also appears that they have not indexed /projects/project/issues, but they did index /projects/project/issues?tracker_id=1, whether Googlebot is following the robots.txt mostly but not completely I do not know, but that page is indexed regardless, where it shouldn't be.
       Updated by Felix Schäfer about 15 years ago
      Updated by Felix Schäfer about 15 years ago
      
    
    - Category deleted (Search engine)
Do you have any idea/example on how to disable bots to navigate parametrized URLs?
       Updated by Ве Fio about 15 years ago
      Updated by Ве Fio about 15 years ago
      
    
    Hi,
From what documentation I can get my hands on, this doesn't seem to be documented. I know that putting an entry like:
Disallow: /issues
Will work, however I am guessing that not disallowing that might have been intentional.
After searching a bit however, I came across a bit of code that is said to work, but I haven't been able to verify it yet.
Disallow: *sort= Disallow: *&sort= Disallow: *? // This should disallow all URL's that request something, not necessarily a good idea, but it's just an example Disallow: *sort=* // if above's won't work, I heard that wildcards aren't supported, so maybe something like.. Disallow: /issues?sort=
I'm 75% sure the ones with the wildcards will work, and 90% sure the example without the wildcard will work.
I tried to put in as many examples as I could. Like I said, I couldn't and am unable to verify them though. Also, there may be more parameters that should be disallowed, but I missed (or they weren't yet navigated). I'll keep on the lookout for more, and update this report as needed. Hope that helps!
Please note: As you can see when visiting Redmine's robots.txt, it states some URL's to disallow. It appears that Googlebot disregards a lot of these even though it knows they're disallowed. I know this, because using Google Webmaster Tools, it showed me that the bot knows that they're disallowed URL's, even though it visited them.
       Updated by Felix Schäfer about 15 years ago
      Updated by Felix Schäfer about 15 years ago
      
    
    Ве Fio wrote:
From what documentation I can get my hands on, this doesn't seem to be documented. I know that putting an entry like:
Disallow: /issuesWill work, however I am guessing that not disallowing that might have been intentional.
I guess so too, the one rule about the issue list is to prevent bots indexing stuff twice.
After searching a bit however, I came across a bit of code that is said to work, but I haven't been able to verify it yet.
[...]I'm 75% sure the ones with the wildcards will work, and 90% sure the example without the wildcard will work.
So sort of "official" documentation would be nice, or at least confirmation that this works. Care to share your sources?
I tried to put in as many examples as I could. Like I said, I couldn't and am unable to verify them though. Also, there may be more parameters that should be disallowed, but I missed (or they weren't yet navigated). I'll keep on the lookout for more, and update this report as needed. Hope that helps!
Please note: As you can see when visiting Redmine's robots.txt, it states some URL's to disallow. It appears that Googlebot disregards a lot of these even though it knows they're disallowed. I know this, because using Google Webmaster Tools, it showed me that the bot knows that they're disallowed URL's, even though it visited them.
That's a problem you should tackle with google, not with us ;-)
       Updated by Ве Fio about 15 years ago
      Updated by Ве Fio about 15 years ago
      
    
    This isn't "official", but I tried to find as many sources as I could, hopefully these'll help out. From what I read, the wildcards will work (for some bots), but are a bad idea because others won't follow it.
http://www.webmasterworld.com/forum93/823.htm
http://www.ihelpyou.com/forums/showthread.php?t=27849
http://www.velocityreviews.com/forums/t608728-robots-txt-and-regular-expressions.html
This is already following the standards, so it's a safe fallback:
Disallow: /issues?sort=
Conclusion: Wildcards are too risky, but we of course already know that the above will work normally as it conforms to the rules. It's up to you if you want to do something, or nothing. ;)
Official documentation: http://www.robotstxt.org/robotstxt.html
Felix Schäfer wrote:
That's a problem you should tackle with google, not with us ;-)
Oh, I was just noting that so that you guys know about it. Like a "warning" :)
       Updated by Ве Fio about 15 years ago
      Updated by Ве Fio about 15 years ago
      
    
    Oh, and if /issues?sort= isn't the only one that bots might follow (because there's other parameterized stuff on the page), I suppose it'd probably be good to maybe put those in. I don't know all of the possible parameters, but you guys should. :)
       Updated by Ве Fio almost 15 years ago
      Updated by Ве Fio almost 15 years ago
      
    
    An alternative, and MUCH MUCH better solution, is to add a noindex meta tag to the pages that shouldn't be indexed, which there are a lot of those on Redmine that robots.txt doesn't cover, and Google is going crazy indexing them.
# tell robots not to index that page <meta name="robots" content="noindex"> # this page is the same as this other page (good for when /issues/21/?reply=2 is the same as /issues/21/) <link rel="canonical" href="/">
I highly suggest this gets implemented as soon as possible. :)
       Updated by Antoine Beaupré over 14 years ago
      Updated by Antoine Beaupré over 14 years ago
      
    
    It seems like this could be easily fixed by the patch is #3754.
       Updated by Harald Welte over 6 years ago
      Updated by Harald Welte over 6 years ago
      
    
    We've just observed that this issue still exists in redmine 3.4. I couldn't find any rationale here in this issue why the related patch was not merged during the past 8 years at some point?
       Updated by Eduardo Ramos over 5 years ago
      Updated by Eduardo Ramos over 5 years ago
      
    
    Harald Welte wrote:
We've just observed that this issue still exists in redmine 3.4. I couldn't find any rationale here in this issue why the related patch was not merged during the past 8 years at some point?
Still failing in redmine 4.1.1 stable, similar GETs on issues.
Receiving requests from various bots which exhaust my raspberry cpu:
172.162.119.114.in-addr.arpa domain name pointer petalbot-114-119-162-172.aspiegel.com.
146.168.229.46.in-addr.arpa domain name pointer crawl18.bl.semrush.com.
...
       Updated by Go MAEDA over 5 years ago
      Updated by Go MAEDA over 5 years ago
      
    
    - Category set to SEO
- Target version set to Candidate for next minor release
Most people here think that the problem is that search engines indexes URLs of filters and queries (/issues/?...) rather than single issue pages (/issues/123).
I agree that indexing "/issues/?..." URLs is a waste of computer resources. However, I think "/issues/123" URLs should be indexed (I usually search for issues in www.redmine.org with Google).
The following patch disallows all URLs that have a query string (?...). It disallows indexing "/issues/?" pages while allowing indexing "/issues/123" pages. The main contents we want search engines to index are issues and wiki pages, so I think it is not a problem to disallow all URLs that have a query string.
diff --git a/app/views/welcome/robots.text.erb b/app/views/welcome/robots.text.erb
index 6f66278ad..dbe9f04dd 100644
--- a/app/views/welcome/robots.text.erb
+++ b/app/views/welcome/robots.text.erb
@@ -1,4 +1,5 @@
 User-agent: *
+Disallow: /*?
 <% @projects.each do |project| -%>
 <%   [project, project.id].each do |p| -%>
 Disallow: <%= url_for(:controller => 'repositories', :action => :show, :id => p) %>
       Updated by Eduardo Ramos over 5 years ago
      Updated by Eduardo Ramos over 5 years ago
      
    
    Go MAEDA wrote:
Most people here think that the problem is that search engines indexes URLs of filters and queries (/issues/?...) rather than single issue pages (/issues/123).
I agree that indexing "/issues/?..." URLs is a waste of computer resources. However, I think "/issues/123" URLs should be indexed (I usually search for issues in www.redmine.org with Google).
The following patch disallows all URLs that have a query string (?...). It disallows indexing "/issues/?" pages while allowing indexing "/issues/123" pages. The main contents we want search engines to index are issues and wiki pages, so I think it is not a problem to disallow all URLs that have a query string.
[...]
Thank u, with that modification (Disallow: /*?) my raspberry is not so stressed (at least cpu is under 15%, before it was about 90% due to crawlers requests)
How could it be patched on a docker-compose layout ? What I did, is to modify the 'robots.text.erb' in redmine container, and restart such container.
       Updated by Go MAEDA over 5 years ago
      Updated by Go MAEDA over 5 years ago
      
    
    - Target version changed from Candidate for next minor release to 4.0.8
Eduardo Ramos wrote:
Thank u, with that modification (Disallow: /*?) my raspberry is not so stressed (at least cpu is under 15%, before it was about 90% due to crawlers requests)
Thank you for testing the patch and for giving feedback. I am setting the target version to 4.0.8.
How could it be patched on a docker-compose layout ? What I did, is to modify the 'robots.text.erb' in redmine container, and restart such container.
I don't know much about Docker. I suggest you ask questions on the forums.
       Updated by Go MAEDA over 5 years ago
      Updated by Go MAEDA over 5 years ago
      
    
    Updated the patch. The previous patch posted in #6734#note-13 has a problem that it prevents crawlers from accessing "/issues?page=". It means that crawlers can get only the first page of the issues list and will not index issues after the second page.
diff --git a/app/views/welcome/robots.text.erb b/app/views/welcome/robots.text.erb
index 6f66278ad..8c2732e00 100644
--- a/app/views/welcome/robots.text.erb
+++ b/app/views/welcome/robots.text.erb
@@ -10,3 +10,6 @@ Disallow: <%= url_for(issues_gantt_path) %>
 Disallow: <%= url_for(issues_calendar_path) %>
 Disallow: <%= url_for(activity_path) %>
 Disallow: <%= url_for(search_path) %>
+Disallow: <%= url_for(issues_path) %>?sort=
+Disallow: <%= url_for(issues_path) %>?query_id=
+Disallow: <%= url_for(issues_path) %>?*set_filter=
       Updated by Go MAEDA over 5 years ago
      Updated by Go MAEDA over 5 years ago
      
    
    - Related to Feature #31617: robots.txt: disallow crawling dynamically generated PDF documents added
       Updated by Eduardo Ramos over 5 years ago
      Updated by Eduardo Ramos over 5 years ago
      
    
    Go MAEDA wrote:
Updated the patch. The previous patch posted in #6734#note-13 has a problem that it prevents crawlers from accessing "/issues?page=". It means that crawlers can get only the first page of the issues list and will not index issues after the second page.
[...]
Tested OK. The cpu even better. No activity registered in redmine logs regarding bots, neither at redmine access logs from nginx.
It could be casuality (no crawlers accessing now), i will monitor it in the following hours anyway.
Thank u
       Updated by Go MAEDA over 5 years ago
      Updated by Go MAEDA over 5 years ago
      
    
    - Subject changed from Robots index /issues (which isn't disallowed in robots.txt) to robots.txt: disallow crawling issues list with a query string
       Updated by Go MAEDA over 5 years ago
      Updated by Go MAEDA over 5 years ago
      
    
    - Status changed from New to Resolved
- Assignee set to Go MAEDA
- Resolution set to Fixed
Committed the patch.
       Updated by Go MAEDA almost 3 years ago
      Updated by Go MAEDA almost 3 years ago
      
    
    - Related to Defect #38201: Fix robots.txt to disallow issue lists with a sort or query_id parameter in any position added