Defect #6734: robots.txt: disallow crawling issues list with a query string - Redmine

Actions

Copy link

Defect #6734

closed

robots.txt: disallow crawling issues list with a query string

Added by Ве Fio over 14 years ago. Updated about 5 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Go MAEDA

Category:

SEO

Target version:

4.0.8

Start date:

2010-10-24

Due date:

% Done:

Estimated time:

Resolution:

Fixed

Affected version:

1.0.2

Description

When robots visit robots.txt, it tells them to disallow /projects/project/issues, but nowhere does it tell it to disallow /issues

From looking at access logs, Googlebot (but all other bots do it to) was indexing /issues, and was indexing many useless pages, mainly like this:

66.249.68.115 - - [24/Oct/2010:07:05:00 -0700] "GET /issues?sort=assigned_to%2Cupdated_on%2Cstatus%3Adesc HTTP/1.1" 200 6254 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

There are about a few hundred of those entries. I disallowed the sort parameter with Google Webmaster Tools, but that's just working around the issue for now.

Files

6734.patch (1.14 KB) 6734.patch

Go MAEDA, 2020-06-28 09:38

Related issues

Actions

Copy link

Updated by Ве Fio over 14 years ago

Aww, excuse me for putting this in "search engine", just realized the category doesn't actually fit this but report. >.<

Actions

Copy link

Updated by Ве Fio over 14 years ago

From searching the Google index, it also appears that they have not indexed /projects/project/issues, but they did index /projects/project/issues?tracker_id=1, whether Googlebot is following the robots.txt mostly but not completely I do not know, but that page is indexed regardless, where it shouldn't be.

Actions

Copy link

Updated by Ве Fio over 14 years ago

You can just ignore comment 2.

Actions

Copy link

Updated by Felix Schäfer over 14 years ago

Category deleted (~~Search engine~~)

Do you have any idea/example on how to disable bots to navigate parametrized URLs?

Actions

Copy link

Updated by Ве Fio over 14 years ago

Hi,

From what documentation I can get my hands on, this doesn't seem to be documented. I know that putting an entry like:
Disallow: /issues

Will work, however I am guessing that not disallowing that might have been intentional.

After searching a bit however, I came across a bit of code that is said to work, but I haven't been able to verify it yet.

Disallow: *sort=
Disallow: *&sort=
Disallow: *? // This should disallow all URL's that request something, not necessarily a good idea, but it's just an example
Disallow: *sort=*
// if above's won't work, I heard that wildcards aren't supported, so maybe something like..
Disallow: /issues?sort=

I'm 75% sure the ones with the wildcards will work, and 90% sure the example without the wildcard will work.

I tried to put in as many examples as I could. Like I said, I couldn't and am unable to verify them though. Also, there may be more parameters that should be disallowed, but I missed (or they weren't yet navigated). I'll keep on the lookout for more, and update this report as needed. Hope that helps!

Please note: As you can see when visiting Redmine's robots.txt, it states some URL's to disallow. It appears that Googlebot disregards a lot of these even though it knows they're disallowed. I know this, because using Google Webmaster Tools, it showed me that the bot knows that they're disallowed URL's, even though it visited them.

Actions

Copy link

Updated by Felix Schäfer over 14 years ago

Ве Fio wrote:

From what documentation I can get my hands on, this doesn't seem to be documented. I know that putting an entry like:
Disallow: /issues

Will work, however I am guessing that not disallowing that might have been intentional.

I guess so too, the one rule about the issue list is to prevent bots indexing stuff twice.

After searching a bit however, I came across a bit of code that is said to work, but I haven't been able to verify it yet.
[...]

I'm 75% sure the ones with the wildcards will work, and 90% sure the example without the wildcard will work.

So sort of "official" documentation would be nice, or at least confirmation that this works. Care to share your sources?

I tried to put in as many examples as I could. Like I said, I couldn't and am unable to verify them though. Also, there may be more parameters that should be disallowed, but I missed (or they weren't yet navigated). I'll keep on the lookout for more, and update this report as needed. Hope that helps!

Please note: As you can see when visiting Redmine's robots.txt, it states some URL's to disallow. It appears that Googlebot disregards a lot of these even though it knows they're disallowed. I know this, because using Google Webmaster Tools, it showed me that the bot knows that they're disallowed URL's, even though it visited them.

That's a problem you should tackle with google, not with us ;-)

Actions

Copy link

Updated by Ве Fio over 14 years ago

This isn't "official", but I tried to find as many sources as I could, hopefully these'll help out. From what I read, the wildcards will work (for some bots), but are a bad idea because others won't follow it.
http://www.webmasterworld.com/forum93/823.htm
http://www.ihelpyou.com/forums/showthread.php?t=27849
http://www.velocityreviews.com/forums/t608728-robots-txt-and-regular-expressions.html

This is already following the standards, so it's a safe fallback:

Disallow: /issues?sort=

Conclusion: Wildcards are too risky, but we of course already know that the above will work normally as it conforms to the rules. It's up to you if you want to do something, or nothing. ;)
Official documentation: http://www.robotstxt.org/robotstxt.html

Felix Schäfer wrote:

That's a problem you should tackle with google, not with us ;-)

Oh, I was just noting that so that you guys know about it. Like a "warning" :)

Actions

Copy link

Updated by Ве Fio over 14 years ago

Oh, and if /issues?sort= isn't the only one that bots might follow (because there's other parameterized stuff on the page), I suppose it'd probably be good to maybe put those in. I don't know all of the possible parameters, but you guys should. :)

Actions

Copy link

Updated by Ве Fio over 14 years ago

An alternative, and MUCH MUCH better solution, is to add a noindex meta tag to the pages that shouldn't be indexed, which there are a lot of those on Redmine that robots.txt doesn't cover, and Google is going crazy indexing them.

# tell robots not to index that page
<meta name="robots" content="noindex">

# this page is the same as this other page (good for when /issues/21/?reply=2 is the same as /issues/21/)
<link rel="canonical" href="/">

I highly suggest this gets implemented as soon as possible. :)

Actions

Copy link

#10

Updated by Antoine Beaupré about 14 years ago

It seems like this could be easily fixed by the patch is #3754.

Actions

Copy link

#11

Updated by Harald Welte about 6 years ago

We've just observed that this issue still exists in redmine 3.4. I couldn't find any rationale here in this issue why the related patch was not merged during the past 8 years at some point?

Actions

Copy link

#12

Updated by Eduardo Ramos about 5 years ago

Harald Welte wrote:

We've just observed that this issue still exists in redmine 3.4. I couldn't find any rationale here in this issue why the related patch was not merged during the past 8 years at some point?

Still failing in redmine 4.1.1 stable, similar GETs on issues.
Receiving requests from various bots which exhaust my raspberry cpu:

172.162.119.114.in-addr.arpa domain name pointer petalbot-114-119-162-172.aspiegel.com.
146.168.229.46.in-addr.arpa domain name pointer crawl18.bl.semrush.com.
...

Actions

Copy link

#13

Updated by Go MAEDA about 5 years ago

Category set to SEO
Target version set to Candidate for next minor release

Most people here think that the problem is that search engines indexes URLs of filters and queries (/issues/?...) rather than single issue pages (/issues/123).

I agree that indexing "/issues/?..." URLs is a waste of computer resources. However, I think "/issues/123" URLs should be indexed (I usually search for issues in www.redmine.org with Google).

The following patch disallows all URLs that have a query string (?...). It disallows indexing "/issues/?" pages while allowing indexing "/issues/123" pages. The main contents we want search engines to index are issues and wiki pages, so I think it is not a problem to disallow all URLs that have a query string.

diff --git a/app/views/welcome/robots.text.erb b/app/views/welcome/robots.text.erb
index 6f66278ad..dbe9f04dd 100644
--- a/app/views/welcome/robots.text.erb
+++ b/app/views/welcome/robots.text.erb
@@ -1,4 +1,5 @@
 User-agent: *
+Disallow: /*?
 <% @projects.each do |project| -%>
 <%   [project, project.id].each do |p| -%>
 Disallow: <%= url_for(:controller => 'repositories', :action => :show, :id => p) %>

Actions

Copy link

#14

Updated by Eduardo Ramos about 5 years ago

Go MAEDA wrote:

Most people here think that the problem is that search engines indexes URLs of filters and queries (/issues/?...) rather than single issue pages (/issues/123).

I agree that indexing "/issues/?..." URLs is a waste of computer resources. However, I think "/issues/123" URLs should be indexed (I usually search for issues in www.redmine.org with Google).

The following patch disallows all URLs that have a query string (?...). It disallows indexing "/issues/?" pages while allowing indexing "/issues/123" pages. The main contents we want search engines to index are issues and wiki pages, so I think it is not a problem to disallow all URLs that have a query string.

[...]

Thank u, with that modification (Disallow: /*?) my raspberry is not so stressed (at least cpu is under 15%, before it was about 90% due to crawlers requests)
How could it be patched on a docker-compose layout ? What I did, is to modify the 'robots.text.erb' in redmine container, and restart such container.

Actions

Copy link

#15

Updated by Go MAEDA about 5 years ago

Target version changed from Candidate for next minor release to 4.0.8

Eduardo Ramos wrote:

Thank u, with that modification (Disallow: /*?) my raspberry is not so stressed (at least cpu is under 15%, before it was about 90% due to crawlers requests)

Thank you for testing the patch and for giving feedback. I am setting the target version to 4.0.8.

How could it be patched on a docker-compose layout ? What I did, is to modify the 'robots.text.erb' in redmine container, and restart such container.

I don't know much about Docker. I suggest you ask questions on the forums.

Actions

Copy link

#16

Updated by Go MAEDA about 5 years ago

Updated the patch. The previous patch posted in #6734#note-13 has a problem that it prevents crawlers from accessing "/issues?page=". It means that crawlers can get only the first page of the issues list and will not index issues after the second page.

diff --git a/app/views/welcome/robots.text.erb b/app/views/welcome/robots.text.erb
index 6f66278ad..8c2732e00 100644
--- a/app/views/welcome/robots.text.erb
+++ b/app/views/welcome/robots.text.erb
@@ -10,3 +10,6 @@ Disallow: <%= url_for(issues_gantt_path) %>
 Disallow: <%= url_for(issues_calendar_path) %>
 Disallow: <%= url_for(activity_path) %>
 Disallow: <%= url_for(search_path) %>
+Disallow: <%= url_for(issues_path) %>?sort=
+Disallow: <%= url_for(issues_path) %>?query_id=
+Disallow: <%= url_for(issues_path) %>?*set_filter=

Actions

Copy link

#17

Updated by Go MAEDA about 5 years ago

Related to Feature #31617: robots.txt: disallow crawling dynamically generated PDF documents added

Actions

Copy link

#18

Updated by Eduardo Ramos about 5 years ago

Go MAEDA wrote:

Updated the patch. The previous patch posted in #6734#note-13 has a problem that it prevents crawlers from accessing "/issues?page=". It means that crawlers can get only the first page of the issues list and will not index issues after the second page.

[...]

Tested OK. The cpu even better. No activity registered in redmine logs regarding bots, neither at redmine access logs from nginx.
It could be casuality (no crawlers accessing now), i will monitor it in the following hours anyway.

Thank u

Actions

Copy link

#19

Updated by Go MAEDA about 5 years ago

Subject changed from Robots index /issues (which isn't disallowed in robots.txt) to robots.txt: disallow crawling issues list with a query string

Actions

Copy link

#20