Strange bot swarm overwhelming a website - SOLVED

bikegremlin · September 2023

I saw a client's website using up 100% of its allotted CPU resources non-stop, and a huge amount of bandwidth.

It wasn't a DDoS, it wasn't a real bot attack either.

Never experienced something like that (you live and learn).

I wrote it all down as a sort of a pulp-fiction detective story - just for some fun while documenting it:

Thought it would be cool to not put any spoilers here.

bikegremlin · September 2023

According to this tweet (by an SEO expert Lily Ray), this site was not the only one bothered by Google crawling and indexing query strings instead of the canonical URLs:

https://twitter.com/lilyraynyc/status/1706444837610750119

I believe that the .htaccess redirect I used for the Medisite would work for that one too.

#BEGIN Redirect from msclkid to the canonical page

RewriteCond %{QUERY_STRING}    "msclkid=" [NC]
RewriteRule (.*)  /$1? [R=301,L]

#END Redirect from msclkid to the canonical page

Here, I added a more detailed explanation of how to see if your site is affected and fix it.

Encoders · September 2023

good find, bandwidth aside using that 100% cpu isn't acceptable. I recall this isn't the first fuckup by google bot isn't it

but then again, most of my websites are just flat static html. i probably will never seen this kind of problem anytime soon

bikegremlin · September 2023

@Encoders said:
good find, bandwidth aside using that 100% cpu isn't acceptable. I recall this isn't the first fuckup by google bot isn't it

but then again, most of my websites are just flat static html. i probably will never seen this kind of problem anytime soon

Depending on one's priorities, it could be argued that double thousandfold indexing is the biggest problem.

The affected site(s) had the same pages indexed several times - the same URL, with various query string combinations at its end.

But yes, CPU load is no fun either.

SimpleSonic · September 2023

I've seen similar issues with bingbot as well where it was similar to a small DoS.

Users can set a max crawl rate for these bots in their search console, but I'm not aware of any user that actually does until it becomes an issue.

Search companies really need to set reasonable max crawls rates by default and actually stick to them, regardless of what is being indexed.

Users shouldn't have to go to lengths just to keep search bots from eating up their website's resources.

bikegremlin · September 2023

@ResellerWiz said:
I've seen similar issues with bingbot as well where it was similar to a small DoS.

Users can set a max crawl rate for these bots in their search console, but I'm not aware of any user that actually does until it becomes an issue.

Search companies really need to set reasonable max crawls rates by default and actually stick to them, regardless of what is being indexed.

Users shouldn't have to go to lengths just to keep search bots from eating up their website's resources.

This was more than just a crawl rate issue.

It was Google crawling the query string variants, and completely ignoring the canonical URL tags of each page (according to their own search result page - no greater proof than that).

Relja GottaLoveSeo Novovic

SimpleSonic · September 2023

@bikegremlin said:

@ResellerWiz said:
I've seen similar issues with bingbot as well where it was similar to a small DoS.

Users can set a max crawl rate for these bots in their search console, but I'm not aware of any user that actually does until it becomes an issue.

Search companies really need to set reasonable max crawls rates by default and actually stick to them, regardless of what is being indexed.

Users shouldn't have to go to lengths just to keep search bots from eating up their website's resources.

This was more than just a crawl rate issue.

It was Google crawling the query string variants, and completely ignoring the canonical URL tags of each page (according to their own search result page - no greater proof than that).

Relja GottaLoveSeo Novovic

Either way, not good.

bikegremlin · September 2023

@ResellerWiz said:

@bikegremlin said:

@ResellerWiz said:
I've seen similar issues with bingbot as well where it was similar to a small DoS.

Users can set a max crawl rate for these bots in their search console, but I'm not aware of any user that actually does until it becomes an issue.

Search companies really need to set reasonable max crawls rates by default and actually stick to them, regardless of what is being indexed.

Users shouldn't have to go to lengths just to keep search bots from eating up their website's resources.

This was more than just a crawl rate issue.

It was Google crawling the query string variants, and completely ignoring the canonical URL tags of each page (according to their own search result page - no greater proof than that).

Relja GottaLoveSeo Novovic

Either way, not good.

Yup. I'd say it's worse than just a high crawl speed.

Janevski · September 2023

This article, truly talks to me:

Dazzle · September 2023

IMHO, the best way to address this issue is by preventing Googlebot from crawling faceted URLs. This can be achieved by using the disallow directive in the robots.txt file.

If you choose to redirect the URLs instead, it's important to note that Googlebot will still crawl them.

bikegremlin · September 2023

@Dazzle said:
IMHO, the best way to address this issue is by preventing Googlebot from crawling faceted URLs. This can be achieved by using the disallow directive in the robots.txt file.

If you choose to redirect the URLs instead, it's important to note that Googlebot will still crawl them.

They will try to crawl them - and get 301 redirected.
Thanks to the .htaccess redirects, there is no bashing the server - WordPress won't even realize someone requested those pages.

After a while, with 301 redirects, Google ditches the redirected pages in favour of the pages they 301 redirect to.
Canonical tags, on the other hand, are apparently treated as a suggestion, not as a hard rule.

Edit:
My initial (gut) response was to block the crawling of those pages (using firewall - lol, to make the matters worse).
But that was a very bad idea.
In my defense, the server was "redlining" so I wanted to do a quick fix, keep the site online, and then take time to think.
Disallow using the robots.txt is a lot less bad solution, but still not ideal IMO.

301 .htaccess redirects are an elegant solution that should fix the problem properly, in the long run - in the best and most efficient way.
I could be wrong, but according to my experience so far, that's what I did and what I would recommend.

Will see how things end up in the Google Search Console over the next month or two.

Dazzle · September 2023

@bikegremlin said:

301 .htaccess redirects are an elegant solution that should fix the problem properly, in the long run - in the best and most efficient way.
I could be wrong, but according to my experience so far, that's what I did and what I would recommend.

It can solve the problem, though, but not in the ideal way. The robots.txt file is created specifically for this purpose, to control how bots behave on your site. If the faceted URLs are visited by a human visitor, then a 301 redirect is the ideal solution.

Regarding GSC, the URLs will remain in the "crawled but not indexed" warning for quite a long time even if you 301 redirect them. A faster method is to use a 404 status code and request index removal from GSC (if the URL is indexed on the SERP). Otherwise, Google may reindex the faceted URLs if they are found elsewhere.

I don't want to start a debate, but share my thoughts for a second opinion. Just read this topic today.

Best regards,
Re

bikegremlin · September 2023

@Dazzle said:

@bikegremlin said:

301 .htaccess redirects are an elegant solution that should fix the problem properly, in the long run - in the best and most efficient way.
I could be wrong, but according to my experience so far, that's what I did and what I would recommend.

It can solve the problem, though, but not in the ideal way. The robots.txt file is created specifically for this purpose, to control how bots behave on your site. If the faceted URLs are visited by a human visitor, then a 301 redirect is the ideal solution.

Regarding GSC, the URLs will remain in the "crawled but not indexed" warning for quite a long time even if you 301 redirect them. A faster method is to use a 404 status code and request index removal from GSC (if the URL is indexed on the SERP). Otherwise, Google may reindex the faceted URLs if they are found elsewhere.

I don't want to start a debate, but share my thoughts for a second opinion. Just read this topic today.

Best regards,
Re

It's good to hear different opinions and points of view. Especially when they are disagreeing. It's difficult to learn otherwise.

Here's my experience (hoping to get corrected if I'm wrong):
301s are pretty good at getting Google to drop the redirected pages from SERP and replace them with the pages you directed to (if it's basically or literally the same page, not some strange "hack").

The old version will get dropped from indexing and shown under non-indexed page links (apparently, Google never forgets, LOL), with the reason "Page with redirect."

It doesn't negatively affect rankings (the dropped pages usually get swapped for the pages you 301 redirected to).

I've used 301 redirects when I moved the cycling website in my native from "www." to a "bicikl." subdomain, and when I moved all the computer-related articles from the cycling website(s) to the "io." subdomain.

I've also used 301s when I ditched Google AMP from my sites.

It all seems to have worked fine with no measurable negative ranking effects.

Having said that, this is the first time I've seen "random" URLs get indexed, and it will take a while to check and see if it went well.

But I would expect a 301 redirect to be more efficient than just blocking the bots (even via the robots.txt file). My reasoning:

Block: "Don't go there, don't look."
301 redirect: "That page is now here, see?"

Especially when the indexed pages all had canonical tags pointing to the same URLs that the 301 redirects take the bots to.

Relja AmateurSEO Novovic

SenseiSteve · March 2024

Great thread. I have recently been receiving warning emails from Google about one of my site's pages not being indexed by Google when they're actually 301 redirected to another domain. So far I've just ignored the warnings, but I think I need to rethink this now.

AlwaysSkint · March 2024

I've had the search engines totally ignore directives for years! Google in particular loves adding query strings. Webmaster Tools be damned.
When I'm a bit more "on the ball", I'll slowly read the above and try to glean some tips.
Cheers.

AlwaysSkint · March 2024

Ironically, the two sites that had massive bandwidth usage recently, seem to not have the above issue with indexed parameters.
A point of note, is that I found the following in robots.txt to be a totally wasted effort..

User-agent: *
Disallow: /*?

Along with these, though hardly surprising from the scum/scourge of the web (IMHO):

User-agent: AhrefsBot
Disallow: /

User-agent: YandexDirect
Disallow: /

User-agent: YandexBot
Disallow: /

Strange bot swarm overwhelming a website - SOLVED

Comments