Strange bot swarm overwhelming a website - SOLVED
bikegremlin
ModeratorOGContent Writer
in WordPress
I saw a client's website using up 100% of its allotted CPU resources non-stop, and a huge amount of bandwidth.
It wasn't a DDoS, it wasn't a real bot attack either.
Never experienced something like that (you live and learn).
I wrote it all down as a sort of a pulp-fiction detective story - just for some fun while documenting it:
Thought it would be cool to not put any spoilers here.
Relja of House Novović, the First of His Name, King of the Plains, the Breaker of Chains, WirMach Wolves pack member
BikeGremlin's web-hosting reviews
Comments
According to this tweet (by an SEO expert Lily Ray), this site was not the only one bothered by Google crawling and indexing query strings instead of the canonical URLs:
https://twitter.com/lilyraynyc/status/1706444837610750119
I believe that the .htaccess redirect I used for the Medisite would work for that one too.
Here, I added a more detailed explanation of how to see if your site is affected and fix it.
Relja of House Novović, the First of His Name, King of the Plains, the Breaker of Chains, WirMach Wolves pack member
BikeGremlin's web-hosting reviews
good find, bandwidth aside using that 100% cpu isn't acceptable. I recall this isn't the first fuckup by google bot isn't it
but then again, most of my websites are just flat static html. i probably will never seen this kind of problem anytime soon
Fuck this 24/7 internet spew of trivia and celebrity bullshit.
Depending on one's priorities, it could be argued that double thousandfold indexing is the biggest problem.
The affected site(s) had the same pages indexed several times - the same URL, with various query string combinations at its end.
But yes, CPU load is no fun either.
Relja of House Novović, the First of His Name, King of the Plains, the Breaker of Chains, WirMach Wolves pack member
BikeGremlin's web-hosting reviews
I've seen similar issues with bingbot as well where it was similar to a small DoS.
Users can set a max crawl rate for these bots in their search console, but I'm not aware of any user that actually does until it becomes an issue.
Search companies really need to set reasonable max crawls rates by default and actually stick to them, regardless of what is being indexed.
Users shouldn't have to go to lengths just to keep search bots from eating up their website's resources.
SimpleSonic - We Make Fast... Easy!
New High Performance Economy Shared Hosting Plans Available As Low As $1.46/mo
This was more than just a crawl rate issue.
It was Google crawling the query string variants, and completely ignoring the canonical URL tags of each page (according to their own search result page - no greater proof than that).
Relja GottaLoveSeo Novovic
Relja of House Novović, the First of His Name, King of the Plains, the Breaker of Chains, WirMach Wolves pack member
BikeGremlin's web-hosting reviews
Either way, not good.
SimpleSonic - We Make Fast... Easy!
New High Performance Economy Shared Hosting Plans Available As Low As $1.46/mo
Yup. I'd say it's worse than just a high crawl speed.
Relja of House Novović, the First of His Name, King of the Plains, the Breaker of Chains, WirMach Wolves pack member
BikeGremlin's web-hosting reviews
This article, truly talks to me:
IMHO, the best way to address this issue is by preventing Googlebot from crawling faceted URLs. This can be achieved by using the disallow directive in the robots.txt file.
If you choose to redirect the URLs instead, it's important to note that Googlebot will still crawl them.
UpCloud free $25 through this aff link - Akamai, DigitalOcean and Vultr alternative, multiple location, IPv6.
They will try to crawl them - and get 301 redirected.
Thanks to the .htaccess redirects, there is no bashing the server - WordPress won't even realize someone requested those pages.
After a while, with 301 redirects, Google ditches the redirected pages in favour of the pages they 301 redirect to.
Canonical tags, on the other hand, are apparently treated as a suggestion, not as a hard rule.
Edit:
My initial (gut) response was to block the crawling of those pages (using firewall - lol, to make the matters worse).
But that was a very bad idea.
In my defense, the server was "redlining" so I wanted to do a quick fix, keep the site online, and then take time to think.
Disallow using the robots.txt is a lot less bad solution, but still not ideal IMO.
301 .htaccess redirects are an elegant solution that should fix the problem properly, in the long run - in the best and most efficient way.
I could be wrong, but according to my experience so far, that's what I did and what I would recommend.
Will see how things end up in the Google Search Console over the next month or two.
Relja of House Novović, the First of His Name, King of the Plains, the Breaker of Chains, WirMach Wolves pack member
BikeGremlin's web-hosting reviews
It can solve the problem, though, but not in the ideal way. The robots.txt file is created specifically for this purpose, to control how bots behave on your site. If the faceted URLs are visited by a human visitor, then a 301 redirect is the ideal solution.
Regarding GSC, the URLs will remain in the "crawled but not indexed" warning for quite a long time even if you 301 redirect them. A faster method is to use a 404 status code and request index removal from GSC (if the URL is indexed on the SERP). Otherwise, Google may reindex the faceted URLs if they are found elsewhere.
I don't want to start a debate, but share my thoughts for a second opinion. Just read this topic today.
Best regards,
Re
UpCloud free $25 through this aff link - Akamai, DigitalOcean and Vultr alternative, multiple location, IPv6.
It's good to hear different opinions and points of view. Especially when they are disagreeing. It's difficult to learn otherwise.
Here's my experience (hoping to get corrected if I'm wrong):
301s are pretty good at getting Google to drop the redirected pages from SERP and replace them with the pages you directed to (if it's basically or literally the same page, not some strange "hack").
The old version will get dropped from indexing and shown under non-indexed page links (apparently, Google never forgets, LOL), with the reason "Page with redirect."
It doesn't negatively affect rankings (the dropped pages usually get swapped for the pages you 301 redirected to).
I've used 301 redirects when I moved the cycling website in my native from "www." to a "bicikl." subdomain, and when I moved all the computer-related articles from the cycling website(s) to the "io." subdomain.
I've also used 301s when I ditched Google AMP from my sites.
It all seems to have worked fine with no measurable negative ranking effects.
Having said that, this is the first time I've seen "random" URLs get indexed, and it will take a while to check and see if it went well.
But I would expect a 301 redirect to be more efficient than just blocking the bots (even via the robots.txt file). My reasoning:
Especially when the indexed pages all had canonical tags pointing to the same URLs that the 301 redirects take the bots to.
Relja AmateurSEO Novovic
Relja of House Novović, the First of His Name, King of the Plains, the Breaker of Chains, WirMach Wolves pack member
BikeGremlin's web-hosting reviews
Great thread. I have recently been receiving warning emails from Google about one of my site's pages not being indexed by Google when they're actually 301 redirected to another domain. So far I've just ignored the warnings, but I think I need to rethink this now.
ProlimeHost Dedicated Servers
Los Angeles, CA - Denver, CO - Singapore
I've had the search engines totally ignore directives for years! Google in particular loves adding query strings. Webmaster Tools be damned.
When I'm a bit more "on the ball", I'll slowly read the above and try to glean some tips.
Cheers.
It wisnae me! A big boy done it and ran away.
NVMe2G for life! until death (the end is nigh)
Ironically, the two sites that had massive bandwidth usage recently, seem to not have the above issue with indexed parameters.
A point of note, is that I found the following in robots.txt to be a totally wasted effort..
Along with these, though hardly surprising from the scum/scourge of the web (IMHO):
It wisnae me! A big boy done it and ran away.
NVMe2G for life! until death (the end is nigh)