Search engine crawler bots feeding frenzy
One of the darker sides of web development is down time. The site owners don’t want it, the site developers don’t want it and most importantly the site users don’t want it. Unfortunately, however, it will happen. This is not a defeatist view or an excuse, it’s realistic.
An experienced software development team will know this and rather than bury their heads in the sand, they will be well prepared to deal with the consequences. It’s all about having the problem solving skills, tools and the right approach to solving the root cause of the problem.
Recently one of our sites went down and so we began asking why…
- Mysql ran out of memory – why?
- The session lookups in the database were taking too long – why?
- We had far too many sessions than was normal in the database – why?
- The site log showed search engine spiders were hammering the site – why?
- They were trapped in an indexing frenzy, crawling an unlimited amount of unique url’s – why?
It was only after a lot of digging that we realised we were storing parameters in the urls that alternatively could have been stored in the user’s session – and so we had our root cause.
Now there are a couple of side issues here that we also need to address (like clearing out the sessions database table more often) however if we had been clearing them out more often it would have undoubtedly been harder to discover that search bots had been hammering our website and that brings me onto the crawling issue.
A quick search around online revealed numerous other site owners complaining that their site is getting indexed far too heavily. It does make you wonder how many websites there are out there now, creaking under the avalanche of search engine spiders — their site owners blissfully unaware.
We’ve since made changes to the way the site works so that it relies more on session variables and less on url parameters. We’re now sending these bots onto 404 pages if they request a url using these parameters, at least our database is not being hit.
However, they are still requesting pages that no longer exist and aren’t linked to, so they must have a back catalogue of pages to index. Hopefully they will calm down, Google Webmaster tools offers some control over the crawl rate but our experience here has been a little inconsistent.
I’ve read reports from some site owners contacting Google directly and ending up with their site not being indexed at all.
I guess you have to be careful what you wish for.Tweet