Paul Sturgess

Search engine crawler bots feeding frenzy

One of the darker sides of web development is down time. The site owners don’t want it, the site developers don’t want it and most importantly the site users don’t want it. Unfortunately, however, it will happen. This is not a defeatist view or an excuse, it’s realistic.

An experienced software development team will know this and rather than bury their heads in the sand, they will be well prepared to deal with the consequences. It’s all about having the problem solving skills, tools and the right approach to solving the root cause of the problem.

Problem solving

Recently one of our sites went down and so we began asking why…

  • Mysql ran out of memory – why?
  • The session lookups in the database were taking too long – why?
  • We had far too many sessions than was normal in the database – why?
  • The site log showed search engine spiders were hammering the site – why?
  • They were trapped in an indexing frenzy, crawling an unlimited amount of unique url’s – why?

It was only after a lot of digging that we realised we were storing parameters in the urls that alternatively could have been stored in the user’s session – and so we had our root cause.

Now there are a couple of side issues here that we also need to address (like clearing out the sessions database table more often) however if we had been clearing them out more often it would have undoubtedly been harder to discover that search bots had been hammering our website and that brings me onto the crawling issue.

A quick search around online revealed numerous other site owners complaining that their site is getting indexed far too heavily. It does make you wonder how many websites there are out there now, creaking under the avalanche of search engine spiders — their site owners blissfully unaware.

We’ve since made changes to the way the site works so that it relies more on session variables and less on url parameters. We’re now sending these bots onto 404 pages if they request a url using these parameters, at least our database is not being hit.

However, they are still requesting pages that no longer exist and aren’t linked to, so they must have a back catalogue of pages to index. Hopefully they will calm down, Google Webmaster tools offers some control over the crawl rate but our experience here has been a little inconsistent.

I’ve read reports from some site owners contacting Google directly and ending up with their site not being indexed at all.

I guess you have to be careful what you wish for.

Tags: google, seo, search, indexing, crawlers, bots

See more posts

Add a comment

Note: comments are moderated before publication.

Most Popular

Garmin Edge 705 Data Recovery with DD & XML

Joel Richards

After recently taking part in a race across Scotland, using a Garmin GPS device (Edge 705) I had around 12 hours of GPS data which unfortunately seemed to get corrupted during the ride. The device was properly displaying the data on the unit, but Windows, OSX and Linux were all …

Kyan.com colophon

Robin Whittleton

Now that our new site is live, I can finally talk about development decisions we made. The site last had a makeover in mid-2008 so what we can do has moved on quite considerably, and we’ve tried to take advantage of that where possible.

Number one in Google

Paul Sturgess

Good listings across multiple search engines can make or break a website, at Kyan we believe there are no real secrets to search engine optimisation (SEO). Transparency with our clients is key, we don’t keep our techniques behind lock and key as we believe SEO is not just the re…