Paul Sturgess

Search engine crawler bots feeding frenzy

One of the darker sides of web development is down time. The site owners don’t want it, the site developers don’t want it and most importantly the site users don’t want it. Unfortunately, however, it will happen. This is not a defeatist view or an excuse, it’s realistic.

An experienced software development team will know this and rather than bury their heads in the sand, they will be well prepared to deal with the consequences. It’s all about having the problem solving skills, tools and the right approach to solving the root cause of the problem.

Problem solving

Recently one of our sites went down and so we began asking why…

  • Mysql ran out of memory – why?
  • The session lookups in the database were taking too long – why?
  • We had far too many sessions than was normal in the database – why?
  • The site log showed search engine spiders were hammering the site – why?
  • They were trapped in an indexing frenzy, crawling an unlimited amount of unique url’s – why?

It was only after a lot of digging that we realised we were storing parameters in the urls that alternatively could have been stored in the user’s session – and so we had our root cause.

Now there are a couple of side issues here that we also need to address (like clearing out the sessions database table more often) however if we had been clearing them out more often it would have undoubtedly been harder to discover that search bots had been hammering our website and that brings me onto the crawling issue.

A quick search around online revealed numerous other site owners complaining that their site is getting indexed far too heavily. It does make you wonder how many websites there are out there now, creaking under the avalanche of search engine spiders — their site owners blissfully unaware.

We’ve since made changes to the way the site works so that it relies more on session variables and less on url parameters. We’re now sending these bots onto 404 pages if they request a url using these parameters, at least our database is not being hit.

However, they are still requesting pages that no longer exist and aren’t linked to, so they must have a back catalogue of pages to index. Hopefully they will calm down, Google Webmaster tools offers some control over the crawl rate but our experience here has been a little inconsistent.

I’ve read reports from some site owners contacting Google directly and ending up with their site not being indexed at all.

I guess you have to be careful what you wish for.

Tags: google, seo, search, indexing, crawlers, bots

Add a comment

Note: comments are moderated before publication.

Most Popular

Free Wifi in Guildford

Peter Roome

I was asked by a friend today if I could recommend any bars/restaurants/cafés in Guildford where she could access free WiFi on her laptop. Besides Giraffe I wasn’t aware of anywhere else in town so I posted the question to Yammer in the office and received a number of help…

"DO NOT EAT" THROW AWAY

Steven Wake

I have the driest draw here at Kyan towers. You see, I am the proud owner of a Silica Gel collection. There is just something about them which compels me to not throw away the little fellas.

Kyan.com design process

Lee Whitelock

It’s great when you get a project you can really sink your teeth into. We pride ourselves on the effort we put into all our projects, of course, but when it’s for your own agency you can really ‘go to town’ and try new things. Our current website was out…

Website easter egg

Piers Palmer

We decided to have a little fun now that summer is over, combining some design yumminess and behavioural goodness. See if you are up to the challenge! Can you find the indomitable and mighty web geek PROFESSOR WAKE on our website. He’s hiding there somewhere… A littl…