Quantcast
Channel: Blog Tips » Featured Posts
Viewing all articles
Browse latest Browse all 18

Web Crawlers: Love the Good, but Kill the Bad and the Ugly

$
0
0

daltons poster

I am staring at my screen.

We’re in the midst of the Christmas holidays, and I have nothing else to do, but stare at my screen.

And scratch my hair. And make faces of disbelief and confusion.

Where does my web server load come from?

I have a dozen websites hosted on a large VPS Linux webserver. Well, at the beginning of last year, it was “medium” sized, but now it is definitively “large”! Over the past year, I saw an increased server load, even though, in Google Analytics I didn’t notice a similar (massive) increase in user/web traffic. Being lazy, I gradually upgraded my VPS server. And now, even that server is getting too small.

It is Xmas time. So what else to do with my spare time but to optimize my server. Once more! #Ihaa!

Ugly crawlers suspected

In the past year, I refused to allocate any time to website upgrades – which often seem a waste of time and are a continuous source of new problems. I did not change anything significant on the server (other than making it bigger). Recalling my past experience with excessive crawling activity, I suspected excessive crawlers, bots and spiders traffic again.
And those don’t leave eave traces in Google Analytics, which explains the discrepancy between server load/server traffic and actual page views.

Ok, roll up your sleeves, clean your guns, lock and load. We’re gonna kill us some bad-ass crawlers and stuff..!

How to limit the Google and Bing crawl rate.

Google (by far) and Bing (to a minor extend) have the most active web crawlers. They can be over-active if you have a lot of content, like I have on my news sites. To make it worse, both of them do not respect the robots.txt setting for the crawl-delays. You have to adjust their crawl rate manually, using the Google and the Bing web master utilities. Here’s how to do it for Google and for Bing.

To beat it all, Google resets its crawl rate every 3 months. Each time I forget to renew Google’s crawl delay adjustment, my server goes down under a sudden and excessive load. Google.. #dah. If I unleash Google with its default crawl settings on my sites, I get 7,000 crawl hits. PER MINUTE.. Imagine.. Poor server…

Adjusting Google and Bing crawlers, is important for the type of sites I run, but it might not be for yours. The more content you have, and the more frequent it is updated, the heavier the crawler traffic will be.

Just check in the Bing and Google crawl statistics in the same webmaster tools. If there is no excess access (say less than 1,000 per day), I would leave as is. If higher, slow down the crawlers.

That was the first kill.. Now on to the real work!

How to spot the villains? Check your access log for excess crawling!

But, there are many more villains in town! I found an easy way to check for crawler activity using the Apache access log. The access log registers all requests Apache (the actual web server software) receives. Analyzing the access log is like following the breadcrumbs to find the villains. Or something like that.

1. Locate the Apache access log on your server. Take the data, typically a month’s worth or more, and copy it to a work directory. Name it “test.log” for instance. I actually copy the file to my Mac, as it runs a Linux-clone anyways.
2. With a single Linux command line, take all user agents, and sort them, based on occurrence: (magic!!)

$cat test.log | awk -F\" '{print $6}' | sort | uniq -c | sort -n

(where “test.log” is the access logfile you want to analyze)

The output will look like this:

51916 MetaURI API/2.0 +metauri.com
59899 Twitterbot/1.0
87819 Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
111261 Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
187812 Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0 (FlipboardProxy/1.1; +http://flipboard.com/browserproxy)
189834 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
390477 facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)

The first number (bolded) is the amount of times this spider/crawler/user agent/ has accessed your site. Beware, these are not all crawlers, as the data is intermixed with actual human user traffic and other useful traffic..

In my example above, you see that the “Facebookexternalhit” user agent accessed the site 390,477 times per month. That is roughly 541x per hour. Excessive. On the kill list, you go!
Other heavy ones are FlipboardProxy, Twitterbot, Spaidu and Metauri. Those are part “crawler”, part “services”. Whatever they are, their usefulness does not justifying the amount of traffic/load on my server, so.. on to some more killing!

Limit crawler traffic using robots.txt

The first port of call, is to limit crawler access using a robots.txt on your website root directory.

On my larger sites, I disallow ANY crawler, except a handful of useful ones. Check the robots.txt from my news site, as an example. That will stop a good deal of traffic.

Now, in my case, I had already tuned my robots.txt, so it was clear my over-eagerly crawling villains caught in the previous step, did not respect the setting of robots.txt, or were not “crawlers” in the strict sense.

So, now for the final kill.

Use .htaccess to hard-block over-anxious spiders and crawlers

The .htaccess is a (hidden) file which can be found in any directory. It sprinkles magical stardust on your website using heavenly redirects and surgical reconstructions of URLs etc.. Anyone who masters .htaccess secrets is worth his/her weight in gold.
Which does not include me, by the way.. I barely master the first chapter in the “Dummies Guide to .htaccess”.

We’ll take the .htaccess file from the root directory of each of our websites.
BUT before you do anything, big WARNING: Make a backup copy of the .htaccess file, as one dot or one comma too much or too little, can render your site inaccessible. Edit it with a Linux-compatible text editor (like Textwrangler).

One of the things you can do, with .htaccess, is redirect web requests coming from certain IP addresses or user agents.
So,… I blocked the excessively active crawlers/bots by catching a string in the USER_AGENT field, and redirect their web requests to a “403 – Forbidden”, before the request even hits my webserver. So, they can hit my sites with all their force, it won’t load my server, or hog any resources, short of a DDOS attack.

Here are the lines I added the following lines to my .htaccess file:

#redirect bad bots to one page 
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} facebookexternalhit [NC,OR] 
RewriteCond %{HTTP_USER_AGENT} Twitterbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Baiduspider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MetaURI [NC,OR]
RewriteCond %{HTTP_USER_AGENT} mediawords [NC,OR]
RewriteCond %{HTTP_USER_AGENT} FlipboardProxy [NC]
RewriteCond %{REQUEST_URI} !\/nocrawler.htm
RewriteRule .* http://yoursite/nocrawler.htm [L]

It catches the server-hogging spiders (at least those which were hogging MY server resources), bots, crawlers with a substring of their user.agent’s name (case insensitive). End each line with the user-agent string with “[NC, OR]“, except the last bot which has [NC] only. (Beware!)
This piece of code redirects the unwanted crawlers to a dummy html file “http://yoursite/nocrawler.htm” in your root directory.

An example could be:

<!DOCTYPE html>
<html>
<body>
<p>This crawler was blocked</p>
</body>
</html> 

Note that the last line “RewriteCond %{REQUEST_URI} !\/nocrawler.htm” is needed to avoid looping. (Thanks Emanuele Quinto!)

I implemented this blocking last night, and voila, for the first time in months, I saw a constant 50% free memory, whereas before, I almost always hit a 90%.

Or.. the hard way to hard-block over-anxious spiders and crawlers

The previous method re-directed any request from the blocked spiders or crawlers to one page.. That is the “friendly” way. However, it you get A LOT of spider requests, this also means that your Apache server will do double work: It will get the original request, which is redirected, and then get a 2nd request to deliver your “nocrawler.htm”-file.
While this will ease off the load on your SQLserver, it won’t ease off the pressure on your Apache server.

A hard -and simple- way to block unwanted spiders, crawlers and other bots, is to return a “403 – Forbidden”, and that is the end of it.

Using the spiders I wanted to block in the previous example, just add this code in your .htaccess:

#block bad bots with a 403
SetEnvIfNoCase User-Agent "facebookexternalhit" bad_bot
SetEnvIfNoCase User-Agent "Twitterbot" bad_bot
SetEnvIfNoCase User-Agent "Baiduspider" bad_bot
SetEnvIfNoCase User-Agent "MetaURI" bad_bot
SetEnvIfNoCase User-Agent "mediawords" bad_bot
SetEnvIfNoCase User-Agent "FlipboardProxy" bad_bot

<Limit GET POST HEAD>
  Order Allow,Deny
  Allow from all
  Deny from env=bad_bot
</Limit>

And kill some more: catch the hacker-bots

While you are playing with the accesslog, try to catch the IP addresses of those malicious bots trying to hack into your website.
A simple hacking technique is poll your site for login or user registration pages.

On my Drupal site, for instance, I can catch those bots trying to access the WordPress login page (wp-login) with this Linux command (using the same test access log we used previously):

grep "wp-login" test.log | sort | uniq -c | sort -n

There is no reason why any human with honorable intentions would try to access the WordPress login on a Drupal site, except for hacking, so that gave me 3-4 really suspicious IPs.

You can double check where the suspected IP addresses come from with a reversed IP tool. In my case I found multiple hack attempts coming from 91.200.12.14
The Reversed IP info gave me:

General information and location of 91.200.12.14
IPv4 address:
91.200.12.14
Reverse DNS: 2.1.com
RIR: RIPENCC
Country: Ukraine
RBL Status: Listed in HTTP:BL
Thread: Suspicious / Bot, Comment Spammer

You can double check if that IP address is known for malicious activity via “The Honeypot Project“:
For my example above, I got this report. No doubt then.

I actually found loads of hacking attempts from 91.200.11.* and 91.200.12.*, so I denied their access by adding the following lines to my .htaccess file:

#deny malicous crawlers IP addresses
deny from 91.200.11.
deny from 91.200.12.

Server tuning, is an art, not a science

After almost a decade of running my own VPS servers, I experienced there is not one recipe for web server tuning. So this post should not be seen as “one solution for all performance problems”.. You should first look at using a proper cache on all your sites, ensure all plugins are working well, check what is causing system bottlenecks, tune the MySQL server,..
BUT over the years, and talking with many webmasters, I have learned one thing: they often forget to look at the crawler traffic. And with that, they forget that even the best tuned massive server will go on its knees if even one bot or crawler really misbehaves..

Have fun!

Picture discovered via Glogster


Viewing all articles
Browse latest Browse all 18

Trending Articles