Hardening Drupal Against Badly Behaved Bots

Aug 27, 2014

If you've run Drupal for any length of time you know that spiders and bots can make the difference between a well performing site and one that requires a coffee refill while the page loads.  Some bots are particularly nasty.  Even with a meticulously crafted robots.txt file, they'll slam you from an entire farm of machines, sometimes as many as five to ten a second.  And that's just from one search engine.

Bots only ask for anonymous pages so one would think they should always be served out of a fast front-end cache like Varnish or NGINX.  Unfortunately many of us like to run our front-end caches in-memory for speed, meaning they're finite sizes.  They work really well at keeping the most current content of our sites available for anyone that wants it.  But when bots come along and ask for a whole string of two year old posts, they're not there.  Generating those deep (rarely used) pages results in Drupal and database load.  What's worse, after those pages are generated, they get inserted into the front-end cache, effectively bumping out more frequently used pages that now have to be regenerated too.

At Digital Services we needed a way to stop bots from generating Drupal load.  This led us to rethink how we cache Drupal as a whole.

Front-end Cache

We use a pair of NGINX servers as a reverse proxy cache in front of Drupal.  Not only do they help decrease database load by serving frequently used pages, they can also serve pages when the database is down.  That helps us maintain uptime for anonymous users even during maintenance windows.

Our NGINX config also includes another element.  We have it examine the request's User-Agent and siphon bots and spiders off to a separate NGINX server that we call botcache.

Botcache

Botcache is the second set of NGINX servers in our setup.  Its role is to maintain a long-term persistent store of  all of our anonymous pages.  The data in botcache should be as current as possible and it's not susceptible to cache clears.  Let's examine differences in these two groups of NGINX setups:

 frontcache1/frontcache2

  • Cache clears:  In order to maintain our sites during database failure or maintenance we never want to delete a frequently used page from cache.  So we've re-engineered our cache clear routine to instead mark the affected entries as expired.  The next request for that item (while the db is up) will trigger Drupal to regenerate the page and the cache to get updated.  If the database is down, the expired copy is still available to be served to the user.
  • Filesystem:  NGINX can run its proxy cache on disk or in memory and there are advantages to each (manual entry inspection vs speed, respectively).  We wanted both sets of advantages so we cheated and told NGINX to use an on disk cache but made it a RAMDISK.
  • Finite size:  Because our NGINX cache is actually a RAMDISK it can't grow forever.  Currently frontcache1 and frontcache2 each have 16G dedicated as their cache.  This means they can't afford to cache the entire site's history, which spiders will want to dredge up -- only the current/active pages.
  • Serving logic:  When a request comes in the cache is searched first.  If a valid cache hit is found, it's served from there and the request is complete.  For cache misses, the User-Agent is examined.  If it's a regular browser making the request, it's passed on to a Drupal upstream server.  Results that are generated by Drupal are inserted into the local (16G) cache before being returned to the client.  If it's a bot/spider making the request, it's passed on to botcache as the upstream server.  Results that are "generated" by botcache are NOT cached on frontcache1/2.  So they do not bump out a more frequently accessed page from the primary cache.

botcache1/botcache2

  • Cache clears:  There are none.  Pages live on in botcache forever which is fitting since nothing is ever deleted from Drupal.
  • Filesystem:  Botcache runs on a regular hard disk so it can have somewhat infinite space.  We're currently using around 480G+ for our cache.
  • Logic:  Requests for cached pages come strictly from frontcache1/2.  If the request is in the cache, it's served up.  If it's not, the request is given a 502 and goes no further.  In short bots no longer generate Drupal load.

Botcache flow diagram

That brings us to the question of how botcache gets and stays populated with current data for spiders to hit.  We run four levels of spiders with the important difference that the spiders are under our direction.  We can decide how frequently they run, how many run concurrently, and their overall impact.  Plus our same spider run can service all search engines instead of each of them hitting us independently.  So it's a huge win in terms of impact.

We run a level1 spider that starts on the front page of all of our 200+ sites and only goes 1 level deep (it follows 1 link deep off the start page).  Our level2 spiders go 2 links deep and so on.

  • Level1 takes about half an hour across all of our sites;  we cron them to run every 2 hours
  • Level2 takes about 2.5 hours;  we cron them 3 times a day
  • Level3 takes about 9 hours;  we cron them once a day

We also had a level0 that would spider entire sites with no limit.  For us, with our scaled back minimal site impact speed, it took over 10 days to complete a single run.  But periodically it's still necessary to grab that deep content and keep it available to search engines.  The problem we ran into was that stories were published too quickly in Drupal.  Since a spider (we use "wget" with it's -r option) crawls in a not-exactly-predefined order, we saw some rather odd results.  We theorized that if the spider crawls page 500 and reads all the links on it (stories M, N, O and P) then it proceeds to recurse down and crawl those individual story post pages (/post/M, /post/N, etc), then by the time it decided to read page 499, story L may have jumped from page 499 to 500 due to a new post on page 1.  Suddenly story L is completely missed and doesn't get crawled.  Given that "last" is one of the page links at the bottom of every page some of them can easily ended up crawled in this reverse order.  We saw as many as 30% misses on stories we expected to be covered by our level0 crawl.  

As a result we replaced the level0 crawl with a simple database query to dump all /post, /people and /term pages to a file, which we in turn feed to wget (no recursion).  We schedule that dump and crawl once a week to keep the deep stories from falling out of botcache.

Back-end Cache

The back-end cache (a.k.a. "Drupal cache") was relatively straight-forward.  We started with the database and quickly moved to memcache for speed.  In recent years we tried transitioning it to Couchbase to remove our dependency on a single server.  Couchbase did sharding by design and could support a cluster that could easily lose a machine or two and keep right on going.  The main downside to CB was that when used as a memcache replacement you couldn't disable its persistence setting.  That meant it wrote everything to disk so that data persisted between cluster shutdowns.  With cache data that wasn't necessary and the disk access actually made the cluster slower.  Sometimes a CB node would be marked as down and removed from the cluster because it spent too much time on disk I/O and didn't respond to a request fast enough.  We tried moving the entire disk partition it uses to a RAMDISK but that triggered a Couchbase bug that resulted in it storing all of the data on a single node (unsharded).  We'd increased the speed of our Couchbase hardware over the years but were still looking for a solution that would allow us to lose a server and not have to write anything to disk.  Finally we stumbled upon repcache, a fork of memcache that handles replication.  This allows us to have two memcache servers that both have identical data, synced in real-time.

Today we have a highly redundant Drupal architecture that's resistant to aggressive spiders and multiple server failures.