Analyzing user agents to identify bots with Elasticsearch

Most of you probably know my startup elmah.io. If not, check it out and head back here when you’ve signed up 🙂 At elmah.io we index errors generated at our customers websites. Among a lot of other things, we index the user agent causing the error. If you’ve ever logged uncaught errors (like 404’s) on a webserver, you know that bots, crawlers, spiders etc. are causing a lot of them.

To minimize the number of logged errors for ourselves and our customers, I want to be able to identify bots by looking at the user agent causing an error. I know that there are lists of both whitehat and blackhat bots out there, but for this purpose, I want a single, short and fast query, identifying as many whitehat bots as possible.

To have some data to play around with, I’ve fetched a list of browser user agents and another list of bot user agents (source: http://useragentstring.com/). Both lists are converted to Elasticsearch bulk index format. Since the purpose of this post is playing around with aggregations in Elasticsearch, I won’t go into detail on how to convert HTML to Elasticsearch bulk format.

To create a new Elasticsearch index to store all of the user agents, execute the following request:

PUT http://localhost:9200/useragents

To index the data and let Elasticsearch create the mapping for the browsers and bots types, POST all of the user agents using the _bulk endpoint:

POST http://localhost:9200/_bulk

{ "index": { "_index": "useragents", "_type": "browsers" } }
{ "userAgent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" }
{ "index": { "_index": "useragents", "_type": "bots" } }
{ "userAgent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" }
...

All user agents are successfully indexed in Elasticsearch. Since the userAgent field in both browsers and bots is analyzed, Elasticsearch breaks down all of the user agent strings into terms. This actually helps us to find common keywords in bot user agents.

To find the most commonly used terms in bot user agents, create a new terms aggregation on all of the bot user agents:

POST http://localhost:9200/useragents/bots/_search

{
    "aggs": {
        "bot_terms": {
            "terms": {
                "field": "userAgent",
                "size": 100
            }
        }
    }
}

Elasticsearch returns a list of buckets, ordered by the most frequent terms:

"buckets": [
{
    "key": "Mozilla",
    "doc_count": 97
},
{
    "key": "5.0",
    "doc_count": 85
},
{
    "key": "compatible",
    "doc_count": 58
}
...
]

This is great input for our identifying bots-quest. Using terms like Mozilla and compatible probably isn’t a good idea, since browser user agents includes these as well. By scrolling down through the list, we find bot, spider, crawler and other interesting terms. Let’s create a query against all bot user agents with these terms:

POST http://localhost:9200/useragents/bots/_search

{
    "query": {
        "query_string": {
            "query": "userAgent: *bot* OR userAgent: *crawl* OR userAgent: *spider* OR userAgent: *search*"
        }
    }
}

As we can see from the result, Elasticsearch finds a lot of hits:

{
  ...
  "hits": {
    "total": 298
    ...
  }
}

Time to test the query against browser user agents to make sure that browsers aren’t threated as bots:

POST http://localhost:9200/useragents/browsers/_search

{
    "query": {
        "query_string": {
             "query": "userAgent: *bot* OR userAgent: *crawl* OR userAgent: *spider* OR userAgent: *search*"
        }
    }
}

The query returns zero hits.

Using Elasticsearch for analyzing user agents and creating a smart query based on terms, turned out as a great way to identify most bots without having to maintain long lists of user agents, making API calls or similar. Querying Elasticsearch takes a few milliseconds and the result is almost as good as writing a lot of code.