Terms aggregations on analyzed fields in Elasticsearch

I recently had the chance to cleanup an Elasticsearch mapping that I’ve been dying to refactor for some time now. The problem were a number of not_analyzed fields which really should have been analyzed, making them available for full-text search. When doing so, I encountered a problem with a couple of terms aggregations which at first seemed odd, but turned out to make a lot of sense.

To illustrate the problem, let’s look at en Elasticsearch mapping containing a not_analyzed field:

{
  "mappings": {
    "mydocument": {
      "properties": {
        "title": {
          "type": "string",
          "index": "not_analyzed"
        }
      }
    }
  }
}

To test this scenario, let’s put some documents into the index:

[
  {"title": "This is the first title"},
  {"title": "This is the second title"},
]

And finally execute the terms aggregation:

{
  "aggs": {
    "titles": {
      "terms": {
        "field": "title"
      }
    }
  }
}

The result shouldn’t be suprising. We ask Elasticsearch to split up the documents into buckets of unique titles:

{
  "buckets": [
    {"key": "This is the first title", "doc_count": 1},
    {"key": "This is the second title", "doc_count": 1},
  ]
}

So far so good, but now for the problem. I changed the title property to an analyzed field and re-indexed all of my documents. The mapping now looks similar to this:

{
  "mappings": {
    "mydocument": {
      "properties": {
        "title": {
          "type": "string",
          "index": "analyzed"
        }
      }
    }
  }
}

Notice how the value of the index property changed from not_analyzed to analyzed. Let’s run our aggregations once more:

{
  "aggs": {
    "titles": {
      "terms": {
        "field": "title"
      }
    }
  }
}

And for the results:

{
  "buckets": [
    {"key": "is", "doc_count": 2},
    {"key": "the", "doc_count": 2},
    {"key": "this", "doc_count": 2},
    {"key": "title", "doc_count": 2},
    {"key": "first", "doc_count": 1},
    {"key": "second", "doc_count": 1}
]

At first sight I was like:

But then I realized that the results make perfectly sense. This isn’t a blog post about how reverse indexes work in Elasticsearch, but in short analyzed instructs Elasticsearch to look at the value of the title property and tokenize it. The terms aggregation runs on top of the reverse index, why Elasticsearch simply reply with an answer for our (sort of stupid) question: Split the values in the reverse index into buckets containing unique terms.

To fix this, we need to store both an analyzed and an not_analyzed version if the title. You could store two individual properties for this, but Elasticsearch already provides a nice way to construct this using a multifield. With a multifield, our mapping looks like this:

{
  "mappings": {
    "mydocument": {
      "properties": {
        "title": {
          "type": "string",
          "index": "analyzed",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

Here we define the title field with both an analyzed version and an not_analyzed version named raw.

With the mapping above and the documents re-indexed, I run my terms aggregation on the title.raw field:

{
  "aggs": {
    "titles": {
      "terms": {
        "field": "title.raw"
      }
    }
  }
}

And the results look good once again:

{
  "buckets": [
    {"key": "This is the first title", "doc_count": 1},
    {"key": "This is the second title", "doc_count": 1},
  ]
}

It’s funny how basic stuff still manage to surprise you once in awhile, even though having worked with a technology for years.