06 Jul 2019

Elasticsearch, types and indices

The other day I added some more logging into a service at work, but not all logs appeared in Kibana. Some messages got lost between CloudWatch Logs and Elasticsearch. After turning up the logging in the Lambda shuffling log messages I was in for a bit of learning about Elasticsearch.

Running the following in a Kibana console will show what the issue was

PUT idx-0/_doc/1
{
  "name": "John",
  "full_name": "John Doe"
}

PUT idx-0/_doc/2
{
  "name": "Jane",
  "full_name": {
    "first": "Jane",
    "last": "Doe"
  }
}

Executing them in order results in the following error on the second command

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse field [full_name] of type [text]"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "failed to parse field [full_name] of type [text]",
    "caused_by": {
      "type": "illegal_state_exception",
      "reason": "Can't get text on a START_OBJECT at 3:16"
    }
  },
  "status": 400
}

The reason for this is that a schema for the data is built up dynamically as documents are pushed in.¹ It is possible to turn off dynamic schema building for an index using a mapping. For the documents above it'd look something lik this

PUT idx-0
{
  "mappings": {
    "_doc": {
      "dynamic": false
    }
  }
}

Now it's possible to push both documents, however searching is not possible, because, as the documentation for dynamic says:

fields will not be indexed so will not be searchable but will still appear in the _source field of returned hits

If there's something that determines the value of logs it's them being searchable.

As far as I understand one solution to all of this would have been mapping types, but that's being removed (see removal of mapping types) so isn't a solution. I'm not sure if Elasticsearch offers any good solution to it nowadays. There's however a workaround, more indices.

Using two indices instead of one does work. So modifying the first commands to use separate indices works.

PUT idx-0/_doc/1
{
  "name": "John",
  "full_name": "John Doe"
}

PUT idx-1/_doc/1
{
  "name": "Jane",
  "full_name": {
    "first": "Jane",
    "last": "Doe"
  }
}

When creating an index pattern for idx-* there's a warning about many analysis functions not working due to the type conflict. However, searching does work and that's all I really care about in this case.

When shuffling the logs from CloudWatch Logs to Elasticsearch we already use multiple indices. They're constructed based on service name, deploy environment (staging, production) and date (a new index each day). To deal with these type conflicts I added a log type that's taken out of the log message itself. It's not an elegant solution – it puts the solution into the services themselves – but it's acceptable.

Footnotes:

Something that makes me wonder what the definition of schema-free is. I sure didn't expect there to ever be a type constraint preventing pushing a document into something that's called schema-free (see the Wikipedia article). (The initiated say it's Lucene, not Elasticsearch, but to me that doesn't make any difference at all.)