ElasticSearch

10 min readJun 10, 2023

Elasticsearch is a distributed, open-source search and analytics engine built on Apache Lucene. It is designed for horizontal scalability, maximum reliability, and easy management. This article will delve into the advanced features of Elasticsearch, such as document immutability, custom routing logic, document versioning, data reading and writing, update queries, bulk APIs, analysis and mapping, and analyzers. We will explore each of these features in detail, with examples to illustrate their practical applications.

Elasticsearch: Understanding the Basics

Document Immutability and Updates

In Elasticsearch, documents are immutable, meaning they cannot be changed once written. Instead, when an update is made, the old document is marked as deleted and a new document is created. This approach avoids the need for two network calls, one to read the old document and another to write the new one.

For instance, consider a document with an ID of 1. If we want to update this document, Elasticsearch doesn’t modify the existing document. Instead, it marks the existing document as deleted and indexes a new document with the new information.

Scripted Updates

Scripted updates provide a way to modify a document without having to retrieve it first. This is done using the ctx._source object, which represents the current source document that the script is operating on. Here's an example of a scripted update:

POST /my-index-000001/_update/1
{
  "script" : {
    "source": "ctx._source.likes++"
  }
}

In this example, the likes field of the document with ID 1 is incremented by one.

Upserts

Upserts are operations that update a document if it exists or insert a new document if it doesn’t. This is done using the update API with the doc_as_upsert parameter set to true. Here's an example of an upsert:

POST /my-index-000001/_update/1
{
  "doc": {
    "name": "John Doe"
  },
  "doc_as_upsert": true
}

In this example, if a document with ID 1 exists, its name field is updated to "John Doe". If it doesn't exist, a new document with ID 1 and a name field of "John Doe" is created.

Custom Routing Logic

By default, Elasticsearch uses a hash of the document’s ID to determine which shard the document should go to. However, you can override this behavior by providing a custom routing value. This can be useful for ensuring that related documents are stored on the same shard, which can improve query performance.

Here’s an example of indexing a document with a custom routing value:

PUT /my-index-000001/_doc/1?routing=user1
{
  "name": "John Doe"
}

In this example, the document with ID 1 is indexed with a routing value of “user1”. This means that the document will be stored on the shard determined by the hash of “user1” instead of the hash of “1”.

Document Versioning

Elasticsearch keeps track of the number of times a document has been updated using a _version field. This field is incremented each time the document is updated. You can use this field for optimistic concurrency control by specifying the version of the document you expect in the if_seq_no and if_primary_term parameters of the update API.

Here’s an example of updating a document with optimistic concurrency control:

POST /my-index-000001/_update/1
{
  "if_seq_no":10,
  "if_primary_term": 2,
  "doc": {
    "name": "John Doe"
  }
}

In this example, the document with ID 1 is updated only if its current _seq_no is 10 and its _primary_term is 2.

Reading Data in Elasticsearch

Reading data in Elasticsearch involves several components:

Routing: When a read request is received, it is routed to the appropriate shard based on the document’s ID or a custom routing value.
Coordinating Node: This is the node that the client connects to for a read request. It is responsible for forwarding the request to the appropriate shard(s) and gathering the responses.
Replica Group: A replica group consists of a primary shard and its replicas. When a read request is received, it can be served by any shard in the replica group.
Adaptive Replica Selection (ARS): ARS is a feature that allows the coordinating node to select the “best” replica to serve a read request based on various factors such as query load and response time.

Writing Data in Elasticsearch

Writing data in Elasticsearch also involves several components:

Primary Terms and Sequence Numbers: Primary terms and sequence numbers are used to ensure consistency in a distributed system. The primary term is incremented each time a new primary shard is elected, and the sequence number is incremented each time a document is indexed, updated, or deleted.
Global and Local Checkpoints: Checkpoints are used to track the progress of replication and recovery. The global checkpoint is the highest sequence number that is known to be replicated on all copies of a shard, and the local checkpoint is the highest sequence number that is known to be processed on a single copy of a shard.
Concurrency Control: Elasticsearch uses optimistic concurrency control to ensure that conflicting updates to a document are handled correctly. This is done using the if_seq_no and if_primary_term parameters of the update API, as explained in the Document Versioning section.

Update Queries in Elasticsearch

An update query in Elasticsearch is a two-step process. First, Elasticsearch takes a snapshot of the current state of the index. Then, it applies the update to the snapshot and indexes the result as a new document.

Bulk API

The bulk API allows you to perform multiple indexing, updating, and deleting operations in a single request. This can be more efficient than performing each operation individually because it reduces the overhead of network round trips.

The bulk API expects the operations to be specified in newline-delimited JSON (NDJSON) format. Here’s an example of a bulk request:

POST /_bulk
{ "index" : { "_index" : "my-index-000001", "_id" : "1" } }
{ "name" : "John Doe" }
{ "update" : { "_index" : "my-index-000001","_id" : "2" } }
{ "doc" : { "name" : "Jane Doe" } }
{ "delete" : { "_index" : "my-index-000001", "_id" : "3" } }

In this example, a new document is indexed, an existing document is updated, and another document is deleted.

Analysis and Mapping in Elasticsearch

Analysis and mapping are two fundamental concepts in Elasticsearch. Analysis is the process of converting text into tokens or terms which are added to the inverted index for searching. Mapping is the process of defining how a document and its fields are stored and indexed.

Text Analysis

Text analysis in Elasticsearch is performed by an analyzer which is composed of one tokenizer and zero or more token filters. The tokenizer splits the text into tokens, and the token filters modify, add, or delete tokens. Here’s an example of how to define a custom analyzer:

PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type":      "custom", 
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  }
}

In this example, a custom analyzer named my_custom_analyzer is defined. It uses the standard tokenizer and two token filters: lowercase and asciifolding.

Mapping

Mapping is the process of defining how a document and the fields it contains are stored and indexed. For instance, you can define which fields should be searchable and how they should be tokenized for search.

In Elasticsearch, each field has a dedicated data type which allows Elasticsearch to properly index the data for efficient searches. Elasticsearch supports a number of data types such as text, keyword, date, long, double, boolean and many more.

Here’s an example of how to define a mapping:

PUT /my-index-000001
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "age": {
        "type": "integer"
      }
    }
  }
}

In this example, a mapping is defined with two fields: name and age. The name field is of type text, and the age field is of type integer.

Inverted Index

An inverted index is a data structure used by search engines to store text data. It is called an “inverted” index because it inverts a page-centric data structure (page->words) to a keyword-centric data structure (word->pages).

This allows for fast full-text searches, as instead of scanning each document for the existence of a word, the search engine can use the inverted index to find out which documents contain the word.

Keyword Data Type

The keyword data type is used for structured content such as email addresses, hostnames, status codes, zip codes or tags. They are typically used for filtering (Find me all blog posts where status is published), sorting, and aggregations. Keyword fields are only searchable by their exact value.

Explicit Mapping

Explicit mapping is defined by the user. It allows you to precisely choose how to index and store fields. Here’s an example of how to define an explicit mapping:

PUT /my-index-000001
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
     "age": {
        "type": "integer"
      }
    }
  }
}

In this example, a mapping is defined with two fields: name and age. The name field is of type text, and the age field is of type integer.

Dynamic Mapping

Dynamic mapping is the process by which Elasticsearch automatically defines a mapping based on the input data. For instance, if you index a document without previously defining mappings, Elasticsearch will automatically create them based on the data you provided.

Here’s an example of how dynamic mapping works:

PUT /my-index-000001/_doc/1
{
  "name": "John Doe",
  "age": 30
}

In this example, Elasticsearch will automatically create a text mapping for the name field and a long mapping for the age field.

Combining Explicit and Dynamic Mapping

You can combine explicit and dynamic mapping by defining explicit mappings for some fields and allowing Elasticsearch to dynamically create mappings for other fields. Here’s an example:

PUT /my-index-000001
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      }
    }
  }
}
PUT /my-index-000001/_doc/1
{
  "name": "John Doe",
  "age": 30
}

In this example, an explicit mapping is defined for the name field, and Elasticsearch dynamically creates a mapping for the age field.

Dynamic Templates

Dynamic templates allow you to define custom mappings that can be applied to dynamically added fields based on:

The field’s name
The field’s data type
The field’s full path

Here’s an example of a dynamic template:

PUT /my-index-000001
{
  "mappings": {
    "dynamic_templates": [
      {
        "integers_as_long": {
          "match_mapping_type": "long",
          "mapping": {
            "type": "integer"
          }
        }
      }
    ]
  }
}

In this example, a dynamic template named integers_as_long is defined. It matches any field of type long and changes its type to integer.

Analyzers in Elasticsearch

Analyzers are at the heart of text analysis in Elasticsearch. They process a text into tokens for indexing in the inverted index. An analyzer is composed of a single Tokenizer and zero or more TokenFilters. It may also contain CharFilters. A tokenizer breaks a string down into individual terms or tokens. A token filter then modifies those tokens, for example, lowercasing them or filtering out stop words. A char filter, on the other hand, preprocesses the string before it is passed to the tokenizer.

Built-in Analyzers

Elasticsearch comes with a number of built-in analyzers, which are ready to use. Some of the standard ones include:

standard: The standard analyzer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols. It is the default analyzer for Elasticsearch.
simple: The simple analyzer divides text into terms whenever it encounters a character which is not a letter. It lowercases all terms.
whitespace: The whitespace analyzer divides text into terms whenever it encounters any whitespace character.
stop: The stop analyzer is like the simple analyzer, but also supports removal of stop words.
keyword: The keyword analyzer is a “noop” analyzer that accepts whatever text it is given and outputs the exact same text as a single term.
pattern: The pattern analyzer uses a regularexpression to split the text into terms. The regular expression should match the characters you want to keep.
language: Elasticsearch provides several language analyzers like english, french, german etc. These analyzers are capable of understanding the peculiarities of their respective languages.

Here’s an example of how to use the english analyzer:

PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "type": "english"
        }
      }
    }
  }
}

In this example, the english analyzer is set as the default analyzer for the index.

Custom Analyzers

In addition to the built-in analyzers, you can also define your own custom analyzers. A custom analyzer is defined by configuring a combination of:

Character filters: These are used to preprocess the string before it is passed to the tokenizer. For example, you can use a character filter to replace & with and.
Tokenizer: This is used to break the string down into individual terms or tokens. For example, you can use the whitespace tokenizer to break the string into terms whenever it encounters any whitespace character.
Token filters: These are used to postprocess the tokens emitted by the tokenizer. For example, you can use a token filter to lowercase all tokens.

Here’s an example of how to define a custom analyzer:jsonCopy code

PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type":      "custom", 
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  }
}

In this example, a custom analyzer named my_custom_analyzer is defined. It uses the standard tokenizer and two token filters: lowercase and asciifolding.

Using Analyzers

Analyzers can be used in several places in Elasticsearch:

During indexing: When a document is indexed, any text fields are analyzed to convert the text into terms that are added to the inverted index.
During searching: When a query string is passed to a query that searches a text field, the query string is analyzed to convert it into terms that are looked up in the inverted index.
In the analyze API: The analyze API allows you to test your analyzers to see the tokens that text is broken down into.

Here’s an example of how to use the analyze API:

POST /_analyze
{
  "analyzer": "standard",
  "text": "The quick brown fox jumped over the lazy dog"
}

In this example, the standard analyzer is used to analyze the provided text.

Conclusion

Elasticsearch is a powerful search and analytics engine that provides a wide range of features and capabilities. Whether you’re building a simple search application or a complex analytics system, Elasticsearch has you covered. With its advanced features and flexible data model, you can handle any data structure or use case with ease.

ElasticSearch

Elasticsearch: Understanding the Basics

Document Immutability and Updates

Scripted Updates

Upserts

Custom Routing Logic

Document Versioning

Reading Data in Elasticsearch

Writing Data in Elasticsearch

Update Queries in Elasticsearch

Bulk API

Analysis and Mapping in Elasticsearch

Text Analysis

Mapping

Inverted Index

Keyword Data Type

Explicit Mapping

Dynamic Mapping

Combining Explicit and Dynamic Mapping

Dynamic Templates

Analyzers in Elasticsearch

Built-in Analyzers

Custom Analyzers

Using Analyzers

Conclusion

Written by Sanket Saxena

No responses yet