ElasticSearch
Elasticsearch is a distributed, open-source search and analytics engine built on Apache Lucene. It is designed for horizontal scalability, maximum reliability, and easy management. This article will delve into the advanced features of Elasticsearch, such as document immutability, custom routing logic, document versioning, data reading and writing, update queries, bulk APIs, analysis and mapping, and analyzers. We will explore each of these features in detail, with examples to illustrate their practical applications.
Elasticsearch: Understanding the Basics
Document Immutability and Updates
In Elasticsearch, documents are immutable, meaning they cannot be changed once written. Instead, when an update is made, the old document is marked as deleted and a new document is created. This approach avoids the need for two network calls, one to read the old document and another to write the new one.
For instance, consider a document with an ID of 1. If we want to update this document, Elasticsearch doesn’t modify the existing document. Instead, it marks the existing document as deleted and indexes a new document with the new information.
Scripted Updates
Scripted updates provide a way to modify a document without having to retrieve it first. This is done using the ctx._source
object, which represents the current source document that the script is operating on. Here's an example of a scripted update:
POST /my-index-000001/_update/1
{
"script" : {
"source": "ctx._source.likes++"
}
}
In this example, the likes
field of the document with ID 1 is incremented by one.
Upserts
Upserts are operations that update a document if it exists or insert a new document if it doesn’t. This is done using the update
API with the doc_as_upsert
parameter set to true. Here's an example of an upsert:
POST /my-index-000001/_update/1
{
"doc": {
"name": "John Doe"
},
"doc_as_upsert": true
}
In this example, if a document with ID 1 exists, its name
field is updated to "John Doe". If it doesn't exist, a new document with ID 1 and a name
field of "John Doe" is created.
Custom Routing Logic
By default, Elasticsearch uses a hash of the document’s ID to determine which shard the document should go to. However, you can override this behavior by providing a custom routing value. This can be useful for ensuring that related documents are stored on the same shard, which can improve query performance.
Here’s an example of indexing a document with a custom routing value:
PUT /my-index-000001/_doc/1?routing=user1
{
"name": "John Doe"
}
In this example, the document with ID 1 is indexed with a routing value of “user1”. This means that the document will be stored on the shard determined by the hash of “user1” instead of the hash of “1”.
Document Versioning
Elasticsearch keeps track of the number of times a document has been updated using a _version
field. This field is incremented each time the document is updated. You can use this field for optimistic concurrency control by specifying the version of the document you expect in the if_seq_no
and if_primary_term
parameters of the update API.
Here’s an example of updating a document with optimistic concurrency control:
POST /my-index-000001/_update/1
{
"if_seq_no":10,
"if_primary_term": 2,
"doc": {
"name": "John Doe"
}
}
In this example, the document with ID 1 is updated only if its current _seq_no
is 10 and its _primary_term
is 2.
Reading Data in Elasticsearch
Reading data in Elasticsearch involves several components:
- Routing: When a read request is received, it is routed to the appropriate shard based on the document’s ID or a custom routing value.
- Coordinating Node: This is the node that the client connects to for a read request. It is responsible for forwarding the request to the appropriate shard(s) and gathering the responses.
- Replica Group: A replica group consists of a primary shard and its replicas. When a read request is received, it can be served by any shard in the replica group.
- Adaptive Replica Selection (ARS): ARS is a feature that allows the coordinating node to select the “best” replica to serve a read request based on various factors such as query load and response time.
Writing Data in Elasticsearch
Writing data in Elasticsearch also involves several components:
- Primary Terms and Sequence Numbers: Primary terms and sequence numbers are used to ensure consistency in a distributed system. The primary term is incremented each time a new primary shard is elected, and the sequence number is incremented each time a document is indexed, updated, or deleted.
- Global and Local Checkpoints: Checkpoints are used to track the progress of replication and recovery. The global checkpoint is the highest sequence number that is known to be replicated on all copies of a shard, and the local checkpoint is the highest sequence number that is known to be processed on a single copy of a shard.
- Concurrency Control: Elasticsearch uses optimistic concurrency control to ensure that conflicting updates to a document are handled correctly. This is done using the
if_seq_no
andif_primary_term
parameters of the update API, as explained in the Document Versioning section.
Update Queries in Elasticsearch
An update query in Elasticsearch is a two-step process. First, Elasticsearch takes a snapshot of the current state of the index. Then, it applies the update to the snapshot and indexes the result as a new document.
Bulk API
The bulk API allows you to perform multiple indexing, updating, and deleting operations in a single request. This can be more efficient than performing each operation individually because it reduces the overhead of network round trips.
The bulk API expects the operations to be specified in newline-delimited JSON (NDJSON) format. Here’s an example of a bulk request:
POST /_bulk
{ "index" : { "_index" : "my-index-000001", "_id" : "1" } }
{ "name" : "John Doe" }
{ "update" : { "_index" : "my-index-000001","_id" : "2" } }
{ "doc" : { "name" : "Jane Doe" } }
{ "delete" : { "_index" : "my-index-000001", "_id" : "3" } }
In this example, a new document is indexed, an existing document is updated, and another document is deleted.
Analysis and Mapping in Elasticsearch
Analysis and mapping are two fundamental concepts in Elasticsearch. Analysis is the process of converting text into tokens or terms which are added to the inverted index for searching. Mapping is the process of defining how a document and its fields are stored and indexed.
Text Analysis
Text analysis in Elasticsearch is performed by an analyzer which is composed of one tokenizer and zero or more token filters. The tokenizer splits the text into tokens, and the token filters modify, add, or delete tokens. Here’s an example of how to define a custom analyzer:
PUT /my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
In this example, a custom analyzer named my_custom_analyzer
is defined. It uses the standard
tokenizer and two token filters: lowercase
and asciifolding
.
Mapping
Mapping is the process of defining how a document and the fields it contains are stored and indexed. For instance, you can define which fields should be searchable and how they should be tokenized for search.
In Elasticsearch, each field has a dedicated data type which allows Elasticsearch to properly index the data for efficient searches. Elasticsearch supports a number of data types such as text
, keyword
, date
, long
, double
, boolean
and many more.
Here’s an example of how to define a mapping:
PUT /my-index-000001
{
"mappings": {
"properties": {
"name": {
"type": "text"
},
"age": {
"type": "integer"
}
}
}
}
In this example, a mapping is defined with two fields: name
and age
. The name
field is of type text
, and the age
field is of type integer
.
Inverted Index
An inverted index is a data structure used by search engines to store text data. It is called an “inverted” index because it inverts a page-centric data structure (page->words) to a keyword-centric data structure (word->pages).
This allows for fast full-text searches, as instead of scanning each document for the existence of a word, the search engine can use the inverted index to find out which documents contain the word.
Keyword Data Type
The keyword
data type is used for structured content such as email addresses, hostnames, status codes, zip codes or tags. They are typically used for filtering (Find me all blog posts where status
is published
), sorting, and aggregations. Keyword fields are only searchable by their exact value.
Explicit Mapping
Explicit mapping is defined by the user. It allows you to precisely choose how to index and store fields. Here’s an example of how to define an explicit mapping:
PUT /my-index-000001
{
"mappings": {
"properties": {
"name": {
"type": "text"
},
"age": {
"type": "integer"
}
}
}
}
In this example, a mapping is defined with two fields: name
and age
. The name
field is of type text
, and the age
field is of type integer
.
Dynamic Mapping
Dynamic mapping is the process by which Elasticsearch automatically defines a mapping based on the input data. For instance, if you index a document without previously defining mappings, Elasticsearch will automatically create them based on the data you provided.
Here’s an example of how dynamic mapping works:
PUT /my-index-000001/_doc/1
{
"name": "John Doe",
"age": 30
}
In this example, Elasticsearch will automatically create a text
mapping for the name
field and a long
mapping for the age
field.
Combining Explicit and Dynamic Mapping
You can combine explicit and dynamic mapping by defining explicit mappings for some fields and allowing Elasticsearch to dynamically create mappings for other fields. Here’s an example:
PUT /my-index-000001
{
"mappings": {
"properties": {
"name": {
"type": "text"
}
}
}
}
PUT /my-index-000001/_doc/1
{
"name": "John Doe",
"age": 30
}
In this example, an explicit mapping is defined for the name
field, and Elasticsearch dynamically creates a mapping for the age
field.
Dynamic Templates
Dynamic templates allow you to define custom mappings that can be applied to dynamically added fields based on:
- The field’s name
- The field’s data type
- The field’s full path
Here’s an example of a dynamic template:
PUT /my-index-000001
{
"mappings": {
"dynamic_templates": [
{
"integers_as_long": {
"match_mapping_type": "long",
"mapping": {
"type": "integer"
}
}
}
]
}
}
In this example, a dynamic template named integers_as_long
is defined. It matches any field of type long
and changes its type to integer
.
Analyzers in Elasticsearch
Analyzers are at the heart of text analysis in Elasticsearch. They process a text into tokens for indexing in the inverted index. An analyzer is composed of a single Tokenizer and zero or more TokenFilters. It may also contain CharFilters. A tokenizer breaks a string down into individual terms or tokens. A token filter then modifies those tokens, for example, lowercasing them or filtering out stop words. A char filter, on the other hand, preprocesses the string before it is passed to the tokenizer.
Built-in Analyzers
Elasticsearch comes with a number of built-in analyzers, which are ready to use. Some of the standard ones include:
standard
: The standard analyzer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols. It is the default analyzer for Elasticsearch.simple
: The simple analyzer divides text into terms whenever it encounters a character which is not a letter. It lowercases all terms.whitespace
: The whitespace analyzer divides text into terms whenever it encounters any whitespace character.stop
: The stop analyzer is like the simple analyzer, but also supports removal of stop words.keyword
: The keyword analyzer is a “noop” analyzer that accepts whatever text it is given and outputs the exact same text as a single term.pattern
: The pattern analyzer uses a regularexpression to split the text into terms. The regular expression should match the characters you want to keep.language
: Elasticsearch provides several language analyzers likeenglish
,french
,german
etc. These analyzers are capable of understanding the peculiarities of their respective languages.
Here’s an example of how to use the english
analyzer:
PUT /my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "english"
}
}
}
}
}
In this example, the english
analyzer is set as the default analyzer for the index.
Custom Analyzers
In addition to the built-in analyzers, you can also define your own custom analyzers. A custom analyzer is defined by configuring a combination of:
- Character filters: These are used to preprocess the string before it is passed to the tokenizer. For example, you can use a character filter to replace
&
withand
. - Tokenizer: This is used to break the string down into individual terms or tokens. For example, you can use the
whitespace
tokenizer to break the string into terms whenever it encounters any whitespace character. - Token filters: These are used to postprocess the tokens emitted by the tokenizer. For example, you can use a token filter to lowercase all tokens.
Here’s an example of how to define a custom analyzer:jsonCopy code
PUT /my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
In this example, a custom analyzer named my_custom_analyzer
is defined. It uses the standard
tokenizer and two token filters: lowercase
and asciifolding
.
Using Analyzers
Analyzers can be used in several places in Elasticsearch:
- During indexing: When a document is indexed, any text fields are analyzed to convert the text into terms that are added to the inverted index.
- During searching: When a query string is passed to a query that searches a text field, the query string is analyzed to convert it into terms that are looked up in the inverted index.
- In the
analyze
API: Theanalyze
API allows you to test your analyzers to see the tokens that text is broken down into.
Here’s an example of how to use the analyze
API:
POST /_analyze
{
"analyzer": "standard",
"text": "The quick brown fox jumped over the lazy dog"
}
In this example, the standard
analyzer is used to analyze the provided text.
Conclusion
Elasticsearch is a powerful search and analytics engine that provides a wide range of features and capabilities. Whether you’re building a simple search application or a complex analytics system, Elasticsearch has you covered. With its advanced features and flexible data model, you can handle any data structure or use case with ease.