Viktor Tyshchenko

Site search organization: basic concepts

, , index , elasticsearch , mapping , analyzers

Now it’s time to get acquainted with Elasticsearch. This NoSQL database is used to store logs, analyze information and - most importantly - search. Actually, it is a search by json documents via engine based on Apache Lucene. Of course Elasticsearch has many other features like shading and replication out of the box, the settings of which I will not touch yet.

Technological stack

The purpose of this article is to explain the basic things that you will encounter when implementing a search. As it is not strange, there is not much detailed and structured information on the basics of ES on the Internet, so it seems to me that this text will be useful. Of course, official documentation is pretty good, but it’s reading can take a long time. More of that, and examples there are given on some abstract data, which is not clear where they came from. Usually Elasticsearch is used in the so-called ELK stack:

  • Elasticsearch - main service, which has only REST API interface;
  • Logstash - data modification service for preparing data to push into Elasticsearch;
  • Kibana - Web-UI for charting, debug tools and so on.

The headquarters of Elasticsearch is located in Amsterdam, so some things may seem very strange. For example, where the branches 3.x and 4.x have disappeared - even the developers themselves do not know ¯ \ _ (ツ) _ / ¯.

Installation

Well, we have clean Ubuntu 16.04 for deploy the system of index information on the site. I decided to put version 5.6 as the most stable in terms of features and documentation. In general, Elasticsearch has many stable versions:

  • 2.4 - too old, the last release in July 2017
  • 5.0 - 5.5 is similar to 2.4
  • 5.6 - according to the documentation it is compatible with the 2.x branch, but it is developing quite actively
  • 6.x - is a new generation, actively developing, but strongly not compatible with previous versions

Installation process for deb-based systems is download and run dpkg -i elasticsearch.deb. Elasticsearch use 2Gb RAM by default, but minimal requirement is 512Mb. You can decrease this parameter in /etc/elasticsearch/jvm.options file or via command line arguments -Xms512m and -Xmx512m, or if you use Docker image via environment variable ES_JAVA_OPTS="-Xms512m -Xmx512m". Just to be sure that Elasticsearch up and running send GET request to GET http://localhost:9200/. Response must have json with some system information: the cluster name and any versions, etc…

You can interact with Elasticsearch via raw HTTP request using curl or wget, but I prefer Developer tools in Kibana.

Syntax requests

As I said earlier, Elasticsearch has only a REST API to work with everything: creating an index or parser, searching, aggregating, retrieving help information… In the future, instead of the detailed GET http://localhost: 9200/_cluster/health I will use the abbreviated version of GET /_cluster/health, omitting the server name and port. Also, the query has optional flags, for example, pretty, which formats the output into a human-readable json, or explain, necessary for analysis of the execution plan. Each response has some technical information: how long it took to complete the query, the number of shards, the number of found objects, etc. By the way, by default maximum records in response are limited by 10 entries. If you need more, you can specify values size and from in the query parameters or open scrollable search - an analog server cursor for relational databases.

Indexes, mappings

As an example, I will work with a database of tweets. We implement there search by content, tags, autocompletion … All documents are stored in the index, which is created by the command: PUT /twitter/. Now you can put our objects there, and ES will try to derive a data scheme, which is called mapping. Of course, this method of its task is far from ideal, in addition, mapping is something more than a set of fields with their types, but it will be easier to understand the material.

POST /twitter/tweet/1?pretty
{
  "subject": "First tweet",
  "geotag": "Dublin",
  "hashtags": ["tweet", "news", "Dublin"],
  "published": "2014-09-12T20:44:42+00:00"
}
{
  "_index": "twitter",
  "_type": "tweet",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "created": true
}

The main thing is that created: true exists in response. In that case _id is specified in the request path, but if it is not, it will be created automatically. By the way, Elasticsearch is distributed system, and it is necessary to somehow resolve conflicts, here it is done using the field _version.

POST /twitter/tweet/2?pretty
{"subject": "First tweet in London", "geotag": 
"London", "hashtags": ["tweet", "mist", "London"], "published": 
"2014-09-13T20:44:42+00:00"}

POST /twitter/tweet/3?pretty
{"subject": "Sunny Dublin", "geotag": 
"Dublin", "hashtags": ["Dublin"], "published": 
"2014-09-11T20:44:42+00:00"}

POST /twitter/tweet/4?pretty
{"subject": "Sunny London", "geotag": 
"London", "hashtags": ["London"], "published": 
"2014-09-11T10:44:42+00:00"}

Review mapping for twitter index GET /twitter/_mapping/:

{
  "twitter": {
    "mappings": {
      "tweet": {
        "properties": {
          "geotag": {
            "type": "text",
            "fields": {
              "keyword": {"type": "keyword", "ignore_above": 256}
            }
          },
          "hashtags": {
            "type": "text",
            "fields": {
              "keyword": {"type": "keyword", "ignore_above": 256}
            }
          },
          "published": {
            "type": "date"
          },
          "subject": {
            "type": "text",
            "fields": {
              "keyword": {"type": "keyword", "ignore_above": 256}
            }
          }
        }
      }
    }
  }
}

Response is described as “for index twitter specified mapping for document tweet which has 3 text fields and one datetime”. ES make no difference between a single row and an array, so the tags are still represented as normal strings, but that’s because each element of that array is still processed separately. Let’s try to add one more type of document to the same index:

PUT /twitter/retweet/1?pretty
{"source": "https://twitter.com/3564123", "subject": 
"Sunny London", "geotag": "London", "hashtags": 
["London"], "published": "2014-09-11T10:44:42+00:00"}
{
  "_index": "twitter",
  "_type": "retweet",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "created": true
}

The object was successfully added and indexed. So, it is possible to add various docs to the same index, and we can find all of them using one query in ES.

Back to the mapping of field subject:

"subject": {
  "type": "text",
  "fields": {
    "keyword": {"type": "keyword", "ignore_above": 256}
  }
}

In this case, the subject field is treated as a normal text file with the default analyzer and as a whole keyword (tag). Analyzers will be discribed a little lower, but here I want to pay your attention to the fact that the same field can be analyzed simultaneously in different ways: as an entire string, as separate tokens, as a set of letters for autocompletion… And in queries you can use them simultaneously. The mapping for such fields looks something like this:

"subject": {
  "type": "text",
  "fields": {
    "raw": {
      "type": "keyword"
    }
    "autocomplete": {
      "type": "text",
      "analyzer": "autocomplete",
      "search_analyzer": "standard"
    },
  },
  "analyzer": "english"
}

  • subject - normal text with english morphology
  • subject.raw - raw text without any analysis
  • subject.autocomplete - value of field subject with analyzer autocomplete

Try to change analyzer for existing field:

PUT twitter/_mapping/tweet
{
  "properties": {
    "subject": {
      "type": "text",
      "analyzer": "english",
      "search_analyzer": "english"
    }
  }
}
{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Mapper for [subject] conflicts with existing mapping 
in other types:\n[mapper [subject] has different [analyzer], 
mapper [subject] is used by multiple types. 
Set update_all_types to true to update [search_analyzer] across all types., 
mapper [subject] is used by multiple types. 
Set update_all_types to true to update [search_quote_analyzer] across all types.]"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Mapper for [subject] conflicts with existing mapping 
in other types:\n[mapper [subject] has different [analyzer], 
mapper [subject] is used by multiple types. 
Set update_all_types to true to update [search_analyzer] across all types., 
mapper [subject] is used by multiple types. 
Set update_all_types to true to update [search_quote_analyzer] across all types.]"
  },
  "status": 400
}

We’ve got error 400: illegal_argument_exception. This is because you can not change the parameters of an existing field, but you can create a new one:

PUT twitter/_mapping/tweet
{
  "properties": {
    "subject": {
      "type": "text",
      "fields": {
        "ru": {
          "type": "text",
          "analyzer": "english",
          "search_analyzer": "english"
        }
      }
    }
  }
}
{
  "acknowledged": true
}

Mapping of subjects looks like field sets with different analyzers:

"subject": {
  "type": "text",
  "fields": {
    "keyword": {
      "type": "keyword",
      "ignore_above": 256
    },
    "ru": {
      "type": "text",
      "analyzer": "english"
    }
  }
}

However, adding a field in this way will not result in its indexing - for this you need to re-save each document or call POST twitter/_update_by_query:

{
  "took": 69,
  "timed_out": false,
  "total": 6,
  "updated": 6,
  "deleted": 0,
  "batches": 1,
  "version_conflicts": 0,
  "noops": 0,
  "retries": {
    "bulk": 0,
    "search": 0
  },
  "throttled_millis": 0,
  "requests_per_second": -1,
  "throttled_until_millis": 0,
  "failures": []
}

Requests

We figured out how Elasticsearch handles objects, this is enough to practice writing queries to it. First, we’ll make sure that the data is still in the database: GET /twitter/tweet/1/. The document we wrote down in the previous step should return. And now try to find tweets that were made after September 1, 2010:

GET /twitter/_search
{
  "query": {
    "range": {
      "published": {
        "gte": "2010-09-01"
      }
    }
  }
}
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 1,
    "hits": [
      {
        "_index": "twitter",
        "_type": "tweet",
        "_id": "2",
        "_score": 1,
        "_source": {
          "subject": "First tweet in London",
          "geotag": "London",
          "hashtags": [
            "tweet",
            "mist",
            "London"
          ],
          "published": "2014-09-13T20:44:42+00:00"
        }
      },
      {
        "_index": "twitter",
        "_type": "tweet",
        "_id": "4",
        "_score": 1,
        "_source": {
          "subject": "Sunny London",
          "geotag": "London",
          "hashtags": [
            "London"
          ],
          "published": "2014-09-11T10:44:42+00:00"
        }
      },
      {
        "_index": "twitter",
        "_type": "tweet",
        "_id": "1",
        "_score": 1,
        "_source": {
          "subject": "First tweet",
          "geotag": "Dublin",
          "hashtags": [
            "tweet",
            "news",
            "Dublin"
          ],
          "published": "2014-09-12T20:44:42+00:00"
        }
      },
      {
        "_index": "twitter",
        "_type": "retweet",
        "_id": "1",
        "_score": 1,
        "_source": {
          "source": "https://twitter.com/3564123",
          "subject": "Sunny London",
          "geotag": "London",
          "hashtags": [
            "London"
          ],
          "published": "2014-09-11T10:44:42+00:00"
        }
      },
      {
        "_index": "twitter",
        "_type": "tweet",
        "_id": "3",
        "_score": 1,
        "_source": {
          "subject": "Sunny Dublin",
          "geotag": "Dublin",
          "hashtags": [
            "Dublin"
          ],
          "published": "2014-09-11T20:44:42+00:00"
        }
      }
    ]
  }
}

Let’s try to search for objects that have the word “Sunny”:

GET /twitter/_search?pretty
{
  "query": {"match": {
    "subject": "Sunny"
  }}
}
{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.7373906,
    "hits": [
      {
        "_index": "twitter",
        "_type": "tweet",
        "_id": "4",
        "_score": 0.7373906,
        "_source": {
          "subject": "Sunny London",
          "geotag": "London",
          "hashtags": [
            "London"
          ],
          "published": "2014-09-11T10:44:42+00:00"
        }
      },
      {
        "_index": "twitter",
        "_type": "retweet",
        "_id": "1",
        "_score": 0.7373906,
        "_source": {
          "source": "https://twitter.com/3564123",
          "subject": "Sunny London",
          "geotag": "London",
          "hashtags": [
            "London"
          ],
          "published": "2014-09-11T10:44:42+00:00"
        }
      },
      {
        "_index": "twitter",
        "_type": "tweet",
        "_id": "3",
        "_score": 0.25811607,
        "_source": {
          "subject": "Sunny Dublin",
          "geotag": "Dublin",
          "hashtags": [
            "Dublin"
          ],
          "published": "2014-09-11T20:44:42+00:00"
        }
      }
    ]
  }
}

Because we searched in the index, then both objects like “tweet”, and “retweet”. In total there were 3 objects, one of which belongs to Dublin - let’s try to exclude it:

GET /twitter/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "subject": "Sunny"
          }
        }
      ],
      "filter": [
        {
          "match": {
            "hashtags": "London"
          }
        }
      ]
    }
  }
}
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.7373906,
    "hits": [
      {
        "_index": "twitter",
        "_type": "tweet",
        "_id": "4",
        "_score": 0.7373906,
        "_source": {
          "subject": "Sunny London",
          "geotag": "London",
          "hashtags": [
            "London"
          ],
          "published": "2014-09-11T10:44:42+00:00"
        }
      },
      {
        "_index": "twitter",
        "_type": "retweet",
        "_id": "1",
        "_score": 0.7373906,
        "_source": {
          "source": "https://twitter.com/3564123",
          "subject": "Sunny London",
          "geotag": "London",
          "hashtags": [
            "London"
          ],
          "published": "2014-09-11T10:44:42+00:00"
        }
      }
    ]
  }
}

I’m added a node with the type bool, which can combine several must requests, build a negative must_not or filter filter (about the type should better refer to documentation). must differs from filter in that queries in it are involved in the calculation of relevance. Elasticsearch is famous for its fuzzy search, we will try to search for some similar text:

GET /twitter/_search?pretty
{
  "query": {"match": {
    "subject": "Sun"
  }}
}
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

There was not any document? It’s happen because we should take into account in what field representation the search is performed. If you remember the mapping for the subject field, then it has additional fields: “raw” and “en”.

GET /twitter/_search?pretty
{
  "query": {"match": {
    "subject.en": "Sun"
  }}
}
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.68640786,
    "hits": [
      {
        "_index": "twitter",
        "_type": "tweet",
        "_id": "4",
        "_score": 0.68640786,
        "_source": {
          "subject": "Sunny London",
          "geotag": "London",
          "hashtags": [
            "London"
          ],
          "published": "2014-09-11T10:44:42+00:00"
        }
      },
      {
        "_index": "twitter",
        "_type": "tweet",
        "_id": "3",
        "_score": 0.25811607,
        "_source": {
          "subject": "Sunny Dublin",
          "geotag": "Dublin",
          "hashtags": [
            "Dublin"
          ],
          "published": "2014-09-11T20:44:42+00:00"
        }
      }
    ]
  }
}

Done - there was even a record about Dublin! Great, let’s try to search “Sunrise”:

GET /twitter/_search
{
  "query": {"match": {
    "subject.en": "Sunrise"
  }}
}

Do not believe me? Install the Kibana and check it :)
Result is empty. The analyzer “english” does not understand such a strong difference in morphology. The search for the words “Dublin” or “First” will also be successful, but how to understand why our “sun” was not found? Let’s turn to analyzers!

Analyzers, filters, tokenizers

For debug the analyzer (you can add your own), ES provides the endpoint _analyze, with which you can evaluate what information will be put in the index:

GET /twitter/_analyze
{
  "analyzer": "english",
  "text": "Sunny London"
}
{
  "tokens": [
    {
      "token": "sunni",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "london",
      "start_offset": 6,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

So, the index have only token “sunni”, which can not be found at all on the request of “sun”. If you look at the list of analyzers, then it is quite voluminous. Most popular:

  • standard - split text to words and leads to lowercase each of them
  • simple - split text to words, leads to lowercase each of them and remove all non-alpabetical items (numbers, for ex)
  • keyword - does nothing with the text, but considers it in one word (suitable for tags, so as not to look inside them)
  • stop - remove stop words
  • english, russian, italian - set of lexical analyzers for human languages

Analyzer is a common concept for converting a string into a set of tokens, it includes:

  • char_filter - processes the entire string to be parsed. For example, standard installation have a script html_strip, which deletes HTML tags, however you can write your own
  • tokenizer - split string to tokens (may be only one)
  • filter - processes each token separately (leads to lowercase, removes stop words, etc.), including can also add synonyms

For example, we’ll write an analyzer for the autocomplete. The basic idea is to split all the words into N-grams and search for the occurrence of text from the input field to this array. First of all, let’s define with the tokenizer - how we will allocate tokens from the text. I think it will work with standard, which will leave only the words (not to be confused with the analyzer standard, which also includes leads to lowercase!). Which we will already cut into N-grams (before this we need to close the index POST /twitter/_close, and open after POST /twitter/_open):

PUT twitter/_settings
{
  "analysis": {
    "analyzer": {
      "autocomplete": {
        "type": "custom",
        "tokenizer": "standard",
        "filter": ["lowercase", "autocomplete_filter"]
      }
    },
    "filter": {
      "autocomplete_filter": {
        "type": "edge_ngram",
        "min_gram": 2,
        "max_gram": 12
      }
    }
  }
}

Hmm, Elasticsearch even said {"acknowledged": true} is 200 OK in his dialect. Let’s check the analyzer’s work in some way:

POST /twitter/_analyze
{
  "analyzer": "autocomplete",
  "text":     "Sunny Dublin"
}
{
  "tokens": [
    {
      "token": "su",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "sun",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "sunn",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
...
    {
      "token": "dublin",
      "start_offset": 6,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 1
    }

We have a list of tokens in response: [‘su’, ‘sun’, ‘sunn’, ‘sunny’, ‘du’ …]. Does it work?! Let’s connect one more sub-field in subject and check that:

PUT twitter/_mapping/tweet
{
    "properties": {
      "subject": {
        "type": "text",
        "fields": {
          "autocomplete": {
            "type": "text",
            "analyzer": "autocomplete",
            "search_analyzer": "standard"
          }
        }
      }
}}


POST twitter/_update_by_query

GET /twitter/_search
{
  "query": {"match": {
    "subject.autocomplete": "Sunn"
  }}
}
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1.065146,
    "hits": [
      {
        "_index": "twitter",
        "_type": "tweet",
        "_id": "4",
        "_score": 1.065146,
        "_source": {
          "subject": "Sunny London",
          "geotag": "London",
          "hashtags": [
            "London"
          ],
          "published": "2014-09-11T10:44:42+00:00"
        }
      },
      {
        "_index": "twitter",
        "_type": "tweet",
        "_id": "3",
        "_score": 0.4382968,
        "_source": {
          "subject": "Sunny Dublin",
          "geotag": "Dublin",
          "hashtags": [
            "Dublin"
          ],
          "published": "2014-09-11T20:44:42+00:00"
        }
      }
    ]
  }
}

And we see our records about the sunny cities! However, according to the word ‘Sundfs’ they will also be founded - this is already a mess. It’s because the same analyzer is used for searching too: Sunsdfs is divided into N-grams [‘co’, ‘sol’, ‘sun’, ‘sund’, ‘sundf’, ‘sundfs’], which intersect with stored in the index. The output is to treat the search object as one word. This is regulated by the search_analyzer setting for the field:

"autocomplete": {
    "type": "text",
    "analyzer": "autocomplete",
    "search_analyzer": "standard"
  }

After the field mapping is updated, the autocomplete working fine!

Conclusion

I hope that now you have an understanding of how to work with Elasticsearch, what are mappers and analyzers, and how to build the simplest queries. I tried to describe the required minimum set of knowledge so that you could independently organize a search for your content. On this for now, however, Elasticsearch still has many other interesting things that go beyond the scope of this article. Here is just a small list of topics:

  • highlighting found matchings
  • special characters for the query ("" for exact search, “-” to exclude words, etc.)
  • work with geoobjects - search around the location
  • cluster configuration
  • writing your own scripts in [Painless language] (https://www.elastic.co/guide/en/elasticsearch/reference/5.6/modules-scripting-painless.html)
  • a wonderful endpoint for monitoring GET / _cat
  • integration with logstash for logging
  • aggregation
  • flags of the requests _source, size, sort, etc.
  • work with aliases, or reindex without downtime (_reindex, _aliases)
  • view the plan for executing the query _explain
  • calculation of relevance (_score)
contact us right now