/doc/development/elasticsearch.md
Markdown | 246 lines | 152 code | 94 blank | 0 comment | 0 complexity | 89a58378a6e264e3e8cb18451d1d9038 MD5 | raw file
- # Elasticsearch knowledge **(STARTER ONLY)**
- This area is to maintain a compendium of useful information when working with elasticsearch.
- Information on how to enable Elasticsearch and perform the initial indexing is kept in ../integration/elasticsearch.md#enabling-elasticsearch
- ## Deep Dive
- In June 2019, Mario de la Ossa hosted a [Deep Dive] on GitLab's [Elasticsearch integration] to share his domain specific knowledge with anyone who may work in this part of the code base in the future. You can find the [recording on YouTube], and the slides on [Google Slides] and in [PDF]. Everything covered in this deep dive was accurate as of GitLab 12.0, and while specific details may have changed since then, it should still serve as a good introduction.
- [Deep Dive]: https://gitlab.com/gitlab-org/create-stage/issues/1
- [Elasticsearch integration]: ../integration/elasticsearch.md
- [recording on YouTube]: https://www.youtube.com/watch?v=vrvl-tN2EaA
- [Google Slides]: https://docs.google.com/presentation/d/1H-pCzI_LNrgrL5pJAIQgvLX8Ji0-jIKOg1QeJQzChug/edit
- [PDF]: https://gitlab.com/gitlab-org/create-stage/uploads/c5aa32b6b07476fa8b597004899ec538/Elasticsearch_Deep_Dive.pdf
- ## Initial installation on OS X
- It is recommended to use the Docker image. After installing docker you can immediately spin up an instance with
- ```
- docker run --name elastic56 -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:5.6.12
- ```
- and use `docker stop elastic56` and `docker start elastic56` to stop/start it.
- ### Installing on the host
- We currently only support Elasticsearch [5.6 to 6.x](../integration/elasticsearch.md#version-requirements)
- Version 5.6 is available on homebrew and is the recommended version to use in order to test compatibility.
- ```
- brew install elasticsearch@5.6
- ```
- There is no need to install any plugins
- ## New repo indexer (beta)
- If you're interested on working with the new beta repo indexer, all you need to do is:
- ```sh
- git clone git@gitlab.com:gitlab-org/gitlab-elasticsearch-indexer.git
- make
- make install
- ```
- this adds `gitlab-elasticsearch-indexer` to `$GOPATH/bin`, please make sure that is in your `$PATH`. After that GitLab will find it and you'll be able to enable it in the admin settings area.
- **note:** `make` will not recompile the executable unless you do `make clean` beforehand
- ## Helpful rake tasks
- - `gitlab:elastic:test:index_size`: Tells you how much space the current index is using, as well as how many documents are in the index.
- - `gitlab:elastic:test:index_size_change`: Outputs index size, reindexes, and outputs index size again. Useful when testing improvements to indexing size.
- Additionally, if you need large repos or multiple forks for testing, please consider [following these instructions](rake_tasks.md#extra-project-seed-options)
- ## How does it work?
- The Elasticsearch integration depends on an external indexer. We ship an [indexer written in Go](https://gitlab.com/gitlab-org/gitlab-elasticsearch-indexer). The user must trigger the initial indexing via a rake task but, after this is done, GitLab itself will trigger reindexing when required via `after_` callbacks on create, update, and destroy that are inherited from [/ee/app/models/concerns/elastic/application_search.rb](https://gitlab.com/gitlab-org/gitlab/blob/master/ee/app/models/concerns/elastic/application_search.rb).
- All indexing after the initial one is done via `ElasticIndexerWorker` (sidekiq jobs).
- Search queries are generated by the concerns found in [ee/app/models/concerns/elastic](https://gitlab.com/gitlab-org/gitlab/tree/master/ee/app/models/concerns/elastic). These concerns are also in charge of access control, and have been a historic source of security bugs so please pay close attention to them!
- ## Existing Analyzers/Tokenizers/Filters
- These are all defined in <https://gitlab.com/gitlab-org/gitlab/blob/master/ee/lib/elasticsearch/git/model.rb>
- ### Analyzers
- #### `path_analyzer`
- Used when indexing blobs' paths. Uses the `path_tokenizer` and the `lowercase` and `asciifolding` filters.
- Please see the `path_tokenizer` explanation below for an example.
- #### `sha_analyzer`
- Used in blobs and commits. Uses the `sha_tokenizer` and the `lowercase` and `asciifolding` filters.
- Please see the `sha_tokenizer` explanation later below for an example.
- #### `code_analyzer`
- Used when indexing a blob's filename and content. Uses the `whitespace` tokenizer and the filters: `code`, `edgeNGram_filter`, `lowercase`, and `asciifolding`
- The `whitespace` tokenizer was selected in order to have more control over how tokens are split. For example the string `Foo::bar(4)` needs to generate tokens like `Foo` and `bar(4)` in order to be properly searched.
- Please see the `code` filter for an explanation on how tokens are split.
- #### `code_search_analyzer`
- Not directly used for indexing, but rather used to transform a search input. Uses the `whitespace` tokenizer and the `lowercase` and `asciifolding` filters.
- ### Tokenizers
- #### `sha_tokenizer`
- This is a custom tokenizer that uses the [`edgeNGram` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenizer.html) to allow SHAs to be searcheable by any sub-set of it (minimum of 5 chars).
- Example:
- `240c29dc7e` becomes:
- - `240c2`
- - `240c29`
- - `240c29d`
- - `240c29dc`
- - `240c29dc7`
- - `240c29dc7e`
- #### `path_tokenizer`
- This is a custom tokenizer that uses the [`path_hierarchy` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pathhierarchy-tokenizer.html) with `reverse: true` in order to allow searches to find paths no matter how much or how little of the path is given as input.
- Example:
- `'/some/path/application.js'` becomes:
- - `'/some/path/application.js'`
- - `'some/path/application.js'`
- - `'path/application.js'`
- - `'application.js'`
- ### Filters
- #### `code`
- Uses a [Pattern Capture token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pattern-capture-tokenfilter.html) to split tokens into more easily searched versions of themselves.
- Patterns:
- - `"(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)"`: captures CamelCased and lowedCameCased strings as separate tokens
- - `"(\\d+)"`: extracts digits
- - `"(?=([\\p{Lu}]+[\\p{L}]+))"`: captures CamelCased strings recursively. Ex: `ThisIsATest` => `[ThisIsATest, IsATest, ATest, Test]`
- - `'"((?:\\"|[^"]|\\")*)"'`: captures terms inside quotes, removing the quotes
- - `"'((?:\\'|[^']|\\')*)'"`: same as above, for single-quotes
- - `'\.([^.]+)(?=\.|\s|\Z)'`: separate terms with periods in-between
- - `'\/?([^\/]+)(?=\/|\b)'`: separate path terms `like/this/one`
- #### `edgeNGram_filter`
- Uses an [Edge NGram token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenfilter.html) to allow inputs with only parts of a token to find the token. For example it would turn `glasses` into permutations starting with `gl` and ending with `glasses`, which would allow a search for "`glass`" to find the original token `glasses`
- ## Gotchas
- - Searches can have their own analyzers. Remember to check when editing analyzers
- - `Character` filters (as opposed to token filters) always replace the original character, so they're not a good choice as they can hinder exact searches
- ## Zero downtime reindexing with multiple indices
- Currently GitLab can only handle a single version of setting. Any setting/schema changes would require reindexing everything from scratch. Since reindexing can take a long time, this can cause search functionality downtime.
- To avoid downtime, GitLab is working to support multiple indices that
- can function at the same time. Whenever the schema changes, the admin
- will be able to create a new index and reindex to it, while searches
- continue to go to the older, stable index. Any data updates will be
- forwarded to both indices. Once the new index is ready, an admin can
- mark it active, which will direct all searches to it, and remove the old
- index.
- This is also helpful for migrating to new servers, e.g. moving to/from AWS.
- Currently we are on the process of migrating to this new design. Everything is hardwired to work with one single version for now.
- ### Architecture
- The traditional setup, provided by `elasticsearch-rails`, is to communicate through its internal proxy classes. Developers would write model-specific logic in a module for the model to include in (e.g. `SnippetsSearch`). The `__elasticsearch__` methods would return a proxy object, e.g.:
- - `Issue.__elasticsearch__` returns an instance of `Elasticsearch::Model::Proxy::ClassMethodsProxy`
- - `Issue.first.__elasticsearch__` returns an instance of `Elasticsearch::Model::Proxy::InstanceMethodsProxy`.
- These proxy objects would talk to Elasticsearch server directly (see top half of the diagram).
- ![Elasticsearch Architecture](img/elasticsearch_architecture.svg)
- In the planned new design, each model would have a pair of corresponding subclassed proxy objects, in which model-specific logic is located. For example, `Snippet` would have `SnippetClassProxy` and `SnippetInstanceProxy` (being subclass of `Elasticsearch::Model::Proxy::ClassMethodsProxy` and `Elasticsearch::Model::Proxy::InstanceMethodsProxy`, respectively).
- `__elasticsearch__` would represent another layer of proxy object, keeping track of multiple actual proxy objects. It would forward method calls to the appropriate index. For example:
- - `model.__elasticsearch__.search` would be forwarded to the one stable index, since it is a read operation.
- - `model.__elasticsearch__.update_document` would be forwarded to all indices, to keep all indices up-to-date.
- The global configurations per version are now in the `Elastic::(Version)::Config` class. You can change mappings there.
- ### Creating new version of schema
- NOTE: **Note:** this is not applicable yet as multiple indices functionality is not fully implemented.
- Folders like `ee/lib/elastic/v12p1` contain snapshots of search logic from different versions. To keep a continuous Git history, the latest version lives under `ee/lib/elastic/latest`, but its classes are aliased under an actual version (e.g. `ee/lib/elastic/v12p3`). When referencing these classes, never use the `Latest` namespace directly, but use the actual version (e.g. `V12p3`).
- The version name basically follows GitLab's release version. If setting is changed in 12.3, we will create a new namespace called `V12p3` (p stands for "point"). Raise an issue if there is a need to name a version differently.
- If the current version is `v12p1`, and we need to create a new version for `v12p3`, the steps are as follows:
- 1. Copy the entire folder of `v12p1` as `v12p3`
- 1. Change the namespace for files under `v12p3` folder from `V12p1` to `V12p3` (which are still aliased to `Latest`)
- 1. Delete `v12p1` folder
- 1. Copy the entire folder of `latest` as `v12p1`
- 1. Change the namespace for files under `v12p1` folder from `Latest` to `V12p1`
- 1. Make changes to files under the `latest` folder as needed
- ## Troubleshooting
- ### Getting `flood stage disk watermark [95%] exceeded`
- You might get an error such as
- ```
- [2018-10-31T15:54:19,762][WARN ][o.e.c.r.a.DiskThresholdMonitor] [pval5Ct]
- flood stage disk watermark [95%] exceeded on
- [pval5Ct7SieH90t5MykM5w][pval5Ct][/usr/local/var/lib/elasticsearch/nodes/0] free: 56.2gb[3%],
- all indices on this node will be marked read-only
- ```
- This is because you've exceeded the disk space threshold - it thinks you don't have enough disk space left, based on the default 95% threshold.
- In addition, the `read_only_allow_delete` setting will be set to `true`. It will block indexing, `forcemerge`, etc
- ```
- curl "http://localhost:9200/gitlab-development/_settings?pretty"
- ```
- Add this to your `elasticsearch.yml` file:
- ```
- # turn off the disk allocator
- cluster.routing.allocation.disk.threshold_enabled: false
- ```
- _or_
- ```
- # set your own limits
- cluster.routing.allocation.disk.threshold_enabled: true
- cluster.routing.allocation.disk.watermark.flood_stage: 5gb # ES 6.x only
- cluster.routing.allocation.disk.watermark.low: 15gb
- cluster.routing.allocation.disk.watermark.high: 10gb
- ```
- Restart Elasticsearch, and the `read_only_allow_delete` will clear on it's own.
- _from "Disk-based Shard Allocation | Elasticsearch Reference" [5.6](https://www.elastic.co/guide/en/elasticsearch/reference/5.6/disk-allocator.html#disk-allocator) and [6.x](https://www.elastic.co/guide/en/elasticsearch/reference/6.x/disk-allocator.html)_