PageRenderTime 44ms CodeModel.GetById 29ms RepoModel.GetById 0ms app.codeStats 0ms

/doc/development/adding_database_indexes.md

https://gitlab.com/realsatomic/gitlab
Markdown | 327 lines | 248 code | 79 blank | 0 comment | 0 complexity | 9f90319ad9264eacd0b59d8d4e48362e MD5 | raw file
  1. ---
  2. stage: Enablement
  3. group: Database
  4. info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
  5. ---
  6. # Adding Database Indexes
  7. Indexes can be used to speed up database queries, but when should you add a new
  8. index? Traditionally the answer to this question has been to add an index for
  9. every column used for filtering or joining data. For example, consider the
  10. following query:
  11. ```sql
  12. SELECT *
  13. FROM projects
  14. WHERE user_id = 2;
  15. ```
  16. Here we are filtering by the `user_id` column and as such a developer may decide
  17. to index this column.
  18. While in certain cases indexing columns using the above approach may make sense
  19. it can actually have a negative impact. Whenever you write data to a table any
  20. existing indexes need to be updated. The more indexes there are the slower this
  21. can potentially become. Indexes can also take up quite some disk space depending
  22. on the amount of data indexed and the index type. For example, PostgreSQL offers
  23. "GIN" indexes which can be used to index certain data types that can not be
  24. indexed by regular B-tree indexes. These indexes however generally take up more
  25. data and are slower to update compared to B-tree indexes.
  26. Because of all this one should not blindly add a new index for every column used
  27. to filter data by. Instead one should ask themselves the following questions:
  28. 1. Can I write my query in such a way that it re-uses as many existing indexes
  29. as possible?
  30. 1. Is the data going to be large enough that using an index will actually be
  31. faster than just iterating over the rows in the table?
  32. 1. Is the overhead of maintaining the index worth the reduction in query
  33. timings?
  34. We'll explore every question in detail below.
  35. ## Re-using Queries
  36. The first step is to make sure your query re-uses as many existing indexes as
  37. possible. For example, consider the following query:
  38. ```sql
  39. SELECT *
  40. FROM todos
  41. WHERE user_id = 123
  42. AND state = 'open';
  43. ```
  44. Now imagine we already have an index on the `user_id` column but not on the
  45. `state` column. One may think this query will perform badly due to `state` being
  46. unindexed. In reality the query may perform just fine given the index on
  47. `user_id` can filter out enough rows.
  48. The best way to determine if indexes are re-used is to run your query using
  49. `EXPLAIN ANALYZE`. Depending on any extra tables that may be joined and
  50. other columns being used for filtering you may find an extra index is not going
  51. to make much (if any) difference. On the other hand you may determine that the
  52. index _may_ make a difference.
  53. In short:
  54. 1. Try to write your query in such a way that it re-uses as many existing
  55. indexes as possible.
  56. 1. Run the query using `EXPLAIN ANALYZE` and study the output to find the most
  57. ideal query.
  58. ## Data Size
  59. A database may decide not to use an index despite it existing in case a regular
  60. sequence scan (= simply iterating over all existing rows) is faster. This is
  61. especially the case for small tables.
  62. If a table is expected to grow in size and you expect your query has to filter
  63. out a lot of rows you may want to consider adding an index. If the table size is
  64. very small (for example, fewer than `1,000` records) or any existing indexes filter out
  65. enough rows you may _not_ want to add a new index.
  66. ## Maintenance Overhead
  67. Indexes have to be updated on every table write. In case of PostgreSQL _all_
  68. existing indexes will be updated whenever data is written to a table. As a
  69. result of this having many indexes on the same table will slow down writes.
  70. Because of this one should ask themselves: is the reduction in query performance
  71. worth the overhead of maintaining an extra index?
  72. If adding an index reduces SELECT timings by 5 milliseconds but increases
  73. INSERT/UPDATE/DELETE timings by 10 milliseconds then the index may not be worth
  74. it. On the other hand, if SELECT timings are reduced but INSERT/UPDATE/DELETE
  75. timings are not affected you may want to add the index after all.
  76. ## Finding Unused Indexes
  77. To see which indexes are unused you can run the following query:
  78. ```sql
  79. SELECT relname as table_name, indexrelname as index_name, idx_scan, idx_tup_read, idx_tup_fetch, pg_size_pretty(pg_relation_size(indexrelname::regclass))
  80. FROM pg_stat_all_indexes
  81. WHERE schemaname = 'public'
  82. AND idx_scan = 0
  83. AND idx_tup_read = 0
  84. AND idx_tup_fetch = 0
  85. ORDER BY pg_relation_size(indexrelname::regclass) desc;
  86. ```
  87. This query outputs a list containing all indexes that are never used and sorts
  88. them by indexes sizes in descending order. This query can be useful to
  89. determine if any previously indexes are useful after all. More information on
  90. the meaning of the various columns can be found at
  91. <https://www.postgresql.org/docs/current/monitoring-stats.html>.
  92. Because the output of this query relies on the actual usage of your database it
  93. may be affected by factors such as (but not limited to):
  94. - Certain queries never being executed, thus not being able to use certain
  95. indexes.
  96. - Certain tables having little data, resulting in PostgreSQL using sequence
  97. scans instead of index scans.
  98. In other words, this data is only reliable for a frequently used database with
  99. plenty of data and with as many GitLab features enabled (and being used) as
  100. possible.
  101. ## Requirements for naming indexes
  102. Indexes with complex definitions need to be explicitly named rather than
  103. relying on the implicit naming behavior of migration methods. In short,
  104. that means you **must** provide an explicit name argument for an index
  105. created with one or more of the following options:
  106. - `where`
  107. - `using`
  108. - `order`
  109. - `length`
  110. - `type`
  111. - `opclass`
  112. ### Considerations for index names
  113. Index names don't have any significance in the database, so they should
  114. attempt to communicate intent to others. The most important rule to
  115. remember is that generic names are more likely to conflict or be duplicated,
  116. and should not be used. Some other points to consider:
  117. - For general indexes, use a template, like: `index_{table}_{column}_{options}`.
  118. - For indexes added to solve a very specific problem, it may make sense
  119. for the name to reflect their use.
  120. - Identifiers in PostgreSQL have a maximum length of 63 bytes.
  121. - Check `db/structure.sql` for conflicts and ideas.
  122. ### Why explicit names are required
  123. As Rails is database agnostic, it generates an index name only
  124. from the required options of all indexes: table name and column name(s).
  125. For example, imagine the following two indexes are created in a migration:
  126. ```ruby
  127. def up
  128. add_index :my_table, :my_column
  129. add_index :my_table, :my_column, where: 'my_column IS NOT NULL'
  130. end
  131. ```
  132. Creation of the second index would fail, because Rails would generate
  133. the same name for both indexes.
  134. This is further complicated by the behavior of the `index_exists?` method.
  135. It considers only the table name, column name(s) and uniqueness specification
  136. of the index when making a comparison. Consider:
  137. ```ruby
  138. def up
  139. unless index_exists?(:my_table, :my_column, where: 'my_column IS NOT NULL')
  140. add_index :my_table, :my_column, where: 'my_column IS NOT NULL'
  141. end
  142. end
  143. ```
  144. The call to `index_exists?` will return true if **any** index exists on
  145. `:my_table` and `:my_column`, and index creation will be bypassed.
  146. The `add_concurrent_index` helper is a requirement for creating indexes
  147. on populated tables. Since it cannot be used inside a transactional
  148. migration, it has a built-in check that detects if the index already
  149. exists. In the event a match is found, index creation is skipped.
  150. Without an explicit name argument, Rails can return a false positive
  151. for `index_exists?`, causing a required index to not be created
  152. properly. By always requiring a name for certain types of indexes, the
  153. chance of error is greatly reduced.
  154. ## Temporary indexes
  155. There may be times when an index is only needed temporarily.
  156. For example, in a migration, a column of a table might be conditionally
  157. updated. To query which columns need to be updated within the
  158. [query performance guidelines](query_performance.md), an index is needed that would otherwise
  159. not be used.
  160. In these cases, a temporary index should be considered. To specify a
  161. temporary index:
  162. 1. Prefix the index name with `tmp_` and follow the [naming conventions](database/constraint_naming_convention.md) and [requirements for naming indexes](#requirements-for-naming-indexes) for the rest of the name.
  163. 1. Create a follow-up issue to remove the index in the next (or future) milestone.
  164. 1. Add a comment in the migration mentioning the removal issue.
  165. A temporary migration would look like:
  166. ```ruby
  167. INDEX_NAME = 'tmp_index_projects_on_owner_where_emails_disabled'
  168. def up
  169. # Temporary index to be removed in 13.9 https://gitlab.com/gitlab-org/gitlab/-/issues/1234
  170. add_concurrent_index :projects, :creator_id, where: 'emails_disabled = false', name: INDEX_NAME
  171. end
  172. def down
  173. remove_concurrent_index_by_name :projects, INDEX_NAME
  174. end
  175. ```
  176. ## Create indexes asynchronously
  177. For very large tables, index creation can be a challenge to manage.
  178. While `add_concurrent_index` creates indexes in a way that does not block
  179. normal traffic, it can still be problematic when index creation runs for
  180. many hours. Necessary database operations like `autovacuum` cannot run, and
  181. on GitLab.com, the deployment process is blocked waiting for index
  182. creation to finish.
  183. To limit impact on GitLab.com, a process exists to create indexes
  184. asynchronously during weekend hours. Due to generally lower levels of
  185. traffic and lack of regular deployments, this process allows the
  186. creation of indexes to proceed with a lower level of risk. The below
  187. sections describe the steps required to use these features:
  188. 1. [Schedule the index to be created](#schedule-the-index-to-be-created).
  189. 1. [Verify the MR was deployed and the index exists in production](#verify-the-mr-was-deployed-and-the-index-exists-in-production).
  190. 1. [Add a migration to create the index synchronously](#add-a-migration-to-create-the-index-synchronously).
  191. ### Schedule the index to be created
  192. Create an MR with a post-deployment migration which prepares the index
  193. for asynchronous creation. An example of creating an index using
  194. the asynchronous index helpers can be seen in the block below. This migration
  195. enters the index name and definition into the `postgres_async_indexes`
  196. table. The process that runs on weekends pulls indexes from this
  197. table and attempt to create them.
  198. ```ruby
  199. # in db/post_migrate/
  200. INDEX_NAME = 'index_ci_builds_on_some_column'
  201. def up
  202. prepare_async_index :ci_builds, :some_column, name: INDEX_NAME
  203. end
  204. def down
  205. unprepare_async_index :ci_builds, :some_column, name: INDEX_NAME
  206. end
  207. ```
  208. ### Verify the MR was deployed and the index exists in production
  209. You can verify if the MR was deployed to GitLab.com by executing
  210. `/chatops run auto_deploy status <merge_sha>`. To verify existence of
  211. the index, you can:
  212. - Use a meta-command in #database-lab, such as: `\d <index_name>`.
  213. - Ensure that the index is not [`invalid`](https://www.postgresql.org/docs/12/sql-createindex.html#:~:text=The%20psql%20%5Cd%20command%20will%20report%20such%20an%20index%20as%20INVALID).
  214. - Ask someone in #database to check if the index exists.
  215. - With proper access, you can also verify directly on production or in a
  216. production clone.
  217. ### Add a migration to create the index synchronously
  218. After the index is verified to exist on the production database, create a second
  219. merge request that adds the index synchronously. The synchronous
  220. migration results in a no-op on GitLab.com, but you should still add the
  221. migration as expected for other installations. The below block
  222. demonstrates how to create the second migration for the previous
  223. asynchronous example.
  224. WARNING:
  225. The responsibility lies on the individual writing the migrations to verify
  226. the index exists in production before merging a second migration that
  227. adds the index using `add_concurrent_index`. If the second migration is
  228. deployed and the index has not yet been created, the index is created
  229. synchronously when the second migration executes.
  230. ```ruby
  231. # in db/post_migrate/
  232. INDEX_NAME = 'index_ci_builds_on_some_column'
  233. disable_ddl_transaction!
  234. def up
  235. add_concurrent_index :ci_builds, :some_column, name: INDEX_NAME
  236. end
  237. def down
  238. remove_concurrent_index_by_name :ci_builds, INDEX_NAME
  239. end
  240. ```
  241. ## Test database index changes locally
  242. You must test the database index changes locally before creating a merge request.
  243. ### Verify indexes created asynchronously
  244. Use the asynchronous index helpers on your local environment to test changes for creating an index:
  245. 1. Enable the feature flags by running `Feature.enable(:database_async_index_creation)` and `Feature.enable(:database_reindexing)` in the Rails console.
  246. 1. Run `bundle exec rails db:migrate` so that it creates an entry in the `postgres_async_indexes` table.
  247. 1. Run `bundle exec rails gitlab:db:reindex` so that the index is created asynchronously.
  248. 1. To verify the index, open the PostgreSQL console using the [GDK](https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/postgresql.md) command `gdk psql` and run the command `\d <index_name>` to check that your newly created index exists.