{"id":37992583,"url":"https://github.com/antonio-alexander/sql-blog-indexing","last_synced_at":"2026-01-16T18:44:43.639Z","repository":{"id":193802007,"uuid":"678170131","full_name":"antonio-alexander/sql-blog-indexing","owner":"antonio-alexander","description":"A repo showing a proof of concept for indexing and benchmarking in sql","archived":false,"fork":false,"pushed_at":"2023-10-08T01:17:00.000Z","size":11,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-01-26T02:49:04.500Z","etag":null,"topics":["benchmark","index","indexing","mysql","performance","sql"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/antonio-alexander.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-08-13T23:05:16.000Z","updated_at":"2023-09-10T03:44:33.000Z","dependencies_parsed_at":"2023-11-10T16:05:30.487Z","dependency_job_id":"3384c883-8f5a-4a0e-8361-c84a66a434de","html_url":"https://github.com/antonio-alexander/sql-blog-indexing","commit_stats":{"total_commits":2,"total_committers":1,"mean_commits":2.0,"dds":0.0,"last_synced_commit":"5a2dd335fd3bed55f4aeb4e30910f958a9941466"},"previous_names":["antonio-alexander/sql-blog-indexing"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/antonio-alexander/sql-blog-indexing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/antonio-alexander%2Fsql-blog-indexing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/antonio-alexander%2Fsql-blog-indexing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/antonio-alexander%2Fsql-blog-indexing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/antonio-alexander%2Fsql-blog-indexing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/antonio-alexander","download_url":"https://codeload.github.com/antonio-alexander/sql-blog-indexing/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/antonio-alexander%2Fsql-blog-indexing/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28481180,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T11:59:17.896Z","status":"ssl_error","status_checked_at":"2026-01-16T11:55:55.838Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","index","indexing","mysql","performance","sql"],"created_at":"2026-01-16T18:44:43.573Z","updated_at":"2026-01-16T18:44:43.624Z","avatar_url":"https://github.com/antonio-alexander.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# sql-blog-indexing (github.com/antonio-alexander/sql-blog-indexing)\n\nThe goal of this repository is to research how queries can become less performant with scale, how indexes can be used to reduce query times, how queries can be optimized and through this, how to structure queries better. This repository seeks to do the following:\n\n- How to use the explain command to understand the query plan\n- How to determine what indexes to create\n\nThis repository is meant to be a deep[er] dive into the \"Performance Considerations\" section in the [github.com/antonio-alexander/sql-blog-paging](github.com/antonio-alexander/sql-blog-paging) repository.\n\n## Bibliography\n\nThese links were used while putting together this article:\n\n- [https://dev.mysql.com/doc/refman/8.0/en/using-explain.html](https://dev.mysql.com/doc/refman/8.0/en/using-explain.html)\n- [https://dev.mysql.com/doc/refman/8.0/en/explain.html](https://dev.mysql.com/doc/refman/8.0/en/explain.html)\n- [https://github.com/datacharmer/test_db/tree/d01cb62fcfa4671773f167480df174d8ce4316c1](https://github.com/datacharmer/test_db/tree/d01cb62fcfa4671773f167480df174d8ce4316c1)\n- [https://dev.mysql.com/doc/refman/8.0/en/table-scan-avoidance.html](https://dev.mysql.com/doc/refman/8.0/en/table-scan-avoidance.html)\n- [https://www.sitepoint.com/using-explain-to-write-better-mysql-queries/](https://www.sitepoint.com/using-explain-to-write-better-mysql-queries/)\n- [https://dev.mysql.com/doc/refman/8.0/en/create-index.html](https://dev.mysql.com/doc/refman/8.0/en/create-index.html)\n- [https://stackoverflow.com/q/10524651](https://stackoverflow.com/q/10524651)\n- [https://www.w3schools.com/mysql/mysql_join_inner.asp](https://www.w3schools.com/mysql/mysql_join_inner.asp)\n- [https://dba.stackexchange.com/questions/74261/optimize-the-query-with-big-and-low-cardinality](https://dba.stackexchange.com/questions/74261/optimize-the-query-with-big-and-low-cardinality)\n\n## Getting-Started\n\nI think the flow of this repo will be different than other repos I've done because there isn't a core \"concept\" that needs to be _really_ broken down. The general goal of this repo is to attempt some scenarios and attempt to optimize them using explain. For each scenario I'll generate a schema a data set and a number of variations on the scenario to be able to compare and contrast between different kinds of optimizations.\n\nThese are the scenarios we're going to attempt to work through:\n\n1. How does querying against a table become affected by the presence or absence of indexes? 10 employees? 100 employees? 1000 employees? 10000 employees?\n2. How does the deleted column being added to the employees table affect queries that include a  constraint for deleted? 10 employees? 100 employees? 1000 employees? 10000 employees?\n3. For soft-deletes, what's the comparison on using 'WHERE deleted=true' when we're using a solution with one table versus a solution with two tables\n4. Comparison for migration of delete+recreate vs update in place in terms of time\n\nThings I'm curious about:\n\n- Is there any room to optimize inserts?\n- How do triggers affect mass inserts/mass updates?\n- Is there an overall difference in on-disk size if using varchar() vs char()?\n\nThere's a lot of ways to get the test database into the image, I want to give you the opportunity to figure out how to do it.\n\n## Explain and Indexes\n\nBefore we dive too deep into each scenario, I want to briefly describe both EXPLAIN and INDEX. EXPLAIN is a command you can use with any query to determine how the SQL server will go about determining the most efficient way to complete your query amongst a host of variables. An INDEX works hand-in-hand with query plans, indexes (sort of) pre-calculate some of the effort of a given query and if done correctly can significantly impact the query plan.\n\n\u003e I wanted to give some examples, but EXPLAIN and INDEX are a bit impossible out of context, so we'll get into the scenarios now-ish\n\n## Scenario 1: Adding Indexes to Optimize Queries with Constraints\n\nThis scenario uses the [https://github.com/datacharmer/test_db](https://github.com/datacharmer/test_db) test database as a starting point. We're going to do a benchmark of the database as-is and see what we can do to optimize the employees table for specific queries, especially at scale.\n\nA hard question is what query we can attempt to optimize; I have a little foreknowledge about how indexes work (from sql-blog-pagination) so I'll do some min/max work on some of the fields given the data set:\n\n- emp_no: 1001 - 499999 (300024)\n- birth_date: 1952-02-01 - 1965-02-01 (300024)\n- hire_date: 1984-01-01 - 2000-01-28 (300024)\n- gender: M(179973), F(120051)\n\nAlthough first_name and last_name can be searched (and indexed), there's not much useful context (at scale) for searching using either, especially at scale. I'm going to make a strong assumption that there's the MOST variation in hire_date as opposed to emp_no or birth_date; I think that more variation equals more to optimize.\n\nSo let's attempt to query for employees that have a birthdate AFTER the median hire_date (1992-01-01):\n\n```sql\nSELECT COUNT(*) FROM employees WHERE hire_date \u003e= '1992-01-01';\n```\n\nThis gives us 87049 different employees which should cover the scale we want to work with. So to start, let's do some basic benchmarking:\n\n```sql\nSELECT emp_no, birth_date, first_name, last_name, gender, hire_date FROM employees WHERE hire_date \u003e= '1992-01-01' LIMIT 10;\nSELECT emp_no, birth_date, first_name, last_name, gender, hire_date FROM employees WHERE hire_date \u003e= '1992-01-01' LIMIT 100;\nSELECT emp_no, birth_date, first_name, last_name, gender, hire_date FROM employees WHERE hire_date \u003e= '1992-01-01' LIMIT 1000;\nSELECT emp_no, birth_date, first_name, last_name, gender, hire_date FROM employees WHERE hire_date \u003e= '1992-01-01' LIMIT 10000;\n```\n\nThese are the results of the earlier queries:\n\n- 10 employees: 0.010s\n- 100 employees: 0.010s\n- 1000 employees: 0.051s\n- 10000 employees: 0.071s\n\nLooking at the above, it's clear that it's not until we try to get more than 100 employees that the plan gets weird and it takes longer, maybe we can optimize those two middle queries and see if we can get those query times down.\n\n\u003c!-- THOSE ARE ROOKIE NUMBERS (gif) --\u003e\n\nAlso, I'll show the indexes that exist on the table:\n\n```sql\nSHOW INDEXES FROM employees;\n```\n\n```log\n+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+\n| Table     | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Ignored |\n+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+\n| employees |          0 | PRIMARY  |            1 | emp_no      | A         |      299335 |     NULL | NULL   |      | BTREE      |         |               | NO      |\n+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+\n```\n\nThis indicates that there's only a single index on emp_no (the primary key); with that done, let's take a look at the EXPLAINs for those two middle queries at 100 and 1000 employees.\n\n```sql\nEXPLAIN SELECT emp_no, birth_date, first_name, last_name, gender, hire_date FROM employees WHERE hire_date \u003e= '1992-01-01' LIMIT 100;\n```\n\n```log\n+------+-------------+-----------+------+---------------+------+---------+------+--------+-------------+\n| id   | select_type | table     | type | possible_keys | key  | key_len | ref  | rows   | Extra       |\n+------+-------------+-----------+------+---------------+------+---------+------+--------+-------------+\n|    1 | SIMPLE      | employees | ALL  | NULL          | NULL | NULL    | NULL | 299335 | Using where |\n+------+-------------+-----------+------+---------------+------+---------+------+--------+-------------+\n```\n\nDecoding this EXPLAIN, we can determine the following:\n\n- it wasn't able to use any keys (note that possible_keys, key, key_len and ref are null)\n- to read 100 rows, we had to read 299335 (out of 300024)\n- the type is \"ALL\" which generally communicates a table scan meaning that its fastest to just scan the entire table (than doing ANYTHING else)\n\n```sql\nEXPLAIN SELECT emp_no, birth_date, first_name, last_name, gender, hire_date FROM employees WHERE hire_date \u003e= '1992-01-01' LIMIT 1000;\n```\n\n```log\n+------+-------------+-----------+------+---------------+------+---------+------+--------+-------------+\n| id   | select_type | table     | type | possible_keys | key  | key_len | ref  | rows   | Extra       |\n+------+-------------+-----------+------+---------------+------+---------+------+--------+-------------+\n|    1 | SIMPLE      | employees | ALL  | NULL          | NULL | NULL    | NULL | 299335 | Using where |\n+------+-------------+-----------+------+---------------+------+---------+------+--------+-------------+\n```\n\nThis explain is pretty much the same as the above (except it's 10x the data). So, I think we can optimize queries against hire_date by adding an index:\n\n```sql\nCREATE INDEX idx_hire_date ON employees (hire_date);\n```\n\n\u003e Keep in mind that we're creating an index to optimize queries rather than a constraint (e.g. a unique index for a primary key)\n\nCreating an index takes time (in this case it took 1.936s) and increases the on-disk size of the table; hopefully an OK price to optimize the queries. Let's try to benchmark again and see what the results are:\n\n```sql\nSELECT emp_no, birth_date, first_name, last_name, gender, hire_date FROM employees WHERE hire_date \u003e= '1992-01-01' LIMIT 10;\nSELECT emp_no, birth_date, first_name, last_name, gender, hire_date FROM employees WHERE hire_date \u003e= '1992-01-01' LIMIT 100;\nSELECT emp_no, birth_date, first_name, last_name, gender, hire_date FROM employees WHERE hire_date \u003e= '1992-01-01' LIMIT 1000;\nSELECT emp_no, birth_date, first_name, last_name, gender, hire_date FROM employees WHERE hire_date \u003e= '1992-01-01' LIMIT 10000;\n```\n\nThese are the results of the earlier queries:\n\n- 10 employees: 0.004s (vs 0.010s)\n- 100 employees: 0.007s (vs 0.010s)\n- 1000 employees: 0.030s (vs 0.051s)\n- 10000 employees: 0.115s (vs 0.071s)\n\nIn comparison to our earlier benchmark, we see a significant decrease in query time, we're talking orders of magnitude difference such that in the time we previously could only query 10 employees, now we can query 10000 employees. Now the fun part, let's re-examine the explain to see if there are any noticeable differences:\n\n```sql\nEXPLAIN SELECT emp_no, birth_date, first_name, last_name, gender, hire_date FROM employees WHERE hire_date \u003e= '1992-01-01' LIMIT 1000;\n```\n\n```log\n+------+-------------+-----------+-------+---------------+---------------+---------+------+--------+-----------------------+\n| id   | select_type | table     | type  | possible_keys | key           | key_len | ref  | rows   | Extra                 |\n+------+-------------+-----------+-------+---------------+---------------+---------+------+--------+-----------------------+\n|    1 | SIMPLE      | employees | range | idx_hire_date | idx_hire_date | 3       | NULL | 149667 | Using index condition |\n+------+-------------+-----------+-------+---------------+---------------+---------+------+--------+-----------------------+\n```\n\nNow this is significantly more interesting, we can see that it's no longer using 'ALL' (so no table scan) and that it has an index it can use. We also see that the number of rows it had to scan through was halved (149667 vs 299335). To put this in context, if we put together this database and found that a lot of the queries that were run against the database involved hire_date, we could add this index and significantly reduce the query time.\n\nIn short, for this specific query, adding an index would significantly reduce the query time __for this specific query__ OTHER queries may not be able to take advantage of the index or may have some ability to take advantage of the index, SQL could also find that a table scan is the fastest way as well.\n\n## Scenario 2: How Do Logical Deletes Affect Queries with Constraints\n\n\u003c!-- How does the deleted column being added to the employees table affect queries that include a  constraint for deleted? 10 employees? 100 employees? 1000 employees? 10000 employees? --\u003e\n\nWhen I implemented [sql-blog-soft-deletes](github.com/antonio-alexander/sql-blog-soft-deletes); there was a question I had that didn't make sense to ask (or answer) within that effort: I wanted to know how performance was affected when adding the deleted column as a query and if using a separate table was more performant. I'm pretty sure using the table is more performant, but I wanted to quantify it (and at scale).\n\nBecause I need the data set to prove either at scale, we'll be modifying the database post [scenario 1](#scenario-1-adding-indexes-to-optimize-queries-with-constraints). We'll do the following to attempt to prove out this scenario:\n\n1. Modify the existing table to add the deleted column to gather some benchmarks\n2. Attempt to optimize attempts to query those deleted (or un-deleted) employees\n3. Create a new table to copy those \"deleted\" employees\n4. Attempt to optimize attempts to query those deleted (or un-deleted) employees using the different table\n\n\u003e Keep in mind that there are strong architectural implications with regards to data consistency when performing a logic delete using an alternate table. Things such as enforcing foreign key constraints, handling auditing and triggers. It's always easier if you leave data consistency up to SQL\n\nTo add the deleted column to the existing employees database, we'll need to enter the following sql:\n\n```sql\nALTER TABLE employees ADD COLUMN deleted bool DEFAULT false;\n```\n\nWe're also going to simulate deleting \"half\" of the employees, but we want to do this in a way that does an unbiased distribution, so we don't have to worry about anything weird when we attempt to add indexes/optimize. We'll do this by deleting employees that have an even employee number. I found this on [https://tableplus.com/blog/2019/09/select-rows-odd-even-value.html](https://tableplus.com/blog/2019/09/select-rows-odd-even-value.html):\n\n```sql\nUPDATE employees SET deleted=true WHERE mod(emp_no,2) = 0;\n```\n\nThis statement should take a while to complete (a couple of seconds); but once it's done, we should have a relative even distribution of \"deleted\" employees. We can confirm the following after we set that command, that out of 300,024 employees, we've deleted 150,012 of them. So here are the queries we want to attempt to optimize:\n\n```sql\nSELECT emp_no, birth_date, first_name, last_name, gender, hire_date FROM employees WHERE deleted=true LIMIT 10;\nSELECT emp_no, birth_date, first_name, last_name, gender, hire_date FROM employees WHERE deleted=true LIMIT 100;\nSELECT emp_no, birth_date, first_name, last_name, gender, hire_date FROM employees WHERE deleted=true LIMIT 1000;\nSELECT emp_no, birth_date, first_name, last_name, gender, hire_date FROM employees WHERE deleted=true LIMIT 10000;\n```\n\nThese are the results of those queries:\n\n- 10 employees: .09s - .020s\n- 100 employees: .007s - .052s\n- 1000 employees: .016s - .070s\n- 10000 employees: .054s - .092s\n\n\u003e Keep in mind that although i'm doing benchmarks, there are __tons__ of variables that affect how long a query takes that have little to do with sql but more to do with available RAM, disk speed (e.g. if its SSD or spinny) etc...\n\nIn comparison to the benchmarks we did earlier, there was significantly more variation in the queries, so I've given a range instead of a single value. Before we start modifying the table, let's do an EXPLAIN on the query with 1000 employees:\n\n```sql\nEXPLAIN SELECT emp_no, birth_date, first_name, last_name, gender, hire_date FROM employees WHERE hire_date \u003e= '1992-01-01' AND deleted=true LIMIT 1000;\n```\n\n```log\n+------+-------------+-----------+-------+---------------+---------------+---------+------+--------+------------------------------------+\n| id   | select_type | table     | type  | possible_keys | key           | key_len | ref  | rows   | Extra                              |\n+------+-------------+-----------+-------+---------------+---------------+---------+------+--------+------------------------------------+\n|    1 | SIMPLE      | employees | range | idx_hire_date | idx_hire_date | 3       | NULL | 149667 | Using index condition; Using where |\n+------+-------------+-----------+-------+---------------+---------------+---------+------+--------+------------------------------------+\n```\n\nSo...let's start with the obvious first step (which I don't think will make a difference). Let's add an index to the deleted column, and see if it changes any of the results:\n\n```sql\nCREATE INDEX idx_deleted ON employees(deleted);\n```\n\nLet's attempt to run the same queries we ran earlier:\n\n- 10 employees: .008s - .017s (.09s - .020s)\n- 100 employees: .011s - .019s (.007s - .052s)\n- 1000 employees: .037s - .082s (.016s - .070s)\n- 10000 employees: .094s - .231s (.054s - .092s)\n\nIn comparison to the previous benchmarks, there is what seems like improvement, but the range is still all over the place. There's not a clear and consistent change in the query time and at a glance, I'll concede that adding a index to the deleted column did not improve our query.\n\nThere's a [stack overflow discussion](https://stackoverflow.com/q/10524651) that I think is super useful, but the gist is that boolean columns (in general) have _very low_ cardinality; this means that in a given data set there's not a huge variation of boolean values so adding an index DOES have benefits, but is an exercise in diminishing returns. The exercise I did early in [scenario 1](#scenario-1-adding-indexes-to-optimize-queries-with-constraints) was to try to show the min/max of the columns; this (kind of) communicated the level of cardinality available within each of those columns.\n\n\u003e You can also think that indexes use binary trees (at least in sql) and that with a boolean column, the binary tree is __INCREDIBLY__ shallow...we can call it a binary shrub: valid values for the column are true/false/null\n\nBut we have an alternate solution (suggested by the caveman DBA): we can create another table that mirrors employees, but instead of holding \"employees\", it'll hold \"deleted \"employees\" which don't exist in the other table. This will be more of a \"proof of concept\" because we're not going to attempt to solve the problem with foreign key constraints.\n\nTo create this other table, we can execute the following:\n\n```sql\nCREATE TABLE employees_deleted (\n    emp_no INT NOT NULL,\n    FOREIGN KEY (emp_no) REFERENCES employees(emp_no)\n);\n```\n\nWe can populate this table with deleted employees (from earlier) by entering the following:\n\n```sql\nINSERT INTO employees_deleted\n    SELECT emp_no FROM employees WHERE deleted=true;\n```\n\nAlternatively, if we wanted to do this from scratch (without adding a deleted column), we could do the following to _delete_ every other user:\n\n```sql\nINSERT INTO employees_deleted\n    SELECT emp_no from employees WHERE mod(emp_no,2) = 0;\n```\n\nWe can semi-confirm that this has been successful with the following queries:\n\n```sql\nSELECT count(*) FROM employees WHERE deleted=true;\nSELECT count(*) FROM employees_deleted;\n```\n\nBoth queries should have a count of 150,012 rows. Now that we have a table of deleted employees and a table of employees, we can query the _deleted_ employees by doing a join on the two tables:\n\n```sql\nSELECT employees.emp_no, employees.birth_date, employees.first_name, employees.last_name, employees.gender, employees.hire_date FROM employees INNER JOIN employees_deleted ON employees.emp_no = employees_deleted.emp_no LIMIT 10;\n```\n\nInner joins aren't special, but we may be able to optimize this because there's an __exploitable__ cardinality that we can get from emp_no that we couldn't get from a boolean column.\n\n```sql\nSELECT employees.emp_no, employees.birth_date, employees.first_name, employees.last_name, employees.gender, employees.hire_date FROM employees INNER JOIN employees_deleted ON employees.emp_no = employees_deleted.emp_no LIMIT 10;\nSELECT employees.emp_no, employees.birth_date, employees.first_name, employees.last_name, employees.gender, employees.hire_date FROM employees INNER JOIN employees_deleted ON employees.emp_no = employees_deleted.emp_no LIMIT 100;\nSELECT employees.emp_no, employees.birth_date, employees.first_name, employees.last_name, employees.gender, employees.hire_date FROM employees INNER JOIN employees_deleted ON employees.emp_no = employees_deleted.emp_no LIMIT 1000;\nSELECT employees.emp_no, employees.birth_date, employees.first_name, employees.last_name, employees.gender, employees.hire_date FROM employees INNER JOIN employees_deleted ON employees.emp_no = employees_deleted.emp_no LIMIT 10000;\n```\n\nThese are the results of those queries:\n\n- 10 employees: .008s - .016s (.008s - .017s)\n- 100 employees: .011s - .017s (.011s - .019s)\n- 1000 employees: .030s - .046s (.037s - .082s)\n- 10000 employees: .143s - .165s (.094s - .231s)\n\nIn my opinion, there's not a huge difference from the previous queries in terms of overall timing, but I do think there was more precision in the queries than before. An EXPLAIN gives us the following:\n\n```log\n+------+-------------+-------------------+--------+---------------+---------+---------+------------------------------------+--------+-------------+\n| id   | select_type | table             | type   | possible_keys | key     | key_len | ref                                | rows   | Extra       |\n+------+-------------+-------------------+--------+---------------+---------+---------+------------------------------------+--------+-------------+\n|    1 | SIMPLE      | employees_deleted | index  | emp_no        | emp_no  | 4       | NULL                               | 150306 | Using index |\n|    1 | SIMPLE      | employees         | eq_ref | PRIMARY       | PRIMARY | 4       | employees.employees_deleted.emp_no | 1      |             |\n+------+-------------+-------------------+--------+---------------+---------+---------+------------------------------------+--------+-------------+\n```\n\nIt feels like there's more room for optimization or there's something _else_ I don't know about, but I think this is good enough to communicate that there's a certain difficulty associated with optimizing queries involving columns with very low cardinality.\n\n## Scenario 3: How Can You Optimize a Table for Pagination?\n\nThis scenario is _almost_ a copy+paste from [github.com/antonio-alexander/sql-blog-paging](github.com/antonio-alexander/sql-blog-paging). To briefly summarize: for large data sets, it's not always practical to query the entire data set, so instead you query a subset of the data or more functionally, a page. A page, for all intents and purposes, is a consistent query you can make for the _same_ data set. You can think of a bank statement that shows you all of your transactions for a certain criteria and if its over a certain number of transactions, instead of giving you every transaction it'll give you the first page of transactions.\n\nThe purpose of this scenario is to try to optimize those queries; we're going to start from a state of zero indexes. Similar to other scenarios, we have a handful of columns with reasonable [cardinality](https://en.wikipedia.org/wiki/Cardinality_(SQL_statements)) such that there's room to _actually_ optimize.\n\nThe queries for pagination come in a handful of forms, so we'll start with the simple (...well bad) queries to give us a starting point to what's happening:\n\n```sql\nEXPLAIN SELECT * FROM employees LIMIT 10;\nEXPLAIN SELECT * FROM employees;\n```\n\n```log\nMariaDB [employees]\u003e EXPLAIN SELECT * FROM employees LIMIT 10;\n+------+-------------+-----------+------+---------------+------+---------+------+--------+-------+\n| id   | select_type | table     | type | possible_keys | key  | key_len | ref  | rows   | Extra |\n+------+-------------+-----------+------+---------------+------+---------+------+--------+-------+\n|    1 | SIMPLE      | employees | ALL  | NULL          | NULL | NULL    | NULL | 299246 |       |\n+------+-------------+-----------+------+---------------+------+---------+------+--------+-------+\n1 row in set (0.001 sec)\n\nMariaDB [employees]\u003e EXPLAIN SELECT * FROM employees;\n+------+-------------+-----------+------+---------------+------+---------+------+--------+-------+\n| id   | select_type | table     | type | possible_keys | key  | key_len | ref  | rows   | Extra |\n+------+-------------+-----------+------+---------------+------+---------+------+--------+-------+\n|    1 | SIMPLE      | employees | ALL  | NULL          | NULL | NULL    | NULL | 299246 |       |\n+------+-------------+-----------+------+---------------+------+---------+------+--------+-------+\n1 row in set (0.001 sec)\n```\n\nUnsurprisingly, the explain indicates the queries are totally un-optimized: there were no possible keys to use and it had to read almost every row (299246 of 300024) to perform each query. Alterantively, this is an un-optimized query because the type is ALL indicating that a table scan was the fastest way to query the data. An ideal result (ignoring the time to complete the query) would be that the query with the limit ONLY had to read 10 rows and would use a type other than ALL.\n\nThe below query attempts to approximate offset/limit based pagination (we're just querying employee number to nudge our explain in the right direction)\n\n```sql\nEXPLAIN SELECT emp_no FROM employees LIMIT 10;\nEXPLAIN SELECT emp_no FROM employees LIMIT 10 OFFSET 10;\n```\n\n```log\nMariaDB [employees]\u003e EXPLAIN SELECT emp_no FROM employees LIMIT 10;\n+------+-------------+-----------+-------+---------------+---------+---------+------+--------+-------------+\n| id   | select_type | table     | type  | possible_keys | key     | key_len | ref  | rows   | Extra       |\n+------+-------------+-----------+-------+---------------+---------+---------+------+--------+-------------+\n|    1 | SIMPLE      | employees | index | NULL          | PRIMARY | 4       | NULL | 299246 | Using index |\n+------+-------------+-----------+-------+---------------+---------+---------+------+--------+-------------+\n1 row in set (0.001 sec)\n\nMariaDB [employees]\u003e EXPLAIN SELECT emp_no FROM employees LIMIT 10 OFFSET 10;\n+------+-------------+-----------+-------+---------------+---------+---------+------+--------+-------------+\n| id   | select_type | table     | type  | possible_keys | key     | key_len | ref  | rows   | Extra       |\n+------+-------------+-----------+-------+---------------+---------+---------+------+--------+-------------+\n|    1 | SIMPLE      | employees | index | NULL          | PRIMARY | 4       | NULL | 299246 | Using index |\n+------+-------------+-----------+-------+---------------+---------+---------+------+--------+-------------+\n1 row in set (0.001 sec)\n```\n\nThese queries are a bit more specific, the \"nudge\" we gave is by using emp_no (the primary key) as criteria which allows use of index associated with the primary key; still we read almost all of the rows, but the type is not ALL but index. This is a better query and should generally be more performant, especially at scale. The queries we did above are based on offset/limit based criteria while the next will be more cursor based pagination:\n\n```sql\nEXPLAIN SELECT emp_no FROM employees WHERE emp_no \u003e 10010 LIMIT 10;\nEXPLAIN SELECT emp_no FROM employees WHERE emp_no \u003c 10010 LIMIT 10;\n```\n\n```log\nMariaDB [employees]\u003e EXPLAIN SELECT emp_no FROM employees WHERE emp_no \u003e 10010 LIMIT 10;\n+------+-------------+-----------+-------+---------------+---------+---------+------+--------+--------------------------+\n| id   | select_type | table     | type  | possible_keys | key     | key_len | ref  | rows   | Extra                    |\n+------+-------------+-----------+-------+---------------+---------+---------+------+--------+--------------------------+\n|    1 | SIMPLE      | employees | range | PRIMARY       | PRIMARY | 4       | NULL | 149623 | Using where; Using index |\n+------+-------------+-----------+-------+---------------+---------+---------+------+--------+--------------------------+\n1 row in set (0.001 sec)\n\nMariaDB [employees]\u003e EXPLAIN SELECT emp_no FROM employees WHERE emp_no \u003c 10010 LIMIT 10;\n+------+-------------+-----------+-------+---------------+---------+---------+------+------+--------------------------+\n| id   | select_type | table     | type  | possible_keys | key     | key_len | ref  | rows | Extra                    |\n+------+-------------+-----------+-------+---------------+---------+---------+------+------+--------------------------+\n|    1 | SIMPLE      | employees | range | PRIMARY       | PRIMARY | 4       | NULL | 9    | Using where; Using index |\n+------+-------------+-----------+-------+---------------+---------+---------+------+------+--------------------------+\n1 row in set (0.001 sec)\n```\n\nThese queries still use LIMIT, but don't use OFFSET and there's a distinct difference to the EXPLAINs: you'll see that the type is different (range) and more interesting, the number of rows is now significantly reduced (to both 149623 and 9 respectively).\n\nAlthough there's a big difference in performance, I'll ruin the success in noting that calculating the cursors (knowing the start emp_no) requires an additional query which could kill your performance in aggregate. Let's say that our data set is 10 pages of 100 employees each sorted by employee number, we can pre-calculate the cursors (10 pages of 100 employees each) for each page with the following query:\n\n```sql\nSET @a = 0;\nSELECT emp_no FROM employees WHERE emp_no \u003e 10100 AND ((@a := @a + 1) % 100 = 0 OR @a %100 = 99) LIMIT 10;\n```\n\n```log\nMariaDB [employees]\u003e SELECT emp_no FROM employees WHERE emp_no \u003e 10100 AND ((@a := @a + 1) % 100 = 0 OR @a %100 = 99) LIMIT 10;\n+--------+\n| emp_no |\n+--------+\n|  10199 |\n|  10200 |\n|  10299 |\n|  10300 |\n|  10399 |\n|  10400 |\n|  10499 |\n|  10500 |\n|  10599 |\n|  10600 |\n+--------+\n10 rows in set (0.001 sec)\n```\n\nFrom this query we can determine that the first page is from emp_no 10100 to 10199 so forth and so on.\n\n\u003e This query is very safe to copy+paste because it is a complete solution for situations where the index of the page (e.g., emp_no) isn't consecutive; this handles the situation where a row has been deleted\n\nOnce we have the cursors, we can then query a specific page (this would be page 9):\n\n```sql\nEXPLAIN SELECT emp_no, first_name, last_name, gender FROM employees WHERE emp_no BETWEEN 10500 AND 10599 ORDER BY emp_no ASC;\n```\n\n```log\nMariaDB [employees]\u003e EXPLAIN SELECT emp_no, first_name, last_name, gender FROM employees WHERE emp_no BETWEEN 10500 AND 10599 ORDER BY emp_no ASC;\n+------+-------------+-----------+-------+---------------+---------+---------+------+------+-------------+\n| id   | select_type | table     | type  | possible_keys | key     | key_len | ref  | rows | Extra       |\n+------+-------------+-----------+-------+---------------+---------+---------+------+------+-------------+\n|    1 | SIMPLE      | employees | range | PRIMARY       | PRIMARY | 4       | NULL | 100  | Using where |\n+------+-------------+-----------+-------+---------------+---------+---------+------+------+-------------+\n1 row in set (0.001 sec)\n```\n\nAs you can see above, this seems relatively optimized, the number of rows read is equal to the number of rows we're interested in and the type is not _ALL_. This technique can work for different kinds of pages ordered differently as long as you index by a unique column (using a non-unique column is possible, but you most likely won't get a consistent number of items).\n\n## Scenario 4: How to Optimize Columns With Low Cardinality\n\nThis scenario will be very extreme, but it's an attempt to try to solve the problem when you have a column with significantly low cardinality. Data types with low cardinality are things like enums or booleans which have very few unique values. With our sample data set, the column with the least amount of cardinality is the gender column (M or F).\n\nLet's try to do an EXPLAIN on the following queries:\n\n```sql\nEXPLAIN SELECT emp_no FROM employees WHERE gender ='M';\nEXPLAIN SELECT emp_no FROM employees WHERE gender ='F';\n```\n\n```log\nMariaDB [employees]\u003e EXPLAIN SELECT emp_no FROM employees WHERE gender ='M';\n+------+-------------+-----------+------+---------------+------+---------+------+--------+-------------+\n| id   | select_type | table     | type | possible_keys | key  | key_len | ref  | rows   | Extra       |\n+------+-------------+-----------+------+---------------+------+---------+------+--------+-------------+\n|    1 | SIMPLE      | employees | ALL  | NULL          | NULL | NULL    | NULL | 299246 | Using where |\n+------+-------------+-----------+------+---------------+------+---------+------+--------+-------------+\n1 row in set (0.001 sec)\n\nMariaDB [employees]\u003e EXPLAIN SELECT emp_no FROM employees WHERE gender ='F';\n+------+-------------+-----------+------+---------------+------+---------+------+--------+-------------+\n| id   | select_type | table     | type | possible_keys | key  | key_len | ref  | rows   | Extra       |\n+------+-------------+-----------+------+---------------+------+---------+------+--------+-------------+\n|    1 | SIMPLE      | employees | ALL  | NULL          | NULL | NULL    | NULL | 299246 | Using where |\n+------+-------------+-----------+------+---------------+------+---------+------+--------+-------------+\n```\n\nFrom the above EXPLAINs, we see that a table scan (ALL) was the fastest way to complete this query. If we run the queries themselves, these are the times we get (on average):\n\n```sql\nSELECT emp_no, first_name, last_name, gender FROM employees WHERE gender = 'F';\nSELECT emp_no, first_name, last_name, gender FROM employees WHERE gender = 'M';\n```\n\n- Male: 0.301s - 0.319s\n- Female: 0.278s - 0.355s\n\nThis isn't terrible (in general), it's probably really great given the circumstances, but maybe there's a way we can optimize it using JOINs. What if we created two new tables, employee_male and employee_female (with a single column):\n\n```sql\nCREATE TABLE employees_male (\n    emp_no INT NOT NULL,\n    FOREIGN KEY (emp_no) REFERENCES employees(emp_no)\n);\n\nCREATE TABLE employees_female (\n    emp_no INT NOT NULL,\n    FOREIGN KEY (emp_no) REFERENCES employees(emp_no)\n);\n\nINSERT INTO employees_male SELECT emp_no FROM employees WHERE gender = 'M';\nINSERT INTO employees_female SELECT emp_no FROM employees WHERE gender = 'F';\n```\n\nWith these new tables, we should be able to query for our male (or female) employees by doing a RIGHT JOIN:\n\n```sql\nEXPLAIN SELECT e.emp_no, first_name, last_name, gender FROM employees AS e RIGHT JOIN employees_male AS em ON e.emp_no = em.emp_no;\n```\n\n```log\nMariaDB [employees]\u003e EXPLAIN SELECT e.emp_no, first_name, last_name, gender FROM employees AS e RIGHT JOIN employees_male AS em ON e.emp_no = em.emp_no;\n+------+-------------+-------+--------+---------------+---------+---------+---------------------+--------+-------------+\n| id   | select_type | table | type   | possible_keys | key     | key_len | ref                 | rows   | Extra       |\n+------+-------------+-------+--------+---------------+---------+---------+---------------------+--------+-------------+\n|    1 | SIMPLE      | em    | index  | NULL          | emp_no  | 4       | NULL                | 180154 | Using index |\n|    1 | SIMPLE      | e     | eq_ref | PRIMARY       | PRIMARY | 4       | employees.em.emp_no | 1      |             |\n+------+-------------+-------+--------+---------------+---------+---------+---------------------+--------+-------------+\n```\n\nThis explain, while not great, is definitely a bit more promising, let's skip to how long it takes to do the queries we did earlier:\n\n```sql\nSELECT e.emp_no, first_name, last_name, gender FROM employees AS e RIGHT JOIN employees_male AS em ON e.emp_no = em.emp_no;\nSELECT e.emp_no, first_name, last_name, gender FROM employees AS e RIGHT JOIN employees_female AS ef ON e.emp_no = ef.emp_no;\n```\n\nThese are the results of those queries:\n\n- Male: 1.090s - 1.102s (vs 0.301s - 0.319s)\n- Female: 0.743s - 0.759s (vs 0.278s - 0.355s)\n\nThis is a bit unexpected (for me) and there may still be some situations where this kind of setup makes sense, but in general, what it means is that given the query, using a WHERE constraint is the fastest way.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fantonio-alexander%2Fsql-blog-indexing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fantonio-alexander%2Fsql-blog-indexing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fantonio-alexander%2Fsql-blog-indexing/lists"}