Database indexing is a crucial aspect of efficient data retrieval in modern databases. It plays a significant role in optimizing query performance and speeding up data retrieval operations.
In this tutorial, we’ll delve into the world of database indexing, exploring its inner workings, benefits, and limitations.
2.1. SQL Databases
To understand indexing, let’s first take a quick look at databases. They are an organized way to reliably insert, store, update, retrieve and delete data. Historically, SQL databases have been the first, and they have stood the test of time as they are versatile and robust. The way they work is that information about all kinds of entities is stored in tables, each consisting of rows and columns:
2.2. Table Structure and Columns
In a table, data is organized into rows, with each row representing a record or entry. Columns, on the other hand, define the various attributes or properties of these records. To illustrate this, let’s consider a simple table with three columns: ID, e-mail, and name. Here’s an example:
For our purposes, a column can be one of 3 types:
- Primary key – only one per table, it identifies each entry of the table through a unique value
- Unique column – zero, one, or multiple per table, it defines an attribute that happens to be unique for each row
- Non-unique column – zero, one or multiple per table, it defines an attribute that can have duplicates
Thus, each user is identified by his ID, has an e-mail address that is unique to him, and also has a name that other users can happen to share.
Let’s talk about indexing. When we index a column, we make searches faster, at the cost of administering an additional data structure behind the scenes.
This makes all searches significantly faster on large volumes of data.
3.1. Searching Through Non-Unique Columns
Let’s say we want to find users named “Michael”. Since multiple users can have this name, we need to go through all rows in the table in order to guarantee we find all of them.
If there are rows, we’ll need to go through of them, on average, to find our answer.
Here’s a graph showing how fast we’ll find our item in a standard column:
3.2. Searching Through Unique Columns
Let’s say we want to find the user with the “[email protected]” e-mail address. Since the e-mail is supposed to be unique, we can stop the search as soon as we find the row with the desired value.
If there are rows, we need to go through of them, on average, to find our answer.
Here’s a graph showing how fast we’ll find our item in a unique column:
3.3. Searching Through the Primary Key
Let’s say we want to find a user with an ID equal to 4291. Simply because of the fact that the ID is a primary key, we are guaranteed significantly fewer search steps.
In fact, if there are rows, we’ll go through of them, on average, to find our answer.
Finally, here’s a graph showing the performance of search with a primary key column:
In order to understand the magnitudes of these 3 scenarios, we can take a look at the following table:
Not only is the version always faster than the alternatives, the difference between it and the other 2 becomes astronomical for large database sizes. This could be the difference between waiting seconds for a query instead of a couple of hours.
3.4. Searching Through Indexed Columns
The idea of indexing is to enable searches through specific columns as fast as searches through the primary key.
Let’s say checking if an e-mail address exists in our database is an operation we do really often, we should then optimize it.
We don’t want to replace our primary key ID with the e-mail column as we want to allow users to change their e-mail addresses without losing their identity.
In this situation, the solution is creating a database index on the e-mail column.
The following table shows the link between the average number of checks we need to make and the column type used in the query:
In this section, we’ll go over a couple of performance considerations that should be considered when deciding if an index is necessary for a specific column.
We’ll start from the following user table that initially has no indexes applied to it:
4.1. Indexing Multiple Columns
The primary purpose of indexing is vastly improved performance on database retrieval. There is, however, a trade-off that is made. While each query involving columns with indexes is way faster, each database insert, update or delete becomes slower. The more indexes we have, the more time it takes to modify entries, and more than that, each entry takes significantly more space.
Taking this into account, we could consider indexing the “Telephone” and “E-mail” columns, as these columns would be pretty useful at login time:
What we shouldn’t do is also index the “Name” and “Address” columns. This would be overkill, as finding a user fast by using his address or his name probably isn’t a realistic application need:
4.2. Indexing Columns with Low Cardinality
Columns with low cardinality, meaning they have a small number of distinct values compared to the total number of records, pose a challenge for indexing. In such cases, the index may not provide significant performance improvements as it would result in scanning a large portion of the table anyway. It’s important to assess the cardinality and uniqueness of columns before deciding whether to create an index.
As a result, indexing our table on the “Gender” column would be extremely ineffective, as it only has two possible values:
In this article, we discuss why database indexing is vital in optimizing query performance and enhancing the overall efficiency of data retrieval operations.
By creating efficient data structures and leveraging them during searches, indexing significantly reduces the time complexity of queries. It allows quick access to specific records, particularly in scenarios involving unique or primary key columns.