1. Introduction

Database indexing is a crucial aspect of efficient data retrieval in modern databases. It plays a significant role in optimizing query performance and speeding up data retrieval operations.

In this tutorial, we’ll delve into the world of database indexing, exploring its inner workings, benefits, and limitations.

2. Databases

2.1. SQL Databases

To understand indexing, let’s first take a quick look at databases. They are an organized way to reliably insert, store, update, retrieve and delete data. Historically, SQL databases have been the first, and they have stood the test of time as they are versatile and robust. The way they work is that information about all kinds of entities is stored in tables, each consisting of rows and columns:

2.2. Table Structure and Columns

In a table, data is organized into rows, with each row representing a record or entry. Columns, on the other hand, define the various attributes or properties of these records. To illustrate this, let’s consider a simple table with three columns: ID, e-mail, and name. Here’s an example:

Primary Key Unique Column Non-unique Column
ID email name
1 [email protected] Michael
2 [email protected] Bob
3 [email protected] Michael

For our purposes, a column can be one of 3 types:

  • Primary key – only one per table, it identifies each entry of the table through a unique value
  • Unique column –  zero, one, or multiple per table, it defines an attribute that happens to be unique for each row
  • Non-unique column – zero, one or multiple per table, it defines an attribute that can have duplicates

Thus, each user is identified by his ID, has an e-mail address that is unique to him, and also has a name that other users can happen to share.

3. Indexing

Let’s talk about indexing. When we index a column, we make searches faster, at the cost of administering an additional data structure behind the scenes.

This makes all searches significantly faster on large volumes of data.

3.1. Searching Through Non-Unique Columns

Let’s say we want to find users named “Michael”. Since multiple users can have this name, we need to go through all rows in the table in order to guarantee we find all of them.

If there are $n$ rows, we’ll need to go through $O(n)$ of them, on average, to find our answer.

Here’s a graph showing how fast we’ll find our item in a standard column:

3.2. Searching Through Unique Columns

Let’s say we want to find the user with the “[email protected]” e-mail address. Since the e-mail is supposed to be unique, we can stop the search as soon as we find the row with the desired value.

If there are $n$ rows, we need to go through $O(\frac{n + 1}{2})$ of them, on average, to find our answer.

Here’s a graph showing how fast we’ll find our item in a unique column:

3.3. Searching Through the Primary Key

Let’s say we want to find a user with an ID equal to 4291. Simply because of the fact that the ID is a primary key, we are guaranteed significantly fewer search steps.

In fact, if there are $n$ rows, we’ll go through $O(log(n))$ of them, on average, to find our answer.

Finally, here’s a graph showing the performance of search with a primary key column:

To understand the magnitudes of these 3 scenarios, we can take a look at the following table:

Size of table (entries) O(n) O((n+1)/2) O(log(n))
10 10 6 4
1000 1000 501 10
100000 100000 50001 17
10000000 10000000 5000001 24

Not only is the O(log(n)) version always faster than the alternatives, but the difference between it and the other 2 becomes astronomical for large database sizes. This could be the difference between waiting seconds for a query instead of a couple of hours.

3.4. Searching Through Indexed Columns

The idea of indexing is to enable searches through specific columns as fast as searches through the primary key.

Let’s say checking if an e-mail address exists in our database is an operation we do really often, we should then optimize it.

We don’t want to replace our primary key ID with the e-mail column as we want to allow users to change their e-mail addresses without losing their identity.

In this situation, the solution is creating a database index on the e-mail column.

The following table shows the link between the average number of checks we need to make and the column type used in the query:

Size of table (entries) Column Unique column Primary Key / Index
10 10 6 4
1000 1000 501 10
100000 100000 50001 17
10000000 10000000 5000001 24

4. Performance

In this section, we’ll go over a couple of performance considerations that should be considered when deciding if an index is necessary for a specific column.

We’ll start from the following user table that initially has no indexes applied to it:

ID Name Telephone E-mail Address Gender
1 John Doe 555-1234 [email protected] 123 Main St Male
2 Jane Smith 555-5678 [email protected] 456 Elm St Female
3 Alex Johnson 555-9876 [email protected] 789 Oak Ave Male
4 Lisa Davis 555-4321 [email protected] 321 Pine St Female
5 Mark Wilson 555-2468 [email protected] 654 Cedar Rd Male
6 Emily Brown 555-8642 [email protected] 987 Maple Ln Female

 

4.1. Indexing Multiple Columns

The primary purpose of indexing is vastly improved performance on database retrieval. There is, however, a trade-off that is made. While each query involving columns with indexes is way faster, each database insert, update or delete becomes slower. The more indexes we have, the more time it takes to modify entries, and more than that, each entry takes significantly more space.

Taking this into account, we could consider indexing the “Telephone” and “E-mail” columns, as these columns would be pretty useful at login time:

ID Name Telephone E-mail Address Gender
1 John Doe 555-1234 [email protected] 123 Main St Male
2 Jane Smith 555-5678 [email protected] 456 Elm St Female
3 Alex Johnson 555-9876 [email protected] 789 Oak Ave Male
4 Lisa Davis 555-4321 [email protected] 321 Pine St Female
5 Mark Wilson 555-2468 [email protected] 654 Cedar Rd Male
6 Emily Brown 555-8642 [email protected] 987 Maple Ln Female

 

What we shouldn’t do is also index the “Name” and “Address” columns. This would be overkill, as finding a user fast by using his address or his name probably isn’t a realistic application need:

ID Name Telephone E-mail Address Gender
1 John Doe 555-1234 [email protected] 123 Main St Male
2 Jane Smith 555-5678 [email protected] 456 Elm St Female
3 Alex Johnson 555-9876 [email protected] 789 Oak Ave Male
4 Lisa Davis 555-4321 [email protected] 321 Pine St Female
5 Mark Wilson 555-2468 [email protected] 654 Cedar Rd Male
6 Emily Brown 555-8642 [email protected] 987 Maple Ln Female

 

4.2. Indexing Columns with Low Cardinality

Columns with low cardinality, meaning they have a small number of distinct values compared to the total number of records, pose a challenge for indexing. In such cases, the index may not provide significant performance improvements as it would result in scanning a large portion of the table anyway. It’s important to assess the cardinality and uniqueness of columns before deciding whether to create an index.

As a result, indexing our table on the “Gender” column would be extremely ineffective, as it only has two possible values:

ID Name Telephone E-mail Address Gender
1 John Doe 555-1234 [email protected] 123 Main St Male
2 Jane Smith 555-5678 [email protected] 456 Elm St Female
3 Alex Johnson 555-9876 [email protected] 789 Oak Ave Male
4 Lisa Davis 555-4321 [email protected] 321 Pine St Female
5 Mark Wilson 555-2468 [email protected] 654 Cedar Rd Male
6 Emily Brown 555-8642 [email protected] 987 Maple Ln Female

5. Conclusion

In this article, we discuss why database indexing is vital in optimizing query performance and enhancing the overall efficiency of data retrieval operations.

By creating efficient data structures and leveraging them during searches, indexing significantly reduces the time complexity of queries. It allows quick access to specific records, particularly in scenarios involving unique or primary key columns.

Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.