The Relationship Between Primary Keys and Clustered Indexes Baeldung on SQL

1. Introduction

In relational database management systems (RDBMS), primary keys and clustered indexes are the foundation of data organization and database performance. Understanding their relationships and limitations is indispensable in designing efficient and scalable databases.

In this tutorial, we’ll explore the concept of clustered indexes, their relation to and differences from primary keys, and examine best practices and use cases.

2. Understanding Clustered Indexes

Unlike primary keys that refer to a field (column) or set of fields in a database table that uniquely identifies each record, clustered indexes seek instead to impose an order on how data values or records are stored in a table. Clustered indexing is an indexing technique that organizes the data rows in a table based on an already specified clustered index key. The index definition includes the clustered index keys as columns.

2.1. Mechanism

Clustered indexes use a Balanced Tree (B-tree) data structure:

A root node containing pointers to intermediate nodes
Intermediate nodes that guide the search to the appropriate leaf node
Leaf nodes which contain the actual data rows in the order of the clustered index key

The database uses the B-tree structure during each search to locate the appropriate leaf nodes containing the requested data. We can insert new rows into the proper physical location to maintain the order of the index.

Updating the clustered index key often necessitates physically moving the row to preserve its order. However, updates to non-key columns do not affect the clustered index. Deleting a row from the table removes that row from the clustered index structure and reorganizes the index if needed.

2.2. Clustered vs. Non-clustered Indexes

Databases group index architecture into clustered and non-clustered types. A clustered index directly integrates the data into the index structure, whereas a non-clustered index stores pointers (called row locators) to the data rows.

In non-clustered tables, we store table rows in an unordered structure called a heap. The row locator, therefore, points to the row for the heap, while for a clustered table, the row locator is the clustered index key.

3. Relationship Between Primary Keys and Clustered Indexes

For a database table, we can create a clustered index when defining a table or add it later by explicitly defining it.

3.1. Default Association

When we create a primary key in most RDBMS, the database automatically creates a clustered index on the primary key column. It could be a simple or composite primary key unless a clustered index exists or is otherwise specified.

The default behavior simplifies database setup because the primary key is frequently searched in reference columns.

Let’s consider a relevant example for the Baeldung University database schema:

We can create a new table, Classroom, that stores details about classrooms used for teaching:

CREATE TABLE Classroom (
  id INT PRIMARY KEY,
  room_number VARCHAR(50),
  building_name VARCHAR(100),
  capacity INT
);

After table creation, the database automatically creates a clustered index on the id column:

IndexName	IndexType	ColumnName
PK__Classroo__3213E83FEDFB26F2	CLUSTERED	id

A unique index name was assigned to the clustered index in the Classroom table.

3.2. Custom-Created Index

If the primary key is not the most logical for clustering, we can define a clustered index on a different column. A relevant example is the Library table, where, despite the book_id column being the primary key uniquely identifying each row, it is inefficient since the table is frequently queried on the borrowed_date column:

CREATE TABLE Library (
  book_id INT PRIMARY KEY,
  book_title VARCHAR(255),
  author VARCHAR(255),
  isbn VARCHAR(20),
  available_copies INT,
  borrowed_date DATE
);
CREATE CLUSTERED INDEX IX_borrowed_date ON Library (borrowed_date);

Because a table can only contain one clustered index, when the table was created initially, the primary key doubled as the clustered index. However, to recreate a clustered index on borrowed_date, the clustered index on the book_id is dropped, and the clustered index pointer is moved from the book_id to the borrowed_date.

3.3. Primary Key With Non-clustered Index

As an alternative to the default association, we can define the primary key to use a non-clustered index:

CREATE TABLE Alumni (
  alumni_id INT PRIMARY KEY NONCLUSTERED,
  student_id INT, 
  employment_status VARCHAR(50),
  employer VARCHAR(255),
  position VARCHAR(100), 
  graduation_year INT
);

The primary key is enforced with a non-clustered index, thereby leaving the clustered index available for any other columns that can be specified later.

3.4. Clustered Index on a Composite Key

A clustered index is created on the associated columns in a Mentorship table defined by a composite primary key:

CREATE TABLE Mentorship (
  mentor_id INT,
  student_id INT, 
  start_date DATE,
  end_date DATE, 
  status VARCHAR(50),
  PRIMARY KEY (mentor_id, student_id)
);

By default, the composite primary key (mentor_id, student_id) will have a clustered index:

IndexName	IndexType	ColumnName
PK__Mentorsh__57714E9AA3CC3EFA	CLUSTERED	mentor_id
PK__Mentorsh__57714E9AA3CC3EFA	CLUSTERED	student_id

We created a clustered index on the primary key, which includes the mentor_id and student_id columns. Since these columns form the primary key, the index name is the same for both columns.

4. Practical Insights and Best Practices

The relationship between primary keys and clustered indexes significantly influences query performance, especially in large datasets. Clustered indexes improve searching efficiency for student_id by aligning physical data storage with the order imposed by the primary key, eliminating the need to scan unordered rows.

In addition, clustered indexes support queries that involve ranges such as BETWEEN or >= without additional sorting. Furthermore, storing data rows in order allows us to read fewer pages for queries, which reduces disk I/O operations.

Finally, primary keys and clustered indexes have divergent behaviors across SQL dialects:

Feature	MySQL	PostgreSQL	SQL Server
Primary Key Behavior	It automatically creates a clustered index unless it is an InnoDB table without indexes.	It consistently enforces uniqueness but does not automatically create a clustered index.	The database automatically creates a clustered index unless specified otherwise.
Clustered Index Default	The primary key can be defined on different columns in a table.	Not automatically linked to the primary key; must be explicitly defined	Defaults to the primary key if no other clustered index exists
Index Maintenance	Requires manual optimization for fragmentation	Provides VACUUM and REINDEX commands to maintain index health	Supports automated and manual index rebuilding

These differences highlight the importance of understanding the nuances of the database platform when designing tables and indexes.

4.1. Challenges and Limitations

The foremost limitation of clustered indexes is that each table can only have one clustered index. Therefore, optimizing tables where multiple columns are frequently queried is difficult.

Next, operations like INSERT, UPDATE, or DELETE can experience slower performance due to the need to maintain the physical order of the rows. For instance, inserting data into the middle of a clustered index causes page splits, fragmenting the index and degrading performance.

Finally, in scenarios where the primary key contains random or non-sequential values (for example, GUIDs), clustered indexes disrupt the logical order and exacerbate fragmentation.

4.2. Best Practices

First, select a column for clustering that aligns with the most frequent query patterns to mitigate the challenges associated with clustered indexes. Ensure that the chosen column is sequential, unique, and rarely updated.

Second, regularly monitor, maintain, and rebuild fragmented indexes.

Third, minimize over-indexing by evaluating the trade-off between read performance and the overhead introduced by maintaining multiple indexes.

Where the primary key does not align with the query patterns, define a separate clustered index and enforce the primary key with a non-clustered index.

4.3. Use Cases

Clustered indexes are beneficial in scenarios where query performance is paramount. For example, in systems where efficient querying is necessary, clustering a sequential primary key like student_id in a Student table ensures optimized data retrieval for looking up student records and performing range queries.

On the other hand, write-heavy applications such as logging systems benefit from clustering on monotonically increasing columns to reduce page splits. In addition, data warehousing and analytics applications often benefit from clustering on columns like Timestamp or composite keys that include dimensions frequently used in filtering or grouping.

5. Conclusion

In this article, we investigated the relationship between primary keys and clustered indexes, their default association, and the applicable exceptions. We also examined the mechanism and defining structure of clustered indexes.

Finally, we considered challenges associated with clustered indexes and their use cases and recommended best practices.

Learn Java Collections

Learn Spring

Learn Maven

View All Courses

Full Archive

About Baeldung