Baeldung Pro – SQL – NPI EA (cat = Baeldung on SQL)
announcement - icon

Learn through the super-clean Baeldung Pro experience:

>> Membership and Baeldung Pro.

No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.

1. Overview

In this tutorial, we’ll learn how to find duplicate rows in our SQL tables and remove them correctly.

First, we’ll look at a database management system (DBMS) agnostic approach to find and remove duplicates. Then, we’ll explore techniques specific to SQL Server and PostgreSQL. Finally, we’ll explore how to remove duplicates without a unique identifier.

As an illustration, we’ll use the Baeldung University schema with the Registration table that holds information about students’ course registrations.

2. Defining the Problem

Before we start, we need to understand what duplicate rows are. A duplicate entry is when parts of more than one row contain the same information. Therefore, a duplicate row is specific to our data set. For example, in the Registration table, a duplicate row means one with the same semester, year, course_id, and student_id in our Registration table, even if the id column has a different value for each row.

Note that unique constraints can prevent duplicate rows from being inserted in the first place.

3. Setup

Let’s start by inserting some duplicate records:

INSERT INTO Registration (id, semester, year, course_id, student_id)
VALUES 
    (100, 'Fall', 2021, 'CS211', 1001),
    (200, 'Fall', 2021, 'CS211', 1001),
    (300, 'Spring', 2022, 'CS212', 1007),
    (400, 'Spring', 2022, 'CS212', 1007),
    (500, 'Spring', 2022, 'CS213', 1003),
    (600, 'Spring', 2022, 'CS213', 1003);

The duplicate pairs are the entries with IDs 100 and 200, 300 and 400, and 500 and 600.

Even though we already know our duplicate rows, an important step in deleting duplicate rows is understanding how to find them first.

4. Finding Duplicates

Let’s explore two ways to look for duplicates in our tables.

First, the most straightforward way is to use the GROUP BY clause together with the COUNT function. Second, we’ll use a window function to provide greater flexibility for complex queries, such as ranking or creating running totals.

4.1. Using GROUP BY and COUNT

We’ll use the GROUP BY clause and COUNT function to count the number of rows having the same values for columns that should be unique:

SELECT semester, year, course_id, student_id, COUNT(*) as count
FROM Registration
GROUP BY semester, year, course_id, student_id
HAVING COUNT(*) > 1;

We are filtering the result with the HAVING clause to return the duplicates along with the count of each duplicate:

semester | year | course_id | student_id | count
---------|------|-----------|------------|------
Fall     | 2021 | CS211     | 1001       | 2
Spring   | 2022 | CS212     | 1007       | 2
Spring   | 2022 | CS213     | 1003       | 2

We can see that we have two duplicates per entry of what we inserted in the Registration table.

4.2. Using ROW_NUMBER() With PARTITION BY

Alternatively, we can use a window function to assign a row number to each row within a group of duplicates.

In short, window functions perform calculations across a set of rows that are somehow related to the current row. They are convenient when calculating running totals, ranking rows, or comparing a row to preceding or subsequent rows without having to join the table to itself. Specifically, we’ll use ROW_NUMBER() to assign a unique integer to each row of our partitioned result set. Additionally, the OVER clause defines this partitioning and ordering of the rows. Therefore, when we partition by unique columns, any row number greater than 1 is a duplicate:

SELECT id, semester, year, course_id, student_id,
    ROW_NUMBER() OVER (
        PARTITION BY semester, year, course_id, student_id
        ORDER BY id
    ) As rn
FROM
    Registration;

This results in a view with an additional column rn that shows the row number within the group of duplicates:

id  | semester | year | course_id | student_id | row_number
----|----------|------|-----------|------------|-----------
100 | Fall     | 2021 | CS211     | 1001       | 1
200 | Fall     | 2021 | CS211     | 1001       | 2
300 | Spring   | 2022 | CS212     | 1007       | 1
400 | Spring   | 2022 | CS212     | 1007       | 2
500 | Spring   | 2022 | CS213     | 1003       | 1
600 | Spring   | 2022 | CS213     | 1003       | 2

Now that we know how to find duplicates let’s look at how to remove them.

5. Removing Duplicate Rows

There are several techniques to delete duplicates that work for all SQL databases, as they are part of the core SQL specification.

5.1. Deleting Duplicates Using Self-Join

To delete duplicates, we can use a self-join to compare the rows and only keep one instance of each duplicate group:

DELETE r1 FROM Registration r1
INNER JOIN Registration r2
WHERE
    r1.semester = r2.semester AND
    r1.year = r2.year AND
    r1.course_id = r2.course_id AND
    r1.student_id = r2.student_id AND
    r1.id > r2.id;

This method is easy to understand and works across all SQL databases. It compares each row with every other row in the table and deletes the duplicates based on conditions we specified in the WHERE clause. In this strategy, we keep the row with the lowest id among the duplicates, as identified by the r1.id > r2.id condition.

5.2. Deleting Duplicates Using Subqueries

Alternatively, we can reuse the queries we used to look for the duplicates. For example, we can use a window function along with a subquery:

DELETE FROM Registration
WHERE id IN (
    SELECT id
    FROM (
        SELECT id,
            ROW_NUMBER() OVER (
                PARTITION BY semester, year, course_id, student_id
                ORDER BY id
            ) as rn
        FROM Registration
    ) t
    WHERE t.row_number > 1
);

Indeed, we could use the GROUP BY/COUNT query as a subquery with identical results as well. Also, choosing between join vs subquery is a matter of personal taste.

6. Removing Duplicates in PostgreSQL

In PostgreSQL, we can also use the USING clause to join the table with a subquery that identifies the duplicates:

DELETE FROM Registration
USING (
    SELECT id
    FROM (
        SELECT id,
            ROW_NUMBER() OVER (
                PARTITION BY semester, year, course_id, student_id
                ORDER BY id
            ) as rn
        FROM Registration
    ) t
    WHERE t.rn > 1
) duplicate
WHERE Registration.id = duplicate.id;

This query uses a named subquery to generate row numbers for each group of duplicates and then selects the rows with row numbers greater than one for deletion.

7. Removing Duplicates in SQL Server

In SQL Server, we can use a Common Table Expression (CTE) to identify duplicates and then delete them:

WITH COUNTED_DUPLICATES AS (
    SELECT id,
        ROW_NUMBER() OVER (
            PARTITION BY semester, year, course_id, student_id
            ORDER BY id
        ) AS row_number
    FROM Registration
)
DELETE FROM CTE
WHERE row_number > 1;

This query uses a CTE to assign row numbers to each group of duplicates and then deletes the rows with row numbers greater than 1.

8. Removing Duplicate Rows Without a Unique Identifier

In tables where there is no unique identifier (such as a primary key or unique column), removing duplicate rows becomes slightly more complex. This is common when data imports occur without applying constraints or in cases where historical data doesn’t follow a standardized format. There are several techniques to identify duplicates while retaining only one instance of each unique row.

For illustration, let’s modify the Registration table and create a new table called modifiedregistration:

CREATE TABLE modifiedregistration (
     semester VARCHAR(30),
     year INT,
     reg_datetime TIMESTAMP,
     course_id VARCHAR(10),
     student_id INT
 );
 
 INSERT INTO modifiedregistration (semester, year, reg_datetime, course_id, student_id)
 VALUES 
     ('Fall', 2021, '2021-09-01 08:00:00', 'CS211', 1001),
     ('Fall', 2021, '2021-09-01 08:00:00', 'CS211', 1001),
     ('Spring', 2022, '2022-01-15 09:30:00', 'CS212', 1007),
     ('Spring', 2022, '2022-01-15 09:30:00', 'CS212', 1007);
CREATE TABLE
INSERT 0 4
Time: 0.009s

Here, we created a new table without a primary key or unique columns to illustrate removing rows without a unique identifier.

8.1. Identifying Duplicate Rows

Now, let’s proceed to identify the duplicate rows. We use the aggregate function and the GROUP BY clause:

SELECT semester, year, reg_datetime, course_id, student_id, COUNT(*)
FROM modifiedregistration
GROUP BY semester, year, reg_datetime, course_id, student_id
HAVING COUNT(*) > 1;
+----------+------+---------------------+-----------+------------+-------+
| semester | year | reg_datetime        | course_id | student_id | count |
|----------+------+---------------------+-----------+------------+-------|
| Spring   | 2022 | 2022-01-15 09:30:00 | CS212     | 1007       | 2     |
| Fall     | 2021 | 2021-09-01 08:00:00 | CS211     | 1001       | 2     |
+----------+------+---------------------+-----------+------------+-------+
SELECT 2
Time: 0.008s

The query returns only rows that appear more than once, revealing duplicates based on all columns. This method also groups identical rows based on all columns and uses COUNT(*) to highlight records with duplicate occurrences.

8.2. Deleting Duplicates Using Internal ctid

Since the modifiedregistration table doesn’t have a primary key or any unique identifier, we can leverage PostgreSQL’s internal ctid to uniquely identify each row. The ctid is a system column that stores the physical location of each row in the table.

For example, we can use internal ctid along with other queries to remove duplicates:

WITH duplicates AS (
    SELECT ctid,
           ROW_NUMBER() OVER(PARTITION BY semester, year, reg_datetime, course_id, student_id ORDER BY semester) AS row_num
    FROM modifiedregistration
)
DELETE FROM modifiedregistration
WHERE ctid IN (SELECT ctid FROM duplicates WHERE row_num > 1);
DELETE 2
Time: 0.005s

Now, let’s explain what each of these queries does in the example above:

  • ROW_NUMBER(): assigns a unique sequential integer (row_num) to each row within the group defined by PARTITION BY
  • PARTITION BY: specifies duplicates are identified based on the combination of all the columns
  • ORDER BY: ensures rows are ordered within their group by semester
  • DELETE: this statement removes all rows except the first one in each group (that is, where row_num > 1)

This process effectively eliminates duplicates, providing one instance of each unique row.

To verify it removed duplicate rows successfully, let’s verify by querying the table:

select * from modifiedregistration;
+----------+------+---------------------+-----------+------------+
| semester | year | reg_datetime        | course_id | student_id |
|----------+------+---------------------+-----------+------------|
| Fall     | 2021 | 2021-09-01 08:00:00 | CS211     | 1001       |
| Spring   | 2022 | 2022-01-15 09:30:00 | CS212     | 1007       |
+----------+------+---------------------+-----------+------------+
SELECT 2
Time: 0.005s

As seen in the output, the duplicates are removed successfully.

8.3. Using Temporary Tables

Another approach to remove duplicates is to use a temporary table to hold unique rows. Then, we replace the original table’s content with the temporary table.

First, let’s create the temporary table with unique rows:

CREATE TEMPORARY TABLE unique_rows AS
 SELECT DISTINCT *
 FROM modifiedregistration;
SELECT 2
Time: 0.006s

Next, we clear the original table and reinsert unique rows:

TRUNCATE TABLE modifiedregistration;

INSERT INTO modifiedregistration
SELECT * FROM unique_rows;
You're about to run a destructive command.
Do you want to proceed? (y/n): y
Your call!
TRUNCATE TABLE
INSERT 0 2
Time: 0.006s;

Let’s verify the deletion by querying the table:

SELECT * FROM modifiedregistration;
+----------+------+---------------------+-----------+------------+
| semester | year | reg_datetime        | course_id | student_id |
|----------+------+---------------------+-----------+------------|
| Spring   | 2022 | 2022-01-15 09:30:00 | CS212     | 1007       |
| Fall     | 2021 | 2021-09-01 08:00:00 | CS211     | 1001       |
+----------+------+---------------------+-----------+------------+
SELECT 2
Time: 0.010s

This approach is useful when working with a large dataset or avoiding complex joins in deletion logic.

9. Conclusion

We’ve learned how to deduplicate tables in an SQL database. First, we learned how to find if a table has duplicate rows. Second, we explored solutions that work for all SQL databases as they adhere to the SQL standard. Additionally, we explored specific solutions for PostgreSQL and SQL Server. Finally, we explored how to remove duplicates without unique identifiers.

The code backing this article is available on GitHub. Once you're logged in as a Baeldung Pro Member, start learning and coding on the project.
2 Comments
Oldest
Newest
Inline Feedbacks
View all comments