
Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: July 30, 2024
In this tutorial, we’ll learn how to find duplicate rows in our SQL tables and remove them correctly.
First, we’ll look at a database management system (DBMS) agnostic approach to find and remove duplicates. Then, we’ll explore techniques specific to SQL Server and PostgreSQL. Finally, we’ll explore how to remove duplicates without a unique identifier.
As an illustration, we’ll use the Baeldung University schema with the Registration table that holds information about students’ course registrations.
Before we start, we need to understand what duplicate rows are. A duplicate entry is when parts of more than one row contain the same information. Therefore, a duplicate row is specific to our data set. For example, in the Registration table, a duplicate row means one with the same semester, year, course_id, and student_id in our Registration table, even if the id column has a different value for each row.
Note that unique constraints can prevent duplicate rows from being inserted in the first place.
Let’s start by inserting some duplicate records:
INSERT INTO Registration (id, semester, year, course_id, student_id)
VALUES
(100, 'Fall', 2021, 'CS211', 1001),
(200, 'Fall', 2021, 'CS211', 1001),
(300, 'Spring', 2022, 'CS212', 1007),
(400, 'Spring', 2022, 'CS212', 1007),
(500, 'Spring', 2022, 'CS213', 1003),
(600, 'Spring', 2022, 'CS213', 1003);
The duplicate pairs are the entries with IDs 100 and 200, 300 and 400, and 500 and 600.
Even though we already know our duplicate rows, an important step in deleting duplicate rows is understanding how to find them first.
Let’s explore two ways to look for duplicates in our tables.
First, the most straightforward way is to use the GROUP BY clause together with the COUNT function. Second, we’ll use a window function to provide greater flexibility for complex queries, such as ranking or creating running totals.
We’ll use the GROUP BY clause and COUNT function to count the number of rows having the same values for columns that should be unique:
SELECT semester, year, course_id, student_id, COUNT(*) as count
FROM Registration
GROUP BY semester, year, course_id, student_id
HAVING COUNT(*) > 1;
We are filtering the result with the HAVING clause to return the duplicates along with the count of each duplicate:
semester | year | course_id | student_id | count
---------|------|-----------|------------|------
Fall | 2021 | CS211 | 1001 | 2
Spring | 2022 | CS212 | 1007 | 2
Spring | 2022 | CS213 | 1003 | 2
We can see that we have two duplicates per entry of what we inserted in the Registration table.
Alternatively, we can use a window function to assign a row number to each row within a group of duplicates.
In short, window functions perform calculations across a set of rows that are somehow related to the current row. They are convenient when calculating running totals, ranking rows, or comparing a row to preceding or subsequent rows without having to join the table to itself. Specifically, we’ll use ROW_NUMBER() to assign a unique integer to each row of our partitioned result set. Additionally, the OVER clause defines this partitioning and ordering of the rows. Therefore, when we partition by unique columns, any row number greater than 1 is a duplicate:
SELECT id, semester, year, course_id, student_id,
ROW_NUMBER() OVER (
PARTITION BY semester, year, course_id, student_id
ORDER BY id
) As rn
FROM
Registration;
This results in a view with an additional column rn that shows the row number within the group of duplicates:
id | semester | year | course_id | student_id | row_number
----|----------|------|-----------|------------|-----------
100 | Fall | 2021 | CS211 | 1001 | 1
200 | Fall | 2021 | CS211 | 1001 | 2
300 | Spring | 2022 | CS212 | 1007 | 1
400 | Spring | 2022 | CS212 | 1007 | 2
500 | Spring | 2022 | CS213 | 1003 | 1
600 | Spring | 2022 | CS213 | 1003 | 2
Now that we know how to find duplicates let’s look at how to remove them.
There are several techniques to delete duplicates that work for all SQL databases, as they are part of the core SQL specification.
To delete duplicates, we can use a self-join to compare the rows and only keep one instance of each duplicate group:
DELETE r1 FROM Registration r1
INNER JOIN Registration r2
WHERE
r1.semester = r2.semester AND
r1.year = r2.year AND
r1.course_id = r2.course_id AND
r1.student_id = r2.student_id AND
r1.id > r2.id;
This method is easy to understand and works across all SQL databases. It compares each row with every other row in the table and deletes the duplicates based on conditions we specified in the WHERE clause. In this strategy, we keep the row with the lowest id among the duplicates, as identified by the r1.id > r2.id condition.
Alternatively, we can reuse the queries we used to look for the duplicates. For example, we can use a window function along with a subquery:
DELETE FROM Registration
WHERE id IN (
SELECT id
FROM (
SELECT id,
ROW_NUMBER() OVER (
PARTITION BY semester, year, course_id, student_id
ORDER BY id
) as rn
FROM Registration
) t
WHERE t.row_number > 1
);
Indeed, we could use the GROUP BY/COUNT query as a subquery with identical results as well. Also, choosing between join vs subquery is a matter of personal taste.
In PostgreSQL, we can also use the USING clause to join the table with a subquery that identifies the duplicates:
DELETE FROM Registration
USING (
SELECT id
FROM (
SELECT id,
ROW_NUMBER() OVER (
PARTITION BY semester, year, course_id, student_id
ORDER BY id
) as rn
FROM Registration
) t
WHERE t.rn > 1
) duplicate
WHERE Registration.id = duplicate.id;
This query uses a named subquery to generate row numbers for each group of duplicates and then selects the rows with row numbers greater than one for deletion.
In SQL Server, we can use a Common Table Expression (CTE) to identify duplicates and then delete them:
WITH COUNTED_DUPLICATES AS (
SELECT id,
ROW_NUMBER() OVER (
PARTITION BY semester, year, course_id, student_id
ORDER BY id
) AS row_number
FROM Registration
)
DELETE FROM CTE
WHERE row_number > 1;
This query uses a CTE to assign row numbers to each group of duplicates and then deletes the rows with row numbers greater than 1.
In tables where there is no unique identifier (such as a primary key or unique column), removing duplicate rows becomes slightly more complex. This is common when data imports occur without applying constraints or in cases where historical data doesn’t follow a standardized format. There are several techniques to identify duplicates while retaining only one instance of each unique row.
For illustration, let’s modify the Registration table and create a new table called modifiedregistration:
CREATE TABLE modifiedregistration (
semester VARCHAR(30),
year INT,
reg_datetime TIMESTAMP,
course_id VARCHAR(10),
student_id INT
);
INSERT INTO modifiedregistration (semester, year, reg_datetime, course_id, student_id)
VALUES
('Fall', 2021, '2021-09-01 08:00:00', 'CS211', 1001),
('Fall', 2021, '2021-09-01 08:00:00', 'CS211', 1001),
('Spring', 2022, '2022-01-15 09:30:00', 'CS212', 1007),
('Spring', 2022, '2022-01-15 09:30:00', 'CS212', 1007);
CREATE TABLE
INSERT 0 4
Time: 0.009s
Here, we created a new table without a primary key or unique columns to illustrate removing rows without a unique identifier.
Now, let’s proceed to identify the duplicate rows. We use the aggregate function and the GROUP BY clause:
SELECT semester, year, reg_datetime, course_id, student_id, COUNT(*)
FROM modifiedregistration
GROUP BY semester, year, reg_datetime, course_id, student_id
HAVING COUNT(*) > 1;
+----------+------+---------------------+-----------+------------+-------+
| semester | year | reg_datetime | course_id | student_id | count |
|----------+------+---------------------+-----------+------------+-------|
| Spring | 2022 | 2022-01-15 09:30:00 | CS212 | 1007 | 2 |
| Fall | 2021 | 2021-09-01 08:00:00 | CS211 | 1001 | 2 |
+----------+------+---------------------+-----------+------------+-------+
SELECT 2
Time: 0.008s
The query returns only rows that appear more than once, revealing duplicates based on all columns. This method also groups identical rows based on all columns and uses COUNT(*) to highlight records with duplicate occurrences.
Since the modifiedregistration table doesn’t have a primary key or any unique identifier, we can leverage PostgreSQL’s internal ctid to uniquely identify each row. The ctid is a system column that stores the physical location of each row in the table.
For example, we can use internal ctid along with other queries to remove duplicates:
WITH duplicates AS (
SELECT ctid,
ROW_NUMBER() OVER(PARTITION BY semester, year, reg_datetime, course_id, student_id ORDER BY semester) AS row_num
FROM modifiedregistration
)
DELETE FROM modifiedregistration
WHERE ctid IN (SELECT ctid FROM duplicates WHERE row_num > 1);
DELETE 2
Time: 0.005s
Now, let’s explain what each of these queries does in the example above:
This process effectively eliminates duplicates, providing one instance of each unique row.
To verify it removed duplicate rows successfully, let’s verify by querying the table:
select * from modifiedregistration;
+----------+------+---------------------+-----------+------------+
| semester | year | reg_datetime | course_id | student_id |
|----------+------+---------------------+-----------+------------|
| Fall | 2021 | 2021-09-01 08:00:00 | CS211 | 1001 |
| Spring | 2022 | 2022-01-15 09:30:00 | CS212 | 1007 |
+----------+------+---------------------+-----------+------------+
SELECT 2
Time: 0.005s
As seen in the output, the duplicates are removed successfully.
Another approach to remove duplicates is to use a temporary table to hold unique rows. Then, we replace the original table’s content with the temporary table.
First, let’s create the temporary table with unique rows:
CREATE TEMPORARY TABLE unique_rows AS
SELECT DISTINCT *
FROM modifiedregistration;
SELECT 2
Time: 0.006s
Next, we clear the original table and reinsert unique rows:
TRUNCATE TABLE modifiedregistration;
INSERT INTO modifiedregistration
SELECT * FROM unique_rows;
You're about to run a destructive command.
Do you want to proceed? (y/n): y
Your call!
TRUNCATE TABLE
INSERT 0 2
Time: 0.006s;
Let’s verify the deletion by querying the table:
SELECT * FROM modifiedregistration;
+----------+------+---------------------+-----------+------------+
| semester | year | reg_datetime | course_id | student_id |
|----------+------+---------------------+-----------+------------|
| Spring | 2022 | 2022-01-15 09:30:00 | CS212 | 1007 |
| Fall | 2021 | 2021-09-01 08:00:00 | CS211 | 1001 |
+----------+------+---------------------+-----------+------------+
SELECT 2
Time: 0.010s
This approach is useful when working with a large dataset or avoiding complex joins in deletion logic.
We’ve learned how to deduplicate tables in an SQL database. First, we learned how to find if a table has duplicate rows. Second, we explored solutions that work for all SQL databases as they adhere to the SQL standard. Additionally, we explored specific solutions for PostgreSQL and SQL Server. Finally, we explored how to remove duplicates without unique identifiers.