Deleting One of Two Identical Rows in SQL Baeldung on SQL

1. Overview

In a relational database, the presence of perfectly identical rows can lead to data inconsistencies and hinder performance. Dealing with duplicate rows is a common data management challenge.

In SQL, perfectly identical rows are the ones where, despite identical data values across all relevant columns, each row is technically distinct due to an underlying, often auto-generated, identifier (like an AUTO_INCREMENT or IDENTITY column). In this tutorial, we’ll explore different methods to delete all the identical rows in SQL.

Note that before executing any DELETE statement on the database, make sure to back up the data.

2. Sample Table and Data

Let’s create a table, DuplicateRecords, and insert sample data:

-- Create sample table
CREATE TABLE DuplicateRecords (
RecordID SERIAL PRIMARY KEY, -- Use INT AUTO_INCREMENT for MySQL
Value1 VARCHAR(50),
Value2 INT,
Value3 DATE
);
-- Insert sample data with duplicates
INSERT INTO DuplicateRecords (Value1, Value2, Value3) VALUES
('Apple', 10, '2023-01-01'),
('Banana', 20, '2023-02-01'),
('Apple', 10, '2023-01-01'), -- Identical to the first row (logical duplicate)
('Cherry', 30, '2023-03-01'),
('Banana', 20, '2023-02-01'), -- Identical to the second row (logical duplicate)
('Apple', 10, '2023-01-01'); -- Another identical row

The above SQL query will create a sample table, DuplicateRecords, and insert 6 rows. These rows are logically identical but have different RecordID values.

Note that we’ll be using this sample data to demonstrate different SQL queries in this article.

3. Using ROW_NUMBER() with a Common Table Expression (CTE)

ROW_NUMBER() is a SQL function that assigns a sequential integer (starting from 1) to rows within a partition of a result set. By partitioning on all columns that define a duplicate (i.e., Value1, Value2, Value3), we can identify all instances of a duplicate set. Finally, we delete all the rows having ROW_NUMBER() greater than 1.

The below query will delete all the duplicate rows in SQL:

-- PostgreSQL / MS SQL Server / MySQL Server
WITH CTE_DuplicateRecords AS (
    SELECT
        RecordID,
        Value1,
        Value2,
        Value3,
        ROW_NUMBER() OVER (PARTITION BY Value1, Value2, Value3 ORDER BY RecordID) as rn
    FROM
        DuplicateRecords
)
DELETE FROM DuplicateRecords
WHERE RecordID IN (SELECT RecordID FROM CTE_DuplicateRecords WHERE rn > 1);

In the above query, the CTE_DuplicateRecords assigns a rank (rn) to each row. Further, we group identical data rows using the “PARTITION BY Value1, Value2, Value3” clause. The “ORDER BY RecordID” clause ensures a consistent selection of which RecordID gets “rn = 1″ (typically, the one with the lowest RecordID is kept). Finally, we use the DELETE statement to remove all but the first occurrence of each logical duplicate.

Note that the above query requires MySQL 8.0 or later versions, as it supports CTEs and window functions, such as ROW_NUMBER().

4. Using DELETE with Self-Join

In this method, we use SQL joins to identify rows that have identical data values but different RecordIDs. It’s often used when we want to keep the row with the minimum (or maximum) RecordID and delete the others.

4.1. Sample Query for PostgreSQL Server

The below query will delete all the duplicate rows in a PostgreSQL server:

-- PostgreSQL
DELETE FROM DuplicateRecords DR1
USING DuplicateRecords DR2
WHERE
    DR1.Value1 = DR2.Value1 AND
    DR1.Value2 = DR2.Value2 AND
    DR1.Value3 = DR2.Value3 AND
    DR1.RecordID > DR2.RecordID;

The above query performs a self-join where DR1 and DR2 are aliases for the DuplicateRecords table. Further, we use the WHERE clause to define the criteria for deletion. The crucial part is “DR1.RecordID > DR2.RecordID”, which ensures that for every pair of identical rows, the row with the higher RecordID is identified for deletion.

4.2. Sample Query for MySQL Server

In MySQL server, the USING clause in the DELETE statement is typically replaced by adding the table alias directly after DELETE. A similar query is shown below for MySQL:

-- MySQL
DELETE FROM DR1
USING DuplicateRecords DR1
JOIN DuplicateRecords DR2 ON
    DR1.Value1 = DR2.Value1 AND
    DR1.Value2 = DR2.Value2 AND
    DR1.Value3 = DR2.Value3 AND
    DR1.RecordID > DR2.RecordID;

The above query will remove all the duplicate rows from the DuplicateRecords table.

4.3. Sample Query for MS SQL Server

Note that the USING clause in a DELETE statement is not supported in MSSQL Server. We can use the below query to remove duplicates from MS SQL server:

-- MS SQL Server
DELETE DR1
FROM DuplicateRecords DR1
JOIN DuplicateRecords DR2 
    ON DR1.Value1 = DR2.Value1 
    AND DR1.Value2 = DR2.Value2 
    AND DR1.Value3 = DR2.Value3 
    AND DR1.RecordID > DR2.RecordID;

5. Using DELETE with MIN/MAX and Subquery

We can use the SQL aggregate functions within a subquery to identify and delete duplicate rows. By using the MIN() or MAX() aggregate functions with a GROUP BY clause, we can select a single RecordID for each group of identical rows.

5.1. Sample Query for PostgreSQL/MS SQL Server

-- PostgreSQL / MS SQL Server
DELETE FROM DuplicateRecords
WHERE RecordID NOT IN (
    SELECT MIN(RecordID)
    FROM DuplicateRecords
    GROUP BY Value1, Value2, Value3
);

Here, we have a subquery that finds the smallest RecordID for each unique combination of Value1, Value2, and Value3. Finally, the outer DELETE statement then removes all rows from DuplicateRecords whose RecordID is not present in that list. Eventually, this will delete all duplicates from the table.

5.2. Sample Query for MySQL Server

The above query will not work in MySQL as it requires an alias for the subquery when it’s used in the IN clause of a DELETE statement.

-- MySQL
DELETE FROM DuplicateRecords
WHERE RecordID NOT IN (
    SELECT T2.MinRecordID FROM (
        SELECT MIN(RecordID) AS MinRecordID
        FROM DuplicateRecords
        GROUP BY Value1, Value2, Value3
    ) AS T2
);

Again, the inner subquery identifies the minimum RecordID similar to the PostgreSQL/MS SQL Server query. Finally, the outer DELETE then removes rows that do not have these minimum RecordIDs.

This approach is generally efficient for moderate to large datasets.

6. Conclusion

In this article, we learned different approaches to deleting duplicate rows in SQL. First, we discussed the ROW_NUMBER() with CTE method. This approach is the most flexible and robust, providing clear control over which duplicate to keep (e.g., based on RecordID, creation date, or another criteria).

We also discussed the self-joins that are useful to delete duplicates when a primary key is available to differentiate them. Finally, we discussed the MIN/MAX with the subquery approach.

The code backing this article is available on GitHub. Once you're logged in as a Baeldung Pro Member, start learning and coding on the project.

Full Archive

About Baeldung