How to Identify Duplicate Values in a SQL Table Baeldung on SQL

1. Overview

Data persistence plays an important role in software development. Though NoSQL databases are popular, the relational databases are still widely used. Relational databases are preferred when the data is well-defined and relationships are crucial.

Relational databases interact with each other using SQL (Structured Query Language). SQL provides many advanced features that allow us to work with databases efficiently, such as using multiple SQL clauses in a single query.

In this tutorial, we’ll use GROUP BY and HAVING clauses to find the duplicate values from a table. So, let’s get started.

2. Setting up an Example

In this section, we’ll set up a PostgreSQL database using a Docker container.

An important point to note is that we’ll use the PostgreSQL database throughout the tutorial. However, the SQL queries will work with any other database that is SQL-compliant.

2.1. Setting up PostgreSQL

First, let’s pull the PostgreSQL docker image from the DockerHub:

$ docker image pull postgres:16.3
16.3: Pulling from library/postgres
09f376ebb190: Pull complete 
119215dfb3e3: Pull complete 
Digest: sha256:1bf73ccae25238fa555100080042f0b2f9be08eb757e200fe6afc1fc413a1b3c
Status: Downloaded newer image for postgres:16.3
docker.io/library/postgres:16.3

In this example, we’ve used the 16.3 tag with the image.

Next, let’s create a directory on a host machine to use with the bind mount:

$ mkdir -p $HOME/docker/volumes/postgres

Finally, let’s create the PostgreSQL container:

$ docker container run --rm --name pg-docker -e POSTGRES_PASSWORD=baeldung -d -p 5432:5432 -v $HOME/docker/volumes/postgres:/var/lib/postgresql/data postgres:16.3

Here, the container name is pg-docker, and the database password is baeldung.

Also, we’ve used the –rm option to create the container. This option automatically removes the container when it exits.

2.2. Setting up Database Schema

In the previous section, we created a PostgreSQL instance. Now, let’s create a sample database schema to use as an example.

First, let’s exec to the pg-docker container and connect to the database using the psql utility:

$ docker exec -it pg-docker bash
$ PGPASSWORD=baeldung psql -h localhost -U postgres

Next, let’s create a new database – duplicates_demo – and connect to it:

postgres=# CREATE DATABASE duplicates_demo;
CREATE DATABASE
postgres=# \c duplicates_demo

In this example, we’ve used the \c meta-command of the psql utility to establish a connection with the duplicates_demo database.

Finally, let’s create a sample table:

CREATE TABLE employee (
    emp_id INT, 
    first_name VARCHAR(128),
    last_name VARCHAR(128),
    email VARCHAR(128),
    office_location VARCHAR(128)
);

2.3. Inserting Data Into the Table

Now, the employee table is ready. Next, let’s insert some sample data into it.

To begin, let’s insert all unique records:

INSERT INTO
    employee (emp_id, first_name, last_name, email, office_location)
VALUES
    (1, 'John', 'Doe', '[email protected]', 'New York'),
    (2, 'Tom', 'Fisk', '[email protected]', 'Palo Alto'),
    (3, 'Linda', 'Miller', '[email protected]', 'Chicago'),
    (4, 'Jessica', 'Williams', '[email protected]', 'Palo Alto');

Next, let’s insert the records with one column duplicated:

INSERT INTO
    employee (emp_id, first_name, last_name, email, office_location)
VALUES
    (1, 'Emma', 'Jackson', '[email protected]', 'Austin'),
    (3, 'Donald', 'Smith', '[email protected]', 'Seattle'),
    (4, 'Olivia', 'Davis', '[email protected]', 'Boston');

In this example, we’ve created records with the duplicate emp_id field.

Now, let’s insert the records with multiple duplicate columns:

INSERT INTO
    employee (emp_id, first_name, last_name, email, office_location)
VALUES
    (2, 'Tom', 'Fisk', '[email protected]', 'Palo Alto'),
    (3, 'Linda', 'Miller', '[email protected]', 'Chicago'),
    (2, 'Tom', 'Fisk', '[email protected]', 'Palo Alto'),
    (3, 'Linda', 'Miller', '[email protected]', 'Chicago'),
    (2, 'Tom', 'Fisk', '[email protected]', 'Palo Alto');

3. General Approach

In this section, we’ll see examples of the COUNT function and the GROUP BY and HAVING clauses. In the subsequent sections, we’ll use them to find duplicate values.

3.1. COUNT Function

COUNT is one of the most widely used aggregate functions in SQL. It returns the number of rows that match a specific criteria.

To understand this, let’s use the COUNT function to find out the number of rows from the employee table:

select COUNT(emp_id) from employee;

The query returns the following result:

 count 
-------
    12

Here, we can see that the output shows that the employee table has 12 rows.

3.2. GROUP BY Clause

As the name suggests, the GROUP BY clause is used to group the rows. An important point to note is that it’s always used in conjunction with one or more aggregate functions.

So, let’s write a simple query to count the number of employees per office:

select office_location, count(emp_id) from employee group by office_location;

The query shows the following output:

office_location  | count
-----------------+------
 New York        |     1
 Austin          |     1
 Chicago         |     3
 Seattle         |     1
 Boston          |     1
 Palo Alto       |     5

In this example, we’ve used the GROUP BY clause to group the rows by the office_location column and the COUNT aggregate function to count rows.

3.3. HAVING Clause

The HAVING clause is used to specify the search condition on a group.

In the previous example, we grouped rows by office_location. Now, Let’s use the HAVING clause to filter the result where the office_location is Palo Alto:

select office_location, count(emp_id) from employee group by office_location having office_location = 'Palo Alto';

The query generates the following result:

office_location | count
-----------------+------
 Palo Alto       |     5

4. Identifying Duplicate Values

In the previous section, we saw how to use the COUNT function, GROUP BY clause, and HAVING clause. Now, let’s use them to identify the duplicate values from the table.

4.1. Duplicate Values in One Column

While setting up an example, we inserted a few records with a duplicate emp_id field.

Now, let’s write a query to list those records:

SELECT emp_id, COUNT(emp_id)
FROM employee
GROUP BY emp_id
HAVING COUNT(emp_id) > 1;

The query returns the duplicated emp_id and the number of times it’s duplicated:

emp_id | count 
--------+-------
      4 |     2
      2 |     4
      3 |     4
      1 |     2

In this example, we searched for duplicated records using the emp_id column.

In the next section, we’ll see how to identify the duplicate values in multiple columns.

4.2. Duplicate Values in Multiple Columns

While setting up an example, we also inserted a few records with multiple duplicate fields.

So, let’s write a query to list rows in which emp_id, first_name, and last_name fields are identical:

SELECT emp_id, first_name, last_name, COUNT(*)
FROM employee
GROUP BY emp_id, first_name, last_name
HAVING COUNT(*) > 1;

The query generates the following output:

emp_id | first_name | last_name | count 
--------+------------+-----------+-------
      3 | Linda      | Miller    |     3
      2 | Tom        | Fisk      |     4

Here, we can see the records from the employee table in which the emp_id, first_name, and last_name columns are identical.

5. Conclusion

In this article, we saw how to find duplicate values from a SQL table.

First, we discussed how to use the COUNT function. After that, we discussed how to use GROUP BY and HAVING clauses. Next, we wrote the SQL query to find duplicates from the single column.

Finally, we wrote the SQL query to find duplicates from multiple columns.

Learn Java Collections

Learn Spring

Learn Maven

View All Courses

Full Archive

About Baeldung