Baeldung Pro – SQL – NPI EA (cat = Baeldung on SQL)
announcement - icon

Learn through the super-clean Baeldung Pro experience:

>> Membership and Baeldung Pro.

No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.

1. Overview

When querying data in SQL, it’s crucial to understand how to efficiently retrieve and organize results. We use GROUP BY and DISTINCT to manage data aggregation and filtering.

The GROUP BY and DISTINCT clauses may look similar as we can use both to eliminate duplicate records. Nevertheless, they serve distinct purposes and are applied differently within SQL queries.

In this tutorial, we’ll explore GROUP BY and DISTINCT clauses and their characteristics. Also, we’ll provide practical examples using the Baeldung University Student table. Finally, we’ll conclude with a comparative analysis to highlight their distinct features.

2. GROUP BY

We use GROUP BY to aggregate data across multiple records by grouping the rows with the same values in specified columns. This keyword is particularly powerful when combined with aggregate functions such as COUNT, SUM, AVG, MAX, and MIN. In practice, it enables the retrieval of summary data from a table.

For example, let’s find out how many students enrolled each year with the use of GROUP BY:

SELECT EXTRACT(YEAR FROM enrollment_date) AS enrollment_year, COUNT(id) AS student_count
FROM Student
GROUP BY EXTRACT(YEAR FROM enrollment_date);
+-----------------+---------------+
| enrollment_year | student_count |
|-----------------+---------------|
| 2022.0          | 5             |
| 2020.0          | 10            |
| 2021.0          | 10            |
+-----------------+---------------+
SELECT 3
Time: 0.015s

In this query, SQL groups the records of the Student table by the year specified in the enrollment_date column. Then, for each group (that is, each enrollment year), it counts the number of id values, which correspond to individual students. The result is a list of enrollment years, each associated with the number of students who enrolled that year.

Additionally, let’s use the GROUP BY clause to calculate the average GPA of students grouped by their enrollment year:

SELECT EXTRACT(YEAR FROM enrollment_date) AS enrollment_year, AVG(gpa) AS average_gpa
FROM Student
GROUP BY EXTRACT(YEAR FROM enrollment_date);
+-----------------+--------------------+
| enrollment_year | average_gpa        |
|-----------------+--------------------|
| 2022.0          | 4.107500076293945  |
| 2020.0          | 3.9229999780654907 |
| 2021.0          | 3.7171429225376675 |
+-----------------+--------------------+
SELECT 3
Time: 0.029s

In this case, the GROUP BY clause groups the students by their enrollment year, and the AVG() function calculates the average GPA for each group. The output gives a year-by-year breakdown of average GPAs, highlighting trends in academic performance over time.

In both examples, the GROUP BY clause is instrumental in breaking down the data into logical groups. In particular, it extracts summary information that would be difficult to achieve with a simple SELECT statement.

Generally, the GROUP BY clause is processed before the SELECT clause in SQL. As a result, in some database systems like SQL Server, the enrollment_year alias defined in the SELECT clause can’t be used in the GROUP BY clause as a substitute for EXTRACT(YEAR FROM enrollment_date). However, other databases, such as MySQL and PostgreSQL, permit this usage.

3. DISTINCT

DISTINCT eliminates duplicate rows from the result set, ensuring that it returns unique values. Furthermore, we employ DISTINCT to retrieve unique records from a column or a combination of columns without performing any aggregation.

For example, let’s retrieve a list of unique enrollment years from the Student table:

SELECT DISTINCT EXTRACT(YEAR FROM enrollment_date) AS enrollment_year
FROM Student;
+-----------------+
| enrollment_year |
|-----------------|
| 2022.0          |
| 2020.0          |
| 2021.0          |
+-----------------+
SELECT 3
Time: 0.013s

Here, SQL scans the enrollment_date column and extracts the year from each date. Then, we apply the DISTINCT keyword to return unique years. Consequently, the result lists enrollment years with duplicates removed.

Furthermore, we can use the DISTINCT clause to retrieve unique combinations of multiple columns. For example, let’s find out which unique combinations of enrollment years and GPAs exist in the Student table:

SELECT DISTINCT EXTRACT(YEAR FROM enrollment_date) AS enrollment_year, gpa
FROM Student;
+-----------------+--------+
| enrollment_year | gpa    |
|-----------------+--------|
| 2022.0          | 4.65   |
| 2021.0          | <null> |
| 2020.0          | 3.3    |
| 2020.0          | 3.8    |
| 2020.0          | 4.9    |
| 2021.0          | 3.64   |
| 2021.0          | 2.75   |
| 2022.0          | 4.59   |
...

In this case, the DISTINCT keyword eliminates any duplicate combinations of enrollment_year and gpa.

4. Differences

Let’s summarize the key differences between the GROUP BY clause and the DISTINCT clause:

Feature GROUP BY DISTINCT
Purpose Groups rows based on one or more columns and allows for aggregation using functions like COUNT, SUM, and AVG Filters out duplicate rows from the result set, ensuring that only unique rows are returned
Aggregation Supports aggregation functions to summarize data within each group Doesn’t support aggregation functions
Output Produces one row per group with aggregated data Produces distinct rows without duplicates
Performance May require more processing power when combined with aggregate functions Typically faster for simple deduplication, but may not be as efficient for large datasets
Functionality Can handle complex queries involving multiple columns and aggregate functions Limited to filtering out duplicates without additional processing
Use Case Useful when we need to perform calculations or summarizations within grouped data, such as calculating averages, totals, or counts Useful when we need to retrieve a list of unique values from one or more columns
Order Is processed before the SELECT clause Is processed as part of the SELECT clause

5. Conclusion

In this article, we’ve explored the differences between the GROUP BY and DISTINCT clauses in SQL. GROUP BY is ideal for aggregating data with summary functions, whereas DISTINCT is useful for filtering out duplicate rows.

Understanding when and how to use each clause can greatly enhance the efficiency and clarity of our SQL queries.

The code backing this article is available on GitHub. Once you're logged in as a Baeldung Pro Member, start learning and coding on the project.