
Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: November 6, 2024
In SQL, efficient data retrieval is paramount in building dynamic, scalable applications. As databases grow more complex, the ability to filter, manipulate, and access data effectively becomes important.
In this tutorial, we’ll explore two powerful tools for data handling and retrieval, subqueries and correlated subqueries, and their similarities, differences, and use cases.
A subquery represents an SQL query embedded within another SQL query (outer query). It is also known as an inner or nested query. Subqueries run independently and serve as the input to the outer query.
The inner query executes only once, regardless of how many rows the outer query processes. Subqueries prove more efficient for large datasets compared to correlated subqueries.
Let’s consider a relevant example for the Baeldung University database schema:
We can retrieve relevant results using subqueries in the SELECT, FROM, or WHERE clause, which allows us to filter data dynamically:
SELECT s.name AS student_name, s.gpa
FROM Student s
WHERE s.id IN (SELECT r.student_id
FROM Registration r
INNER JOIN Course c ON r.course_id = c.id
WHERE c.department_id = (SELECT id FROM Department WHERE name = 'Mathematics'));
There are two inner subqueries. First, the innermost subquery identifies which department corresponds to “Mathematics” so that the courses can be accurately filtered. The second subquery retrieves the student_ids of students who registered in any course that belongs to the Mathematics department.
Finally, the outer query uses these student_ids to retrieve the corresponding student name and gpa from the Student table, thus completing the data retrieval process:
student_name | gpa |
---|---|
John Liu | 4 |
Rita Ora | 4.2 |
Samantha Prabhu | 4.9 |
Rose Rit | 3.57 |
Phellum Luis | 4.21 |
Ritu Raj | Null |
Piu Liu | 2.99 |
Roni Roto | 4.44 |
Of all the eight students enrolled in the Mathematics department, only one is yet to have a gpa.
In correlated subqueries, the inner and outer queries are interdependent. The correlated inner query reference columns from the outer query hence, they cannot run independently.
Unlike regular subqueries, correlated subqueries execute once for every row processed by the outer query, making them more complex. We use correlated subqueries when we need to compare values on a row-by-row basis:
SELECT f.name AS faculty_name, COUNT(t.course_id) AS courses_taught, f.position
FROM Faculty f
INNER JOIN Teaching t ON t.faculty_id = f.id
WHERE year = '2023' AND t.semester = 'SPRING'
GROUP BY f.id, t.semester, t.year, f.name, f.position
HAVING COUNT(t.course_id) = (
SELECT MAX(semester_course_count)
FROM (SELECT COUNT(t2.course_id) AS semester_course_count
FROM Teaching t2
WHERE t2.faculty_id = f.id
GROUP BY t2.semester, t2.year) AS semester_counts);
The correlated subquery first counts the number of courses each faculty member taught in a semester and then finds the highest count among all faculty members. The main query then compares each faculty course count for Spring 2023 with this maximum.
Using the HAVING clause, it filters to show only those faculty members who match the highest course count. The entire query effectively identifies faculty members and their respective positions who taught the most courses in that semester:
faculty_name | courses_taught | position |
---|---|---|
Risa Sodi | 4 | Associate Professor |
Milos Simic | 3 | Teaching Assistant |
AV Subramanium | 3 | Assistant Professor |
Ajit Singh | 3 | Assistant Professor |
Wlliam Liu | 3 | Professor |
… | … | … |
We can see a list of faculty members who had varying levels of teaching load (in terms of course content) in Spring 2023.
Generally, subqueries help SQL developers break down complex operations into smaller, more manageable steps. However, some factors set regular and correlated subqueries apart from each other:
Parameters | Subquery | Correlated subquery |
---|---|---|
Approach | Bottom-up approach (inner query first, then outer query) | Top-down approach (outer query first, then inner query for each row) |
Nested Levels | Subqueries can be deeply nested, allowing for multiple levels of queries within queries, especially for complex filtering and aggregate operations. | Nested correlated subqueries can increase complexity and performance costs. |
Data cardinality | Works well with high cardinality datasets where broad filtering or aggregation needs to be done on the entire dataset or across multiple records. | It is ideal for low to medium cardinality scenarios that require row-by-row comparisons or calculations. |
Reusability | Easier to reuse in other queries or as a standalone query | The correlated subquery’s tight coupling with the outer query limits its reusability when isolated. |
Optimization by SQL engine | SQL engines can optimize subqueries by using techniques like query flattening. | Optimization techniques like caching on correlated subqueries are difficult because of their dependency on the outer query. |
Use cases | They are used when the result of the inner query can be calculated independently, such as retrieving a specific value or performing an aggregate calculation that applies globally. | Applied when the inner query needs to compare or evaluate data row-by-row. |
The decision to use a subquery or a correlated subquery depends on the problem and the desired solution. Alternatively, we can often rewrite both queries as JOIN statements for better performance and clarity.
In this article, we discussed the similarities and distinctions between a subquery and a correlated subquery. We also examined how the independence or lack of it influences the use cases of the types of subqueries. Additionally, we highlighted the different challenges that plague each subquery type.