In general you have to possibilities to build extended queries in SQL: Joins and subselects. In most cases both of them are applicable, but which one should be chosen then? Sometimes I prefer the subselect, because it is, at least in my opinion, easier to read, although many sources say that a join is often faster. Of course this is only my personal preference, but a completely objective fact to consider is performance.
The Tests
So I decided to do some performance tests, to be more specific I solved some tasks using both of these methods, and compared them. These tests were run on a Lenovo T420s Notebook running ArchLinux and MySQL 5.5.27, using MySQL Workbench. Each test was repeated 10 times, so that I could calculate an average time, which hopefully reduces unavoidable side effects.
As example database I used the already existing Employee-Database provided by the mysql team. It already contains many records, so it was excellent for my purpose. I assume that you may also be interested in the structure of this database.
Test 1
The first task is to return the number of employees earning more than 150.000,– at the moment.
I begun with the join version:
SELECT COUNT(*) FROM employees INNER JOIN salaries ON employees.emp_no = salaries.emp_no WHERE salaries.salary > 150000 AND NOW() BETWEEN from_date AND to_date;
As you can see I join the employees with its corresponding salaries, choose only the salaries over 150.000,– and limit the result to the actual salaries. Due to the big size of the database this operation already took 931,2 ms.
After this first test run I tried the following statement, using a subselect:
SELECT COUNT(*) FROM employees WHERE EXISTS( SELECT 'ok' FROM salaries WHERE salaries.emp_no = employees.emp_no AND NOW() BETWEEN from_date AND to_date AND salaries.salary > 150000 );
This time I select all the employees, and restrict the result set with the exists-Keyword, using a subquery, which returns something, if the employee from the outer select has a salaray over 150.000,–.
And also this operation took quite long: 1376,4 ms in average. That means the subselect version takes nearly half time longer than the join version (47,8%).
Test 2
Of course I have also tried a second task, just to double check my results. This time I wanted to know how many employees are working for departments with a manager having a salary over 100.000,–.
Again the join version first:
SELECT COUNT(*) FROM employees INNER JOIN dept_emp ON employees.emp_no = dept_emp.emp_no INNER JOIN departments ON departments.dept_no = dept_emp.dept_no INNER JOIN dept_manager ON departments.dept_no = dept_manager.dept_no INNER JOIN employees managers ON managers.emp_no = dept_manager.emp_no INNER JOIN salaries ON managers.emp_no = salaries.emp_no WHERE NOW() BETWEEN dept_emp.from_date AND dept_emp.to_date AND NOW() BETWEEN dept_manager.from_date AND dept_manager.to_date AND NOW() BETWEEN salaries.from_date AND salaries.to_date AND salaries.salary > 100000;
It’s the same technique used in the first task: I joined all the required tables, and specified the desired rows in the where-clause. This statement took the RDBMS in average 147,4 ms to complete.
Now I will have a look at the subselect version again:
SELECT COUNT(*) FROM employees INNER JOIN dept_emp ON employees.emp_no = dept_emp.emp_no WHERE NOW() BETWEEN dept_emp.from_date AND dept_emp.to_date AND dept_emp.dept_no IN ( SELECT dept_manager.dept_no FROM dept_manager INNER JOIN employees ON employees.emp_no = dept_manager.emp_no INNER JOIN salaries ON employees.emp_no = salaries.emp_no WHERE NOW() BETWEEN dept_manager.from_date AND dept_manager.to_date AND NOW() BETWEEN salaries.from_date AND salaries.to_date AND salaries.salary > 100000 );
As you can already see the task is splitted into two smaller tasks: The outer select searches for all the employees with its department number, while the subselect searches for all the departments with a manager earning more than 100.000,–. This is what I meant with the easier understanding of subselects at the beginning of this blog entry. But I repeat myself, so let the real numbers talk.
In this task the performance difference is even bigger: In average this statement took 3201,4 ms, which means it took 200 times longer!
Conclusion
Summarized I have to say that the Join-Method really seems to be much faster, at least in my tests. I also think that this is valid for most cases, since I have the opinion that the tasks are showing real world examples.
Nevertheless I will continue to use subselects also, as they are much easier to read and understand. If there are some real performance problems, refactoring can be done afterwards, so the rule saying that to early optimization is bad can still be followed.