RSS

Joins vs Subselects

In general you have to possibilities to build extended queries in SQL: Joins and subselects. In most cases both of them are applicable, but which one should be chosen then? Sometimes I prefer the subselect, because it is, at least in my opinion, easier to read, although many sources say that a join is often faster. Of course this is only my personal preference, but a completely objective fact to consider is performance.

The Tests

So I decided to do some performance tests, to be more specific I solved some tasks using both of these methods, and compared them. These tests were run on a Lenovo T420s Notebook running ArchLinux and MySQL 5.5.27, using MySQL Workbench. Each test was repeated 10 times, so that I could calculate an average time, which hopefully reduces unavoidable side effects.

As example database I used the already existing Employee-Database provided by the mysql team. It already contains many records, so it was excellent for my purpose. I assume that you may also be interested in the structure of this database.

Test 1

The first task is to return the number of employees earning more than 150.000,– at the moment.

I begun with the join version:

SELECT COUNT(*)
FROM employees
INNER JOIN salaries
    ON employees.emp_no = salaries.emp_no
WHERE salaries.salary > 150000 
    AND NOW() BETWEEN from_date AND to_date;

As you can see I join the employees with its corresponding salaries, choose only the salaries over 150.000,– and limit the result to the actual salaries. Due to the big size of the database this operation already took 931,2 ms.

After this first test run I tried the following statement, using a subselect:

SELECT COUNT(*)
FROM employees
WHERE EXISTS(
    SELECT 'ok'
    FROM salaries
    WHERE salaries.emp_no = employees.emp_no 
        AND NOW() BETWEEN from_date AND to_date
        AND salaries.salary > 150000
    );

This time I select all the employees, and restrict the result set with the exists-Keyword, using a subquery, which returns something, if the employee from the outer select has a salaray over 150.000,–.

And also this operation took quite long: 1376,4 ms in average. That means the subselect version takes nearly half time longer than the join version (47,8%).

Test 2

Of course I have also tried a second task, just to double check my results. This time I wanted to know how many employees are working for departments with a manager having a salary over 100.000,–.

Again the join version first:

SELECT COUNT(*)
FROM employees
INNER JOIN dept_emp
    ON employees.emp_no = dept_emp.emp_no
INNER JOIN departments
    ON departments.dept_no = dept_emp.dept_no
INNER JOIN dept_manager
    ON departments.dept_no = dept_manager.dept_no
INNER JOIN employees managers
    ON managers.emp_no = dept_manager.emp_no
INNER JOIN salaries
    ON managers.emp_no = salaries.emp_no
WHERE NOW() BETWEEN dept_emp.from_date AND dept_emp.to_date
    AND NOW() BETWEEN dept_manager.from_date AND dept_manager.to_date
    AND NOW() BETWEEN salaries.from_date AND salaries.to_date
    AND salaries.salary > 100000;

It’s the same technique used in the first task: I joined all the required tables, and specified the desired rows in the where-clause. This statement took the RDBMS in average 147,4 ms to complete.

Now I will have a look at the subselect version again:

SELECT COUNT(*)
FROM employees
INNER JOIN dept_emp
    ON employees.emp_no = dept_emp.emp_no
WHERE NOW() BETWEEN dept_emp.from_date AND dept_emp.to_date
    AND dept_emp.dept_no IN (
        SELECT dept_manager.dept_no
        FROM dept_manager
        INNER JOIN employees
            ON employees.emp_no = dept_manager.emp_no
        INNER JOIN salaries
            ON employees.emp_no = salaries.emp_no
        WHERE NOW() BETWEEN dept_manager.from_date AND dept_manager.to_date
            AND NOW() BETWEEN salaries.from_date AND salaries.to_date
            AND salaries.salary > 100000
        );

As you can already see the task is splitted into two smaller tasks: The outer select searches for all the employees with its department number, while the subselect searches for all the departments with a manager earning more than 100.000,–. This is what I meant with the easier understanding of subselects at the beginning of this blog entry. But I repeat myself, so let the real numbers talk.

In this task the performance difference is even bigger: In average this statement took 3201,4 ms, which means it took 200 times longer!

Conclusion

Summarized I have to say that the Join-Method really seems to be much faster, at least in my tests. I also think that this is valid for most cases, since I have the opinion that the tasks are showing real world examples.

Nevertheless I will continue to use subselects also, as they are much easier to read and understand. If there are some real performance problems, refactoring can be done afterwards, so the rule saying that to early optimization is bad can still be followed.

 
Leave a comment

Posted by on August 11, 2012 in MySQL

 

Tags: , ,

Hello world!

Two or three months ago I have read a very interesting article on Smashing Magazine, called “Publish What You Learn”. I really liked the idea of running a blog, although I know that it is a very time-consuming activity. In particular it is even a bigger problem if you are doing a full-time study accompanied by a part-time job.

But especially because of my current internship, in which I am learning many new things, as I am using a technology which is completely new to me, has lead me to the decision, that it’s the best time to start this intent. On the other hand side it is quite difficult to find time during an internship, and maybe even harder to arouse passion, but of course i am willed to write one or two posts a month.

So wish me luck, and hope that I will have an interesting topic quite soon…

 
Leave a comment

Posted by on July 12, 2012 in Uncategorized

 
 
Follow

Get every new post delivered to your Inbox.