employees with the highest salary in the department
Anyone who has ever been to a job interview knows that they ask all sorts of strange questions. Often you can come across a task to find employees with the highest salary in a department. And your potential boss thinks that this task has only one “correct solution”, the one he read about on the Internet. Is that true?
Formulation of the problem
Of course, your potential boss didn't come up with either this problem or its solution. He looked up both the problem and the “correct” answer on the Internet to demonstrate his mental superiority. Like, he would have solved this problem “correctly” in seconds, and the potential employee should do the same.
The problem is formulated as follows. There is a database containing employees with their division by department and salary. I tried to google the root of evil. The links led me to the article: https://www.jitbit.com/news/181-jitbits-sql-interview-questions/
These are the recommendations for hiring employees at JitBit, which describe examples of questions, as well as the correct behavior of the person conducting the interview. I would like to separately note the quote.
And there's more than one correct solution to each of these questions.
In other words, there is more than one correct solution here. There are several tasks, but this article is only about one of them. The task is formulated as follows. There is a database containing information about employees, salaries and their division by department.
CREATE TABLE departments (
department_id integer PRIMARY KEY,
name text NOT NULL
);
CREATE TABLE employees (
employee_id integer PRIMARY KEY,
department_id integer NOT NULL REFERENCES departments(department_id),
name text NOT NULL,
salary money NOT NULL
);
It is necessary to obtain a list of employees receiving the maximum salary in the department.
Select employees who have the highest salaries in their departments
Sometimes they add that the name of the department also needs to be obtained, but there is nothing complicated or unusual about this, for the sake of simplicity of presentation, I will solve without this.
Solution options
I suppose there could be many. The first thing that comes to mind is to solve it the way I solved it all my youth, when there were no window functions in PostgreSQL yet.
SELECT name AS employee, salary
FROM (
SELECT department_id, max(salary) AS salary
FROM employees
GROUP BY department_id
) AS m
JOIN employees AS e USING (department_id, salary);
A completely workable option, but the potential boss wants more, he assured that cool specialists will immediately write a cool “correct” answer from the Internet. He begins to suggest that you need to use window functions. And the second option that immediately comes to mind in the allotted seconds is to do the same thing, but through window functions.
SELECT name AS employee, salary
FROM (
SELECT name, salary, max(salary) OVER (PARTITION BY department_id) AS ms
FROM employees
) AS e
WHERE salary=ms;
But that's not it either. Where did your potential boss get the idea that there is only one “correct” cool answer? And it's not the one the potential employee suggests? For example, from here.
https://stackoverflow.com/questions/16799640/employees-with-largest-salary-in-department
Apparently, this interview task has become popular not only in our country. So popular that it is discussed on Stack Overflow. And there, an authoritative cool guru (whom 6 people voted for) claims that the coolest way to solve this task is to use the rank() function. This is exactly what is expected of you at the interview:
SELECT name AS employee, salary
FROM (
SELECT name, salary, rank() OVER (PARTITION BY department_id ORDER BY salary DESC)
AS rnk FROM employees e
) AS e
WHERE rnk=1;
If you don't offer this solution, you won't get the job.
Well, if you think about it? Rank() is a rather heavy function in terms of computing power consumption, created for several other tasks. Will this solution really be the best?
Let's imagine that you work in a large company: 2000 employees divided into 20 departments. You have a real, not educational, task: to get a list of employees with the highest salaries by department. And it is very important for you to solve it so that it is carried out as efficiently as possible. Why? Well, for example, to show your superiors that you are the best, you are more efficient than ever, you do everything in the best possible way and deserve a raise.
Testing environment
I have provided the DDL for creating tables with employees above, you need to add a table to store the measurement results.
CREATE TABLE results (
method text NOT NULL, -- название метода
start timestamptz NOT NULL, -- время начала теста
finish timestamptz NOT NULL, -- время завершения
execution_time real NOT NULL -- время выполнения по данным EXPLAIN ANALYZE
);
Fill in the information about the employees.
INSERT INTO departments (department_id, name)
SELECT generate_series(0,19), random()::text;
INSERT INTO employees (employee_id, department_id, salary, name)
SELECT id AS employee_id, id/100 AS department_id,
rnd::numeric::money AS salary, rnd::text AS name
FROM (SELECT gs, random()*300000 FROM generate_series(0,1999) AS gs) AS t(id,rnd);
And conduct testing.
DO $do$
DECLARE
start_ timestamptz;
finish_ timestamptz;
execution_time_ real;
explain text; -- вывод EXPLAIN
methods CONSTANT text[] := ARRAY['max', 'window', 'rank'];
method_query CONSTANT text[] := ARRAY[
'EXPLAIN ANALYZE SELECT name AS employee, salary FROM (SELECT department_id, max(salary) AS salary FROM employees GROUP BY department_id) AS m JOIN employees AS e USING (department_id, salary)',
'EXPLAIN ANALYZE SELECT name AS employee, salary FROM (SELECT name, salary, max(salary) OVER (PARTITION BY department_id) AS ms FROM employees) AS e WHERE salary=ms',
'EXPLAIN ANALYZE SELECT name AS employee, salary FROM (SELECT name, salary, rank() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rnk FROM employees e) AS e WHERE rnk=1'
];
BEGIN
PERFORM * FROM employees; -- warm cache
FOR j in 1..1000 LOOP FOR m IN 1..array_length(methods, 1) LOOP
start_ := clock_timestamp();
-- будет работать только в английской локализации lc_messages
FOR explain IN EXECUTE method_query[m] LOOP
IF starts_with(explain, 'Execution Time:')
THEN execution_time_ := substring(explain FROM 'Execution Time: ([.0-9]*) ms')::real; END IF;
END LOOP;
finish_ := clock_timestamp();
INSERT INTO results (method, start, finish, execution_time)
VALUES (methods[m], start_, finish_, execution_time_);
END LOOP;
END LOOP;
END; $do$;
In order to achieve maximum accuracy in this matter, I make 1000 series of measurements.
Test results
Of course, if we need to bother with performance, then we will have to talk about creating an index and indexing the query. But, out of scientific curiosity, let's do the first experiment without indexes.
test=# select method, avg(execution_time) as execution_time
from results group by method;
method | execution_time
--------+-------------------
max | 98.34957398986816
rank | 51.50633999633789
window | 46.65021300888061
(3 строки)
And we see that even here the “correct” and “cool” solution to the problem that the potential boss knows is not the best.
But since we consider ourselves highly qualified specialists, we will think about an index. It seems that this index would be a good option:
CREATE INDEX ds1 ON employees (department_id, salary);
First, go to the department along the tree, and then, using the fact that btree has a bidirectional structure, read the maximum salary from the “right end”. But will PostgreSQL guess to do this? Test results.
method | execution_time
--------+---------------------
max | 0.24314399652183055
rank | 0.8777229990959168
window | 0.8027530006766319
Hmm, the “correct and cool” solution gives the worst result in terms of time. Is our potential boss really that stupid that he will only hire those candidates who offer the worst solution of all possible?
But since we are not bosses, but technical specialists, we prove our rightness not with “confidence in our voices”, but with technical details. Let's figure out what's going on there.
test=# EXPLAIN (COSTS false) SELECT name AS employee, salary
FROM (
SELECT department_id, max(salary) AS salary
FROM employees
GROUP BY department_id
) AS m
JOIN employees AS e USING (department_id, salary);
QUERY PLAN
--------------------------------------------------------------------------------------------------------
Nested Loop
-> GroupAggregate
Group Key: employees.department_id
-> Index Only Scan using ds1 on employees
-> Index Scan using ds1 on employees e
Index Cond: ((department_id = employees.department_id) AND (salary = (max(employees.salary))))
(6 строк)
test=# EXPLAIN (COSTS false) SELECT name AS employee, salary
FROM (
SELECT name, salary, max(salary) OVER (PARTITION BY department_id) AS ms
FROM employees
) AS e
WHERE salary=ms;
QUERY PLAN
-----------------------------------------------
Subquery Scan on e
Filter: (e.salary = e.ms)
-> WindowAgg
-> Index Scan using ds1 on employees
(4 строки)
test=# EXPLAIN (COSTS false) SELECT name AS employee, salary
FROM (
SELECT name, salary,
rank() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rnk
FROM employees e
) AS e
WHERE rnk=1;
QUERY PLAN
------------------------------------------------------------
Subquery Scan on e
Filter: (e.rnk = 1)
-> WindowAgg
Run Condition: (rank() OVER (?) <= 1)
-> Incremental Sort
Sort Key: e_1.department_id, e_1.salary DESC
Presorted Key: e_1.department_id
-> Index Scan using ds1 on employees e_1
(8 строк)
It is clear that the first two methods have perfectly figured out how to use this index, but the last one, which uses rank(), for some reason did not cope and uses it only halfway. No problem, to help him you can fix the index.
DROP INDEX ds1;
CREATE INDEX ds2 on employees (department_id, salary DESC);
test=# EXPLAIN (COSTS false) SELECT name AS employee, salary FROM (SELECT department_id, max(salary) AS salary FROM employees GROUP BY department_id) AS m JOIN employees AS e USING (department_id, salary);
QUERY PLAN
--------------------------------------------------------------------------------------------------------
Nested Loop
-> GroupAggregate
Group Key: employees.department_id
-> Index Only Scan using ds2 on employees
-> Index Scan using ds2 on employees e
Index Cond: ((department_id = employees.department_id) AND (salary = (max(employees.salary))))
(6 строк)
test=# EXPLAIN (COSTS false) SELECT name AS employee, salary FROM (SELECT name, salary, max(salary) OVER (PARTITION BY department_id) AS ms FROM employees) AS e WHERE salary=ms;
QUERY PLAN
-----------------------------------------------
Subquery Scan on e
Filter: (e.salary = e.ms)
-> WindowAgg
-> Index Scan using ds2 on employees
(4 строки)
test=# EXPLAIN (COSTS false) SELECT name AS employee, salary FROM (SELECT name, salary, rank() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rnk FROM employees e) AS e WHERE rnk=1;
QUERY PLAN
---------------------------------------------------
Subquery Scan on e
Filter: (e.rnk = 1)
-> WindowAgg
Run Condition: (rank() OVER (?) <= 1)
-> Index Scan using ds2 on employees e_1
(5 строк)
The first two methods work the same as before. And the rank method began to use the index to its full potential. The results were not long in coming.
method | execution_time
--------+---------------------
max | 0.24539799855649472
rank | 0.32680700132250784
window | 0.8033939964771271
Even in the most favorable case for oneself, the “cool and only correct” method by which a potential boss selects is one and a half times slower than the method that first comes to mind.
conclusions
I think the rank() method is the worst solution to this problem, because it never managed to take first place for any index considered. In other words, there was no situation where its use was justified. It turns out that the whole point of a job interview is to select the employee who offers the worst solution to the problem. The rest will not be hired.
This can be understood and easily generalized, the problem is actually fundamental. How does it usually happen? There is a full stack employee who turns into a team lead. PostgreSQL maintenance initially falls on him, but he cannot cope, and a decision is made to hire a highly specialized PostgreSQL specialist to solve the problems that the team lead himself cannot solve. But during the interview, if the potential employee offers better solutions to technical problems than the potential boss knows about, then he will think that “the answers are wrong” and will not hire him. They will only hire someone who offers the same bad solutions to technical problems that the potential boss already knows about.
For example, you shouldn't offer a failover cluster on Pacemaker during an interview, which is much better than Patroni.
https://habr.com/ru/companies/domclick/articles/516538/
Because due to his limitations, the potential boss knows only one single “correct and cool” solution on Patroni, he read about it on the Internet. The same applies to other issues. Any good solution that goes beyond the limitations of the potential boss will be regarded as a mistake or a demonstration of incompetence.