Ranking in SQL
The Ordering
Lets get a table with some data. Imagine a table of revenue💵 generated by each sales representative👔. We want to find the top performers🌟
| Id | Team | Name | Revenue |
|---|---|---|---|
| 101 | Angel's Wings | Abraham | 6005 |
| 102 | Boosters | John | 5611 |
| 103 | Devil's Cry | Helena | 8461 |
| 104 | Redemption | Kate | 8180 |
| 105 | Angel's Wings | Jacob | 9327 |
| 106 | Devil's Cry | Mary | 8461 |
| 107 | Boosters | Menalaus | 8461 |
| 108 | Redemption | Adam | 8461 |
| 109 | Redemption | Noah | 7483 |
| 110 | Devil's Cry | Jenny | 6000 |
SELECT
Id,
Name,
Revenue
FROM
salesData_tbl
ORDER BY
Revenue DESC;
The default ordering in SQL is ascending (keyword:
ASC). To get descending we used theDESCProfessionally, its better to be explicit and mention
ASC(even if its default) to avoid confusion while understanding the code
To, us the result looks something like this:
| Id | Team | Name | Revenue |
|---|---|---|---|
| 105 | Angel's Wings | Jacob | 9327 |
| 103 | Devil's Cry | Helena | 8461 |
| 106 | Devil's Cry | Mary | 8461 |
| 107 | Boosters | Menalaus | 8461 |
| 108 | Redemption | Adam | 8461 |
| 104 | Redemption | Kate | 8180 |
| 109 | Redemption | Noah | 7483 |
| 101 | Angel's Wings | Abraham | 6005 |
| 110 | Devil's Cry | Jenny | 6000 |
| 102 | Boosters | John | 5611 |
The ranking
What we did above is sufficient for cases like getting the top performer. In many cases people end up doing ranking. While it has its use cases, it is not necessary right now
Then, when is it necessary🤔?
We have 3 different window functions in SQL for ranking. They are:
- Row number
- Rank
- Dense Rank
Most practically, we use Row number in Data engineering to remove duplicates, so you might have heard and used it before😃. But today, we shall explore where to use all of them
But before that, lets actually look at what they do to avoid confusion and a side-by-side comparison. So, lets run the query:
SELECT
Id,
Name,
Revenue,
ROW_NUMBER() OVER (ORDER BY Revenue Desc) as RowNum,
RANK() OVER (ORDER BY Revenue Desc) as Rnk,
DENSE_RANK() OVER (ORDER BY Revenue Desc) as DRnk
FROM
salesData_tbl;
Notice how we didn't specify the
ORDER BY Revenuein the end because its part of window function syntax
| Id | Team | Name | Revenue | RowNum | Rnk | Drnk |
|---|---|---|---|---|---|---|
| 105 | Angel's Wings | Jacob | 9327 | 1 | 1 | 1 |
| 103 | Devil's Cry | Helena | 8461 | 2 | 2 | 2 |
| 106 | Devil's Cry | Mary | 8461 | 3 | 2 | 2 |
| 107 | Boosters | Menalaus | 8461 | 4 | 2 | 2 |
| 108 | Redemption | Adam | 8461 | 5 | 2 | 2 |
| 104 | Redemption | Kate | 8180 | 6 | 6 | 3 |
| 109 | Redemption | Noah | 7483 | 7 | 7 | 4 |
| 101 | Angel's Wings | Abraham | 6005 | 8 | 8 | 5 |
| 110 | Devil's Cry | Jenny | 6000 | 9 | 9 | 6 |
| 102 | Boosters | John | 5611 | 10 | 10 | 7 |
The real deal is when there is a tie. Obviously, there is a tie between Helena, Mary, Menalaus and Adam
Row Number
Row number ensures that it does not contain a duplicate value. So, when there is a tie among them, it puts the tied people in random order
If we want a specific order for tied people we can do that as
ROW_NUMBER() OVER (ORDER BY Revenue DESC, Id ASC)
Imagine a scenario where I held this as an examination/interview and these are their scores. I want to select 4 people for my company🏭. Row number does the job perfect for me
Rank
Rank is used to find how many people are better than you. Suppose, Kate wants to login and see how many people out-performed her. Since, her rank is 6 it means 5 people out-performed her. Similarly for Menalaus at rank 2, 1 person out performed him
Dense Rank
Dense rank does not leave gaps in between because someone tied above you. It is not as ruthless as Rank
Suppose I am giving bonus to TOP 3 performers. Using row number means injustice to Menalaus, Adam and Kate. Using rank means injustice to Kate. Enter Dense rank and it fixes everything
Catch the duplicates
So, now you know how and where to use ranking in SQL
More commonly we see row number in our code. It's most commonly used to remove duplicates in Data Engineering. Imagine you are tracking the log-ins and log-outs of employees. You want to see if there are any duplicates in the table (it shouldn't. Otherwise you are wasting your storage for nothing). To keep it simple, lets say there is only 1 employee
| Id | Action | Time |
|---|---|---|
| 101 | In | 10:01 |
| 101 | Out | 12:07 |
| 101 | In | 12:49 |
| 101 | Out | 17:57 |
| 101 | In | 18:09 |
| 101 | Out | 18:32 |
| 101 | Out | 18:32 |
| 101 | Out | 18:32 |
Now, its acceptable for a person to log out and log back in for some left over task. But the last 3 entries are true duplicates. To automatically find such entries we can run the query
SELECT
*,
ROW_NUMBER() OVER (ORDER BY Time Desc) as RowNum
FROM
logs_tbl
WHERE
Action = 'Out';
Its a bit unfair we knew before hand that the duplicates are in
Action = 'Out'. But bear with me ...
| Id | Action | Time | RowNum |
|---|---|---|---|
| 101 | Out | 18:32 | 1 |
| 101 | Out | 18:32 | 2 |
| 101 | Out | 18:32 | 3 |
| 101 | Out | 17:57 | 4 |
| 101 | Out | 12:07 | 5 |
And for the magic🪄, we add PARTITION BY. Partition... well makes partitions based on a criteria. Here we want to create a partition/compartment of particular times
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY Time
ORDER BY Time Desc
) as RowNum
FROM
logs_tbl
WHERE
Action = 'Out';
| Id | Action | Time | RowNum |
|---|---|---|---|
| 101 | Out | 18:32 | 1 |
| 101 | Out | 18:32 | 2 |
| 101 | Out | 18:32 | 3 |
| 101 | Out | 17:57 | 1 |
| 101 | Out | 12:07 | 1 |
Hence, if we find any row number > 1, it means that the table has duplicates. As for the reason why I chose Action = 'Out' (even though in this case it makes no difference), is to ensure I don't catch cases where the person logs in and logs out within the minute. In that case, the system would have noted down the same time for both log in and log out, and this would have shown up as duplicate
The Partitioning
Since, you already learnt partitioning above, it does not require another introduction. Remember the very first table, with all the crazy team names. Imagine, you want the best performer from each team to be made the captain🤠 or something... In this case we can use partition on the team name
SELECT
Id,
Name,
Revenue,
ROW_NUMBER() OVER (
PARTITION BY Team
ORDER BY Revenue Desc
) as RowNum
FROM
salesData_tbl;
| Id | Team | Name | Revenue | RowNum |
|---|---|---|---|---|
| 105 | Angel's Wings | Jacob | 9327 | 1 |
| 101 | Angel's Wings | Abraham | 6005 | 2 |
| 103 | Devil's Cry | Helena | 8461 | 1 |
| 106 | Devil's Cry | Mary | 8461 | 2 |
| 110 | Devil's Cry | Jenny | 6000 | 3 |
| 107 | Boosters | Menalaus | 8461 | 1 |
| 102 | Boosters | John | 5611 | 2 |
| 108 | Redemption | Adam | 8461 | 1 |
| 104 | Redemption | Kate | 8180 | 2 |
| 109 | Redemption | Noah | 7483 | 3 |
So, in Devil's Cry, even though Helena and Mary were tied, Helena got the captaincy🤠