Window Functions in SQL

Prabhat Rawat
8 min readFeb 20, 2023

SQL is a programming language used to communicate with and manipulate databases. SQL helps control information stored in databases, allowing users to retrieve the specific data they’re looking for when they need it. In the Below article we are going to cover one of the most important concept for SQL i.e. Window functions.

Window functions are powerful functions that performs calculations along the set rows of the query result. Sometimes they work quite similar like Aggregate functions but in case of aggregates generally we lose the individual rows whereas in Window functions we are able to prevent this. I’ll be able to explain this in a better way in examples that I’ll use below.

Let’s began….

Example : For this demonstration i’m using a sample Video game Sales Table, named “VG_Sales” shown below :

Sample Data

I’ll refer only this table for all the examples below in this article.

OVER()

Over is a type of window function that works exactly like GROUP BY on a set of rows but in this aggregated result goes along with every row of the window.

Syntax : OVER(PARTITION BY col1, col2….. ORDER BY col1,col2….)

Here, col1, col2…. are column names on which you need to create the window for calculation.

CASE : Let’s suppose we want to put the video games into category of “premium” or “economy” on basis of Global_Sales for each platform available in the list. So, for every Platform if the Global Sales is greater than the average sales then “premium” and if less than average sales then it’s “economy”.

We can easily achieve this by applying a window on brand name for average price calculation in every set of rows.

/* game sales comparision with average sales,

Platform wise evaluation of video game Segment
IF Global_Sales of product(on every platform) GREATER THEN AVERAGE -> "Premium"
IF Global_Sales of product(on every platform) LESS THEN AVERAGE -> "Economy"
*/

SELECT [RANK] as ID, [Name], [Platform], [Year],
Genre, Publisher, Global_Sales,
AVG(Global_Sales) OVER(PARTITION BY [PLATFORM]) as Average_Sales,
CASE
WHEN Global_Sales > AVG(Global_Sales) OVER(PARTITION BY [PLATFORM]) THEN 'Premium'
WHEN Global_Sales <= AVG(Global_Sales) OVER(PARTITION BY [PLATFORM]) THEN 'Economy'
END as Segment
FROM [VG_Sales]
Output Case 1

NOTE : All the other window functions will work along with the over clause

ROW_NUMBER()

Row_Number is our first ranking based window function which can used to rank the rows on basis of order and partitions.

Syntax : ROW_NUMBER() OVER(PARTITION BY cols….. ORDER BY cols….)

For this case passing Order by clause is mandatory because logically we can’t rank any data point without an order.

Let’s try to rank our data for on basis of highest and lowest price of product

/* Rank the Games on basis of there Global Sales */

SELECT [RANK] as ID, [Name], [Platform], [Year],
Genre, Publisher, Global_Sales,
ROW_NUMBER() OVER(ORDER BY Global_Sales) as Rank_by_Sales
FROM [VG_Sales]

CASE : Suppose we want to identify that for every year which Genre and it’s respective video game performing with maximum global sales.

/*
Ranking Games on basis of Global Sales per year
*/

SELECT [RANK] as ID, [Name], [Platform], [Year],
Genre, Publisher, Global_Sales,
ROW_NUMBER() OVER(PARTITION BY [Year] ORDER BY Global_Sales) as Rank_by_Sales
FROM [VG_Sales] ORDER BY [Year]
Output Case 2

In the above image we can see that in 1980 Shooter and in 1981 Platform & Action are most popular gaming genres.

Row_Number is quite useful, but if the window segment have same set of values then still it will rank them differently as shown below :

Row_Number (Different rank for same Sales amount)

Logically these all sales values are same so this is not upto the mark and sometimes can be an issue(depends on the use case), for handle this we have 2 other rank based window functions i.e. RANK and DENSE_RANK. Let’s understand how both of them works.

RANK()

Rank is our second ranking based window function.

Syntax : RANK() OVER(PARTITION BY cols….. ORDER BY cols….)

For this case passing Order by clause is mandatory because logically we can’t rank any data point without an order.

This works quite similar to Row_Number() but if the window segment have same set of values then it will give them same rank but will skip the rank for next non-similar data points as shown below :

RANK working difference with ROW_NUMBER

Since on the partition of GLOBAL_SALES top 168 values are same hence ranked as ‘1’ but for next non-same value it skips the number of ranks and put ‘169’ as new rank value.

That’s great, but still this have some issue of skipping the rank for same set of values

DENSE_RANK()

Dense_Rank() is our third ranking based window function, which can handle both the issue that we just saw in the previous 2 functions :

Syntax : DENSE_RANK() OVER(PARTITION BY cols….. ORDER BY cols….)

For this case passing Order by clause is mandatory because logically we can’t rank any data point without an order.

Here below we can see that how DENSE_RANK() function changes the result as compare to above 2 ranking window functions. Personally, this one comes more frequently in use in my tasks.

Working of DENSE_RANK()

I hope by these 3 examples I’m able to explain these ranking window functions. These 3 functions plays very important role in daily activities of data engineers and data analysts.

Now, let’s move to see some more functions.

NTILE()

NTILE is a SQL function which can used to divide the data roughly in equal segments or buckets by providing each segment an integer segment id. One can easily create ‘n’ number of segments in there data based on ordering.

Syntax : NTILE(No. of Segments) OVER(PARTITION BY cols.. ORDER BY cols…)

For this case passing Number of Segments and Order by clause is mandatory.

Suppose we want to partition our data in 5 segments (Low to High) based on the Global sales.

That’s how NTILE() works.

If we want we can tune this scenario a bit, this time let’s make 7 roughly equal sets for every year in the data.

SELECT 
Rank as ID, Name, Platform, [Year], Genre, Publisher, Global_Sales,
NTILE(7) OVER(PARTITION BY [YEAR] ORDER BY Global_Sales) as Segments
FROM VG_SALES

This is a simple and quite useful function and I hope this above example gives you a good idea of working of NTILE function.

LAG() & LEAD()

LAG() & LEAD() are analytical window functions from which we can refer the data from rows above or below the current row. Both these functions goes along with each other. Also we don’t require both of them, we can achieve both the functionalities with either only LAG or LEAD.

Syntax : LAG(column, value, default) OVER(PARTITION BY cols.. ORDER BY cols…)

column : the column which needs to be lagged

value : the value of lag we want. eg : lag of 1 or 2 etc..

default : while moving values above and below, we can set a default value for the cell for which there is no above/below present.

For this case passing Order by clause is mandatory.

To explain them better I will directly jump into example shown below :

CASE : Let’s Suppose we want to analyze profit percentages for video games from the sales of current year to last year sales.

Step 1 : First we need to group by the global sales year-wise as shown below


SELECT [Year], sum(Global_Sales) as Total_Global_Sales
FROM [dbo].[VG_Sales] GROUP BY [Year]

Step 2 : Next we need to use LAG function to put the respective last year sales along with Current Year sales.

--With No Default Value

SELECT *, LAG(Total_Global_Sales,1) OVER(ORDER BY [YEAR])
as Prev_Year_Sales FROM
(
SELECT [Year], sum(Global_Sales) as Total_Global_Sales
FROM [dbo].[VG_Sales] GROUP BY [Year]
) as temp_1


--With Default Value as '0'

SELECT *, LAG(Total_Global_Sales,1,0) OVER(ORDER BY [YEAR])
as Prev_Year_Sales FROM
(
SELECT [Year], sum(Global_Sales) as Total_Global_Sales
FROM [dbo].[VG_Sales] GROUP BY [Year]
) as temp_1

NOTE : We can pass a default value for first row(optional) i.e. for the row with no previous value.

Step 3 : Now we have current year and respective previous year sales values together, hence we only need to calculate profit percentage as shown below


SELECT *, round(((Total_Global_Sales - Prev_Year_Sales)/Prev_Year_Sales)*100,2) as [Profit %age]
FROM
(
SELECT *, LAG(Total_Global_Sales,1) OVER(ORDER BY [YEAR])
as Prev_Year_Sales FROM
(
SELECT [Year], sum(Global_Sales) as Total_Global_Sales
FROM [dbo].[VG_Sales] GROUP BY [Year]
) as temp_1
) as temp_2 ORDER BY [Year]
Sales Comparison from last year

Working with LEAD()

Lead works quite similar to LAG, only difference is LAG() brings the previous rows values together and LEAD brings the next rows values together.

Let’s replace the LAG function with LEAD in our previous example and see the result :

SELECT *, round(((Total_Global_Sales - Prev_Year_Sales)/Prev_Year_Sales)*100,2) as [Profit %age]
FROM
(
SELECT *, LEAD(Total_Global_Sales,1) OVER(ORDER BY [YEAR])
as Prev_Year_Sales FROM
(
SELECT [Year], sum(Global_Sales) as Total_Global_Sales
FROM [dbo].[VG_Sales] GROUP BY [Year]
) as temp_1
) as temp_2 ORDER BY [Year]
Comparing LEAD() vs LAG()

As you can see above in case of LAG it’s comparing sales of current year and last year whereas in case of LEAD it’s comparing sales of current year and next year.

I hope this example works well in explaining the working of Lead & Lag 😎

By this blog post I feel we have covered many important window functions that actually helps me alot in my day-to-day activities in data engineering & data analysis. I hope you enjoyed the scenarios and this post gives you a well defined understanding of how window functions works.

In future posts I’ll came up with multiple different scenarios regarding SQL, ETL, Data Engineering and Data Analysis, So stay connected and for any further queries and Suggestions you can reach me out via mail at : ragvenderrawat@gmail.com

--

--

Prabhat Rawat

Azure Data Engineer|| ETL Developer|| Deep Learning Enthusiastic|| Machine Learning Enthusiastic|| Python Programmer