Time series de-seasonality with BigQuery

As a frequent R user in school, I was used to apply the time series packages directly to process decomposition, when it came to the de-seasonality. I’ve done several time series analyses with BigQuery in my current job, and actually found that the step-by-step “recipe” helped me understand the mechanism in a deeper level. I would share some experience about how to manually process time series decomposition with the query language, for business analysis purpose_^[1].

Before coding, you need firstly to understand what is seasonality and why we need to remove it for certain analysis. Let’s think about a simple case:

An ice cream vendor increased the price per ice cream by 3% in May
Revenue from a 3-month window Feb – Apr before the price change was 1000 €
Revenue from a 3-month window Jun – Aug after the price change was 3000 €
Can he conclude the 3% price change drives revenue tripled?

The answer is no, because the sales of ice cream are seasonally higher in the summer months than that in any other months, and those “extra sales” repetitively happen every summer are what we call “seasonality”. To evaluate how much the 3% price change conducted the revenue growth, you need to remove this “foam” (seasonality) from the time series, to obtain a relatively smooth curve, and then compare the 3-month revenue before and after.

Secondly, you need to understand a simple formula of time series, from an econometric perspective.

A time series Yt, which can be sales, temperature, population etc. during certain period with daily/ weekly/ quarterly/ monthly/ yearly frequency, is composed by 3 elements:
- Trend Tt
- Seasonality St
- Noises / Residual / Remainder component Rt
Yt = Tt + St + Rt
The purpose of de-seasonality is to find out and get rid of the seasonal component St from the time series Yt

The example of the time series de-seasonality will use the BigQuery public data. To access the project, this guidance will help you set it up. The classic decomposition method will be applied, and the output will be presented and visualized in connected Google Sheets.

sTEP 1. DATA SHORTLIST

The time series of new_residential_construction with monthly observations are applied as the example. The table is under the dataset census_bureau_construction, in the project bigquery-public-data. The earliest period dates back to 1959-01-15. To simplify our analysis, we shortlist the dimensions and the length of date by:

Choosing the period after year 2000, period_date > “2000-01-01”
Reporting at “Total Units” level, datatype_id = 1
Using the unadjusted data, is_adjusted = 0
Selecting the category as “Housing Units Under Construction”, category_id = 6
Constraining the geographic dimensions in “United States” (ignore the regional dimensions, e.g. South, West, Northeast, etc.) to see the full picture, geo_id = 1

The query result is the time series that we will use for decomposition.

SELECT 
period_date,
category_desc,
datatype_desc,
geo_desc,
is_adjusted,
value
FROM `bigquery-public-data.census_bureau_construction.new_residential_construction` 
WHERE period_date > "2000-01-01"
AND datatype_id = 1
AND is_adjusted = 0
AND category_id = 6
AND geo_id = 1

period_date	category_desc	datatype_desc	geo_desc	value
2000-01-15	Housing Units Under Construction	Total Units	United States	946.6
2000-02-15	Housing Units Under Construction	Total Units	United States	948.4
2000-03-15	Housing Units Under Construction	Total Units	United States	952.5
2000-04-15	Housing Units Under Construction	Total Units	United States	979.5
2000-05-15	Housing Units Under Construction	Total Units	United States	994.6

Query result: customized time series

sTEP 2. TIME SERIES SETTING and moving avg

Before finding out the seasonality St, 3 columns are added to assist the calculation:

t: a running number as an alias to the period, from the latest to the most recent
t_month: month number from the corresponding date. We need the month number because seasonality is a value that repetitively allocated in the identical month number
value_12_ma: moving average of observations (column: value) with a 12-month window. In this case, the 12-month window is selected as 11 months in ahead (11 rows before) + the current month (the current row) .

Note that a moving average with a window of m observations, is usually called as m-MA, or moving average of order m. If m is an even number, for example, in our case, monthly data – MA of order 12, or quarterly data – MA of order 4, the trend component is based on one another moving average of the moving average (will introduce later), for making an even-order moving average symmetric^[2]. If m is an odd number, for example, weekly data – MA of order 7, the 7-MA can be directly treated as the trend component.

SELECT 
ROW_NUMBER() OVER (ORDER BY period_date ASC) AS t,
period_date,
SUBSTRING(CAST(period_date AS STRING),6,2) AS t_month,
value,

CASE 
WHEN ROW_NUMBER() OVER (ORDER BY period_date ASC) < 12 THEN NULL 
ELSE
	AVG(value) OVER (
		ORDER BY period_date ASC 
		ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
	) 
END AS value_12_ma,

FROM `bigquery-public-data.census_bureau_construction.new_residential_construction` 
WHERE period_date > "2000-01-01"
AND datatype_id = 1
AND is_adjusted = 0
AND category_id = 6
AND geo_id = 1

Row 7-14 provides a method to calculate moving average with a 12-month window. It is necessary to point out:

ROWS BETWEEN 11 PRECEDING AND CURRENT ROW

The expression assigns the moving average to a row, whose corresponding value and the 11 values before result in the moving average. Nevertheless, It is flexible to design the position of the window. For example, if you want to assign the moving average to a row, whose corresponding value, 5 values before and 6 values after, result in the moving average, you can change the expression into:

ROWS BETWEEN 5 PRECEDING AND 6 FOLLOWING

However, whatever the position the window is, we need to pay attention to the moving average of certain row that does not group a sufficient amount of observations as the window designed. In our example, we implemented an condition to clean those moving averages whose window length is shorter than 12 months:

WHEN ROW_NUMBER() OVER (ORDER BY period_date ASC) < 12 THEN NULL

Let the the first 11 values in the moving average column be NULL, because the moving average at the first 11 rows does not group enough values as what the 12-month window designed. For example, the moving average of the third row is the mean of the first 3 values, a much shorter 3-month window.

Till this step, we’ve set up the time series for decomposition.

t	period_date	t_month	value	value_12_ma
1	2000-01-15	01	946.6	null
2	2000-02-15	02	948.4	null
3	2000-03-15	03	952.5	null
4	2000-04-15	04	979.5	null
5	2000-05-15	05	994.6	null
6	2000-06-15	06	1006.8	null
7	2000-07-15	07	1015.2	null
8	2000-08-15	08	1009.9	null
9	2000-09-15	09	998.9	null
10	2000-10-15	10	997.3	null
11	2000-11-15	11	980.2	null
12	2000-12-15	12	933.8	980.3083333333333
13	2001-01-15	01	940.8	979.825
14	2001-02-15	02	939.1	979.05
15	2001-03-15	03	967.0	980.2583333333333
16	2001-04-15	04	996.2	981.65
17	2001-05-15	05	1027.6	984.4

Query result: a table added with row number, month number and 12-MA

STEP 3. CALCULATE THE TREND COMPONENT

As it is mentioned above, if the order of the moving average is an odd number (e.g. weekly data with a 7-day window), the m-MA can be directly treated as the trend component. However, when dealing with the monthly data, the order of the moving average is 12, and we need to calculate one more moving average of order 2 on the 12-MA to obtain the trend component, which is notated as 2 × 12-MA.

SELECT *,
CASE WHEN t < 13 THEN NULL 
ELSE 
	AVG(value_12_ma) OVER (
		ORDER BY period_date asc
		ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
	) 
END AS trend_value,

FROM (
	SELECT 
	ROW_NUMBER() OVER (ORDER BY period_date ASC) AS t,
	period_date,
	SUBSTRING(CAST(period_date AS STRING),6,2) AS t_month,
	value,

	CASE 
	WHEN ROW_NUMBER() OVER (ORDER BY period_date ASC) < 12 THEN NULL 
	ELSE
		AVG(value) OVER (
			ORDER BY period_date ASC 
			ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
		) 
	END AS value_12_ma,

	FROM `bigquery-public-data.census_bureau_construction.new_residential_construction` 

	WHERE period_date > "2000-01-01"
	AND datatype_id = 1
	AND is_adjusted = 0
	AND category_id = 6
	AND geo_id = 1
)

Row 2-8 calculates the moving average of order 2 on the already-calculated moving average of order 12, and this newly generated value is the trend value of this time series. Table below shows the result.

t	period_date	t_month	value	value_12_ma	trend_value
1	2000-01-15	01	946.6	null	null
2	2000-02-15	02	948.4	null	null
3	2000-03-15	03	952.5	null	null
4	2000-04-15	04	979.5	null	null
5	2000-05-15	05	994.6	null	null
6	2000-06-15	06	1006.8	null	null
7	2000-07-15	07	1015.2	null	null
8	2000-08-15	08	1009.9	null	null
9	2000-09-15	09	998.9	null	null
10	2000-10-15	10	997.3	null	null
11	2000-11-15	11	980.2	null	null
12	2000-12-15	12	933.8	980.3083333333333	null
13	2001-01-15	01	940.8	979.825	980.0666666666667
14	2001-02-15	02	939.1	979.05	979.4375
15	2001-03-15	03	967.0	980.2583333333333	979.6541666666667
16	2001-04-15	04	996.2	981.65	980.9541666666667
17	2001-05-15	05	1027.6	984.4	983.025
18	2001-06-15	06	1038.3	987.025	985.7125
19	2001-07-15	07	1056.3	990.4499999999999	988.7375
20	2001-08-15	08	1046.2	993.475	991.9625

Query result: a table added with row number, month number, 12-MA and the trend value (2 × 12-MA)

Step 4. calculate the seasonal component

The seasonal component is the mean of the difference between the observation value and the trend component, in the repetitive periods. For example, you have a monthly-frequency time series with a length of 3 years (36 observations), the seasonal component in January in the 3 years should be the same, and equals to:

[(Value_year1-Jan – Trend_year1-Jan) + (Value_year2-Jan – Trend_year2-Jan) + (Value_year3-Jan – Trend_year3-Jan)] ÷ 3

Row 3-7 in the following query executes the logic mentioned above to obtain the seasonal component.

Row 9-13 in the following query calculates the residual component, which equals to:

Value Yt – Trend Tt – Seasonality St = Residual Rt

Row 15-19 in the following query calculates the de-seasonality time series:

Value Yt – Seasonality St = Residual Rt + Trend Tt = time series without the seasonal component

SELECT *,

CASE 
WHEN t < 13 THEN NULL
ELSE 
	AVG(value - trend_value) OVER (PARTITION BY t_month)
END AS seasonality_value,

CASE 
WHEN t < 13 THEN NULL
ELSE 
	value - trend_value - (AVG(value - trend_value) OVER (PARTITION BY t_month))
END AS residual_value,

CASE 
WHEN t < 13 THEN NULL
ELSE 
	value - (AVG(value - trend_value) OVER (PARTITION BY t_month))
END AS value_deseasonality,

FROM 
(
	SELECT *,
	CASE WHEN t < 13 THEN NULL 
	ELSE 
		AVG(value_12_ma) OVER (
			ORDER BY period_date asc
			ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
		) 
	END AS trend_value,

	FROM (
		SELECT 
		ROW_NUMBER() OVER (ORDER BY period_date ASC) AS t,
		period_date,
		SUBSTRING(CAST(period_date AS STRING),6,2) AS t_month,
		value,

		CASE 
		WHEN ROW_NUMBER() OVER (ORDER BY period_date ASC) < 12 THEN NULL 
		ELSE
			AVG(value) OVER (
				ORDER BY period_date ASC 
				ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
			) 
		END AS value_12_ma,

		FROM `bigquery-public-data.census_bureau_construction.new_residential_construction` 

		WHERE period_date > "2000-01-01"
		AND datatype_id = 1
		AND is_adjusted = 0
		AND category_id = 6
		AND geo_id = 1
	)
)

ORDER BY period_date ASC

Now we are ready to connect Google Sheets with this query, and utilize the final table for visualization.

t	period_date	t_month	value	value_12_ma	trend_value	seasonality_value	residual_value	value_deseasonality
1	2000-01-15	01	946.6	null	null	null	null	null
2	2000-02-15	02	948.4	null	null	null	null	null
3	2000-03-15	03	952.5	null	null	null	null	null
4	2000-04-15	04	979.5	null	null	null	null	null
5	2000-05-15	05	994.6	null	null	null	null	null
6	2000-06-15	06	1006.8	null	null	null	null	null
7	2000-07-15	07	1015.2	null	null	null	null	null
8	2000-08-15	08	1009.9	null	null	null	null	null
9	2000-09-15	09	998.9	null	null	null	null	null
10	2000-10-15	10	997.3	null	null	null	null	null
11	2000-11-15	11	980.2	null	null	null	null	null
12	2000-12-15	12	933.8	980.3083333333333	null	null	null	null
13	2001-01-15	01	940.8	979.825	980.0666666666667	-22.255416666666704	-17.01125000000006	963.0554166666667
14	2001-02-15	02	939.1	979.05	979.4375	-26.131359649122864	-14.206140350877114	965.2313596491229
15	2001-03-15	03	967.0	980.2583333333333	979.6541666666667	-17.865789473684284	5.211622807017587	984.8657894736842
16	2001-04-15	04	996.2	981.65	980.9541666666667	-2.6839912280702114	17.929824561403606	998.8839912280703
17	2001-05-15	05	1027.6	984.4	983.025	11.862280701754313	32.71271929824562	1015.7377192982456
18	2001-06-15	06	1038.3	987.025	985.7125	19.524780701754334	33.06271929824564	1018.7752192982456
19	2001-07-15	07	1056.3	990.4499999999999	988.7375	30.926535087719245	36.635964912280755	1025.3734649122807
20	2001-08-15	08	1046.2	993.475	991.9625	26.467543859649112	27.769956140350956	1019.7324561403509
21	2001-09-15	09	1043.3	997.175	995.325	26.886403508771878	21.08859649122803	1016.4135964912281
22	2001-10-15	10	1035.6	1000.3666666666667	998.7708333333334	20.942105263157842	15.887061403508696	1014.6578947368421
23	2001-11-15	11	1018.4	1003.55	1001.9583333333334	11.036622807017492	5.405043859649114	1007.3633771929825
24	2001-12-15	12	959.4	1005.6833333333333	1004.6166666666667	-21.10504385964916	-24.111622807017536	980.5050438596492
25	2002-01-15	01	957.3	1007.0583333333333	1006.3708333333333	-22.255416666666704	-26.81541666666662	979.5554166666667

Query result: a table added with row number, month number, 12-MA, the trend component (2 × 12-MA), the seasonal component, the residual component and the time series de-seasonality

Step 5. visualization

The picture below illustrates the original time series, the trend and the time series without the seasonal component:

Housing Unites Under Construction: value
Trend: trend_value
Housing Unites Under Construction (De-seasonality): value_deseasonality

Here you can see the adjusted time series (without seasonality), the blue curve, gets much more smoother than the original time series, the red one.

The picture below demonstrates the seasonal component and the residual that we separated from the original time series, where you can clearly observe the seasonal peak.

Seasonality: seasonality_value
Residual: residual_value

Hope you have a fun and smooth de-seasonality process!

Notes & References

[1] The analysis is only for business analysis purpose. Therefore, I try to avoid jargons and extra analysis that are commonly used in econometrics, such as white noises, lags, stationarity, etc., although time series analysis has a strict academic structure.

[2] Hyndman R J, Athanasopoulos G. Forecasting: principles and practice[M]. OTexts, 2018. Chapter 6.3 Classical decomposition. https://otexts.com/fpp2/

Other reference: A Guide to Modern Econometrics: Edition 5, Marno Verbeek, 2017, Wiley Global Education

HANXIAO