Time series de-seasonality with BigQuery

As a frequent R user in school, I was used to apply the time series packages directly to process decomposition, when it came to the de-seasonality. I’ve done several time series analyses with BigQuery in my current job, and actually found that the step-by-step “recipe” helped me understand the mechanism in a deeper level. I would share some experience about how to manually process time series decomposition with the query language, for business analysis purpose[1].

Before coding, you need firstly to understand what is seasonality and why we need to remove it for certain analysis. Let’s think about a simple case:

  • An ice cream vendor increased the price per ice cream by 3% in May
  • Revenue from a 3-month window Feb – Apr before the price change was 1000 €
  • Revenue from a 3-month window Jun – Aug after the price change was 3000 €
  • Can he conclude the 3% price change drives revenue tripled?

The answer is no, because the sales of ice cream are seasonally higher in the summer months than that in any other months, and those “extra sales” repetitively happen every summer are what we call “seasonality”. To evaluate how much the 3% price change conducted the revenue growth, you need to remove this “foam” (seasonality) from the time series, to obtain a relatively smooth curve, and then compare the 3-month revenue before and after.

Secondly, you need to understand a simple formula of time series, from an econometric perspective.

  • A time series Yt, which can be sales, temperature, population etc. during certain period with daily/ weekly/ quarterly/ monthly/ yearly frequency, is composed by 3 elements:
    • Trend Tt
    • Seasonality St
    • Noises / Residual / Remainder component Rt
  • Yt = Tt + St + Rt
  • The purpose of de-seasonality is to find out and get rid of the seasonal component St from the time series Yt

The example of the time series de-seasonality will use the BigQuery public data. To access the project, this guidance will help you set it up. The classic decomposition method will be applied, and the output will be presented and visualized in connected Google Sheets.

sTEP 1. DATA SHORTLIST

The time series of new_residential_construction with monthly observations are applied as the example. The table is under the dataset census_bureau_construction, in the project bigquery-public-data. The earliest period dates back to 1959-01-15. To simplify our analysis, we shortlist the dimensions and the length of date by:

  • Choosing the period after year 2000, period_date > “2000-01-01”
  • Reporting at “Total Units” level, datatype_id = 1
  • Using the unadjusted data, is_adjusted = 0
  • Selecting the category as “Housing Units Under Construction”, category_id = 6
  • Constraining the geographic dimensions in “United States” (ignore the regional dimensions, e.g. South, West, Northeast, etc.) to see the full picture, geo_id = 1

The query result is the time series that we will use for decomposition.

SELECT 
period_date,
category_desc,
datatype_desc,
geo_desc,
is_adjusted,
value
FROM `bigquery-public-data.census_bureau_construction.new_residential_construction` 
WHERE period_date > "2000-01-01"
AND datatype_id = 1
AND is_adjusted = 0
AND category_id = 6
AND geo_id = 1
period_datecategory_descdatatype_descgeo_descis_adjustedvalue
2000-01-15Housing Units Under ConstructionTotal UnitsUnited States0946.6
2000-02-15Housing Units Under ConstructionTotal UnitsUnited States0948.4
2000-03-15Housing Units Under ConstructionTotal UnitsUnited States0952.5
2000-04-15Housing Units Under ConstructionTotal UnitsUnited States0979.5
2000-05-15Housing Units Under ConstructionTotal UnitsUnited States0994.6
Query result: customized time series

sTEP 2. TIME SERIES SETTING and moving avg

Before finding out the seasonality St, 3 columns are added to assist the calculation:

  • t: a running number as an alias to the period, from the latest to the most recent
  • t_month: month number from the corresponding date. We need the month number because seasonality is a value that repetitively allocated in the identical month number
  • value_12_ma: moving average of observations (column: value) with a 12-month window. In this case, the 12-month window is selected as 11 months in ahead (11 rows before) + the current month (the current row) .

Note that a moving average with a window of m observations, is usually called as m-MA, or moving average of order m. If m is an even number, for example, in our case, monthly data – MA of order 12, or quarterly data – MA of order 4, the trend component is based on one another moving average of the moving average (will introduce later), for making an even-order moving average symmetric[2]. If m is an odd number, for example, weekly data – MA of order 7, the 7-MA can be directly treated as the trend component.

SELECT 
ROW_NUMBER() OVER (ORDER BY period_date ASC) AS t,
period_date,
SUBSTRING(CAST(period_date AS STRING),6,2) AS t_month,
value,

CASE 
WHEN ROW_NUMBER() OVER (ORDER BY period_date ASC) < 12 THEN NULL 
ELSE
	AVG(value) OVER (
		ORDER BY period_date ASC 
		ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
	) 
END AS value_12_ma,

FROM `bigquery-public-data.census_bureau_construction.new_residential_construction` 
WHERE period_date > "2000-01-01"
AND datatype_id = 1
AND is_adjusted = 0
AND category_id = 6
AND geo_id = 1

Row 7-14 provides a method to calculate moving average with a 12-month window. It is necessary to point out:

ROWS BETWEEN 11 PRECEDING AND CURRENT ROW

The expression assigns the moving average to a row, whose corresponding value and the 11 values before result in the moving average. Nevertheless, It is flexible to design the position of the window. For example, if you want to assign the moving average to a row, whose corresponding value, 5 values before and 6 values after, result in the moving average, you can change the expression into:

ROWS BETWEEN 5 PRECEDING AND 6 FOLLOWING

However, whatever the position the window is, we need to pay attention to the moving average of certain row that does not group a sufficient amount of observations as the window designed. In our example, we implemented an condition to clean those moving averages whose window length is shorter than 12 months:

WHEN ROW_NUMBER() OVER (ORDER BY period_date ASC) < 12 THEN NULL

Let the the first 11 values in the moving average column be NULL, because the moving average at the first 11 rows does not group enough values as what the 12-month window designed. For example, the moving average of the third row is the mean of the first 3 values, a much shorter 3-month window.

Till this step, we’ve set up the time series for decomposition.

tperiod_datet_monthvaluevalue_12_ma
12000-01-1501946.6null
22000-02-1502948.4null
32000-03-1503952.5null
42000-04-1504979.5null
52000-05-1505994.6null
62000-06-15061006.8null
72000-07-15071015.2null
82000-08-15081009.9null
92000-09-1509998.9null
102000-10-1510997.3null
112000-11-1511980.2null
122000-12-1512933.8980.3083333333333
132001-01-1501940.8979.825
142001-02-1502939.1979.05
152001-03-1503967.0980.2583333333333
162001-04-1504996.2981.65
172001-05-15051027.6984.4
Query result: a table added with row number, month number and 12-MA

STEP 3. CALCULATE THE TREND COMPONENT

As it is mentioned above, if the order of the moving average is an odd number (e.g. weekly data with a 7-day window), the m-MA can be directly treated as the trend component. However, when dealing with the monthly data, the order of the moving average is 12, and we need to calculate one more moving average of order 2 on the 12-MA to obtain the trend component, which is notated as 2 × 12-MA.

SELECT *,
CASE WHEN t < 13 THEN NULL 
ELSE 
	AVG(value_12_ma) OVER (
		ORDER BY period_date asc
		ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
	) 
END AS trend_value,

FROM (
	SELECT 
	ROW_NUMBER() OVER (ORDER BY period_date ASC) AS t,
	period_date,
	SUBSTRING(CAST(period_date AS STRING),6,2) AS t_month,
	value,

	CASE 
	WHEN ROW_NUMBER() OVER (ORDER BY period_date ASC) < 12 THEN NULL 
	ELSE
		AVG(value) OVER (
			ORDER BY period_date ASC 
			ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
		) 
	END AS value_12_ma,

	FROM `bigquery-public-data.census_bureau_construction.new_residential_construction` 

	WHERE period_date > "2000-01-01"
	AND datatype_id = 1
	AND is_adjusted = 0
	AND category_id = 6
	AND geo_id = 1
)

Row 2-8 calculates the moving average of order 2 on the already-calculated moving average of order 12, and this newly generated value is the trend value of this time series. Table below shows the result.

tperiod_datet_monthvaluevalue_12_matrend_value
12000-01-1501946.6nullnull
22000-02-1502948.4nullnull
32000-03-1503952.5nullnull
42000-04-1504979.5nullnull
52000-05-1505994.6nullnull
62000-06-15061006.8nullnull
72000-07-15071015.2nullnull
82000-08-15081009.9nullnull
92000-09-1509998.9nullnull
102000-10-1510997.3nullnull
112000-11-1511980.2nullnull
122000-12-1512933.8980.3083333333333null
132001-01-1501940.8979.825980.0666666666667
142001-02-1502939.1979.05979.4375
152001-03-1503967.0980.2583333333333979.6541666666667
162001-04-1504996.2981.65980.9541666666667
172001-05-15051027.6984.4983.025
182001-06-15061038.3987.025985.7125
192001-07-15071056.3990.4499999999999988.7375
202001-08-15081046.2993.475991.9625
Query result: a table added with row number, month number, 12-MA and the trend value (2 × 12-MA)

Step 4. calculate the seasonal component

The seasonal component is the mean of the difference between the observation value and the trend component, in the repetitive periods. For example, you have a monthly-frequency time series with a length of 3 years (36 observations), the seasonal component in January in the 3 years should be the same, and equals to:

[(Valueyear1-Jan – Trendyear1-Jan) + (Valueyear2-Jan – Trendyear2-Jan) + (Valueyear3-Jan – Trendyear3-Jan)] ÷ 3

Row 3-7 in the following query executes the logic mentioned above to obtain the seasonal component.

Row 9-13 in the following query calculates the residual component, which equals to:

Value Yt – Trend Tt – Seasonality St = Residual Rt

Row 15-19 in the following query calculates the de-seasonality time series:

Value Yt – Seasonality St = Residual Rt + Trend Tt = time series without the seasonal component

SELECT *,

CASE 
WHEN t < 13 THEN NULL
ELSE 
	AVG(value - trend_value) OVER (PARTITION BY t_month)
END AS seasonality_value,

CASE 
WHEN t < 13 THEN NULL
ELSE 
	value - trend_value - (AVG(value - trend_value) OVER (PARTITION BY t_month))
END AS residual_value,

CASE 
WHEN t < 13 THEN NULL
ELSE 
	value - (AVG(value - trend_value) OVER (PARTITION BY t_month))
END AS value_deseasonality,

FROM 
(
	SELECT *,
	CASE WHEN t < 13 THEN NULL 
	ELSE 
		AVG(value_12_ma) OVER (
			ORDER BY period_date asc
			ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
		) 
	END AS trend_value,

	FROM (
		SELECT 
		ROW_NUMBER() OVER (ORDER BY period_date ASC) AS t,
		period_date,
		SUBSTRING(CAST(period_date AS STRING),6,2) AS t_month,
		value,

		CASE 
		WHEN ROW_NUMBER() OVER (ORDER BY period_date ASC) < 12 THEN NULL 
		ELSE
			AVG(value) OVER (
				ORDER BY period_date ASC 
				ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
			) 
		END AS value_12_ma,

		FROM `bigquery-public-data.census_bureau_construction.new_residential_construction` 

		WHERE period_date > "2000-01-01"
		AND datatype_id = 1
		AND is_adjusted = 0
		AND category_id = 6
		AND geo_id = 1
	)
)

ORDER BY period_date ASC

Now we are ready to connect Google Sheets with this query, and utilize the final table for visualization.

tperiod_datet_monthvaluevalue_12_matrend_valueseasonality_valueresidual_valuevalue_deseasonality
12000-01-1501946.6nullnullnullnullnull
22000-02-1502948.4nullnullnullnullnull
32000-03-1503952.5nullnullnullnullnull
42000-04-1504979.5nullnullnullnullnull
52000-05-1505994.6nullnullnullnullnull
62000-06-15061006.8nullnullnullnullnull
72000-07-15071015.2nullnullnullnullnull
82000-08-15081009.9nullnullnullnullnull
92000-09-1509998.9nullnullnullnullnull
102000-10-1510997.3nullnullnullnullnull
112000-11-1511980.2nullnullnullnullnull
122000-12-1512933.8980.3083333333333nullnullnullnull
132001-01-1501940.8979.825980.0666666666667-22.255416666666704-17.01125000000006963.0554166666667
142001-02-1502939.1979.05979.4375-26.131359649122864-14.206140350877114965.2313596491229
152001-03-1503967.0980.2583333333333979.6541666666667-17.8657894736842845.211622807017587984.8657894736842
162001-04-1504996.2981.65980.9541666666667-2.683991228070211417.929824561403606998.8839912280703
172001-05-15051027.6984.4983.02511.86228070175431332.712719298245621015.7377192982456
182001-06-15061038.3987.025985.712519.52478070175433433.062719298245641018.7752192982456
192001-07-15071056.3990.4499999999999988.737530.92653508771924536.6359649122807551025.3734649122807
202001-08-15081046.2993.475991.962526.46754385964911227.7699561403509561019.7324561403509
212001-09-15091043.3997.175995.32526.88640350877187821.088596491228031016.4135964912281
222001-10-15101035.61000.3666666666667998.770833333333420.94210526315784215.8870614035086961014.6578947368421
232001-11-15111018.41003.551001.958333333333411.0366228070174925.4050438596491141007.3633771929825
242001-12-1512959.41005.68333333333331004.6166666666667-21.10504385964916-24.111622807017536980.5050438596492
252002-01-1501957.31007.05833333333331006.3708333333333-22.255416666666704-26.81541666666662979.5554166666667
Query result: a table added with row number, month number, 12-MA, the trend component (2 × 12-MA), the seasonal component, the residual component and the time series de-seasonality

Step 5. visualization

The picture below illustrates the original time series, the trend and the time series without the seasonal component:

  • Housing Unites Under Construction: value
  • Trend: trend_value
  • Housing Unites Under Construction (De-seasonality): value_deseasonality

Here you can see the adjusted time series (without seasonality), the blue curve, gets much more smoother than the original time series, the red one.

The picture below demonstrates the seasonal component and the residual that we separated from the original time series, where you can clearly observe the seasonal peak.

  • Seasonality: seasonality_value
  • Residual: residual_value

Hope you have a fun and smooth de-seasonality process!

Notes & References

[1] The analysis is only for business analysis purpose. Therefore, I try to avoid jargons and extra analysis that are commonly used in econometrics, such as white noises, lags, stationarity, etc., although time series analysis has a strict academic structure.

[2] Hyndman R J, Athanasopoulos G. Forecasting: principles and practice[M]. OTexts, 2018. Chapter 6.3 Classical decomposition. https://otexts.com/fpp2/

Other reference: A Guide to Modern Econometrics: Edition 5, Marno Verbeek, 2017, Wiley Global Education

评论 Leave a Comment