Marketing Promotion Campaign Analysis Using Feature Based Time Series Clustering
Feature based time series analysis method is used to make recommendations to expand to new regions
- Overview
- Data Quality
- Exploratory Data Analysis
- Trend Analysis:
- Influence of Macroeconomic Factors
- Cross Correlation Analysis
- Clustering
- Modeling methodology:
Overview
The goal of this notebook is to show how to do exploratory data analysis of time series data and use feature based clustering method to identify attributes that are similar in time domain. The novelty of this based is, instead of relying on shape of the time series, feature based approach enables more interpretable solution. I used some of these techniques in a real project so I wanted to showcase this for my own future reference.
The data here is alcohol consumption across 80+ regions of Russia. I am not sure if this is real data or not. Hypothetically, an alcohol company has been operating in St. Petersburg. Their campaign was successful in St.Petersburg so we want to identify other regions where they can potentially run similar campaigns. We have alcohol consumption for differrent alcohol types per capita from 1998 to 2016.
I don't know anything about Russia's geography or economy, so I will make reasonable assumptions in my analysis.
#!pip install altair
#!pip install catch22
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import altair as alt
import seaborn as sns
import statsmodels.api as sm
import scipy.stats as st
import statsmodels.stats.api as sms
from sklearn.metrics import r2_score
from catch22 import catch22_all
from sklearn.cluster import KMeans
alt.data_transformers.disable_max_rows()
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv(r'./data/russian_alcohol_consumption.csv')
df.head()
df.isna().any()
Wine, beer, vodka, champagne and brandy have null values. Let's find if any of the columns have more than 5% null values (i.e in this at least 1).
# Which region has null values?
null_regions = []
print("% of Values missing \n========================================")
for r in df.region.unique():
nulls = (df.query('region == @r').isna().mean())
if nulls.max() > 0.05:
print(r,",",nulls.idxmax(), ":", 100 * round(nulls.max(),2),"%")
null_regions.append(r)
There are four regions with more than 79% of the values missing. If few values were missing, we could have used interpolation methods to impute the missing values, but in this we don't have enough values to do so. Let's look at which values are missing.
#alcohol types
alcohols = list(df.columns[2:])
#Find number of missing values and location of the missing values each of the 4 regions
null_regions_df = (df[df['region']
.isin(null_regions)]
.sort_values(by=['region','year'])
.groupby(['year','region'])
.apply(lambda x : x.isna().sum())
.drop(['year','region'],axis=1)
.assign(nulls = lambda x: x.sum(axis=1))
.reset_index()
)
null_regions_df.head(5)
plt.title("Fig 1: Heatmap showing number of alcohol types missing by year and region \n")
(
sns.heatmap
(pd.crosstab(
null_regions_df['region'],
null_regions_df['year'],
values = null_regions_df['nulls'],
aggfunc='max'
),
cmap="magma_r",
annot=True,
cbar=False,
square=True
)
);
Observations:
In Fig 1., the heatmap shows, out of the 5 alcohol types, how many types have missing values.
- Crimea and Sevastopol has values missing for all 5 alcohol types for years 1998 to 2013 and none missing for the last 3 years. This suggests that likely the data for first 16 years either wasn't collected or wasn't available.
- Chechen Republic has values missing except for one alcohol type in the last 3 years
- Ingushetia has missing values at random
Since our goal is to compare the regions using alcohol consumption data for all years, it won't be possible to include these 4 regions in to the analysis. We will drop all these 4 regions from further analysis.
#Are there any duplicate values?
if max((df[~df.region.isin(null_regions)].region.value_counts())) == min((df[~df.region.isin(null_regions)].region.value_counts())):
print("All Regions have 19 years of data")
else:
print("Some of the regions of more/less data")
None of the regions have duplicate data across all years
#Are there any observations with 0 or negative values? Consumption can be 0 but not negative
df[~df.region.isin(null_regions)].describe().loc['min']
There is at least one region with 0 lts per capita consumption of brandy. Let's look at that closely.
df[~df.region.isin(null_regions)].query('brandy == 0')
In 2004, Tuva Republic did not record any consumption of brandy.
(df.query('region == "Tuva Republic"')
.set_index('year')['brandy']
.plot(title="Fig 2: Brandy consumption for Tuva Republic \n 0 consumption in 2004", legend=True)
)
(plt.axhline(df.query('region == "Tuva Republic"')
.set_index('year')['brandy']
.median(), color='red')
);
In 2004, the brandy consumption in Tuva dropped from 0.1 ltr to 0. The median consumption over the entire period was 0.2 ltr. So, considering overall low consumption, 0 does not seem like an outlier. We will keep the value as is.
# Max values
df[~df.region.isin(null_regions)].describe().iloc[:,1:]
Except for beer, the max values for all other alcohols seem reasonable. In case of beer, the 75% value is 67.5 lts and max is 207, which is 5 standard deviations away from 75th percentile. We need to understand which region has this value and how should we treat it.
Below, we will drop the regions with null values and create a new dataframe with percentage share of each alcohol type for the given year and region.
df1 = df[~df.region.isin(null_regions)]
df1['total']=(df1[alcohols].sum(axis=1))
df1 = (df1
.assign(winepct = df1['wine']/df1['total'])
.assign(beerpct = df1['beer']/df1['total'])
.assign(vodkapct = df1['vodka']/df1['total'])
.assign(champct = df1['champagne']/df1['total'])
.assign(brandypct = df1['brandy']/df1['total'])
.assign(year = pd.to_datetime(df1.year,format='%Y'))
.round(3)
)
df1.head(5)
Exploratory Data Analysis
To better understand the pattern in the data and compare the regions quickly over the 19 year period, decile heatmap as show in Fig 3 is created. In this visual, for each year, the total alcohol conumption is binned into 10 bins (i.e deciles). For example, the region with the highest consumption will be in 10th decile while the region with the lowest consumption will be in the 1st decile. The lighter color indicates high decile (and high consumption) while dark color shows low consumption. The regions are sorted by total consumption so the regions with high consumption are at the top.
plt.figure(figsize = (10,22))
plt.title("Fig 3: Decile Heat Map \n Shows distribution of binned alcohol consumption by year and region \n Lighter color: High consumption, darker color: low consumption", fontsize =14)
sns.heatmap(df1[['year','region','total']]
.assign(year=df1.year.dt.year)
.set_index(['year','region'])
.groupby(level=0)
.transform(lambda x: pd.qcut(x, q = np.linspace(0,1,11), labels=False))
.unstack(level=0)
.assign(t = lambda x: x.sum(axis=1))
.sort_values(by='t', ascending=False)
.drop('t',axis=1),
cmap='magma',cbar_kws={"orientation": "horizontal"});
Observations:
- Moscow has light color all throughout the 19 years showing it had the highest alcohol consumption continuoulsy for all years.
- Saint Petersburg is towards the top, thus one of the regions with high alcohol consumptions
- Saint Petersburg was in high deciles until 2010 and then the alcohol sumption dropped drammatically in the last 5-6 years.
- A visual comparison shows Chelyabinsk Oblask, Yaroslavi Oblast, Perm also show a similar trend as that of Saint Petersburg
- Some regions such as Karaychay, Dagestan, Chukotka, Adyega etc which are at the bottom, had very low and flat trend in consumption compared to Saint Petersburg
- Some regions such as Tver, Nizhny, Sakhlin showed pick up in consumption in later years, opposite to Saint Petersburg
Let's compare the consumption further by looking at box plot and by individual alcohol type.
df1_melt = df1.iloc[:,:7].melt(['year','region'],var_name='Alcohol', value_name='Consumption')
df1_melt.head(3)
Trend Analysis:
Fig 4 shows distribution of total alcohol consumption for each of the year and by region. Values that are within the box plot are within the inter-quartile range ,where as values outside are potential outliers. The thick blue line shows the median consumption over the period. Light orange color line shows total consumption of Saint Petersburg for comaprison purposes. Each dot in fig 5 is a region. You can hover over the circle to get breakdown of consumption by alcohol type, volume and percentage.
Fig 5 shows the consumption for each selected region over the 19 year period. Select the region fro teh dropdown menu to get the consumption by type.
You can zoom-in/out and pan to explore the data in fig 4 & 5.
median_line = alt.Chart(df1).mark_line().encode(
x = 'year:T',
y = 'median(total):Q'
).interactive()
total_box = alt.Chart(df1).mark_boxplot(opacity=0.3).encode(
x = 'year:T',
y = 'total:Q',
).interactive()
total_scatter = alt.Chart(df1).mark_circle(opacity=0.8).encode(
x = 'year:T',
y = 'total:Q',
color='region',
tooltip = list(df1.columns)).interactive()
total_line = alt.Chart(df1).mark_line(opacity=0.1).encode(
x = 'year:T',
y = 'total:Q',
color='region',
tooltip = list(df1.columns),
strokeWidth = alt.condition("datum.region == 'Saint Petersburg'",alt.value(3),alt.value(1)),
).interactive()
total_line_stp = alt.Chart(df1.query('region == "Saint Petersburg"')).mark_line(opacity=0.4, strokeDash=[1,1]).encode(
x = 'year:T',
y = 'total:Q',
color=alt.value("#FFAA00"),
strokeWidth = alt.condition("datum.region == 'Saint Petersburg'",alt.value(3),alt.value(1)),
tooltip = list(df1.columns)
).interactive()
# (total_scatter + total_line_stp).properties(width=700, height = 400, title = "Total Alcohol consumption for all regions vs. Year")
top = (total_scatter + total_box + median_line + total_line + total_line_stp).properties(width=700, height = 300, title = "Fig 4: Total Alcohol consumption for all regions vs. Year")
region_dropdown = alt.binding_select(options= sorted(list(df1_melt.region.unique())))
region_select = alt.selection_single(fields=['region'], bind=region_dropdown, name="Select")
bottom = (alt.Chart(df1_melt)
.mark_line().encode(alt.X('year:T'),
alt.Y('mean(Consumption):Q', scale=alt.Scale(domain=[0,210])),
color = 'Alcohol:N',
tooltip=['year','Consumption','Alcohol'])
).add_selection(
region_select
).transform_filter(
region_select
).properties(width=700, height = 200, title = "Fig 5:Select Region: Alcohol consumption by Year, Alcohol Type ")
(top & bottom.interactive())
Observations:
Above plot shows the distribution of total consumption over the years for all non-null regions. Some observations:
-
Total alcohol consumption in Russia increased from 1998 to 2008, levelled off over the next 4 years and then started declining after 2012. Global recession started in ~2008. It is highly likely that that may have affected the consumption.
-
The box plot shows the distribution of total alcohol consumption for all regions. Overall, the the total consumption for all regions was within the interquartile range, except in years 2004-2006 for Zabalkalsky Krai (three blue dots in the chart that are away from all other points). For Zabaykalsky Krai, the consumption was significantly higher from 2004 to 2006. Selecting Zabaykalsky Krai in fig 5 shows thatthe beer consumption in 2006 was 207 lts, which is the outlier we found in the analysis above. Interestingly, consumption of vodka and other alcohols also increased over that period indicating it's not an error in values. The consumption dropped precipitously after 2006. It stayed the same before, during and after the global recession, contrary to overall trend in Russia.
-
Moscow shows trend similar to overall trend, while Moscow Oblast is exactly opposite. Beer consumption of Moscow Oblast increased rammatically in 2006 and was not affected by the Recession.
-
Selecting Saint Petersburg in fig 5 shows that beer consumption increased steadily from 1998 to 2006 and dropped significantly thereafter and continued its decline.
-
Saint Petersburg is one of the most visted cities of Russia and tourism is a big part of the economy. Global recession in 2008 might have led to the decline tourism and thus copnsumption of alcohol.
-
There is no seasonality observed in this data and data are non-stationary.
Comparing Saint Petersburg with Rest of Russia
Fig 6 and 7 show breakdown of alcohol consumption by volume and percentage for all of Russia. Fig 8 & 9 show similar metrics for Saint Petersburg. This will help us compare the trends and identify insights. All 4 charts are zoomable and can be panned. Fig 10 shows the alcohol consumption for Saint Petersburg by rank order.
total_n = alt.Chart(df1_melt).mark_area(size=20).encode(
y=alt.Y('sum(Consumption)', stack="normalize", axis=alt.Axis(title='%Alcohol consumed')),
x='year:T',
color='Alcohol',
tooltip=['sum(Consumption):Q','year']
).interactive().properties(title = 'Fig 7: All Regions:%Alcohol Consumption by Type & Year')
total_z = alt.Chart(df1_melt).mark_area(size=20).encode(
y=alt.Y('sum(Consumption)', stack="zero", axis=alt.Axis(title='Alcohol consumed by vol')),
x='year:T',
color='Alcohol',
tooltip=['sum(Consumption):Q','year']
).interactive().properties(title = 'Fig 6: All Regions:Total Alcohol Consumption by Type & Year by vol')
total_n_stp = alt.Chart(df1_melt.query('region =="Saint Petersburg"')).mark_area(size=20).encode(
y=alt.Y('sum(Consumption)', stack="normalize", axis=alt.Axis(title='%Alcohol consumed')),
x='year:T',
color='Alcohol',
tooltip=['sum(Consumption):Q','year']
).interactive().properties(title = 'Fig 9:St Petersburg: %Alcohol Consumption by Type & Year')
total_z_stp = alt.Chart(df1_melt.query('region =="Saint Petersburg"')).mark_area(size=20).encode(
y=alt.Y('sum(Consumption)', stack="zero", axis=alt.Axis(title='Alcohol consumed by volume')),
x='year:T',
color='Alcohol',
tooltip=['sum(Consumption):Q','year']
).interactive().properties(title = 'Fig 8: St Petersburg: Total Alcohol Consumption by Type & Year by vol')
( total_z | total_n) & ( total_z_stp | total_n_stp)
(df1[['wine','beer','vodka','champagne','brandy','region','year']]
.query('region == "Saint Petersburg"')
.drop('region',axis=1)
.set_index('year')
.rank(axis=1, ascending=False)
.astype('int')
.plot(figsize=(10,6),
label=True,
title="Fig 10: Alcohol Consumption Rank: Saint Petersburg \n Beer is #1, Wine took over Vodka in 2013 as #2")
);
df1['weak'] = df1['beer'] + df1['wine'] + df1['champagne']
df1['hard'] = df1['vodka'] + df1['brandy']
df1['hardpct'] = df1['hard']/df1['total']
df1['weakpct'] = df1['weak']/df1['total']
(alt.Chart(df1.query('region == "Saint Petersburg"')[['year','hard','weak']].melt(['year'], var_name='type'))
.mark_area()
.encode(
x='year:T',
y= alt.Y('sum(value)', stack='normalize'),
color='type',
tooltip=['year','type','value'])
).properties(title='Fig 11: Hard vs. Weak alcohol by % for Saint Petersburg')
Observations:
1. Total consumption: Overall consumption of alcohol per capita for Russia and Saint Petersburg follow similar trends. Both show decline in consumption in the last 5-10 years but at differrent rates. The decline started much earlier (2009) in Saint Petersburg while it remained steady from 2009-2012 for Russia and then started decreasing. Saint Peterberg's consumption dropped by ~70% compared to only 25% for Russia. Saint Peterburg accounted for 2% of total consumption at its peak in 2009 and only 0.08% in 2016.
2. Beer Consumption: Beer consumption decreased by volume across Russia and Saint Petersburg in the last few years. But interestingly, while the overall consumption dropped in volume, preference for beer increased from ~ 55% in the late 1990s to ~ 75% by 2016. In contrast, consumers in Saint Petersburg always preferred beer in the 1990s (~63%), it grew to 87% at its peak in 2009 and dropped to 67% by 2016. While beer was still the preferred beverage, its preference has been decreasing in Saint Petersburg in constrast to the national average. My hypothesis is that tourism is a big part of Saint Petersburg's economy, so recession in 2008-09 may have caused reduction in tourism and consumption of beer.
3.Vodka: Vodka is synonymous with Russia and yet vodka isn't the most consumed alcohol. Consumption of alcohol decreased in Russia and Saint Petersburg. Vodka remained the second most consumed alcohol in Saint Petersburg until 2013. As show in fig 10, vodka became the third most consumed alcohol by volume.
4.Wine: Preference for wine over vodka steadily increased in the first 16 years and then took over vodka as the second most consumed alcohol by volume in Saint Petersburg. While some reasearch has been done to identify the reasons, it's not conclusive. It's postulated that women prefer wine over beer and vodka so perhaps change in Saint Petersburg's demographic mixture may have led to wine's popularity but not much data is available to confirm that. As show in fig 10, wine became more popular than vodka in 2014. Consumers in Saint Petersburg drink 5% more wine per capita than national average (14% vs. 9%).
5. Champagne & Brandy: These two are lumped together because historically these two alcohols haven't been consumed much, neither in Russia nor in Saint Petersburg. For example, in 2016 Saint Petersburg consumers drank 41 ltr beer on average and only 3.2 lts champagne and brandy. Although the percentage of these two has been increasing, at the expense of beer, but still the volumes are very low.
6. Hard vs. weak alcohol: We can group the alcohols by the percentage of alcohol present in them. Vodka, brandy contain ~40% alcohol whereas beer, champagne & wine contain 4-12% alcohol. If we group the alcohols this way as hard vs. weak based on alcohol content, we see a differrent picture. As shown in fig 11, although by volume the amount of hard alcohol consumed is low, by percentage consumers in Saint Petersburg are actually preferring hard over weak alcohols at a rapid pace.
Saint Petersburg Vs. Other Regions:
We compared Saint Petersburg against the national average. Now let's compare Saint Petersburg against the regions over two differrent periods by median alcohol consumption. As mentioned in the observations above (and fig 10), Saint Peterburg's alcohol consumption pattern changed in 2014. To put it in perspective, fig 12-15 compare Saint Petersburg's median alcohol consumption over two differrent periods, 1998-2013 and 2013-2016.
Fig 12: Median consumption of wine vs. beer from 1998-2013. Each circle is a region, orange circle is Saint Petersburg. Size of the circle shows vodka consumption.
Fig 13: Median consumption of wine vs. beer from 2014-2016. Each circle is a region, orange circle is Saint Petersburg. Size of the circle shows vodka consumption.
Fig 14: Median consumption of wine vs. vodka from 1998-2013. Each circle is a region, orange circle is Saint Petersburg. Size of the circle shows beer consumption.
Fig 15: Median consumption of wine vs. beer from 2014-2016. Each circle is a region, orange circle is Saint Petersburg. Size of the circle shows beer consumption.
All figures are zoomable and pannable and have tooltips.
L3Y_mean = df1.set_index('year').loc['2014':].iloc[:,:4].groupby('region').median().reset_index()
L16Y_mean = df1.set_index('year').loc[:'2013'].iloc[:,:4].groupby('region').median().reset_index()
(alt.Chart(L16Y_mean).mark_circle(size=100).encode(
alt.X('beer', scale=alt.Scale(domain=[0,150])),
alt.Y('wine',scale=alt.Scale(domain=[0,14])),
tooltip=['region','wine','beer', 'vodka'],
color = alt.condition("datum.region == 'Saint Petersburg'",
alt.value("orange"),
alt.value("steelblue")),
size='vodka'
).properties(title ='Fig 12:Median Consumption Wine vs. Beer (1998-2013)')).interactive() | (alt.Chart(L3Y_mean).mark_circle(size=100).encode(
alt.X('beer', scale=alt.Scale(domain=[0,150])),
alt.Y('wine', scale=alt.Scale(domain=[0,14])),
tooltip=['region','wine','beer','vodka'],
color = alt.condition("datum.region == 'Saint Petersburg'",
alt.value("orange"),
alt.value("steelblue")),
size='vodka'
).properties(title ='Fig 13:Median Consumption Wine vs. Beer (2014-2016)')).interactive()
(alt.Chart(L16Y_mean).mark_circle(size=100).encode(
alt.X('wine', scale=alt.Scale(domain=[0,20])),
alt.Y('vodka',scale=alt.Scale(domain=[0,22])),
tooltip=['region','wine','beer', 'vodka'],
color = alt.condition("datum.region == 'Saint Petersburg'",
alt.value("orange"),
alt.value("steelblue")),
size='beer'
).properties(title ='Fig 14:Median Consumption Wine vs. Vodka(1998-2013)')).interactive() |((alt.Chart(L3Y_mean).mark_circle(size=100).encode(
alt.X('wine', scale=alt.Scale(domain=[0,20])),
alt.Y('vodka', scale=alt.Scale(domain=[0,22])),
tooltip=['region','wine','beer','vodka'],
color = alt.condition("datum.region == 'Saint Petersburg'",
alt.value("orange"),
alt.value("steelblue")),
size='beer'
).properties(title ='Fig 15:Median Consumption Wine vs. Vodka(2014-2016)')).interactive())
Observations:
- In fig 12, we can clearly see beer consumption in Saint Petersburg was the highest on average over the 16 year period. It's the only region with beer consumption over 100 ltr per person on average ! There are more regions in the 40-60 lts range.Size of the circle also shows, all regions cosumed roughly similar amount of vodka. If we compare that with fig 12, not only beer but vodka consumption also reduced significantly. Moscow Oblast consumers started drinking more beer compared to national average. The shape of the scatterplot is also interesting. Circle in fig 12 are dispersed horizontally whereas circles in fig 13 are distributed vertically more, showing shift towards wine. Also, if you can imagine a line of best fit in these two scatterplots, in fig 12 the beer and wine preference have strong correlation, but in fig 13 that correlation seems weaker. Saint Petersburg is around in the middle of the scatter plot vertically, showing wine consumption is around the national average range.
- In fig 14 & 15, the vertical shift in circle can be easily observed, showing reduction in vodka consumption. Saint Petersburg is also towards the top third in wine consumption in 2014-2016 period.
Influence of Macroeconomic Factors
Research has shown that macroeconomic factors play a role in alcohol consumption pattern. One of the ways to gauge macroeconomic factor is by using gross domestic product (GDP) o fthe country. GDP is a measure of monetary value of goods produced in the country ref. We can use GDP of Russia from 1998 to 2016 and try to see if GDP has any correlation with consumption of alcohol. Data for GDP numbers is taken from here.
russia_gdp = pd.read_table('https://raw.githubusercontent.com/pawarbi/datasets/master/russia_gdp').sort_values(by='Year ')
russia_gdp.columns = russia_gdp.columns.str.replace(' ','').str.replace(' ','').str.lower()
russia_gdp['percapita'] = pd.to_numeric([x.replace('$','').replace(',','').rstrip().lstrip() for x in russia_gdp['percapita']])
russia_gdp['year']=pd.to_datetime(russia_gdp['year'],format='%Y')
russia_gdp.set_index('year',inplace=True)
russia_gdp = russia_gdp['1998':'2016']
russia_gdp.head(3)
russia_gdp2 = ((df1[['year','total']]
.join(russia_gdp,on='year')
.set_index('year')[['total','percapita']]
.reset_index().groupby('year')['total','percapita']
.mean()
))
russia_gdp2.head(3)
print("The correlation coefficient between total alcohol consumption and GDP per capita is: ", round(russia_gdp2['total'].corr(russia_gdp2['percapita']),2))
total_vs_gdp = (alt.Chart(russia_gdp2.reset_index())
.mark_circle(size=60)
.encode(
x='percapita',
y='total',
tooltip=['year','total','percapita'])
.properties(title='Fig 16Total Alcohol Consumption per capita vs. GDP per capita (1998-2016)')
)
(total_vs_gdp + total_vs_gdp.transform_regression('percapita', 'total', method='linear').mark_line()).interactive() | (total_vs_gdp + total_vs_gdp.transform_regression('percapita', 'total', method='log').mark_line()).interactive()
mod = sm.OLS(russia_gdp2.total.iloc[:-4], np.log(russia_gdp2.percapita.iloc[:-4]))
res = mod.fit()
print(res.summary())
russia_gdp2['pred'] = (res.predict(np.log(russia_gdp2.percapita)))
(russia_gdp2['total']
.plot(legend=True,
label='Total Consumption',
title='Fig. 17: Actual vs. Predicted Total Consumption (All Regions)')
)
(russia_gdp2['pred']
.plot(legend=True,
label='Predicted Consumption',
color='red',
style='--',
figsize=(10,6))
);
Observations:
-
In fig 16, total alcohol consumption per year and the GDP numbers for Russia for that year are plotted. For visual comparison, two lines are fitted, the one on the left is a linear fit and the one on the right si with log of GDP. We can clearly see that log-linear model fits better visually.
-
For a more accuarate analysis, linear regression model using log of GDP is fit to predict the consumption of alcohol. Adjusted R2 values is 97% with P value significantly less than the critical 5% threshold. P value of Jarque Bera test on residuals also shows that the residuals are normally distributed thus no bias in the model.
-
This shows that macroeconomic conditions of the country play a significant role in alcohol consumption behavior of the consumers. The practical significance of that is before launching the marketing campaign across the nation, the company should asess the macroeconomic factors to maximize the success of the campaign and sales.
-
In fig 17, the linear model is used to predict the total alcohol consumption. As we can see, although the numbers are very accurate, it captured the trend very well.
Regional GDP comparison
We are interested in regional pattern in alcohol consumption, so lets check regional GDP numbers and compare them with Saint Peterburg's GDP. We can use the regional GDP numbers from here
ru_gdp_regional = pd.read_table('https://raw.githubusercontent.com/sapawar4/datasets/main/russia_gdp_regional')
ru_gdp_regional.head(3)
print("This data has GDP of", len(ru_gdp_regional), "regions of Russia, for year 2018")
This is the GDP of differrent regions of Russia for year 2018. Since the alcohol data is from 1996-2016, we can't use these number directly but we can normalize the numbers by Saint Petersburg's GDP to get an idea of relative gdp. Regions with GDP >= 1 indicate GDP of those regions are equal to greater than that of Saint Petersburg's.
Russia's average inflation between 2016-2018 was ~3.5% (source: https://www.macrotrends.net/countries/RUS/russia/inflation-rate-cpi). Since the alcohol consumption data is only up to 2016, we can adjust the 2018 GDP numbers by inflation to get an approximate estimtion of 2016 GDP numbers.
ru_gdp_regional['GDP'] = np.round(np.divide(ru_gdp_regional['GDP'],1.035**2),0)
ru_gdp_regional.head(3)
ru_gdp_regional['rgdp'] = (np.divide(
ru_gdp_regional['GDP'],
ru_gdp_regional.query('Region == "Saint Petersburg"')['GDP'].max())
)
(alt.Chart(ru_gdp_regional).mark_bar()
.encode(x='Region:N',
y = 'rgdp:Q',
tooltip=['Region','GDP','rgdp'],
color = alt.condition("datum.rgdp < 1 ",
alt.value("steelblue"),
alt.value("orange")))
.properties(width=900, title = 'Fig 18: Relative adjusted GDP of differrent regions of Russia')
.interactive()
)
yr_median = pd.DataFrame(df1.set_index('year')['2014':].groupby('region')['total'].median())
ru_gdp_regional=ru_gdp_regional.set_index('Region').join(yr_median).reset_index()
ru_gdp_regional= ru_gdp_regional.dropna()
ru_gdp_regional.head(3)
regional_gdp_chart = (alt.Chart(ru_gdp_regional)
.mark_circle(size=60)
.encode(
x='GDP',
y='total',
tooltip=['Region','total','GDP'],
color = alt.condition("datum.Region == 'Saint Petersburg'",
alt.value("orange"),
alt.value("steelblue")))
)
box_lin = alt.Chart(ru_gdp_regional).mark_boxplot().encode(x='GDP', tooltip=['Region']).properties(height = 100,title = 'Distribution of GDP w Outliers')
combo1 = (regional_gdp_chart + regional_gdp_chart.transform_regression('GDP', 'total', method='log').mark_line() & box_lin).properties(title='Fig 19 Median Comsumption vs GDP with Outliers')
ru_gdp_regional.sort_values('GDP',ascending=False).iloc[:3]
regional_gdp_chart2 = (alt.Chart(ru_gdp_regional.sort_values('GDP',ascending=False).iloc[3:])
.mark_circle(size=60)
.encode(
x='GDP',
y='total',
tooltip=['Region','total','GDP'],
color = alt.condition("datum.Region == 'Saint Petersburg'",
alt.value("orange"),
alt.value("steelblue")))
)
box_lin2 = alt.Chart(ru_gdp_regional.sort_values('GDP',ascending=False).iloc[3:]).mark_boxplot().encode(x='GDP', tooltip=['Region']).properties(height = 100, title = 'Distribution of GDP w/o Outliers')
combo2 = (regional_gdp_chart2 + regional_gdp_chart2.transform_regression('GDP', 'total', method='log').mark_line() & box_lin2).properties(title='Fig 20 Median Comsumption vs GDP without Outliers')
(combo1 | combo2).configure_view(strokeWidth=0)
print("The correlation coefficient between regional GDP and median alcohol consumption across differrent regions is",round((ru_gdp_regional.iloc[6:].total).corr(np.log(ru_gdp_regional.iloc[6:].GDP)),2))
Observations:
- Fig 18 compares the regional GDP numbers relative to Saint Petersburg. There are 11 regions that have GDP equal to or more than that of Saint Peterburg's.
- Fig 19 and 20 show scatterplot of GDP and median alcohol consumption for year 2016 for 80 regions in Russia. The boxplot at the bottom of the scatterplot show the distribution of GDP and presence of outliers. For example the left scatterplot shows there are there are at least 2 significant outliers. Fig 20 shows the same plot with outliers removed. Orange circle is Saint Petersburg. The log linear model similar to one used earlier also fits very well in this case, showing overall the GDP vs. consumption model holds good even at regional level.
- The correlation coefficient between regional GDP and alcohol consumption is 0.52, showing moderate correlation. There could be other factors such as gender mix, inflation, personal income, supply chain factors, government policies etc that may influence consumption not just GDP.
- The significance of this is that regional GDP should be considered as one of the factors to determine the regions that are similar to Saint Petersburg. It's definitely not the only factor but could be one of the main factors based on the analyses above.
def get_ccf(s1,s2,lag, topN):
ccf_dict = {}
for r in list(nonstp.region.unique()):
s2 = (nonstp.query('region == @r')['total'].values)
ccf = sm.tsa.stattools.ccf(stp_total, s2)[lag]
ccf_dict[r]=[ccf]
return pd.DataFrame(ccf_dict).T.nlargest(topN, [0])
stp_total = df1.query('region == "Saint Petersburg"')['total'].values
nonstp = df1[['year','region','total']].query('region != "Saint Petersburg"')
get_ccf(stp_total,nonstp, lag=0,topN=20).plot.barh(figsize=(12,8), title='Fig 21: Top 20 regions by cross-correlation', label=False);
(pd.Series(
sm.tsa.stattools.ccf(stp_total, df1.query('region == "Republic of Dagestan"')['total'].values))[:12]
.plot.bar(title='Fig 22:Cross Correlation: Saint Peterburg vs. Republic of Dagestan', figsize=(12,6))
);
Observations:
- Fig 21 shows the top 20 regions that have high cross correlation coefficient with Saint Petersburg. Region with highest correlation by total consumption is Republic of Dagestan. Only the first lag is considered in this figure.
- In fig 22, for republic of Dagestan, first 12 lags are plotted. It shows that first and second lags are quite significant. This implies total cosumption in Dagestan from two years ago can be used to predict the alcohol consumption in Saint Petersburg. Although that doesn't help practically, we at least know now which regions are similar to Saint Petersburg based on total consumption.
Clustering
Time series data can be clustered using various ways:
-
Distance based clustering : Instead of using euclidean distance, we can use Dynamic Time Warping (DTW) to find similarity between two time series. This is a shape based approach meaning the two series are considered similar if their shapes match based on DTW metric.
-
Functional data analysis: In this case, the basis function of the time series is obtained, and time series are clustered based on the similarity of the function.
-
Feature based : Features or characteristics of the time series based on its temporal structure are calculated and then those are used for clustering.
The main downside of the first two methods is that they do not provide much explainable insights into why and how the series are clustered. We will use the feature based approach instead. In fetaure based approach, various metrics such as mean, mode, skewness, lags, ACF, PACF etc. along with many other metrics are used. In this case, we will use catch22 features which have been shown to capture temporal structure of the series adequately without sacrificing the accuracy. Ref.
From our initial analysis we know that, in Saint Petersburg:
- Overall alcohol consumption has decreased, especially in the last ~5 years
- Consumption of beer has decreased drastically
- There is an uptick in wine consumption and decrease in vodka consumption
- Although more weak alcohol is consumed, preference for hard liquer has been increasing
- Macroeconomic factors such as GDP influence alcohol consumption
We want to capture these features in the time series and find regions that show similar characteristics.
Modeling methodology:
- Combine consumption of alcohols in two groups based on alcohol content - hard (vodka + brandy) and week (beer + wine + champagne)
- Obtain 'catc22' features for both of these alcohols types
- Cluster the data using K-means algorithm
- Identify regions that are common in both clusters
df1_l_hard = df1[['region','hard']].set_index('region')
df1_pivot_hard = df1[['year','region','hard']].pivot(index='year',columns='region')
c22_h = pd.DataFrame()
for region in set(df1.region.unique()):
c22_h[region] = catch22_all(df1_l_hard.loc[region]['hard'].values)['values']
c22_h.index = catch22_all(df1_l_hard.loc[region]['hard'].values)['names']
## Time series features for hard alcohols
c22_hard = c22_h.T
c22_hard
df1_l_wk = df1[['region','weak']].set_index('region')
df1_pivot_wk = df1[['year','region','weak']].pivot(index='year',columns='region')
c22_w = pd.DataFrame()
for region in df1.region.unique():
c22_w[region] = catch22_all(df1_l_wk.loc[region]['weak'].values)['values']
c22_w.index = catch22_all(df1_l_wk.loc[region]['weak'].values)['names']
#Time series features for weak alcohols
c22_weak= c22_w.T
c22_weak
kmeans1 = KMeans(n_clusters=4, random_state=1).fit(c22_hard)
kmeans2 = KMeans(n_clusters=4, random_state=1).fit(c22_weak)
c22_final = pd.DataFrame()
c22_final['regions'] = list(c22_weak.index)
c22_final['hard_cluster'] = kmeans1.labels_
c22_final['weak_cluster'] = kmeans2.labels_
c22_final
c22_final.query('regions == "Saint Petersburg"')
h__ = int(c22_final.query('regions == "Saint Petersburg"')['hard_cluster'].values)
w__ = int(c22_final.query('regions == "Saint Petersburg"')['weak_cluster'].values)
Now let's get similar regions for both alcohol types.
c22_final.query('hard_cluster == @h__')['regions']
c22_final.query('weak_cluster == @w__')['regions']
#Find regions that are common in both hard & weak alcohol clusters
similar = set(c22_final.query('weak_cluster == @w__')['regions'].values).intersection(set(c22_final.query('hard_cluster == @h__')['regions'].values))
similar
Out of 81 regions, 10 regions show patters that show trends for hard & weak alcohols that are similar to that of Saint Petersburg. Let's look at those closely. You can in the charts below, feature based clustering method identified regions that follow similar time series pattern.
for r in similar:
df1_melt[(df1_melt.region.isin(similar))].query('Alcohol == "beer"').set_index('year').query('region == @r')['Consumption'].plot()
for r in similar:
df1_melt[(df1_melt.region.isin(similar))].query('Alcohol == "wine"').set_index('year').query('region == @r')['Consumption'].plot()
for r in similar:
df1_melt[(df1_melt.region.isin(similar))].query('Alcohol == "vodka"').set_index('year').query('region == @r')['Consumption'].plot()
Conclusion:
- It is recommended to run the next campaign in following 10 regions: 'Altai Krai', 'Bryansk Oblast', 'Chelyabinsk Oblast', 'Irkutsk Oblast', 'Novosibirsk Oblast', 'Penza Oblast', 'Sverdlovsk Oblast', 'Tomsk Oblast', 'Tyumen Oblast', 'Yamalo-Nenets Autonomous Okrug'
-
In the analysis it was found that consumers in Saint Petersburg have been consuming less alcohol compared to the past, especially beer. They have also been preferring wine over vodka in the last 3 years. It was also found that although the consumers have been drinking less alcohol, the preference towards hard alcohols (vodka & brandy) has been picking up over weak alcohols (beer, wine and champagne)
-
To find regions similar to that of Saint Petersburg, feature based time series clustering technique was used to find regions that show similar trends for both hard and weak alcohols.
-
Analysis showed that macroeconomic factors play an important rule in alcohol consumption. We recommend that the company assess macroeconomics (CPI, GDP, international markets etc.) to decide when to launch the campaign. We could not include GDP data in clustering due to unavailability of data for all regions, but we recommend carrying out additional clustering with inclusion of GDP to find similar regions.
Limitations:
-
I did not focus on any geospatial analysis. Russia is a vast country and many regions operate autonomously so there are likely to be strong differrences among some regions despite being similar in alchol volume consumption.
-
I have made assumptions regarding GDP numbers
-
One limitation of K-Means is its sensitivity to initialization. Although I spent some time tuning parameters, I did not include it in the analysis here to keep it short. In my real project, I used GMM instead of K-Means.
-
Catch22 is one of the many libraries that allow time series feature extraction. You can also take a look at
tsfresh
andtsfeat