Overview

The goal of this notebook is to show how to do exploratory data analysis of time series data and use feature based clustering method to identify attributes that are similar in time domain. The novelty of this based is, instead of relying on shape of the time series, feature based approach enables more interpretable solution. I used some of these techniques in a real project so I wanted to showcase this for my own future reference.

The data here is alcohol consumption across 80+ regions of Russia. I am not sure if this is real data or not. Hypothetically, an alcohol company has been operating in St. Petersburg. Their campaign was successful in St.Petersburg so we want to identify other regions where they can potentially run similar campaigns. We have alcohol consumption for differrent alcohol types per capita from 1998 to 2016.

I don't know anything about Russia's geography or economy, so I will make reasonable assumptions in my analysis.

Importing Libraries

#!pip install altair

#!pip install catch22
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import altair as alt

import seaborn as sns
import statsmodels.api as sm
import scipy.stats as st
import statsmodels.stats.api as sms
from sklearn.metrics import r2_score
from catch22 import catch22_all
from sklearn.cluster import KMeans 


alt.data_transformers.disable_max_rows()


import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv(r'./data/russian_alcohol_consumption.csv')
df.head()
year region wine beer vodka champagne brandy
0 1998 Republic of Adygea 1.9 8.8 3.4 0.3 0.1
1 1998 Altai Krai 3.3 19.2 11.3 1.1 0.1
2 1998 Amur Oblast 2.1 21.2 17.3 0.7 0.4
3 1998 Arkhangelsk Oblast 4.3 10.6 11.7 0.4 0.3
4 1998 Astrakhan Oblast 2.9 18.0 9.5 0.8 0.2

Data Quality

It is important that before we analyze the data, we should check the quality of the data. If there are any missing values, outliers, unusual observations etc. those should be treated before analysing the data.

Null Values

df.isna().any()
year         False
region       False
wine          True
beer          True
vodka         True
champagne     True
brandy        True
dtype: bool

Wine, beer, vodka, champagne and brandy have null values. Let's find if any of the columns have more than 5% null values (i.e in this at least 1).

# Which region has null values?
null_regions = []
print("% of Values missing \n========================================")
for r in df.region.unique():
    
    nulls = (df.query('region == @r').isna().mean())
    if nulls.max() > 0.05:
        print(r,",",nulls.idxmax(), ":", 100 * round(nulls.max(),2),"%")
        null_regions.append(r)
% of Values missing 
========================================
Republic of Ingushetia , brandy : 79.0 %
Republic of Crimea , wine : 84.0 %
Sevastopol , wine : 84.0 %
Chechen Republic , wine : 100.0 %

There are four regions with more than 79% of the values missing. If few values were missing, we could have used interpolation methods to impute the missing values, but in this we don't have enough values to do so. Let's look at which values are missing.

#alcohol types
alcohols = list(df.columns[2:])                                   

#Find number of missing values and location of the missing values each of the 4 regions
null_regions_df = (df[df['region']
 .isin(null_regions)]
 .sort_values(by=['region','year'])
 .groupby(['year','region'])
 .apply(lambda x : x.isna().sum())
 .drop(['year','region'],axis=1)
 .assign(nulls = lambda x: x.sum(axis=1))
 .reset_index()
)
null_regions_df.head(5)
year region wine beer vodka champagne brandy nulls
0 1998 Chechen Republic 1 1 1 1 1 5
1 1998 Republic of Crimea 1 1 1 1 1 5
2 1998 Republic of Ingushetia 1 1 1 1 1 5
3 1998 Sevastopol 1 1 1 1 1 5
4 1999 Chechen Republic 1 1 1 1 1 5
plt.title("Fig 1: Heatmap showing number of alcohol types missing by year and region \n") 

(
sns.heatmap    
     (pd.crosstab(
      null_regions_df['region'],
      null_regions_df['year'],
      values = null_regions_df['nulls'],
      aggfunc='max'
    
      ),
      cmap="magma_r", 
      annot=True, 
      cbar=False, 
      square=True
     )
);    

Observations:

In Fig 1., the heatmap shows, out of the 5 alcohol types, how many types have missing values.

  1. Crimea and Sevastopol has values missing for all 5 alcohol types for years 1998 to 2013 and none missing for the last 3 years. This suggests that likely the data for first 16 years either wasn't collected or wasn't available.
  2. Chechen Republic has values missing except for one alcohol type in the last 3 years
  3. Ingushetia has missing values at random

Since our goal is to compare the regions using alcohol consumption data for all years, it won't be possible to include these 4 regions in to the analysis. We will drop all these 4 regions from further analysis.

Duplicate Values

#Are there any duplicate values? 


if max((df[~df.region.isin(null_regions)].region.value_counts())) == min((df[~df.region.isin(null_regions)].region.value_counts())):
    print("All Regions have 19 years of data")
else:
    print("Some of the regions of more/less data")
All Regions have 19 years of data

None of the regions have duplicate data across all years

Unusual Data

For all alcohol types we will look at min and max values to see if there are any unusual values such as 0 or negative consumption.

#Are there any observations with 0 or negative values? Consumption can be 0 but not negative

df[~df.region.isin(null_regions)].describe().loc['min']
year         1998.0
wine            0.1
beer            1.0
vodka           0.4
champagne       0.1
brandy          0.0
Name: min, dtype: float64

There is at least one region with 0 lts per capita consumption of brandy. Let's look at that closely.

df[~df.region.isin(null_regions)].query('brandy == 0')
year region wine beer vodka champagne brandy
565 2004 Tuva Republic 2.1 25.8 15.2 0.2 0.0

In 2004, Tuva Republic did not record any consumption of brandy.

(df.query('region == "Tuva Republic"')
 .set_index('year')['brandy']
 .plot(title="Fig 2: Brandy consumption for Tuva Republic \n 0 consumption in 2004", legend=True)
)


(plt.axhline(df.query('region == "Tuva Republic"')
             .set_index('year')['brandy']
             .median(), color='red')
 
);

In 2004, the brandy consumption in Tuva dropped from 0.1 ltr to 0. The median consumption over the entire period was 0.2 ltr. So, considering overall low consumption, 0 does not seem like an outlier. We will keep the value as is.

# Max values
df[~df.region.isin(null_regions)].describe().iloc[:,1:]
wine beer vodka champagne brandy
count 1539.000000 1539.000000 1539.000000 1539.000000 1539.000000
mean 5.637544 51.722190 11.902404 1.315172 0.524185
std 2.811555 25.115577 5.078808 0.798462 0.399331
min 0.100000 1.000000 0.400000 0.100000 0.000000
25% 3.550000 32.700000 8.400000 0.800000 0.200000
50% 5.400000 50.300000 11.600000 1.200000 0.400000
75% 7.400000 67.500000 15.000000 1.660000 0.700000
max 18.100000 207.300000 40.600000 5.560000 2.300000

Except for beer, the max values for all other alcohols seem reasonable. In case of beer, the 75% value is 67.5 lts and max is 207, which is 5 standard deviations away from 75th percentile. We need to understand which region has this value and how should we treat it.

Below, we will drop the regions with null values and create a new dataframe with percentage share of each alcohol type for the given year and region.

df1 = df[~df.region.isin(null_regions)]
df1['total']=(df1[alcohols].sum(axis=1))
df1 = (df1
       .assign(winepct = df1['wine']/df1['total'])
       .assign(beerpct = df1['beer']/df1['total'])
       .assign(vodkapct = df1['vodka']/df1['total'])
       .assign(champct = df1['champagne']/df1['total'])
       .assign(brandypct = df1['brandy']/df1['total'])
       .assign(year = pd.to_datetime(df1.year,format='%Y'))
       .round(3)

       
      )
df1.head(5)
year region wine beer vodka champagne brandy total winepct beerpct vodkapct champct brandypct
0 1998-01-01 Republic of Adygea 1.9 8.8 3.4 0.3 0.1 14.5 0.131 0.607 0.234 0.021 0.007
1 1998-01-01 Altai Krai 3.3 19.2 11.3 1.1 0.1 35.0 0.094 0.549 0.323 0.031 0.003
2 1998-01-01 Amur Oblast 2.1 21.2 17.3 0.7 0.4 41.7 0.050 0.508 0.415 0.017 0.010
3 1998-01-01 Arkhangelsk Oblast 4.3 10.6 11.7 0.4 0.3 27.3 0.158 0.388 0.429 0.015 0.011
4 1998-01-01 Astrakhan Oblast 2.9 18.0 9.5 0.8 0.2 31.4 0.092 0.573 0.303 0.025 0.006

Exploratory Data Analysis

To better understand the pattern in the data and compare the regions quickly over the 19 year period, decile heatmap as show in Fig 3 is created. In this visual, for each year, the total alcohol conumption is binned into 10 bins (i.e deciles). For example, the region with the highest consumption will be in 10th decile while the region with the lowest consumption will be in the 1st decile. The lighter color indicates high decile (and high consumption) while dark color shows low consumption. The regions are sorted by total consumption so the regions with high consumption are at the top.

plt.figure(figsize = (10,22))
plt.title("Fig 3: Decile Heat Map \n Shows distribution of binned alcohol consumption by year and region \n Lighter color: High consumption, darker color: low consumption", fontsize =14)


sns.heatmap(df1[['year','region','total']]
 .assign(year=df1.year.dt.year)
 .set_index(['year','region'])
 .groupby(level=0)
 .transform(lambda x: pd.qcut(x, q = np.linspace(0,1,11), labels=False))
 .unstack(level=0)
 .assign(t = lambda x: x.sum(axis=1))
 .sort_values(by='t', ascending=False)
 .drop('t',axis=1),
            cmap='magma',cbar_kws={"orientation": "horizontal"});    

Observations:

  1. Moscow has light color all throughout the 19 years showing it had the highest alcohol consumption continuoulsy for all years.
  2. Saint Petersburg is towards the top, thus one of the regions with high alcohol consumptions
  3. Saint Petersburg was in high deciles until 2010 and then the alcohol sumption dropped drammatically in the last 5-6 years.
  4. A visual comparison shows Chelyabinsk Oblask, Yaroslavi Oblast, Perm also show a similar trend as that of Saint Petersburg
  5. Some regions such as Karaychay, Dagestan, Chukotka, Adyega etc which are at the bottom, had very low and flat trend in consumption compared to Saint Petersburg
  6. Some regions such as Tver, Nizhny, Sakhlin showed pick up in consumption in later years, opposite to Saint Petersburg

Let's compare the consumption further by looking at box plot and by individual alcohol type.

df1_melt = df1.iloc[:,:7].melt(['year','region'],var_name='Alcohol', value_name='Consumption')
df1_melt.head(3)
year region Alcohol Consumption
0 1998-01-01 Republic of Adygea wine 1.9
1 1998-01-01 Altai Krai wine 3.3
2 1998-01-01 Amur Oblast wine 2.1

Trend Analysis:

Fig 4 shows distribution of total alcohol consumption for each of the year and by region. Values that are within the box plot are within the inter-quartile range ,where as values outside are potential outliers. The thick blue line shows the median consumption over the period. Light orange color line shows total consumption of Saint Petersburg for comaprison purposes. Each dot in fig 5 is a region. You can hover over the circle to get breakdown of consumption by alcohol type, volume and percentage.

Fig 5 shows the consumption for each selected region over the 19 year period. Select the region fro teh dropdown menu to get the consumption by type.

You can zoom-in/out and pan to explore the data in fig 4 & 5.

median_line = alt.Chart(df1).mark_line().encode(

    x = 'year:T',
    y = 'median(total):Q'
).interactive()

total_box = alt.Chart(df1).mark_boxplot(opacity=0.3).encode(

    x = 'year:T',
    y = 'total:Q',
    
).interactive()

total_scatter = alt.Chart(df1).mark_circle(opacity=0.8).encode(

    x = 'year:T',
    y = 'total:Q',
    color='region',
    tooltip = list(df1.columns)).interactive()
    
total_line = alt.Chart(df1).mark_line(opacity=0.1).encode(

    x = 'year:T',
    y = 'total:Q',
    color='region',
    tooltip = list(df1.columns),
    strokeWidth = alt.condition("datum.region == 'Saint Petersburg'",alt.value(3),alt.value(1)),
    
).interactive()

total_line_stp = alt.Chart(df1.query('region == "Saint Petersburg"')).mark_line(opacity=0.4, strokeDash=[1,1]).encode(

    x = 'year:T',
    y = 'total:Q',
    color=alt.value("#FFAA00"),
    strokeWidth = alt.condition("datum.region == 'Saint Petersburg'",alt.value(3),alt.value(1)),
    
    tooltip = list(df1.columns)    
    
).interactive()

# (total_scatter + total_line_stp).properties(width=700, height = 400, title = "Total Alcohol consumption for all regions vs. Year")
top = (total_scatter + total_box + median_line + total_line + total_line_stp).properties(width=700, height = 300, title = "Fig 4: Total Alcohol consumption for all regions vs. Year")


region_dropdown = alt.binding_select(options= sorted(list(df1_melt.region.unique())))
region_select = alt.selection_single(fields=['region'], bind=region_dropdown, name="Select")


bottom = (alt.Chart(df1_melt)
 .mark_line().encode(alt.X('year:T'),  
                     alt.Y('mean(Consumption):Q', scale=alt.Scale(domain=[0,210])), 
                     color = 'Alcohol:N', 
                     tooltip=['year','Consumption','Alcohol'])
).add_selection(
    region_select
).transform_filter(
    region_select
).properties(width=700, height = 200, title = "Fig 5:Select Region: Alcohol consumption by Year, Alcohol Type ")


(top & bottom.interactive())

Observations:

Above plot shows the distribution of total consumption over the years for all non-null regions. Some observations:

  1. Total alcohol consumption in Russia increased from 1998 to 2008, levelled off over the next 4 years and then started declining after 2012. Global recession started in ~2008. It is highly likely that that may have affected the consumption.

  2. The box plot shows the distribution of total alcohol consumption for all regions. Overall, the the total consumption for all regions was within the interquartile range, except in years 2004-2006 for Zabalkalsky Krai (three blue dots in the chart that are away from all other points). For Zabaykalsky Krai, the consumption was significantly higher from 2004 to 2006. Selecting Zabaykalsky Krai in fig 5 shows thatthe beer consumption in 2006 was 207 lts, which is the outlier we found in the analysis above. Interestingly, consumption of vodka and other alcohols also increased over that period indicating it's not an error in values. The consumption dropped precipitously after 2006. It stayed the same before, during and after the global recession, contrary to overall trend in Russia.

  3. Moscow shows trend similar to overall trend, while Moscow Oblast is exactly opposite. Beer consumption of Moscow Oblast increased rammatically in 2006 and was not affected by the Recession.

  4. Selecting Saint Petersburg in fig 5 shows that beer consumption increased steadily from 1998 to 2006 and dropped significantly thereafter and continued its decline.

  5. Saint Petersburg is one of the most visted cities of Russia and tourism is a big part of the economy. Global recession in 2008 might have led to the decline tourism and thus copnsumption of alcohol.

  6. There is no seasonality observed in this data and data are non-stationary.

Comparing Saint Petersburg with Rest of Russia

Fig 6 and 7 show breakdown of alcohol consumption by volume and percentage for all of Russia. Fig 8 & 9 show similar metrics for Saint Petersburg. This will help us compare the trends and identify insights. All 4 charts are zoomable and can be panned. Fig 10 shows the alcohol consumption for Saint Petersburg by rank order.

total_n = alt.Chart(df1_melt).mark_area(size=20).encode(
    y=alt.Y('sum(Consumption)', stack="normalize", axis=alt.Axis(title='%Alcohol consumed')),
    x='year:T',
    color='Alcohol',
    tooltip=['sum(Consumption):Q','year']
    
).interactive().properties(title = 'Fig 7: All Regions:%Alcohol Consumption by Type & Year')

total_z = alt.Chart(df1_melt).mark_area(size=20).encode(
    y=alt.Y('sum(Consumption)', stack="zero", axis=alt.Axis(title='Alcohol consumed by vol')),
    x='year:T',
    color='Alcohol',
    tooltip=['sum(Consumption):Q','year']
    
).interactive().properties(title = 'Fig 6: All Regions:Total Alcohol Consumption by Type & Year by vol')

total_n_stp = alt.Chart(df1_melt.query('region =="Saint Petersburg"')).mark_area(size=20).encode(
    y=alt.Y('sum(Consumption)', stack="normalize", axis=alt.Axis(title='%Alcohol consumed')),
    x='year:T',
    color='Alcohol',
    tooltip=['sum(Consumption):Q','year']
    
).interactive().properties(title = 'Fig 9:St Petersburg: %Alcohol Consumption by Type & Year')

total_z_stp = alt.Chart(df1_melt.query('region =="Saint Petersburg"')).mark_area(size=20).encode(
    y=alt.Y('sum(Consumption)', stack="zero", axis=alt.Axis(title='Alcohol consumed by volume')),
    x='year:T',
    color='Alcohol',
    tooltip=['sum(Consumption):Q','year']
    
).interactive().properties(title = 'Fig 8: St Petersburg: Total Alcohol Consumption by Type & Year by vol')

(  total_z | total_n) & (  total_z_stp | total_n_stp)
(df1[['wine','beer','vodka','champagne','brandy','region','year']]
 .query('region == "Saint Petersburg"')
 .drop('region',axis=1)
 .set_index('year')
 .rank(axis=1, ascending=False)
 .astype('int')
 .plot(figsize=(10,6), 
       label=True, 
       title="Fig 10: Alcohol Consumption Rank: Saint Petersburg \n Beer is #1, Wine took over Vodka in 2013 as #2")
 
);
df1['weak'] = df1['beer'] + df1['wine'] + df1['champagne']
df1['hard'] = df1['vodka']  + df1['brandy']
df1['hardpct'] = df1['hard']/df1['total']
df1['weakpct'] = df1['weak']/df1['total']


(alt.Chart(df1.query('region == "Saint Petersburg"')[['year','hard','weak']].melt(['year'], var_name='type'))
 .mark_area()
 .encode(
     x='year:T', 
     y= alt.Y('sum(value)', stack='normalize'), 
     color='type',
     tooltip=['year','type','value'])
 
).properties(title='Fig 11: Hard vs. Weak alcohol by % for Saint Petersburg')

Observations:

1. Total consumption: Overall consumption of alcohol per capita for Russia and Saint Petersburg follow similar trends. Both show decline in consumption in the last 5-10 years but at differrent rates. The decline started much earlier (2009) in Saint Petersburg while it remained steady from 2009-2012 for Russia and then started decreasing. Saint Peterberg's consumption dropped by ~70% compared to only 25% for Russia. Saint Peterburg accounted for 2% of total consumption at its peak in 2009 and only 0.08% in 2016.

2. Beer Consumption: Beer consumption decreased by volume across Russia and Saint Petersburg in the last few years. But interestingly, while the overall consumption dropped in volume, preference for beer increased from ~ 55% in the late 1990s to ~ 75% by 2016. In contrast, consumers in Saint Petersburg always preferred beer in the 1990s (~63%), it grew to 87% at its peak in 2009 and dropped to 67% by 2016. While beer was still the preferred beverage, its preference has been decreasing in Saint Petersburg in constrast to the national average. My hypothesis is that tourism is a big part of Saint Petersburg's economy, so recession in 2008-09 may have caused reduction in tourism and consumption of beer.

3.Vodka: Vodka is synonymous with Russia and yet vodka isn't the most consumed alcohol. Consumption of alcohol decreased in Russia and Saint Petersburg. Vodka remained the second most consumed alcohol in Saint Petersburg until 2013. As show in fig 10, vodka became the third most consumed alcohol by volume.

4.Wine: Preference for wine over vodka steadily increased in the first 16 years and then took over vodka as the second most consumed alcohol by volume in Saint Petersburg. While some reasearch has been done to identify the reasons, it's not conclusive. It's postulated that women prefer wine over beer and vodka so perhaps change in Saint Petersburg's demographic mixture may have led to wine's popularity but not much data is available to confirm that. As show in fig 10, wine became more popular than vodka in 2014. Consumers in Saint Petersburg drink 5% more wine per capita than national average (14% vs. 9%).

5. Champagne & Brandy: These two are lumped together because historically these two alcohols haven't been consumed much, neither in Russia nor in Saint Petersburg. For example, in 2016 Saint Petersburg consumers drank 41 ltr beer on average and only 3.2 lts champagne and brandy. Although the percentage of these two has been increasing, at the expense of beer, but still the volumes are very low.

6. Hard vs. weak alcohol: We can group the alcohols by the percentage of alcohol present in them. Vodka, brandy contain ~40% alcohol whereas beer, champagne & wine contain 4-12% alcohol. If we group the alcohols this way as hard vs. weak based on alcohol content, we see a differrent picture. As shown in fig 11, although by volume the amount of hard alcohol consumed is low, by percentage consumers in Saint Petersburg are actually preferring hard over weak alcohols at a rapid pace.

Saint Petersburg Vs. Other Regions:

We compared Saint Petersburg against the national average. Now let's compare Saint Petersburg against the regions over two differrent periods by median alcohol consumption. As mentioned in the observations above (and fig 10), Saint Peterburg's alcohol consumption pattern changed in 2014. To put it in perspective, fig 12-15 compare Saint Petersburg's median alcohol consumption over two differrent periods, 1998-2013 and 2013-2016.

Fig 12: Median consumption of wine vs. beer from 1998-2013. Each circle is a region, orange circle is Saint Petersburg. Size of the circle shows vodka consumption.

Fig 13: Median consumption of wine vs. beer from 2014-2016. Each circle is a region, orange circle is Saint Petersburg. Size of the circle shows vodka consumption.

Fig 14: Median consumption of wine vs. vodka from 1998-2013. Each circle is a region, orange circle is Saint Petersburg. Size of the circle shows beer consumption.

Fig 15: Median consumption of wine vs. beer from 2014-2016. Each circle is a region, orange circle is Saint Petersburg. Size of the circle shows beer consumption.

All figures are zoomable and pannable and have tooltips.

L3Y_mean = df1.set_index('year').loc['2014':].iloc[:,:4].groupby('region').median().reset_index()
L16Y_mean = df1.set_index('year').loc[:'2013'].iloc[:,:4].groupby('region').median().reset_index()

(alt.Chart(L16Y_mean).mark_circle(size=100).encode(

    alt.X('beer', scale=alt.Scale(domain=[0,150])),
    alt.Y('wine',scale=alt.Scale(domain=[0,14])),
    tooltip=['region','wine','beer', 'vodka'],
    color = alt.condition("datum.region == 'Saint Petersburg'",
            alt.value("orange"),  
            alt.value("steelblue")),
    size='vodka'

  ).properties(title ='Fig 12:Median Consumption Wine vs. Beer (1998-2013)')).interactive()   | (alt.Chart(L3Y_mean).mark_circle(size=100).encode(

    alt.X('beer', scale=alt.Scale(domain=[0,150])),
    alt.Y('wine', scale=alt.Scale(domain=[0,14])),
    tooltip=['region','wine','beer','vodka'],
    color = alt.condition("datum.region == 'Saint Petersburg'",
            alt.value("orange"),  
            alt.value("steelblue")),
    size='vodka'

).properties(title ='Fig 13:Median Consumption Wine vs. Beer (2014-2016)')).interactive()
(alt.Chart(L16Y_mean).mark_circle(size=100).encode(

    alt.X('wine', scale=alt.Scale(domain=[0,20])),
    alt.Y('vodka',scale=alt.Scale(domain=[0,22])),
    tooltip=['region','wine','beer', 'vodka'],
    color = alt.condition("datum.region == 'Saint Petersburg'",
            alt.value("orange"),  
            alt.value("steelblue")),
    size='beer'

).properties(title ='Fig 14:Median Consumption Wine vs. Vodka(1998-2013)')).interactive() |((alt.Chart(L3Y_mean).mark_circle(size=100).encode(

    alt.X('wine', scale=alt.Scale(domain=[0,20])),
    alt.Y('vodka', scale=alt.Scale(domain=[0,22])),
    tooltip=['region','wine','beer','vodka'],
    color = alt.condition("datum.region == 'Saint Petersburg'",
            alt.value("orange"),  
            alt.value("steelblue")),
    size='beer'

).properties(title ='Fig 15:Median Consumption Wine vs. Vodka(2014-2016)')).interactive())

Observations:

  1. In fig 12, we can clearly see beer consumption in Saint Petersburg was the highest on average over the 16 year period. It's the only region with beer consumption over 100 ltr per person on average ! There are more regions in the 40-60 lts range.Size of the circle also shows, all regions cosumed roughly similar amount of vodka. If we compare that with fig 12, not only beer but vodka consumption also reduced significantly. Moscow Oblast consumers started drinking more beer compared to national average. The shape of the scatterplot is also interesting. Circle in fig 12 are dispersed horizontally whereas circles in fig 13 are distributed vertically more, showing shift towards wine. Also, if you can imagine a line of best fit in these two scatterplots, in fig 12 the beer and wine preference have strong correlation, but in fig 13 that correlation seems weaker. Saint Petersburg is around in the middle of the scatter plot vertically, showing wine consumption is around the national average range.
  1. In fig 14 & 15, the vertical shift in circle can be easily observed, showing reduction in vodka consumption. Saint Petersburg is also towards the top third in wine consumption in 2014-2016 period.

Influence of Macroeconomic Factors

Research has shown that macroeconomic factors play a role in alcohol consumption pattern. One of the ways to gauge macroeconomic factor is by using gross domestic product (GDP) o fthe country. GDP is a measure of monetary value of goods produced in the country ref. We can use GDP of Russia from 1998 to 2016 and try to see if GDP has any correlation with consumption of alcohol. Data for GDP numbers is taken from here.

russia_gdp = pd.read_table('https://raw.githubusercontent.com/pawarbi/datasets/master/russia_gdp').sort_values(by='Year ')

russia_gdp.columns = russia_gdp.columns.str.replace(' ','').str.replace(' ','').str.lower()
russia_gdp['percapita'] = pd.to_numeric([x.replace('$','').replace(',','').rstrip().lstrip() for x in russia_gdp['percapita']])


russia_gdp['year']=pd.to_datetime(russia_gdp['year'],format='%Y')
russia_gdp.set_index('year',inplace=True)


russia_gdp = russia_gdp['1998':'2016']
russia_gdp.head(3)
gdp percapita growthrate
year
1998-01-01 $270.96B 1835 -5.30%
1999-01-01 $195.91B 1331 6.40%
2000-01-01 $259.71B 1772 10.00%
russia_gdp2 = ((df1[['year','total']]
 .join(russia_gdp,on='year')
 .set_index('year')[['total','percapita']]
 .reset_index().groupby('year')['total','percapita']
 .mean()
 ))

russia_gdp2.head(3)
total percapita
year
1998-01-01 37.260988 1835.0
1999-01-01 43.042963 1331.0
2000-01-01 48.173210 1772.0
print("The correlation coefficient between total alcohol consumption and GDP per capita is: ",  round(russia_gdp2['total'].corr(russia_gdp2['percapita']),2))
The correlation coefficient between total alcohol consumption and GDP per capita is:  0.75
total_vs_gdp = (alt.Chart(russia_gdp2.reset_index())
 .mark_circle(size=60)
 .encode(
     x='percapita',
     y='total', 
     tooltip=['year','total','percapita'])
 .properties(title='Fig 16Total Alcohol Consumption per capita vs. GDP per capita (1998-2016)')

 
)

(total_vs_gdp + total_vs_gdp.transform_regression('percapita', 'total', method='linear').mark_line()).interactive() | (total_vs_gdp + total_vs_gdp.transform_regression('percapita', 'total', method='log').mark_line()).interactive()

Regression Analysis, GDP vs. Alcohol consumption

mod = sm.OLS(russia_gdp2.total.iloc[:-4], np.log(russia_gdp2.percapita.iloc[:-4]))
res = mod.fit()
print(res.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                  total   R-squared (uncentered):                   0.979
Model:                            OLS   Adj. R-squared (uncentered):              0.977
Method:                 Least Squares   F-statistic:                              646.3
Date:                Sat, 05 Feb 2022   Prob (F-statistic):                    4.07e-13
Time:                        02:49:34   Log-Likelihood:                         -56.716
No. Observations:                  15   AIC:                                      115.4
Df Residuals:                      14   BIC:                                      116.1
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
percapita      8.4538      0.333     25.422      0.000       7.741       9.167
==============================================================================
Omnibus:                        5.388   Durbin-Watson:                   0.126
Prob(Omnibus):                  0.068   Jarque-Bera (JB):                3.462
Skew:                          -1.176   Prob(JB):                        0.177
Kurtosis:                       3.063   Cond. No.                         1.00
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
russia_gdp2['pred'] = (res.predict(np.log(russia_gdp2.percapita)))
(russia_gdp2['total']
 .plot(legend=True, 
       label='Total Consumption', 
       title='Fig. 17: Actual vs. Predicted Total Consumption (All Regions)')
)
(russia_gdp2['pred']
 .plot(legend=True, 
       label='Predicted Consumption',
       color='red', 
       style='--',
       figsize=(10,6))
);

Observations:

  1. In fig 16, total alcohol consumption per year and the GDP numbers for Russia for that year are plotted. For visual comparison, two lines are fitted, the one on the left is a linear fit and the one on the right si with log of GDP. We can clearly see that log-linear model fits better visually.

  2. For a more accuarate analysis, linear regression model using log of GDP is fit to predict the consumption of alcohol. Adjusted R2 values is 97% with P value significantly less than the critical 5% threshold. P value of Jarque Bera test on residuals also shows that the residuals are normally distributed thus no bias in the model.

  3. This shows that macroeconomic conditions of the country play a significant role in alcohol consumption behavior of the consumers. The practical significance of that is before launching the marketing campaign across the nation, the company should asess the macroeconomic factors to maximize the success of the campaign and sales.

  4. In fig 17, the linear model is used to predict the total alcohol consumption. As we can see, although the numbers are very accurate, it captured the trend very well.

Regional GDP comparison

We are interested in regional pattern in alcohol consumption, so lets check regional GDP numbers and compare them with Saint Peterburg's GDP. We can use the regional GDP numbers from here

ru_gdp_regional = pd.read_table('https://raw.githubusercontent.com/sapawar4/datasets/main/russia_gdp_regional')
ru_gdp_regional.head(3)
Region GDP
0 Republic of Adygea 3790
1 Altai Krai 3730
2 Amur Oblast 6000
print("This data has GDP of", len(ru_gdp_regional), "regions of Russia, for year 2018")
This data has GDP of 80 regions of Russia, for year 2018

This is the GDP of differrent regions of Russia for year 2018. Since the alcohol data is from 1996-2016, we can't use these number directly but we can normalize the numbers by Saint Petersburg's GDP to get an idea of relative gdp. Regions with GDP >= 1 indicate GDP of those regions are equal to greater than that of Saint Petersburg's.

Russia's average inflation between 2016-2018 was ~3.5% (source: https://www.macrotrends.net/countries/RUS/russia/inflation-rate-cpi). Since the alcohol consumption data is only up to 2016, we can adjust the 2018 GDP numbers by inflation to get an approximate estimtion of 2016 GDP numbers.

ru_gdp_regional['GDP'] = np.round(np.divide(ru_gdp_regional['GDP'],1.035**2),0)
ru_gdp_regional.head(3)
Region GDP
0 Republic of Adygea 3538.0
1 Altai Krai 3482.0
2 Amur Oblast 5601.0
ru_gdp_regional['rgdp'] = (np.divide(
    ru_gdp_regional['GDP'],
    ru_gdp_regional.query('Region == "Saint Petersburg"')['GDP'].max())
                          )
(alt.Chart(ru_gdp_regional).mark_bar()
 .encode(x='Region:N', 
         y = 'rgdp:Q', 
         tooltip=['Region','GDP','rgdp'],
         color = alt.condition("datum.rgdp < 1 ",
          alt.value("steelblue"),  
          alt.value("orange")))
 .properties(width=900, title = 'Fig 18: Relative adjusted GDP of differrent regions of Russia')
 .interactive()

) 
yr_median = pd.DataFrame(df1.set_index('year')['2014':].groupby('region')['total'].median())
ru_gdp_regional=ru_gdp_regional.set_index('Region').join(yr_median).reset_index()

ru_gdp_regional= ru_gdp_regional.dropna()
ru_gdp_regional.head(3)
Region GDP rgdp total
0 Republic of Adygea 3538.0 0.305632 36.0
1 Altai Krai 3482.0 0.300795 55.9
2 Amur Oblast 5601.0 0.483846 65.4
regional_gdp_chart = (alt.Chart(ru_gdp_regional)
 .mark_circle(size=60)
 .encode(
     x='GDP',
     y='total', 
     tooltip=['Region','total','GDP'],
          color = alt.condition("datum.Region == 'Saint Petersburg'",
          alt.value("orange"),  
          alt.value("steelblue")))

 
)

box_lin = alt.Chart(ru_gdp_regional).mark_boxplot().encode(x='GDP', tooltip=['Region']).properties(height = 100,title = 'Distribution of GDP w Outliers')

combo1 = (regional_gdp_chart + regional_gdp_chart.transform_regression('GDP', 'total', method='log').mark_line() & box_lin).properties(title='Fig 19 Median Comsumption vs GDP with Outliers')
ru_gdp_regional.sort_values('GDP',ascending=False).iloc[:3]
Region GDP rgdp total
40 Nenets Autonomous Okrug 102985.0 8.896424 85.9
78 Yamalo-Nenets Autonomous Okrug 84613.0 7.309347 99.1
59 Sakhalin Oblast 35679.0 3.082153 103.2
regional_gdp_chart2 = (alt.Chart(ru_gdp_regional.sort_values('GDP',ascending=False).iloc[3:])
 .mark_circle(size=60)
 .encode(
     x='GDP',
     y='total', 
     tooltip=['Region','total','GDP'],
          color = alt.condition("datum.Region == 'Saint Petersburg'",
          alt.value("orange"),  
          alt.value("steelblue")))

 
)

box_lin2 = alt.Chart(ru_gdp_regional.sort_values('GDP',ascending=False).iloc[3:]).mark_boxplot().encode(x='GDP', tooltip=['Region']).properties(height = 100, title = 'Distribution of GDP w/o Outliers')

combo2 = (regional_gdp_chart2 + regional_gdp_chart2.transform_regression('GDP', 'total', method='log').mark_line() & box_lin2).properties(title='Fig 20 Median Comsumption vs GDP without Outliers')
(combo1 | combo2).configure_view(strokeWidth=0)
print("The correlation coefficient between regional GDP and median alcohol consumption across differrent regions is",round((ru_gdp_regional.iloc[6:].total).corr(np.log(ru_gdp_regional.iloc[6:].GDP)),2))
The correlation coefficient between regional GDP and median alchol consumption across differrent regions is 0.52

Observations:

  1. Fig 18 compares the regional GDP numbers relative to Saint Petersburg. There are 11 regions that have GDP equal to or more than that of Saint Peterburg's.
  1. Fig 19 and 20 show scatterplot of GDP and median alcohol consumption for year 2016 for 80 regions in Russia. The boxplot at the bottom of the scatterplot show the distribution of GDP and presence of outliers. For example the left scatterplot shows there are there are at least 2 significant outliers. Fig 20 shows the same plot with outliers removed. Orange circle is Saint Petersburg. The log linear model similar to one used earlier also fits very well in this case, showing overall the GDP vs. consumption model holds good even at regional level.
  1. The correlation coefficient between regional GDP and alcohol consumption is 0.52, showing moderate correlation. There could be other factors such as gender mix, inflation, personal income, supply chain factors, government policies etc that may influence consumption not just GDP.
  1. The significance of this is that regional GDP should be considered as one of the factors to determine the regions that are similar to Saint Petersburg. It's definitely not the only factor but could be one of the main factors based on the analyses above.

Cross Correlation Analysis

Cross correlation analysis allows us to compare two time series. It measures a degree of similarity between a time series and lagged verion of another series. This will help us find regions similar to Saint Petersburg based on total alcohol consumption.

def get_ccf(s1,s2,lag, topN):
    ccf_dict = {}
    for r in list(nonstp.region.unique()):
        s2 = (nonstp.query('region == @r')['total'].values)
        ccf = sm.tsa.stattools.ccf(stp_total, s2)[lag]
        ccf_dict[r]=[ccf]
        
    return pd.DataFrame(ccf_dict).T.nlargest(topN, [0])
stp_total = df1.query('region == "Saint Petersburg"')['total'].values
nonstp = df1[['year','region','total']].query('region != "Saint Petersburg"')
get_ccf(stp_total,nonstp, lag=0,topN=20).plot.barh(figsize=(12,8), title='Fig 21: Top 20 regions by cross-correlation', label=False);
(pd.Series(
    sm.tsa.stattools.ccf(stp_total, df1.query('region == "Republic of Dagestan"')['total'].values))[:12]
    .plot.bar(title='Fig 22:Cross Correlation: Saint Peterburg vs. Republic of Dagestan', figsize=(12,6))
);

Observations:

  1. Fig 21 shows the top 20 regions that have high cross correlation coefficient with Saint Petersburg. Region with highest correlation by total consumption is Republic of Dagestan. Only the first lag is considered in this figure.
  1. In fig 22, for republic of Dagestan, first 12 lags are plotted. It shows that first and second lags are quite significant. This implies total cosumption in Dagestan from two years ago can be used to predict the alcohol consumption in Saint Petersburg. Although that doesn't help practically, we at least know now which regions are similar to Saint Petersburg based on total consumption.

Clustering

Time series data can be clustered using various ways:

  1. Distance based clustering : Instead of using euclidean distance, we can use Dynamic Time Warping (DTW) to find similarity between two time series. This is a shape based approach meaning the two series are considered similar if their shapes match based on DTW metric.

  2. Functional data analysis: In this case, the basis function of the time series is obtained, and time series are clustered based on the similarity of the function.

  3. Feature based : Features or characteristics of the time series based on its temporal structure are calculated and then those are used for clustering.

The main downside of the first two methods is that they do not provide much explainable insights into why and how the series are clustered. We will use the feature based approach instead. In fetaure based approach, various metrics such as mean, mode, skewness, lags, ACF, PACF etc. along with many other metrics are used. In this case, we will use catch22 features which have been shown to capture temporal structure of the series adequately without sacrificing the accuracy. Ref.

From our initial analysis we know that, in Saint Petersburg:

  • Overall alcohol consumption has decreased, especially in the last ~5 years
  • Consumption of beer has decreased drastically
  • There is an uptick in wine consumption and decrease in vodka consumption
  • Although more weak alcohol is consumed, preference for hard liquer has been increasing
  • Macroeconomic factors such as GDP influence alcohol consumption

We want to capture these features in the time series and find regions that show similar characteristics.

Modeling methodology:

  1. Combine consumption of alcohols in two groups based on alcohol content - hard (vodka + brandy) and week (beer + wine + champagne)
  2. Obtain 'catc22' features for both of these alcohols types
  3. Cluster the data using K-means algorithm
  4. Identify regions that are common in both clusters
df1_l_hard = df1[['region','hard']].set_index('region')
df1_pivot_hard = df1[['year','region','hard']].pivot(index='year',columns='region')
c22_h = pd.DataFrame()
for region in set(df1.region.unique()):
    c22_h[region] = catch22_all(df1_l_hard.loc[region]['hard'].values)['values']
    c22_h.index = catch22_all(df1_l_hard.loc[region]['hard'].values)['names']
## Time series features for hard alcohols
c22_hard = c22_h.T
c22_hard 
DN_HistogramMode_5 DN_HistogramMode_10 CO_f1ecac CO_FirstMin_ac CO_HistogramAMI_even_2_5 CO_trev_1_num MD_hrv_classic_pnn40 SB_BinaryStats_mean_longstretch1 SB_TransitionMatrix_3ac_sumdiagcov PD_PeriodicityWang_th0_01 ... FC_LocalSimple_mean1_tauresrat DN_OutlierInclude_p_001_mdrmd DN_OutlierInclude_n_001_mdrmd SP_Summaries_welch_rect_area_5_1 SB_BinaryStats_diff_longstretch0 SB_MotifThree_quantile_hh SC_FluctAnal_2_rsrangefit_50_1_logi_prop_r1 SC_FluctAnal_2_dfa_50_1_2_logi_prop_r1 SP_Summaries_welch_rect_centroid FC_LocalSimple_mean3_stderr
Nizhny Novgorod Oblast 0.732248 0.547634 3.0 8.0 0.485987 -0.050743 1.000000 11.0 0.062500 3.0 ... 0.250000 0.078947 0.894737 0.653610 8.0 1.798106 0.0 0.0 0.392699 0.735469
Magadan Oblast -0.614018 -0.423923 2.0 4.0 0.318822 0.026515 0.888889 4.0 0.046296 2.0 ... 0.333333 -0.631579 0.052632 0.185831 7.0 1.878967 0.0 0.0 0.589049 0.982601
Lipetsk Oblast 0.380434 0.160263 2.0 10.0 0.475436 -1.053809 0.944444 13.0 0.166667 0.0 ... 0.142857 -0.210526 0.894737 0.409234 6.0 1.827175 0.0 0.0 0.589049 0.472617
Republic of North Ossetia-Alania -0.400949 -0.557504 4.0 14.0 0.763766 -0.034504 0.888889 7.0 0.074074 3.0 ... 0.333333 -0.684211 0.842105 0.660165 7.0 1.611158 0.0 0.0 0.196350 0.311650
Orenburg Oblast 0.327689 0.483742 4.0 10.0 0.665650 -0.087016 1.000000 11.0 0.166667 5.0 ... 0.142857 -0.473684 0.894737 0.658906 4.0 1.644073 0.0 0.0 0.196350 0.488831
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Kostroma Oblast 0.568742 0.736033 4.0 10.0 0.625786 -0.075374 1.000000 10.0 0.166667 0.0 ... 0.125000 -0.421053 0.894737 0.715419 5.0 1.846053 0.0 0.0 0.196350 0.517566
Kursk Oblast 0.817754 0.676729 2.0 7.0 0.549905 -0.429980 0.944444 12.0 0.062500 2.0 ... 1.250000 0.289474 0.789474 0.502553 4.0 1.849898 0.0 0.0 0.392699 0.876202
Stavropol Krai -0.496139 -0.496139 4.0 14.0 0.841241 -0.028304 0.944444 8.0 0.166667 4.0 ... 0.285714 -0.684211 0.789474 0.703954 5.0 1.611158 0.0 0.0 0.196350 0.303376
Kaliningrad Oblast 0.402788 0.402788 2.0 6.0 0.431913 -0.128262 0.944444 8.0 0.062500 3.0 ... 0.250000 -0.315789 0.894737 0.441250 4.0 2.003931 0.0 0.0 0.785398 0.882366
Voronezh Oblast 0.358083 -0.452264 2.0 8.0 0.621714 -0.240525 0.888889 9.0 0.074074 3.0 ... 0.400000 -0.157895 0.894737 0.530539 4.0 1.692020 0.0 0.0 0.392699 0.689238

81 rows × 22 columns

df1_l_wk = df1[['region','weak']].set_index('region')
df1_pivot_wk = df1[['year','region','weak']].pivot(index='year',columns='region')
c22_w = pd.DataFrame()
for region in df1.region.unique():
    c22_w[region] = catch22_all(df1_l_wk.loc[region]['weak'].values)['values']
    c22_w.index = catch22_all(df1_l_wk.loc[region]['weak'].values)['names']
#Time series features for weak alcohols
c22_weak= c22_w.T
c22_weak
DN_HistogramMode_5 DN_HistogramMode_10 CO_f1ecac CO_FirstMin_ac CO_HistogramAMI_even_2_5 CO_trev_1_num MD_hrv_classic_pnn40 SB_BinaryStats_mean_longstretch1 SB_TransitionMatrix_3ac_sumdiagcov PD_PeriodicityWang_th0_01 ... FC_LocalSimple_mean1_tauresrat DN_OutlierInclude_p_001_mdrmd DN_OutlierInclude_n_001_mdrmd SP_Summaries_welch_rect_area_5_1 SB_BinaryStats_diff_longstretch0 SB_MotifThree_quantile_hh SC_FluctAnal_2_rsrangefit_50_1_logi_prop_r1 SC_FluctAnal_2_dfa_50_1_2_logi_prop_r1 SP_Summaries_welch_rect_centroid FC_LocalSimple_mean3_stderr
Republic of Adygea 0.595243 0.775842 2.0 5.0 0.252250 0.463747 1.000000 11.0 0.062500 4.0 ... 0.250000 0.315789 -0.842105 0.442748 3.0 1.812143 0.0 0.0 0.589049 0.616533
Altai Krai 0.578897 0.387142 3.0 6.0 0.625786 -0.026379 1.000000 12.0 0.074074 0.0 ... 0.600000 0.473684 -0.789474 0.579394 4.0 1.849898 0.0 0.0 0.392699 0.750386
Amur Oblast 1.194978 1.036788 4.0 10.0 0.897133 -0.009958 0.944444 10.0 0.074074 0.0 ... 0.833333 0.421053 -0.684211 0.842913 3.0 1.549173 0.0 0.0 0.196350 0.552741
Arkhangelsk Oblast -0.483755 -0.246369 2.0 4.0 0.296951 -0.712265 0.888889 4.0 0.046296 4.0 ... 0.333333 -0.368421 -0.789474 0.103270 6.0 1.878967 0.0 0.0 0.785398 1.149274
Astrakhan Oblast 0.932883 1.088741 3.0 11.0 0.509388 -0.102313 0.944444 12.0 0.074074 3.0 ... 1.000000 0.421053 -0.789474 0.668325 4.0 1.688174 0.0 0.0 0.392699 0.726105
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Chuvash Republic 0.829517 0.647259 4.0 8.0 0.679860 0.127388 0.888889 12.0 0.074074 0.0 ... 0.166667 0.421053 -0.736842 0.595159 4.0 1.611158 0.0 0.0 0.392699 0.470437
Chukotka Autonomous Okrug -0.514813 -0.342103 4.0 10.0 0.792951 0.063017 1.000000 6.0 0.166667 3.0 ... 0.125000 0.789474 -0.736842 0.619796 3.0 1.812143 0.0 0.0 0.196350 0.411099
Sakha (Yakutia) Republic 0.833131 0.959731 5.0 13.0 0.987765 0.017513 0.722222 10.0 0.166667 0.0 ... 0.571429 0.631579 -0.631579 0.824523 2.0 1.398986 0.0 0.0 0.196350 0.274629
Yamalo-Nenets Autonomous Okrug 0.576638 0.735361 3.0 8.0 0.373232 0.095685 0.944444 13.0 0.074074 0.0 ... 1.000000 0.210526 -0.736842 0.583341 5.0 1.812143 0.0 0.0 0.392699 0.654370
Yaroslavl Oblast 0.362071 0.884441 2.0 4.0 0.237669 -0.264563 0.833333 10.0 0.046296 5.0 ... 0.666667 0.315789 -0.315789 0.323033 3.0 1.937106 0.0 0.0 0.785398 1.014128

81 rows × 22 columns

kmeans1 = KMeans(n_clusters=4, random_state=1).fit(c22_hard)
kmeans2 = KMeans(n_clusters=4, random_state=1).fit(c22_weak)
c22_final = pd.DataFrame()
c22_final['regions'] = list(c22_weak.index)
c22_final['hard_cluster'] = kmeans1.labels_
c22_final['weak_cluster'] = kmeans2.labels_
c22_final
regions hard_cluster weak_cluster
0 Republic of Adygea 3 2
1 Altai Krai 0 0
2 Amur Oblast 2 3
3 Arkhangelsk Oblast 1 1
4 Astrakhan Oblast 3 2
... ... ... ...
76 Chuvash Republic 2 0
77 Chukotka Autonomous Okrug 3 2
78 Sakha (Yakutia) Republic 1 3
79 Yamalo-Nenets Autonomous Okrug 0 0
80 Yaroslavl Oblast 3 1

81 rows × 3 columns

c22_final.query('regions == "Saint Petersburg"')
regions hard_cluster weak_cluster
57 Saint Petersburg 0 0
h__ = int(c22_final.query('regions == "Saint Petersburg"')['hard_cluster'].values)
w__ = int(c22_final.query('regions == "Saint Petersburg"')['weak_cluster'].values)

Now let's get similar regions for both alcohol types.

c22_final.query('hard_cluster == @h__')['regions']
1                         Altai Krai
6                    Belgorod Oblast
7                     Bryansk Oblast
12                   Voronezh Oblast
15                  Zabaykalsky Krai
16                    Ivanovo Oblast
17                    Irkutsk Oblast
18         Kabardino-Balkar Republic
20              Republic of Kalmykia
24               Republic of Karelia
27                   Kostroma Oblast
28                    Krasnodar Krai
29                  Krasnoyarsk Krai
32                  Leningrad Oblast
37                            Moscow
39                   Murmansk Oblast
43                Novosibirsk Oblast
46                      Oryol Oblast
47                      Penza Oblast
49                    Primorsky Krai
53                     Tuva Republic
54                     Rostov Oblast
55                     Ryazan Oblast
56                     Samara Oblast
57                  Saint Petersburg
59                   Sakhalin Oblast
60                 Sverdlovsk Oblast
65             Republic of Tatarstan
67                      Tomsk Oblast
68                       Tula Oblast
69                     Tyumen Oblast
71                  Ulyanovsk Oblast
72                   Khabarovsk Krai
75                Chelyabinsk Oblast
79    Yamalo-Nenets Autonomous Okrug
Name: regions, dtype: object
c22_final.query('weak_cluster == @w__')['regions']
1                                Altai Krai
5                 Republic of Bashkortostan
7                            Bryansk Oblast
8                      Republic of Buryatia
11                           Vologda Oblast
13                     Republic of Dagestan
17                           Irkutsk Oblast
25                          Kemerovo Oblast
31                             Kursk Oblast
35                         Mari El Republic
41                   Nizhny Novgorod Oblast
43                       Novosibirsk Oblast
45                          Orenburg Oblast
47                             Penza Oblast
57                         Saint Petersburg
60                        Sverdlovsk Oblast
67                             Tomsk Oblast
69                            Tyumen Oblast
74    Khanty–Mansi Autonomous Okrug – Yugra
75                       Chelyabinsk Oblast
76                         Chuvash Republic
79           Yamalo-Nenets Autonomous Okrug
Name: regions, dtype: object
#Find regions that are common in both hard & weak alcohol clusters
similar = set(c22_final.query('weak_cluster == @w__')['regions'].values).intersection(set(c22_final.query('hard_cluster == @h__')['regions'].values))
similar
{'Altai Krai',
 'Bryansk Oblast',
 'Chelyabinsk Oblast',
 'Irkutsk Oblast',
 'Novosibirsk Oblast',
 'Penza Oblast',
 'Saint Petersburg',
 'Sverdlovsk Oblast',
 'Tomsk Oblast',
 'Tyumen Oblast',
 'Yamalo-Nenets Autonomous Okrug'}

Out of 81 regions, 10 regions show patters that show trends for hard & weak alcohols that are similar to that of Saint Petersburg. Let's look at those closely. You can in the charts below, feature based clustering method identified regions that follow similar time series pattern.

Beer Consumption

for r in similar:
    df1_melt[(df1_melt.region.isin(similar))].query('Alcohol == "beer"').set_index('year').query('region == @r')['Consumption'].plot()

Wine Consumption

for r in similar:
    df1_melt[(df1_melt.region.isin(similar))].query('Alcohol == "wine"').set_index('year').query('region == @r')['Consumption'].plot()

Vodka Consumption

for r in similar:
    df1_melt[(df1_melt.region.isin(similar))].query('Alcohol == "vodka"').set_index('year').query('region == @r')['Consumption'].plot()

Conclusion:

  1. It is recommended to run the next campaign in following 10 regions: 'Altai Krai', 'Bryansk Oblast', 'Chelyabinsk Oblast', 'Irkutsk Oblast', 'Novosibirsk Oblast', 'Penza Oblast', 'Sverdlovsk Oblast', 'Tomsk Oblast', 'Tyumen Oblast', 'Yamalo-Nenets Autonomous Okrug'
  1. In the analysis it was found that consumers in Saint Petersburg have been consuming less alcohol compared to the past, especially beer. They have also been preferring wine over vodka in the last 3 years. It was also found that although the consumers have been drinking less alcohol, the preference towards hard alcohols (vodka & brandy) has been picking up over weak alcohols (beer, wine and champagne)

  2. To find regions similar to that of Saint Petersburg, feature based time series clustering technique was used to find regions that show similar trends for both hard and weak alcohols.

  3. Analysis showed that macroeconomic factors play an important rule in alcohol consumption. We recommend that the company assess macroeconomics (CPI, GDP, international markets etc.) to decide when to launch the campaign. We could not include GDP data in clustering due to unavailability of data for all regions, but we recommend carrying out additional clustering with inclusion of GDP to find similar regions.

Limitations:

  1. I did not focus on any geospatial analysis. Russia is a vast country and many regions operate autonomously so there are likely to be strong differrences among some regions despite being similar in alchol volume consumption.

  2. I have made assumptions regarding GDP numbers

  3. One limitation of K-Means is its sensitivity to initialization. Although I spent some time tuning parameters, I did not include it in the analysis here to keep it short. In my real project, I used GMM instead of K-Means.

  4. Catch22 is one of the many libraries that allow time series feature extraction. You can also take a look at tsfresh and tsfeat