Overview

The goal of this notebook is to show how to do exploratory data analysis of time series data and use feature based clustering method to identify attributes that are similar in time domain. The novelty of this based is, instead of relying on shape of the time series, feature based approach enables more interpretable solution. I used some of these techniques in a real project so I wanted to showcase this for my own future reference.

The data here is alcohol consumption across 80+ regions of Russia. I am not sure if this is real data or not. Hypothetically, an alcohol company has been operating in St. Petersburg. Their campaign was successful in St.Petersburg so we want to identify other regions where they can potentially run similar campaigns. We have alcohol consumption for differrent alcohol types per capita from 1998 to 2016.

I don't know anything about Russia's geography or economy, so I will make reasonable assumptions in my analysis.

Importing Libraries

#!pip install altair

#!pip install catch22
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import altair as alt

import seaborn as sns
import statsmodels.api as sm
import scipy.stats as st
import statsmodels.stats.api as sms
from sklearn.metrics import r2_score
from catch22 import catch22_all
from sklearn.cluster import KMeans 


alt.data_transformers.disable_max_rows()


import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv(r'./data/russian_alcohol_consumption.csv')
df.head()
year region wine beer vodka champagne brandy
0 1998 Republic of Adygea 1.9 8.8 3.4 0.3 0.1
1 1998 Altai Krai 3.3 19.2 11.3 1.1 0.1
2 1998 Amur Oblast 2.1 21.2 17.3 0.7 0.4
3 1998 Arkhangelsk Oblast 4.3 10.6 11.7 0.4 0.3
4 1998 Astrakhan Oblast 2.9 18.0 9.5 0.8 0.2

Data Quality

It is important that before we analyze the data, we should check the quality of the data. If there are any missing values, outliers, unusual observations etc. those should be treated before analysing the data.

Null Values

df.isna().any()
year         False
region       False
wine          True
beer          True
vodka         True
champagne     True
brandy        True
dtype: bool

Wine, beer, vodka, champagne and brandy have null values. Let's find if any of the columns have more than 5% null values (i.e in this at least 1).

# Which region has null values?
null_regions = []
print("% of Values missing \n========================================")
for r in df.region.unique():
    
    nulls = (df.query('region == @r').isna().mean())
    if nulls.max() > 0.05:
        print(r,",",nulls.idxmax(), ":", 100 * round(nulls.max(),2),"%")
        null_regions.append(r)
% of Values missing 
========================================
Republic of Ingushetia , brandy : 79.0 %
Republic of Crimea , wine : 84.0 %
Sevastopol , wine : 84.0 %
Chechen Republic , wine : 100.0 %

There are four regions with more than 79% of the values missing. If few values were missing, we could have used interpolation methods to impute the missing values, but in this we don't have enough values to do so. Let's look at which values are missing.

#alcohol types
alcohols = list(df.columns[2:])                                   

#Find number of missing values and location of the missing values each of the 4 regions
null_regions_df = (df[df['region']
 .isin(null_regions)]
 .sort_values(by=['region','year'])
 .groupby(['year','region'])
 .apply(lambda x : x.isna().sum())
 .drop(['year','region'],axis=1)
 .assign(nulls = lambda x: x.sum(axis=1))
 .reset_index()
)
null_regions_df.head(5)
year region wine beer vodka champagne brandy nulls
0 1998 Chechen Republic 1 1 1 1 1 5
1 1998 Republic of Crimea 1 1 1 1 1 5
2 1998 Republic of Ingushetia 1 1 1 1 1 5
3 1998 Sevastopol 1 1 1 1 1 5
4 1999 Chechen Republic 1 1 1 1 1 5
plt.title("Fig 1: Heatmap showing number of alcohol types missing by year and region \n") 

(
sns.heatmap    
     (pd.crosstab(
      null_regions_df['region'],
      null_regions_df['year'],
      values = null_regions_df['nulls'],
      aggfunc='max'
    
      ),
      cmap="magma_r", 
      annot=True, 
      cbar=False, 
      square=True
     )
);    

Observations:

In Fig 1., the heatmap shows, out of the 5 alcohol types, how many types have missing values.

  1. Crimea and Sevastopol has values missing for all 5 alcohol types for years 1998 to 2013 and none missing for the last 3 years. This suggests that likely the data for first 16 years either wasn't collected or wasn't available.
  2. Chechen Republic has values missing except for one alcohol type in the last 3 years
  3. Ingushetia has missing values at random

Since our goal is to compare the regions using alcohol consumption data for all years, it won't be possible to include these 4 regions in to the analysis. We will drop all these 4 regions from further analysis.

Duplicate Values

#Are there any duplicate values? 


if max((df[~df.region.isin(null_regions)].region.value_counts())) == min((df[~df.region.isin(null_regions)].region.value_counts())):
    print("All Regions have 19 years of data")
else:
    print("Some of the regions of more/less data")
All Regions have 19 years of data

None of the regions have duplicate data across all years

Unusual Data

For all alcohol types we will look at min and max values to see if there are any unusual values such as 0 or negative consumption.

#Are there any observations with 0 or negative values? Consumption can be 0 but not negative

df[~df.region.isin(null_regions)].describe().loc['min']
year         1998.0
wine            0.1
beer            1.0
vodka           0.4
champagne       0.1
brandy          0.0
Name: min, dtype: float64

There is at least one region with 0 lts per capita consumption of brandy. Let's look at that closely.

df[~df.region.isin(null_regions)].query('brandy == 0')
year region wine beer vodka champagne brandy
565 2004 Tuva Republic 2.1 25.8 15.2 0.2 0.0

In 2004, Tuva Republic did not record any consumption of brandy.

(df.query('region == "Tuva Republic"')
 .set_index('year')['brandy']
 .plot(title="Fig 2: Brandy consumption for Tuva Republic \n 0 consumption in 2004", legend=True)
)


(plt.axhline(df.query('region == "Tuva Republic"')
             .set_index('year')['brandy']
             .median(), color='red')
 
);

In 2004, the brandy consumption in Tuva dropped from 0.1 ltr to 0. The median consumption over the entire period was 0.2 ltr. So, considering overall low consumption, 0 does not seem like an outlier. We will keep the value as is.

# Max values
df[~df.region.isin(null_regions)].describe().iloc[:,1:]
wine beer vodka champagne brandy
count 1539.000000 1539.000000 1539.000000 1539.000000 1539.000000
mean 5.637544 51.722190 11.902404 1.315172 0.524185
std 2.811555 25.115577 5.078808 0.798462 0.399331
min 0.100000 1.000000 0.400000 0.100000 0.000000
25% 3.550000 32.700000 8.400000 0.800000 0.200000
50% 5.400000 50.300000 11.600000 1.200000 0.400000
75% 7.400000 67.500000 15.000000 1.660000 0.700000
max 18.100000 207.300000 40.600000 5.560000 2.300000

Except for beer, the max values for all other alcohols seem reasonable. In case of beer, the 75% value is 67.5 lts and max is 207, which is 5 standard deviations away from 75th percentile. We need to understand which region has this value and how should we treat it.

Below, we will drop the regions with null values and create a new dataframe with percentage share of each alcohol type for the given year and region.

df1 = df[~df.region.isin(null_regions)]
df1['total']=(df1[alcohols].sum(axis=1))
df1 = (df1
       .assign(winepct = df1['wine']/df1['total'])
       .assign(beerpct = df1['beer']/df1['total'])
       .assign(vodkapct = df1['vodka']/df1['total'])
       .assign(champct = df1['champagne']/df1['total'])
       .assign(brandypct = df1['brandy']/df1['total'])
       .assign(year = pd.to_datetime(df1.year,format='%Y'))
       .round(3)

       
      )
df1.head(5)
year region wine beer vodka champagne brandy total winepct beerpct vodkapct champct brandypct
0 1998-01-01 Republic of Adygea 1.9 8.8 3.4 0.3 0.1 14.5 0.131 0.607 0.234 0.021 0.007
1 1998-01-01 Altai Krai 3.3 19.2 11.3 1.1 0.1 35.0 0.094 0.549 0.323 0.031 0.003
2 1998-01-01 Amur Oblast 2.1 21.2 17.3 0.7 0.4 41.7 0.050 0.508 0.415 0.017 0.010
3 1998-01-01 Arkhangelsk Oblast 4.3 10.6 11.7 0.4 0.3 27.3 0.158 0.388 0.429 0.015 0.011
4 1998-01-01 Astrakhan Oblast 2.9 18.0 9.5 0.8 0.2 31.4 0.092 0.573 0.303 0.025 0.006

Exploratory Data Analysis

To better understand the pattern in the data and compare the regions quickly over the 19 year period, decile heatmap as show in Fig 3 is created. In this visual, for each year, the total alcohol conumption is binned into 10 bins (i.e deciles). For example, the region with the highest consumption will be in 10th decile while the region with the lowest consumption will be in the 1st decile. The lighter color indicates high decile (and high consumption) while dark color shows low consumption. The regions are sorted by total consumption so the regions with high consumption are at the top.

plt.figure(figsize = (10,22))
plt.title("Fig 3: Decile Heat Map \n Shows distribution of binned alcohol consumption by year and region \n Lighter color: High consumption, darker color: low consumption", fontsize =14)


sns.heatmap(df1[['year','region','total']]
 .assign(year=df1.year.dt.year)
 .set_index(['year','region'])
 .groupby(level=0)
 .transform(lambda x: pd.qcut(x, q = np.linspace(0,1,11), labels=False))
 .unstack(level=0)
 .assign(t = lambda x: x.sum(axis=1))
 .sort_values(by='t', ascending=False)
 .drop('t',axis=1),
            cmap='magma',cbar_kws={"orientation": "horizontal"});    

Observations:

  1. Moscow has light color all throughout the 19 years showing it had the highest alcohol consumption continuoulsy for all years.
  2. Saint Petersburg is towards the top, thus one of the regions with high alcohol consumptions
  3. Saint Petersburg was in high deciles until 2010 and then the alcohol sumption dropped drammatically in the last 5-6 years.
  4. A visual comparison shows Chelyabinsk Oblask, Yaroslavi Oblast, Perm also show a similar trend as that of Saint Petersburg
  5. Some regions such as Karaychay, Dagestan, Chukotka, Adyega etc which are at the bottom, had very low and flat trend in consumption compared to Saint Petersburg
  6. Some regions such as Tver, Nizhny, Sakhlin showed pick up in consumption in later years, opposite to Saint Petersburg

Let's compare the consumption further by looking at box plot and by individual alcohol type.

df1_melt = df1.iloc[:,:7].melt(['year','region'],var_name='Alcohol', value_name='Consumption')
df1_melt.head(3)
year region Alcohol Consumption
0 1998-01-01 Republic of Adygea wine 1.9
1 1998-01-01 Altai Krai wine 3.3
2 1998-01-01 Amur Oblast wine 2.1

Trend Analysis:

Fig 4 shows distribution of total alcohol consumption for each of the year and by region. Values that are within the box plot are within the inter-quartile range ,where as values outside are potential outliers. The thick blue line shows the median consumption over the period. Light orange color line shows total consumption of Saint Petersburg for comaprison purposes. Each dot in fig 5 is a region. You can hover over the circle to get breakdown of consumption by alcohol type, volume and percentage.

Fig 5 shows the consumption for each selected region over the 19 year period. Select the region fro teh dropdown menu to get the consumption by type.

You can zoom-in/out and pan to explore the data in fig 4 & 5.

median_line = alt.Chart(df1).mark_line().encode(

    x = 'year:T',
    y = 'median(total):Q'
).interactive()

total_box = alt.Chart(df1).mark_boxplot(opacity=0.3).encode(

    x = 'year:T',
    y = 'total:Q',
    
).interactive()

total_scatter = alt.Chart(df1).mark_circle(opacity=0.8).encode(

    x = 'year:T',
    y = 'total:Q',
    color='region',
    tooltip = list(df1.columns)).interactive()
    
total_line = alt.Chart(df1).mark_line(opacity=0.1).encode(

    x = 'year:T',
    y = 'total:Q',
    color='region',
    tooltip = list(df1.columns),
    strokeWidth = alt.condition("datum.region == 'Saint Petersburg'",alt.value(3),alt.value(1)),
    
).interactive()

total_line_stp = alt.Chart(df1.query('region == "Saint Petersburg"')).mark_line(opacity=0.4, strokeDash=[1,1]).encode(

    x = 'year:T',
    y = 'total:Q',
    color=alt.value("#FFAA00"),
    strokeWidth = alt.condition("datum.region == 'Saint Petersburg'",alt.value(3),alt.value(1)),
    
    tooltip = list(df1.columns)    
    
).interactive()

# (total_scatter + total_line_stp).properties(width=700, height = 400, title = "Total Alcohol consumption for all regions vs. Year")
top = (total_scatter + total_box + median_line + total_line + total_line_stp).properties(width=700, height = 300, title = "Fig 4: Total Alcohol consumption for all regions vs. Year")


region_dropdown = alt.binding_select(options= sorted(list(df1_melt.region.unique())))
region_select = alt.selection_single(fields=['region'], bind=region_dropdown, name="Select")


bottom = (alt.Chart(df1_melt)
 .mark_line().encode(alt.X('year:T'),  
                     alt.Y('mean(Consumption):Q', scale=alt.Scale(domain=[0,210])), 
                     color = 'Alcohol:N', 
                     tooltip=['year','Consumption','Alcohol'])
).add_selection(
    region_select
).transform_filter(
    region_select
).properties(width=700, height = 200, title = "Fig 5:Select Region: Alcohol consumption by Year, Alcohol Type ")


(top & bottom.interactive())