This notebook generates common key performance indicators (KPI) in gaming for a subset of the World of Warcraft (WOW) dataset on Kaggle:

In [859]:
# Load all packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from datetime import datetime
from matplotlib import colors
from matplotlib import pyplot
from bokeh.io import output_notebook
from bokeh.charts import TimeSeries, show,Scatter
from bokeh.models import BoxSelectTool

output_notebook()

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
Loading BokehJS ...

The following code preprocesses the dataset for further analysis

In [860]:
# create dataframe table
ww=pd.read_csv("./warcraft-avatar-history/wowah_data.csv")
In [861]:
# Inspect
ww.head()
Out[861]:
char level race charclass zone guild timestamp
0 59425 1 Orc Rogue Orgrimmar 165 01/01/08 00:02:04
1 65494 9 Orc Hunter Durotar -1 01/01/08 00:02:04
2 65325 14 Orc Warrior Ghostlands -1 01/01/08 00:02:04
3 65490 18 Orc Hunter Ghostlands -1 01/01/08 00:02:04
4 2288 60 Orc Hunter Hellfire Peninsula -1 01/01/08 00:02:09
In [862]:
# Rename columns to remove extra space
ww.rename(columns={' level':'level',' race':'race',' charclass': 'charclass', ' zone':'zone',' guild':'guild',' timestamp':'timestamp'}, inplace=True)
In [863]:
# Summarize data types of columns 
ww.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10826734 entries, 0 to 10826733
Data columns (total 7 columns):
char         int64
level        int64
race         object
charclass    object
zone         object
guild        int64
timestamp    object
dtypes: int64(3), object(4)
memory usage: 578.2+ MB
In [864]:
# Take a subsample of users and their associated values from the original dataframe to reduce processing 
# time of data analysis.
# Note: The following analysis will produce different results from similar analyses on the entire dataframe, especially
# for very small subsamples.
unique_char=pd.Series(ww.char.unique())
sample_unique_char=list(unique_char.sample(frac=0.03))
ww_char_sub=ww[ww['char'].isin(sample_unique_char)]
In [865]:
# Convert timestamp and date to datetime format and generate columns for date, month, day, hour
ww_char_sub.loc[:,'timestamp']=pd.to_datetime(ww_char_sub['timestamp'])

ww_char_sub.loc[:,'date']=ww_char_sub['timestamp'].dt.date
ww_char_sub.loc[:,'month_int']=ww_char_sub['timestamp'].dt.month
ww_char_sub.loc[:,'day']=ww_char_sub['timestamp'].dt.day
ww_char_sub.loc[:,'hour']=ww_char_sub['timestamp'].dt.hour

m={1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun', 7:'Jul', 8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'}
ww_char_sub.loc[:,'month']=ww_char_sub['month_int'].map(m)

ww_char_sub=ww_char_sub.sort_values('timestamp', ascending=True)

ww_char_sub.loc[:,'date']=pd.to_datetime(ww_char_sub['date'])
In [866]:
# Inspect
ww_char_sub.head()
Out[866]:
char level race charclass zone guild timestamp date month_int day hour month
25 753 70 Orc Hunter Orgrimmar 53 2008-01-01 00:02:24 2008-01-01 1 1 0 Jan
97 3275 70 Tauren Shaman Sethekk Halls 19 2008-01-01 00:03:09 2008-01-01 1 1 0 Jan
211 24674 70 Undead Mage Eye of the Storm 101 2008-01-01 00:04:26 2008-01-01 1 1 0 Jan
280 23248 70 Undead Warrior Shattrath City 243 2008-01-01 00:04:46 2008-01-01 1 1 0 Jan
293 65477 15 Blood Elf Paladin Ghostlands -1 2008-01-01 00:04:57 2008-01-01 1 1 0 Jan
In [867]:
# Inpsect 
print ww_char_sub.shape
(312993, 12)
In [868]:
# Include the date that a user likely downloaded the game
add=ww_char_sub.groupby('char').agg({'timestamp':np.min}).reset_index()
add=add.rename(columns={'timestamp':'beg_date'})
ww_char_sub=ww_char_sub.merge(add, how="left", on="char")
print ww_char_sub.shape
(312993, 13)
In [871]:
# Include the days since install (dsi) as a column and convert install date to datetime format
ww_char_sub['dsi']=pd.Series([i.days for i in (ww_char_sub.timestamp-ww_char_sub.beg_date)])
ww_char_sub.loc[:,'beg_date']=ww_char_sub['beg_date'].dt.date
In [872]:
# Inspect
ww_char_sub.head()
Out[872]:
char level race charclass zone guild timestamp date month_int day hour month beg_date dsi
0 753 70 Orc Hunter Orgrimmar 53 2008-01-01 00:02:24 2008-01-01 1 1 0 Jan 2008-01-01 0
1 3275 70 Tauren Shaman Sethekk Halls 19 2008-01-01 00:03:09 2008-01-01 1 1 0 Jan 2008-01-01 0
2 24674 70 Undead Mage Eye of the Storm 101 2008-01-01 00:04:26 2008-01-01 1 1 0 Jan 2008-01-01 0
3 23248 70 Undead Warrior Shattrath City 243 2008-01-01 00:04:46 2008-01-01 1 1 0 Jan 2008-01-01 0
4 65477 15 Blood Elf Paladin Ghostlands -1 2008-01-01 00:04:57 2008-01-01 1 1 0 Jan 2008-01-01 0
In [873]:
# Inspect
ww_char_sub.shape
Out[873]:
(312993, 14)
In [874]:
# Break levels into ranges
ww_char_sub['level_range']=pd.cut(ww_char_sub['level'],[0,60,70,80])

The following code prepares the data to plot total daily active users (DAU) over time

In [875]:
# Count unique users by date
dau=ww_char_sub.groupby(['date',])['char'].nunique().reset_index()
In [876]:
# Inpsect
dau.head()
Out[876]:
date char
0 2008-01-01 81
1 2008-01-02 68
2 2008-01-03 62
3 2008-01-04 68
4 2008-01-05 85
In [877]:
# Set date as index
dau.set_index('date',inplace=True)

Plot comments:

A long-term negative trend overall is observed in the DAU count over the year 2008.

The most noticable drop in DAU's is observed in early May, but quick, sharp drops also occur in late January, late March, early April, middle of June, and the middle of November.

In [878]:
# Plot DAU
colors=sns.color_palette("Set2", 10)

fig, ax = plt.subplots(1,1, figsize=(20,15))

dau.loc[:,['char']].plot(color=colors[0],ax=ax,legend=False)

ax.set_ylabel("DAU Count",style='italic',fontsize=14)
ax.set_xlabel("Date",style='italic',fontsize=14)
ax.tick_params(axis='both', which='major', labelsize=12)

pyplot.show()

The following code prepares the data to plot total daily active users over time by level of user

In [879]:
# Count unique users by date and range of levels
dau_level=ww_char_sub.groupby(['date','level_range',])['char'].nunique().reset_index()
In [880]:
# Add a column to the dau_level dataframe that includes the sum of all unique users by date
print dau_level.shape
add=dau_level.groupby('date')['char'].agg({'char':np.sum}).reset_index()
add=add.rename(columns={'char':'percent_dau'})
dau_level=dau_level.merge(add,how='left',on="date")
print dau_level.shape
(747, 3)
(747, 4)
In [881]:
# Inspect
dau_level.head()
Out[881]:
date level_range char percent_dau
0 2008-01-01 (0, 60] 29 81
1 2008-01-01 (60, 70] 52 81
2 2008-01-02 (0, 60] 19 68
3 2008-01-02 (60, 70] 49 68
4 2008-01-03 (0, 60] 17 62
In [882]:
# Calculate the percent of DAU out of all DAU for the day
dau_level.loc[:,'percent_dau']=(dau_level['char']*100.0)/dau_level['percent_dau']

dau_level=dau_level.rename(columns={'char':'dau'})
In [883]:
# Inspect
dau_level.head()
Out[883]:
date level_range dau percent_dau
0 2008-01-01 (0, 60] 29 35.802469
1 2008-01-01 (60, 70] 52 64.197531
2 2008-01-02 (0, 60] 19 27.941176
3 2008-01-02 (60, 70] 49 72.058824
4 2008-01-03 (0, 60] 17 27.419355

Plot comments:

The DAU counts for the highest levels (70-80) begin in late Nov 2008.

High levels have relatively high DAU's. Also, the long-term negative trend as observed above occurs mostly due to a long-term decline in DAU at high levels (60-70).

The most noticable drop in DAU's is observed for all levels in early May, but quick, sharp drops also occur in late January, late March, early April, middle of June, and middle of November.

In [884]:
# Plot DAU 
# Note: DAU percents that were similar over time were grouped. The grouping resulted in three distinct groups. 
fig, ax = plt.subplots(1,1, figsize=(20,15))


levels=list(dau_level['level_range'].unique())

for i,item in enumerate(levels):
    temp=dau_level[dau_level['level_range']==item]
    temp.set_index('date',inplace=True)
    temp.loc[:,['dau']].plot(color=colors[i],ax=ax)

ax.legend(levels,loc="best",fontsize='x-large')


# # plt.setp(labels, rotation=90) 
ax.set_ylabel("DAU Count",style='italic',fontsize=14)
ax.set_xlabel("Date",style='italic',fontsize=14)
ax.tick_params(axis='both', which='major', labelsize=12)

pyplot.show()