PUBG Analysis

PUBG Analysis

Winston Ma Zhao Yang

Winston_Ma-PUBG Analysis

PUBG Analysis

There has been an increasing popularity in battle royale games in 2018. Many famous and recent titles include Fortnite, Realm Royale and Call of Duty - Black Ops 4. This trend has been arguably started by a more well known title called PlayerUnknown Battlegrounds. In this data analysis project, I will be working on official game data provided publicly by the PUBG team. I will be doing deep data explorations to find out how different players play the game and subsequently also try to develop a simplified machine learning model to predict the rankings of players in each different particular games.

I have clocked a substantial amount of hours in PUBG and would really love to gain insights on how other people play their way to victory. In battle royale games, there are infinite ways to win a game. Be it hiding or going into frequent combat, there is no fixed way for the game to be played. However, if there is a unspoken formula to winning the game, I am confident that it can be uncovered through analysing the game's data.

The source of the dataset was taken from kaggle.com (https://www.kaggle.com/c/pubg-finish-placement-prediction).

The dataset shows the different statistics and decisions a player makes during the game. It also includes their positions, kills and items used. The dataset contain quite a few more different aspects that might not be useful for data analysis.

I have chosen this particular dataset not only because I find it interesting, I feel that I really want to make use of the skills I learn rather than try to find deep insights about national statitics and doing something that I can relate more closely to would definitely help alot. Especially for my first big data analysis project, I want to explore what I am able to do with a dataset that has many different possibilties.

The dataset provided has 2 kinds; The training data as well as the test data. I will be using the training data for data analysis as well as attempting to model a machine learning algorithm for the test data. Both datasets was provided in CSV form and contain a few thousands of anonymous player data.

Overall, this dataset is fairly clean with no empty cells. All numbers are in integer and string forms and all others are in proper string forms. Thus, data cleaning is not needed here.

However, one major problem with dataset is having many cells being 0 due to the nature of the game. Many do not manage to get kills nor knock anyone out before losing the game. Other factors also contribute to the number of 0 found.

In [34]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns

sns.set()

We first read the respective csv files into their defined variables.

In [11]:
train = pd.read_csv('train_V2.csv')
test = pd.read_csv('test_V2.csv')
In [13]:
header = ['DBNOs','assists','boosts','damageDealt','headshotKills','heals','Id','killPlace','killPoints','killStreaks','kills','longestKill','matchDuration','matchId','matchType','rankPoints','revives','rideDistance','roadKills','swimDistance','teamKills','vehicleDestroys','walkDistance','weaponsAcquired','winPoints','groupId','numGroups','maxPlace','winPlacePerc']
explanation = [
    'Number of enemy players knocked',
    'Number of enemy players this player damaged that were killed by teammates',
    'Number of boost items used',
    'Total damage dealt. Note: Self inflicted damage is subtracted',
    'Number of enemy players killed with headshots',
    'Number of healing items used',
    'Player’s Id',
    'Ranking in match of number of enemy players killed',
    'Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”',
    'Max number of enemy players killed in a short amount of time',
    'Number of enemy players killed',
    'Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat',
    'Duration of match in seconds',
    'ID to identify match. There are no matches that are in both the training and testing set',
    'String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches',
    'Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”',
    'Number of times this player revived teammates',
    'Total distance traveled in vehicles measured in meters',
    'Number of kills while in a vehicle',
    'Total distance traveled by swimming measured in meters',
    'Number of times this player killed a teammate',
    'Number of vehicles destroyed',
    'Total distance traveled on foot measured in meters',
    'Number of weapons picked up',
    'Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”',
    'ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time',
    'Number of groups we have data for in the match',
    'Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements',
    'The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match'
]
In [38]:
pd.set_option('display.max_colwidth', 500)
fields = pd.DataFrame({'name':header,'description':explanation})
fields.set_index('name')
Out[38]:
description
name
DBNOs Number of enemy players knocked
assists Number of enemy players this player damaged that were killed by teammates
boosts Number of boost items used
damageDealt Total damage dealt. Note: Self inflicted damage is subtracted
headshotKills Number of enemy players killed with headshots
heals Number of healing items used
Id Player’s Id
killPlace Ranking in match of number of enemy players killed
killPoints Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”
killStreaks Max number of enemy players killed in a short amount of time
kills Number of enemy players killed
longestKill Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat
matchDuration Duration of match in seconds
matchId ID to identify match. There are no matches that are in both the training and testing set
matchType String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches
rankPoints Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”
revives Number of times this player revived teammates
rideDistance Total distance traveled in vehicles measured in meters
roadKills Number of kills while in a vehicle
swimDistance Total distance traveled by swimming measured in meters
teamKills Number of times this player killed a teammate
vehicleDestroys Number of vehicles destroyed
walkDistance Total distance traveled on foot measured in meters
weaponsAcquired Number of weapons picked up
winPoints Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”
groupId ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time
numGroups Number of groups we have data for in the match
maxPlace Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements
winPlacePerc The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match

Data Exploration

We can first take a look at which game modes are the most popular.

In [40]:
game_modes = train['matchType'].value_counts()
game_modes = game_modes.reset_index()
game_modes.columns = ['matchType','count']

fig,ax=plt.subplots(figsize=(15,5))
sns.barplot(data=game_modes.head(),x='matchType',y='count',palette='ocean')

ax.set_title('Top 5 most popular game modes',fontsize=15)
ax.set_xlabel('Game Mode')
ax.set_ylabel('Count')
plt.show()

A team of four in first person player seems to be how people like the game to be played. Teamwork is very important in such co-op games.

We can also look at how long do these game modes usually lasts for.

In [41]:
top5 = train[train['matchType'].isin(['squad-fpp','duo-fpp','squad','solo-fpp','duo'])]
fig,ax=plt.subplots(figsize=(15,5))
sns.violinplot(x='matchType',y='matchDuration',data=top5,ax=ax)
ax.set_title('Distribution of match duration of top 5 game modes',fontsize=15)
ax.set_xlabel('Game Mode')
ax.set_ylabel('Match Duration')
plt.show()

Most of games last until 25 minutes or until 30 minutes. This is where most people kill or get killed. A possible reasons is due to the circle boundary of the playing area closing. Many people die early in the game in solo because no one will be there to revive them. Hence, their game ends immediately when their health bar fully depletes.

We now can also take a look at the combat action in the game. Do the players kill a lot in a certain area or do they go around the map looking for kills?

In [42]:
def kill_category(kills):
    if kills == 0:
        return '0 kills'
    elif kills <=5:
        return 'Less than 5 kills'
    elif kills <= 10:
        return 'Less than 10 kills'
    elif kills <=20:
        return 'Less than 20 kills'
    else:
        return 'More than 20 kills'
cond = train['walkDistance'] > 0
kill_analysis = train[cond]
kill_analysis['kill_category'] = kill_analysis['kills'].apply(kill_category)

fig,ax=plt.subplots(figsize=(15,5))
cond = kill_analysis['walkDistance'] < 15000
sns.boxplot(data=kill_analysis[cond],x='kill_category',y='walkDistance',ax=ax,)

ax.set_title('Distance travelled for each kill group',fontsize=15)
ax.set_xlabel('Number of kills')
ax.set_ylabel('Distance walked')
plt.show()
C:\Users\Winston\Anaconda3\lib\site-packages\ipykernel_launcher.py:14: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  

Those that get 5 to 20 kills tend of travel more than those that do not travel a lot. However, if one is lucky, they would not need to travel a lot of get a lot of kills if the area they are in is filled with new players that have no surrounding awareness.

There are many ways to get around the map. Be it running on foot, driving or swimming. We can see the movements of the players and compare them to those that attained first position and see if they are any different.

In [76]:
def sec_to_min(sec):
    return sec//60

cond = train['walkDistance'] > 0
cond2 = train['winPlacePerc'] != 1
dist_analysis = train[cond & cond2]
dist_analysis['minutes'] = dist_analysis['matchDuration'].apply(sec_to_min)
better_players['minutes'] = better_players['matchDuration'].apply(sec_to_min)
def mean_of_each_minute(data,column):
    new_data = data.groupby('minutes')[column].mean().reset_index()
    return new_data

ride_distance = mean_of_each_minute(dist_analysis,'rideDistance')
walk_distance = mean_of_each_minute(dist_analysis,'walkDistance')
swim_distance = mean_of_each_minute(dist_analysis,'swimDistance')

better_ride_distance = mean_of_each_minute(better_players,'rideDistance')
better_walk_distance = mean_of_each_minute(better_players,'walkDistance')
better_swim_distance = mean_of_each_minute(better_players,'swimDistance')

fig,ax=plt.subplots(nrows=2,figsize=(15,10))

ride_distance.plot(kind='line',x='minutes',y='rideDistance',ax=ax[0],color='green')
walk_distance.plot(kind='line',x='minutes',y='walkDistance',ax=ax[0],color='red')
swim_distance.plot(kind='line',x='minutes',y='swimDistance',ax=ax[0],color='blue')

better_ride_distance.plot(kind='line',x='minutes',y='rideDistance',ax=ax[1],color='green')
better_walk_distance.plot(kind='line',x='minutes',y='walkDistance',ax=ax[1],color='red')
better_swim_distance.plot(kind='line',x='minutes',y='swimDistance',ax=ax[1],color='blue')

fig.suptitle('Mean distance travelled at different match durations',fontsize=20)
ax[0].set_title('Normal Players',fontsize=15)
ax[0].set_xlabel('Time')
ax[0].set_ylabel('Distance')
ax[1].set_title('Better Players',fontsize=15)
ax[1].set_xlabel('Time')
ax[1].set_ylabel('Distance')
plt.show()
C:\Users\Winston\Anaconda3\lib\site-packages\ipykernel_launcher.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
C:\Users\Winston\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  

Overall, the pattern seems to be similar but it is important to take note that better players do not move as much during the timings where the boundary closes to not attract much attention.

We can also apply this similar analysis to the item they usee throughout the game.

In [48]:
boosts = mean_of_each_minute(dist_analysis,'boosts')
heals = mean_of_each_minute(dist_analysis,'heals')
weapons = mean_of_each_minute(dist_analysis,'weaponsAcquired')

better_boosts = mean_of_each_minute(better_players,'boosts')
better_heals = mean_of_each_minute(better_players,'heals')
better_weapons = mean_of_each_minute(better_players,'weaponsAcquired')

fig,ax=plt.subplots(nrows=2,figsize=(15,10))

boosts.plot(kind='line',x='minutes',y='boosts',ax=ax[0],color='green')
heals.plot(kind='line',x='minutes',y='heals',ax=ax[0],color='red')
weapons.plot(kind='line',x='minutes',y='weaponsAcquired',ax=ax[0],color='blue')

better_boosts.plot(kind='line',x='minutes',y='boosts',ax=ax[1],color='green')
better_heals.plot(kind='line',x='minutes',y='heals',ax=ax[1],color='red')
better_weapons.plot(kind='line',x='minutes',y='weaponsAcquired',ax=ax[1],color='blue')

fig.suptitle('Total items usage when player\'s match end',fontsize=20)
ax[0].set_title('Normal Players',fontsize=15)
ax[0].set_xlabel('Time')
ax[0].set_ylabel('Number')
ax[1].set_title('Better Players',fontsize=15)
ax[1].set_xlabel('Time')
ax[1].set_ylabel('Number')
plt.show()

We can tell from the data that to survive longer in the game, one should consume more heals and boosts. But more important, do not acquire to much weapons - that is to say be contented with what you have and move on.

To find out how people play the game and achieved considerable amount of success, we could analyse the movement of players that got first in their respective matches.

In [44]:
first = train['winPlacePerc'] == 1
better_players = train[first]
In [91]:
def distance_analyzer(distance_type):    
    seventy_five = np.percentile(better_players[distance_type],75)
    mean = np.percentile(better_players[distance_type],50)
    upper_bound = seventy_five * 1.5
    cond1 = better_players[distance_type] < mean
    cond2 = better_players[distance_type] > mean
    cond3 = better_players[distance_type] < upper_bound
    cond4 = better_players[distance_type] > upper_bound
    below_mean = better_players[cond1][distance_type].count()
    above_mean = better_players[cond2 & cond3][distance_type].count()
    above_upper_bound = better_players[cond4][distance_type].count()
    test = pd.DataFrame({'Condition':['Below '+str(mean)+'m','Above '+str(mean)+'m','Above '+str(upper_bound)+'m'],
                         'Count':[below_mean,above_mean,above_upper_bound]})
    return test
In [110]:
cond1 = better_players['swimDistance'] == 0
cond2 = better_players['swimDistance'] > 0
zero = better_players[cond1]['swimDistance'].count()
above = better_players[cond2]['swimDistance'].count()

walk_distance = distance_analyzer('walkDistance')
ride_distance = distance_analyzer('rideDistance')
swim_distance = pd.DataFrame({'Condition':['0m','Above 0m'],'Count':[zero,above]})

distances = [walk_distance,ride_distance,swim_distance]

fig,ax = plt.subplots(ncols=3,figsize=(16,6))
col = 0
for distance in distances:
    sns.barplot('Condition',y='Count',data=distance,ax=ax[col],palette='plasma')
    col +=1
ax[0].set_title('Walking')
ax[1].set_title('Driving')
ax[2].set_title('Swimming')
fig.suptitle('Distance Analysis',fontsize=20)
plt.show()

To have higher chances of winning the game, follow the pros and try not to swim. Drive and walk more, but not too much.

In [114]:
def item_analyzer(item_type):    
    seventy_five = np.percentile(better_players[item_type],75)
    mean = np.percentile(better_players[item_type],50)
    upper_bound = seventy_five * 1.5
    cond1 = better_players[item_type] < mean
    cond2 = better_players[item_type] > mean
    cond3 = better_players[item_type] < upper_bound
    cond4 = better_players[item_type] > upper_bound
    below_mean = better_players[cond1][item_type].count()
    above_mean = better_players[cond2 & cond3][item_type].count()
    above_upper_bound = better_players[cond4][item_type].count()
    test = pd.DataFrame({'Condition':['Below '+str(mean),'Above '+str(mean),'Above '+str(upper_bound)],
                         'Count':[below_mean,above_mean,above_upper_bound]})
    return test
In [122]:
boost = item_analyzer('boosts')
heal = item_analyzer('heals')
weapon = item_analyzer('weaponsAcquired')
items = [boost,heal,weapon]
col=0
fig,ax=plt.subplots(ncols=3,figsize=(16,6))
for item in items:
    sns.barplot('Condition',y='Count',data=item,ax=ax[col],palette='plasma')
    col+=1
ax[0].set_title('Boosts')
ax[1].set_title('Heals')
ax[2].set_title('Weapons')
fig.suptitle('Items Analysis',fontsize=20)
plt.show()

Heal yourself more than consuming like how the pros do it. But remember to do both. Changing your weapons around 5 times would also get you a higher chance of winning.

In [74]:
cond = better_players['kills'] <= 15
cond2 = better_players['kills'] > 15
cond3 = better_players['kills'] <=30
cond4 = better_players['kills'] > 30
cond5 = better_players['kills'] < 60
less_combat_players = better_players[cond]
less_combat_players = less_combat_players.sample(int(less_combat_players.shape[0]/50))
mid_combat_players = better_players[cond2 & cond3]
good_combat_players = better_players[cond4 & cond5]
fig,ax=plt.subplots(figsize=(15,8))
x = np.linspace(0, 40, 1000)
plt.plot(x, x + 0, linestyle='dotted',color='black')
less_combat_players.plot(kind='scatter',x='kills',y='headshotKills',ax=ax)
mid_combat_players.plot(kind='scatter',x='kills',y='headshotKills',ax=ax,color='red')
good_combat_players.plot(kind='scatter',x='kills',y='headshotKills',ax=ax,color='green')


ax.set_title('Accuracy of good players',fontsize=15)
ax.set_xlabel('Kills')
ax.set_ylabel('Headshot kills')

plt.show()

Looking at the accuracies of the better players. I guess that not everyone would try to go for that headshot and try to bring down the enemy's health through bodily injuries. A good tatic but this also shows that pros are not that pro after all.

We can try to use a decison tree to predict the match placements.

In [6]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
In [25]:
FEATURES = ['DBNOs','assists','boosts','kills','heals','matchDuration','rideDistance','walkDistance','weaponsAcquired']
OUTCOME = 'winPlacePerc'
In [19]:
x_train,x_test,y_train,y_test = train_test_split(train[FEATURES],train[OUTCOME],test_size=0.2)
In [13]:
def create_tree(x_train, y_train, x_test, max_depth=None):
    decision_tree = DecisionTreeClassifier(criterion="entropy",max_depth=max_depth)

    decision_tree.fit(x_train, y_train)
    
    result = decision_tree.predict(x_test)
    
    return (result, decision_tree)
In [26]:
results, decision_tree = create_tree(x_train,y_train,x_test)

results_pruned, decision_tree_pruned = create_tree(x_train,y_train,x_test,max_depth=3)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-26-edeac936128b> in <module>()
----> 1 results, decision_tree = create_tree(x_train,y_train,x_test)
      2 
      3 results_pruned, decision_tree_pruned = create_tree(x_train,y_train,x_test,max_depth=3)

<ipython-input-13-bd8472d20915> in create_tree(x_train, y_train, x_test, max_depth)
      2     decision_tree = DecisionTreeClassifier(criterion="entropy",max_depth=max_depth)
      3 
----> 4     decision_tree.fit(x_train, y_train)
      5 
      6     result = decision_tree.predict(x_test)

~\Anaconda3\lib\site-packages\sklearn\tree\tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    788             sample_weight=sample_weight,
    789             check_input=check_input,
--> 790             X_idx_sorted=X_idx_sorted)
    791         return self
    792 

~\Anaconda3\lib\site-packages\sklearn\tree\tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    138 
    139         if is_classification:
--> 140             check_classification_targets(y)
    141             y = np.copy(y)
    142 

~\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py in check_classification_targets(y)
    170     if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
    171                       'multilabel-indicator', 'multilabel-sequences']:
--> 172         raise ValueError("Unknown label type: %r" % y_type)
    173 
    174 

ValueError: Unknown label type: 'continuous'

Due to unknown errors and despite trying other machine learning algorithms, there seem to be no conclusive results. If more time was given, I would definitely would like to do more exploratory data analysis as well as try to fix the machine learning algorithms error raised to get predictions.

From this exercise, I realise that even with huge chunks of data provided to you. One needs to have very analytical skills to sieve out relevant and useful information to provide insights. The fact that this is my first time try to do EDA, it does not deter me to continue this process it is very rewarding. I hope that one day, I could also attain what other data scienctists and eda experts are able to accomplish. It is really amazing how they manage to come out with wonderful kernerls on www.kaggle.com.

August 12, 2020 Published by  Winston Ma Zhao Yang-

Related Topics

Sentiment Analysis and Naive Bayes Classification on e-commerce reviews

Read more

A look on income increment and poverty in Singapore

Read more

Music Then and Now

Read more