# Road Safety in Great Britain

Koh Yee Shin

Road Accidents Analyses - Lum Yao Jun & Koh Yee Shin

# Analysis of Road Accidents in Great Britain¶

What variables and factors contribute to the frequency and severity of a road traffic accident in the UK?
For years, the UK’s traffic police have been accumulating data on road traffic injuries and accidents. As such accidents tend to be fatal and life-threatening, analysis of the factors contributing to an accident would prove useful for good knowledge, and could possibly contribute towards saving lives.

Analysis Objective:
Generate insights regarding the frequency and severity of accidents

How does the frequency of accidents change across time?
How do different factors (time, type of roads, etc) affect the rate and severity of accidents?
Which factors overall give a better prediction of accident severity?

Names: Lum Yao Jun, Koh Yee Shin

# The Data¶

Data Source: The Department for Transport, data.gov.uk
This dataset is published by the UK’s Department for Transport, and contains detailed road safety data about the circumstances of personal injury road accidents in Great Britain. Data in this dataset was collected from personal injury reports filed to the police, and thus only covers accidents which were reported. Variables of interest in this dataset include the severity of accident, the type of roads, junctions, geographical features and the lighting conditions when the accident occurred. The dataset is provided in csv format, and is downloadable from this link. Accident data is presented daily, from years 2005 to 2014.

### Methodology¶

In our exploratory analysis, data was transformed in several ways to generate insights. Transformations include data aggregation by groups, pivot tables, and creation of new features. The usage of graphical visualizations are used to visualize univariate relationships between dataset features and accident frequency/severity, in an attempt to explain their causes.

Apart from graphs, simple linear regression was used to estimate trends and seasonalities, as the algorithm generates easily interpretable results. An ensemble ML algorithm, xgboost, was also used to determine feature importance. This algorithm was chosen due to its performance in the industry (it is known to be accurate).

All missing datapoints in the original dataset are denoted by '-1'. We replace them with NaN values, and exclude them from graphs accordingly.

### Construct Working Dataset¶

In [1]:
# Import relevant libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import datetime as dt
import seaborn as sns

In [2]:
# Import dataset


C:\Users\Yee Shin\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2785: DtypeWarning: Columns (31) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)

In [3]:
# display(data.head())
display(data.shape)
# display(data.dtypes)

(1640597, 32)
In [4]:
# Filter data to years 2012 - 2014 only (Dataset is too big)
df = data[(data['Date'] >= '2012-01-01') & (data['Date'] < '2015-01-01')]

df.shape

Out[4]:
(430553, 32)
In [5]:
# Filter unneeded columns
selected = ['Accident_Severity',
'Number_of_Vehicles',
'Number_of_Casualties',
'Date','Day_of_Week',
'Time',
'Speed_limit',
'Junction_Detail',
'Junction_Control',
'Pedestrian_Crossing-Human_Control',
'Pedestrian_Crossing-Physical_Facilities',
'Light_Conditions',
'Weather_Conditions',
'Urban_or_Rural_Area']
df = df[selected]
display(df.dtypes)

Accident_Severity Number_of_Vehicles Number_of_Casualties Date Day_of_Week Time 1st_Road_Class Road_Type Speed_limit Junction_Detail Junction_Control Pedestrian_Crossing-Human_Control Pedestrian_Crossing-Physical_Facilities Light_Conditions Weather_Conditions Road_Surface_Conditions Urban_or_Rural_Area
1210044 3 2 1 2012-01-19 5 20:35 3 6 30 6 2 0 5 4 1 1 1
1210045 3 2 1 2012-04-01 4 17:00 4 6 30 3 4 0 0 4 1 1 1
1210046 3 2 1 2012-10-01 3 10:07 3 2 30 6 4 0 4 1 1 1 1
1210047 3 1 1 2012-01-18 4 12:20 5 6 30 3 4 0 0 1 1 1 1
1210048 3 1 1 2012-01-17 3 20:24 4 6 30 3 4 0 0 4 1 1 1
Accident_Severity                                   int64
Number_of_Vehicles                                  int64
Number_of_Casualties                                int64
Date                                       datetime64[ns]
Day_of_Week                                         int64
Time                                               object
Speed_limit                                         int64
Junction_Detail                                     int64
Junction_Control                                    int64
Pedestrian_Crossing-Human_Control                   int64
Pedestrian_Crossing-Physical_Facilities             int64
Light_Conditions                                    int64
Weather_Conditions                                  int64
Urban_or_Rural_Area                                 int64
dtype: object
In [6]:
# Missing values in the data are denoted as '-1'. We replace them with nan from the numpy package
df[df==-1]= np.nan
df[df=='-1']= np.nan


### Conduct analysis on frequency of accidents across time (Trend analysis)¶

In [7]:
# Aggregate by date
dailydf = df.groupby('Date').size()
type(dailydf)
dailydf = dailydf.reset_index(name = 'daycount')

Out[7]:
Date daycount
0 2012-01-01 249
1 2012-01-02 424
2 2012-01-03 439
3 2012-01-04 329
4 2012-01-05 376
In [8]:
# Calculate summary statistics
dailydf.describe()

Out[8]:
daycount
count 1096.000000
mean 392.840328
std 75.366399
min 128.000000
25% 344.750000
50% 399.000000
75% 444.000000
max 617.000000
In [9]:
# Conduct simple linear regression of the time series of accident counts. The predicted
# fitted values are for the mapping of the trendline. The "tindex" column is for the trendline,
# since

x = pd.Series(np.arange(len(dailydf))+1)
y= dailydf['daycount']
regression = sm.OLS(y,x).fit()
display(regression.summary())
display(regression.params)
dailydf['tindex'] = x = pd.Series(np.arange(len(dailydf))+1)
dailydf['fitted'] = regression.fittedvalues

Dep. Variable: R-squared: daycount 0.002 OLS 0.001 Least Squares 2.578 Fri, 23 Nov 2018 0.109 22:27:16 -6290.7 1096 1.259e+04 1094 1.260e+04 1 nonrobust
coef std err t P>|t| [0.025 0.975] 386.5086 4.553 84.893 0.000 377.575 395.442 0.0115 0.007 1.605 0.109 -0.003 0.026
 Omnibus: Durbin-Watson: 15.457 1.326 0 15.931 -0.295 0.000347 2.956 1270

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.27e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
const    386.508624
0          0.011544
dtype: float64
Out[9]:
Date daycount tindex fitted
0 2012-01-01 249 1 386.520168
1 2012-01-02 424 2 386.531711
2 2012-01-03 439 3 386.543255
3 2012-01-04 329 4 386.554799
4 2012-01-05 376 5 386.566342
5 2012-01-06 415 6 386.577886
6 2012-01-07 347 7 386.589430
7 2012-01-08 359 8 386.600974
8 2012-01-09 297 9 386.612517
9 2012-01-10 434 10 386.624061
10 2012-01-11 453 11 386.635605
11 2012-01-12 412 12 386.647148
12 2012-01-13 499 13 386.658692
13 2012-01-14 372 14 386.670236
14 2012-01-15 281 15 386.681779
15 2012-01-16 543 16 386.693323
16 2012-01-17 465 17 386.704867
17 2012-01-18 420 18 386.716410
18 2012-01-19 443 19 386.727954
19 2012-01-20 487 20 386.739498
In [10]:
FIG_SIZE = (15, 6)
fig,ax = plt.subplots(figsize = FIG_SIZE)
dailydf.plot(kind = 'line',x = 'Date',y='daycount',ax=ax, color='green')
fig.suptitle('Accident counts over time')
ax.set_xlabel('Date')
ax.set_ylabel('Number of Accidents')
dailydf.plot(kind = 'line',x = 'Date',y='fitted',ax=ax, color='red')

Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e9c006ec50>

#### Report:¶

From the data, we see that reported accidents in GB average about 393 per day, with a standard deviation of around 75. A linear regression of accident counts against the time trend indicate that average accident rates per day has remained somewhat constant, seen by the statistically insignificant coefficient of the time trend value (p-value is 0.109).

This would mean that accident rates in GB has not decreased across these 3 years, and any efforts by the traffic police to solve this issue has had little impact on the frequency of accidents

### Seasonality analysis¶

From our previous plot, accident rates seem to display some cyclic behavior, with lower accident rates at the beginning of the year, compared to the end of the year.

We shall attempt to dwelve deeper into this behavior by running a seasonality analysis. To do so, each month will be transformed into dummy variables. A regression with each month as their respective dummy variables will be conducted to ascertain if there is a seasonality effect. Lastly, the relative seasonality for each months will be plotted.

In [11]:
# Construct dummy Variables for each month
dailydf['Date'] =pd.to_datetime(dailydf['Date'])
dailydf['month'] =dailydf['Date'].dt.month
dailydf['Jan']= np.where(dailydf['month']==1, 1, 0)
dailydf['Feb']= np.where(dailydf['month']==2, 1, 0)
dailydf['Mar']= np.where(dailydf['month']==3, 1, 0)
dailydf['Apr']= np.where(dailydf['month']==4, 1, 0)
dailydf['May']= np.where(dailydf['month']==5, 1, 0)
dailydf['Jun']= np.where(dailydf['month']==6, 1, 0)
dailydf['Jul']= np.where(dailydf['month']==7, 1, 0)
dailydf['Aug']= np.where(dailydf['month']==8, 1, 0)
dailydf['Sep']= np.where(dailydf['month']==9, 1, 0)
dailydf['Oct']= np.where(dailydf['month']==10, 1, 0)
dailydf['Nov']= np.where(dailydf['month']==11, 1, 0)
dailydf['Dec']= np.where(dailydf['month']==12, 1, 0)
# dailydf['month'] =dailydf['month'].apply(str)

Out[11]:
Date daycount tindex fitted month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 2012-01-01 249 1 386.520168 1 1 0 0 0 0 0 0 0 0 0 0 0
1 2012-01-02 424 2 386.531711 1 1 0 0 0 0 0 0 0 0 0 0 0
2 2012-01-03 439 3 386.543255 1 1 0 0 0 0 0 0 0 0 0 0 0
3 2012-01-04 329 4 386.554799 1 1 0 0 0 0 0 0 0 0 0 0 0
4 2012-01-05 376 5 386.566342 1 1 0 0 0 0 0 0 0 0 0 0 0
In [12]:
# Conduct Regression with month dummy variables to find seasonality effect
varselect = ['tindex','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
#January is left out as the objective of the analysis is to
#determine the relative mean accident rate of each month compared to January

x = dailydf[varselect]
y = dailydf['daycount']
regression = sm.OLS(y,x).fit()
display(regression.summary())
season = regression.params.reset_index(name = 'Seasonality')
season = season.iloc[2:,:]
display(season)

Dep. Variable: R-squared: daycount 0.053 OLS 0.043 Least Squares 5.088 Fri, 23 Nov 2018 2.59e-08 22:27:17 -6261.9 1096 1.255e+04 1083 1.261e+04 12 nonrobust
coef std err t P>|t| [0.025 0.975] 381.7614 8.160 46.787 0.000 365.751 397.772 0.0042 0.007 0.559 0.576 -0.010 0.019 -13.4490 11.066 -1.215 0.224 -35.162 8.264 -6.6348 10.822 -0.613 0.540 -27.869 14.599 -2.3299 10.923 -0.213 0.831 -23.763 19.103 14.4008 10.850 1.327 0.185 -6.889 35.690 17.6821 10.960 1.613 0.107 -3.824 39.188 25.0924 10.897 2.303 0.021 3.710 46.474 3.3286 10.928 0.305 0.761 -18.115 24.772 17.4647 11.052 1.580 0.114 -4.221 39.151 25.8158 11.004 2.346 0.019 4.225 47.407 42.0767 11.136 3.778 0.000 20.226 63.927 -18.6109 11.097 -1.677 0.094 -40.385 3.163
 Omnibus: Durbin-Watson: 7.745 1.393 0.021 7.865 -0.205 0.0196 2.939 7890

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.89e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
index Seasonality
2 Feb -13.448988
3 Mar -6.634814
4 Apr -2.329894
5 May 14.400833
6 Jun 17.682098
7 Jul 25.092394
8 Aug 3.328560
9 Sep 17.464664
10 Oct 25.815820
11 Nov 42.076655
12 Dec -18.610898

From the above regression, it can be seen that July, October and November are statistically significant months (as their p-value is <0.05).

In [13]:
# Plot seasonality
colors = {'Feb':'r','Mar':'r','Apr':'r','May':'r','Jun':'r','Jul':'b','Aug':'r','Sep':'r','Oct':'b','Nov':'b','Dec':'r'}

fig,ax = plt.subplots(figsize = FIG_SIZE)
bar = season.plot(kind = 'bar',x = 'index',y='Seasonality',ax=ax, color=[colors[t] for t in season['index']])
fig.suptitle('Relative seasonality of months (with January as a base, blue bars are statistically significant)')
ax.set_xlabel('Month')
ax.set_ylabel('Relative Seasonality')

Out[13]:
Text(0,0.5,'Relative Seasonality')

This graph shows the relative accident counts compared to January. Blue bars are statistically significant at the 95% level. We can see that the relative number of accidents are lower during December - Apr, and higher during May to November, with the highest during the July and November periods.

To support this observation, we proceeded to plot the mean and standard deviation of the number of accidents for each month

In [14]:
mean = dailydf.groupby('month')['daycount'].mean()
std = dailydf.groupby('month')['daycount'].std()
fig,ax = plt.subplots(figsize = FIG_SIZE)
plt.errorbar(mean.index, mean, xerr=0.5, yerr=std, linestyle='')

plt.suptitle('Mean and Standard Deviation of accidents by months')
ax.set_xlabel('Month')
ax.set_ylabel('Accident Count')

Out[14]:
Text(0,0.5,'Accident Count')

This plot shows the mean and standard deviation of accidents per month. We once again see a pattern with accident rates rising through the months from December onwards till November.

We thus can conclude that the accident data does exhibit seasonality, with more accidents occuring during the later half of the year (excluding december) .

### Time of Day Analysis¶

We now look at how frequent accidents are across the days. We hypothesize that accidents are more prevalent at night, as lighting conditions are worse. We also expect accidents at night to be more severe, since drivers may be unable to see pedestrians or other drivers clearly before the collision. They may hence find it difficult to swerve or otherwise take steps to reduce the severity of the impact before collision.

To test this hypothesis, we decided to plot a stacked area chart. We chose to plot a stacked area chart due to the time values being continous. The stacked area chart will also display the "density" of the frequency of accidents (analogous to a probability distribution) in their respective categories (Slight, Serious and Fatal), and will allow us to determine if the frequency of accidents is higher at certain times of the day.

In [15]:
# Construct an 'hour accident happened' variable using the time accident happened variable

df['hour'] = df['Time'].str[0:2]
df['hour'] = df['hour'].apply(float)
df.dtypes

Out[15]:
Accident_Severity                                   int64
Number_of_Vehicles                                  int64
Number_of_Casualties                                int64
Date                                       datetime64[ns]
Day_of_Week                                         int64
Time                                               object
Speed_limit                                         int64
Junction_Detail                                   float64
Junction_Control                                  float64
Pedestrian_Crossing-Human_Control                   int64
Pedestrian_Crossing-Physical_Facilities             int64
Light_Conditions                                    int64
Weather_Conditions                                  int64
Urban_or_Rural_Area                                 int64
hour                                              float64
dtype: object
In [16]:
houroccur = (df.groupby(['hour','Accident_Severity']).size())
houroccur = houroccur.reset_index(name = 'count')
houroccur = pd.pivot_table(houroccur,
index = "hour",
columns = 'Accident_Severity',
values = 'count')
houroccur = houroccur.reset_index()
houroccur['hour'] = houroccur['hour'].apply(int)
houroccur

Out[16]:
Accident_Severity hour 1 2 3
0 0 145 1144 4718
1 1 122 845 3374
2 2 95 739 2638
3 3 101 622 2194
4 4 87 454 1863
5 5 113 695 2935
6 6 134 1307 6664
7 7 195 2615 16210
8 8 171 3633 27958
9 9 179 2628 19088
10 10 246 2693 17001
11 11 262 3025 19594
12 12 221 3394 22028
13 13 266 3475 22067
14 14 254 3604 22236
15 15 316 4696 28302
16 16 339 5024 29764
17 17 353 5319 33440
18 18 282 4358 25961
19 19 266 3241 18336
20 20 178 2493 13213
21 21 186 2027 9944
22 22 211 1779 8507
23 23 180 1385 6408
In [17]:
fig,ax = plt.subplots(figsize = FIG_SIZE)
x=houroccur['hour']
y=[ houroccur[1], houroccur[2], houroccur[3] ]

colors =  sns.color_palette("Set1")
plt.stackplot(x,y, labels=['Slight','Serious','Fatal'])
plt.legend(loc='upper left')
plt.title('Frequency of Accidents by Time of day (24hrs)')
plt.ylabel('Frequency of Accidents')
plt.xlabel('Hour of Day (24hrs)')

Out[17]:
Text(0.5,0,'Hour of Day (24hrs)')

This stacked area plot of the accident frequency indicates that our hypothesis is incorrect. Not only are there fewer accidents during the night (defined as 8pm-6am), but these accidents are also less severe (as indicated by the area for the frequency of fatal accidents taking up a greater proportion of whole during the day).

It seems that accidents are more frequent and severe during certain periods of the day: Around 7am in the morning, and around 5pm in the evening. This suggests that accidents are likely to be caused more by congestion during rush hours, than by poor environment conditions.

### Frequency of accidents on roads and speed limits¶

We decided to further look at the frequency of accidents for each speed limit to determine if reckless driving at high speeds are a factor in causing more accidents.

In [18]:
# Create dataset to sort on speed
speedlimit = (df.groupby(['Date','Speed_limit']).size())
speedlimit = speedlimit.reset_index(name = 'count')


Out[18]:
Date Speed_limit count
0 2012-01-01 20 2
1 2012-01-01 30 157
2 2012-01-01 40 23
3 2012-01-01 50 8
4 2012-01-01 60 38
In [19]:
# Check averages
speedavg = speedlimit.groupby('Speed_limit').mean()
speedavg = speedavg.reset_index()
speedavg

Out[19]:
Speed_limit count
0 10 1.000000
1 20 7.634280
2 30 254.991788
3 40 32.318431
4 50 14.924270
5 60 55.819343
6 70 27.186131
In [20]:
# Graph Means
fig,ax = plt.subplots(figsize=FIG_SIZE)
speedavg.plot(kind = 'bar',x = 'Speed_limit', y = 'count', ax = ax, color = 'steelblue')
ax.set_title('Average accidents by speed limit each day')
ax.set_ylabel("Accident count")

Out[20]:
Text(0,0.5,'Accident count')

It would appear that roads with a speed limit of 30 have the most accidents. This is curious as we would expect roads with a higher speed limit to have more accidents due to the difficulty of reacting in time to any unexpected road conditions.

As the bar graph shows aggregated data, we also decided to plot a time series of the frequency of accidents for each speed limit to determine if the speed limit of 30 was due to a one-off mass collusion.

In [21]:
# Graph rate over time
speedTS = pd.pivot_table(speedlimit,
index = "Date",
columns = 'Speed_limit',
values = 'count')
speedTS = speedTS.reset_index()

Out[21]:
Speed_limit Date 10 20 30 40 50 60 70
0 2012-01-01 NaN 2.0 157.0 23.0 8.0 38.0 21.0
1 2012-01-02 NaN 6.0 290.0 30.0 10.0 62.0 26.0
2 2012-01-03 NaN 11.0 314.0 28.0 14.0 49.0 23.0
3 2012-01-04 NaN 5.0 211.0 30.0 14.0 43.0 26.0
4 2012-01-05 NaN 4.0 254.0 26.0 15.0 56.0 21.0
In [22]:
# Graph accident frequency over time by speed limit
x = speedTS['Date']
y = [speedTS[10],speedTS[20],speedTS[30],speedTS[40],speedTS[50],speedTS[60],speedTS[70]]
fig,ax = plt.subplots(figsize=FIG_SIZE)
for i in y:
plt.plot(x,i)
plt.legend()
plt.title('Accident rates by speed limit over time')
plt.xlabel('Dates')
plt.ylabel('Accident counts')

Out[22]:
Text(0,0.5,'Accident counts')

As shown by the graph, the higher accident rate of roads in speed limit 30 is consistent across the whole dataset. A possible cause may be the fact that 30 is the default speed limit in many UK roads.

To test this hypothesis, we decided to make a stacked bar chart of the count of accidents by road type, split by whether or not the speed limit is 30 for that particular type.

In [23]:
# Group by road types and whether the speed limit is 30 or not
df['speedlimit30']= np.where(df['Speed_limit']==30, 'is30', 'not30')

# Rename Road types according to website's provided data dictionary

# Pivot data for graph
columns = 'speedlimit30',
values = 'count')


0 1 is30 18811
1 1 not30 11096
2 2 is30 7530
3 2 not30 896
4 3 is30 17151
5 3 not30 43995
6 6 is30 233736
7 6 not30 91266
8 7 is30 1156
9 7 not30 3361
10 9 is30 1087
11 9 not30 468
Out[23]:
speedlimit30 is30 not30
Dual carriageway 17151 43995
One way street 7530 896
Single carriageway 233736 91266
Unknown 1087 468
In [24]:
# Create stacked bar chart
fig,ax = plt.subplots()

stacked = True,
ax=ax)

ax.set_title('Count of accidents by road type, split by speed limit 30 or not')
ax.set_ylabel("Accident count")

Out[24]:
Text(0,0.5,'Accident count')

The graph indicates that the 'Single carriageway' road type has the most accidents. This is understandable, as a single carriageway is a 2 way road with one lane in each direction, without any middle-of-road barriers. To overtake traffic on such roads, cars would have to overtake on the lane with oncoming traffic, which is dangerous.

We see that the majority of single carriageway roads have the speed limit of 30. This helps to explain why there is such a high rate of accidents on speed limit 30 roads.

### Feature Importance¶

We now attempt to see which features in the dataset proves useful for predicting accident severity.

In [25]:
# Import relevant models
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [26]:
# Filter unneeded columns from original dataset
selected = ['Accident_Severity',
'Number_of_Vehicles',
'Date','Day_of_Week',
'Time',
'Speed_limit',
'Junction_Detail',
'Junction_Control',
'Pedestrian_Crossing-Human_Control',
'Pedestrian_Crossing-Physical_Facilities',
'Light_Conditions',
'Weather_Conditions',
'Urban_or_Rural_Area']

df3 = data[selected]
display(df3.dtypes)
display(df3.shape)

Accident_Severity                                   int64
Number_of_Vehicles                                  int64
Date                                       datetime64[ns]
Day_of_Week                                         int64
Time                                               object
Speed_limit                                         int64
Junction_Detail                                     int64
Junction_Control                                    int64
Pedestrian_Crossing-Human_Control                   int64
Pedestrian_Crossing-Physical_Facilities             int64
Light_Conditions                                    int64
Weather_Conditions                                  int64
Urban_or_Rural_Area                                 int64
dtype: object
(1640597, 16)
In [27]:
import warnings
warnings.filterwarnings("ignore")

In [28]:
# One-hot-encode months
df3['month'] =df3['Date'].dt.month
df3['Jan']= np.where(df3['month']==1, 1, 0)
df3['Feb']= np.where(df3['month']==2, 1, 0)
df3['Mar']= np.where(df3['month']==3, 1, 0)
df3['Apr']= np.where(df3['month']==4, 1, 0)
df3['May']= np.where(df3['month']==5, 1, 0)
df3['Jun']= np.where(df3['month']==6, 1, 0)
df3['Jul']= np.where(df3['month']==7, 1, 0)
df3['Aug']= np.where(df3['month']==8, 1, 0)
df3['Sep']= np.where(df3['month']==9, 1, 0)
df3['Oct']= np.where(df3['month']==10, 1, 0)
df3['Nov']= np.where(df3['month']==11, 1, 0)
df3['Dec']= np.where(df3['month']==12, 1, 0)

# Construct hour variable
df3['hour'] = df3['Time'].str[0:2]
df3['hour'] = df3['hour'].apply(float)

# Set severity to be 'Death' or 'Non Death' only (instead of 3 levels "slight, serious, fatal")
df3['Severitypredict']= np.where(df3['Accident_Severity']==3, 1, 0)

# Drop unwanted variables
deselected = ['Accident_Severity',
'Time',
'Date']
df4 = df3.drop(deselected, axis=1)
display(df4.dtypes)

Number_of_Vehicles Day_of_Week 1st_Road_Class Road_Type Speed_limit Junction_Detail Junction_Control Pedestrian_Crossing-Human_Control Pedestrian_Crossing-Physical_Facilities Light_Conditions ... May Jun Jul Aug Sep Oct Nov Dec hour Severitypredict
0 1 3 3 6 30 0 -1 0 1 1 ... 0 0 0 0 0 0 0 0 17.0 0
1 1 4 4 3 30 6 2 0 5 4 ... 1 0 0 0 0 0 0 0 17.0 1
2 2 5 5 6 30 0 -1 0 0 4 ... 0 1 0 0 0 0 0 0 0.0 1
3 1 6 3 6 30 0 -1 0 0 1 ... 0 0 1 0 0 0 0 0 10.0 1
4 1 2 6 6 30 0 -1 0 0 7 ... 0 0 0 0 0 1 0 0 21.0 1

5 rows × 28 columns

Number_of_Vehicles                           int64
Day_of_Week                                  int64
Speed_limit                                  int64
Junction_Detail                              int64
Junction_Control                             int64
Pedestrian_Crossing-Human_Control            int64
Pedestrian_Crossing-Physical_Facilities      int64
Light_Conditions                             int64
Weather_Conditions                           int64
Urban_or_Rural_Area                          int64
month                                        int64
Jan                                          int32
Feb                                          int32
Mar                                          int32
Apr                                          int32
May                                          int32
Jun                                          int32
Jul                                          int32
Aug                                          int32
Sep                                          int32
Oct                                          int32
Nov                                          int32
Dec                                          int32
hour                                       float64
Severitypredict                              int32
dtype: object
In [29]:
# Split data into Y and X
y = df4['Severitypredict']
x = df4.drop('Severitypredict', axis=1)
display(x.shape)
display(y.shape)

(1640597, 27)
(1640597,)
In [30]:
# Split into Train/Test
seed = 123
testpercent = 0.33
X_train, X_test, Y_train, Y_test = train_test_split(x,y,test_size = testpercent)
display(X_train.shape)
display(X_test.shape)
display(Y_train.shape)
display(Y_test.shape)

(1099199, 27)
(541398, 27)
(1099199,)
(541398,)
In [31]:
#fit model
# model1 = xgb.XGBClassifier(silent=False)
# model1.fit(X_train,Y_train)

dtrain = xgb.DMatrix(X_train,Y_train)
dtest = xgb.DMatrix(X_test,Y_test)
evals = [ (dtrain,'train'),(dtest,'eval')]
# model1 = xgb.cv(DTrain)
# Train model
params = {"booster" : 'gbtree',
"learning_rate" : 0.3229541,
"n_estimators" : 1000,
"silent" : False,
"eta" : 0.3,
"min_child_weight" : 1,
"max_depth" : 4,
"gamma" : 0,
"subsample" : 0.3893429,
"colsample_bytree" : 0.4595015,
"colsample_bylevel" : 1,
"scale_pos_weight" : 1,
"objective" : 'binary:logistic',
"eval_metric" : 'auc',
"seed" : 123 ,
"verbose" : True
}
model1 = xgb.train(params,dtrain, num_boost_round = 100,evals=evals,early_stopping_rounds = 10,verbose_eval = True,
)

[22:27:32] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[0]	train-auc:0.574416	eval-auc:0.57395
Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.

Will train until eval-auc hasn't improved in 10 rounds.
[22:27:33] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[1]	train-auc:0.633635	eval-auc:0.633604
[22:27:34] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[2]	train-auc:0.637454	eval-auc:0.637541
[22:27:34] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[3]	train-auc:0.635835	eval-auc:0.635709
[22:27:35] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[4]	train-auc:0.636211	eval-auc:0.636162
[22:27:36] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[5]	train-auc:0.64365	eval-auc:0.643409
[22:27:37] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[6]	train-auc:0.646364	eval-auc:0.646394
[22:27:37] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[7]	train-auc:0.646933	eval-auc:0.647025
[22:27:38] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[8]	train-auc:0.648107	eval-auc:0.648338
[22:27:39] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[9]	train-auc:0.649441	eval-auc:0.649697
[22:27:40] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[10]	train-auc:0.650018	eval-auc:0.650293
[22:27:41] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 0 pruned nodes, max_depth=4
[11]	train-auc:0.650387	eval-auc:0.650626
[22:27:41] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[12]	train-auc:0.651314	eval-auc:0.651442
[22:27:42] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[13]	train-auc:0.651935	eval-auc:0.652023
[22:27:43] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[14]	train-auc:0.652624	eval-auc:0.652658
[22:27:44] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[15]	train-auc:0.653519	eval-auc:0.653519
[22:27:44] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[16]	train-auc:0.654131	eval-auc:0.654209
[22:27:45] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[17]	train-auc:0.654586	eval-auc:0.654675
[22:27:46] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[18]	train-auc:0.654935	eval-auc:0.654956
[22:27:47] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[19]	train-auc:0.655142	eval-auc:0.655129
[22:27:48] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[20]	train-auc:0.655362	eval-auc:0.655271
[22:27:48] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[21]	train-auc:0.655519	eval-auc:0.655308
[22:27:49] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 0 pruned nodes, max_depth=4
[22]	train-auc:0.655635	eval-auc:0.655344
[22:27:50] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[23]	train-auc:0.656031	eval-auc:0.655728
[22:27:51] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[24]	train-auc:0.656556	eval-auc:0.656247
[22:27:51] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 0 pruned nodes, max_depth=4
[25]	train-auc:0.656662	eval-auc:0.656367
[22:27:52] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[26]	train-auc:0.656857	eval-auc:0.656519
[22:27:53] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 0 pruned nodes, max_depth=4
[27]	train-auc:0.656929	eval-auc:0.656573
[22:27:54] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[28]	train-auc:0.657054	eval-auc:0.656699
[22:27:55] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[29]	train-auc:0.657179	eval-auc:0.656844
[22:27:56] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[30]	train-auc:0.657251	eval-auc:0.656847
[22:27:57] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[31]	train-auc:0.657354	eval-auc:0.656866
[22:27:57] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[32]	train-auc:0.657389	eval-auc:0.656871
[22:27:58] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[33]	train-auc:0.657592	eval-auc:0.657125
[22:27:59] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[34]	train-auc:0.657777	eval-auc:0.657311
[22:28:00] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[35]	train-auc:0.65785	eval-auc:0.657314
[22:28:01] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[36]	train-auc:0.658092	eval-auc:0.657456
[22:28:02] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[37]	train-auc:0.65818	eval-auc:0.657521
[22:28:03] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[38]	train-auc:0.658287	eval-auc:0.657529
[22:28:04] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[39]	train-auc:0.65838	eval-auc:0.657603
[22:28:04] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 0 pruned nodes, max_depth=4
[40]	train-auc:0.658418	eval-auc:0.65761
[22:28:05] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[41]	train-auc:0.658645	eval-auc:0.657887
[22:28:06] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[42]	train-auc:0.658723	eval-auc:0.657911
[22:28:07] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[43]	train-auc:0.658838	eval-auc:0.657946
[22:28:08] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[44]	train-auc:0.658967	eval-auc:0.657986
[22:28:09] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[45]	train-auc:0.659114	eval-auc:0.658065
[22:28:09] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[46]	train-auc:0.659227	eval-auc:0.658114
[22:28:10] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[47]	train-auc:0.65937	eval-auc:0.65828
[22:28:11] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[48]	train-auc:0.659537	eval-auc:0.658421
[22:28:12] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[49]	train-auc:0.659556	eval-auc:0.658407
[22:28:13] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[50]	train-auc:0.659632	eval-auc:0.658404
[22:28:14] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 0 pruned nodes, max_depth=4
[51]	train-auc:0.659674	eval-auc:0.65838
[22:28:15] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[52]	train-auc:0.65975	eval-auc:0.658453
[22:28:15] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 0 pruned nodes, max_depth=4
[53]	train-auc:0.659807	eval-auc:0.658503
[22:28:16] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=4
[54]	train-auc:0.65982	eval-auc:0.658515
[22:28:17] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[55]	train-auc:0.659967	eval-auc:0.658632
[22:28:18] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[56]	train-auc:0.660016	eval-auc:0.658686
[22:28:19] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[57]	train-auc:0.660137	eval-auc:0.658805
[22:28:19] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[58]	train-auc:0.660192	eval-auc:0.658802
[22:28:20] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[59]	train-auc:0.660231	eval-auc:0.658822
[22:28:21] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 26 extra nodes, 0 pruned nodes, max_depth=4
[60]	train-auc:0.660297	eval-auc:0.658858
[22:28:22] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[61]	train-auc:0.660346	eval-auc:0.658906
[22:28:23] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[62]	train-auc:0.660376	eval-auc:0.658962
[22:28:24] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[63]	train-auc:0.66042	eval-auc:0.658944
[22:28:25] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[64]	train-auc:0.660513	eval-auc:0.659019
[22:28:26] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[65]	train-auc:0.660516	eval-auc:0.658981
[22:28:27] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[66]	train-auc:0.660633	eval-auc:0.659051
[22:28:27] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 0 pruned nodes, max_depth=4
[67]	train-auc:0.660662	eval-auc:0.659067
[22:28:28] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[68]	train-auc:0.660706	eval-auc:0.659072
[22:28:29] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[69]	train-auc:0.660742	eval-auc:0.659118
[22:28:30] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 0 pruned nodes, max_depth=4
[70]	train-auc:0.660755	eval-auc:0.659114
[22:28:31] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[71]	train-auc:0.660841	eval-auc:0.659119
[22:28:32] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[72]	train-auc:0.660901	eval-auc:0.659158
[22:28:32] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 0 pruned nodes, max_depth=4
[73]	train-auc:0.660939	eval-auc:0.659191
[22:28:33] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[74]	train-auc:0.660965	eval-auc:0.659175
[22:28:34] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[75]	train-auc:0.661052	eval-auc:0.659226
[22:28:35] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 0 pruned nodes, max_depth=4
[76]	train-auc:0.661109	eval-auc:0.659228
[22:28:36] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[77]	train-auc:0.661116	eval-auc:0.659231
[22:28:36] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[78]	train-auc:0.661136	eval-auc:0.659229
[22:28:37] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 0 pruned nodes, max_depth=4
[79]	train-auc:0.66117	eval-auc:0.659251
[22:28:38] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[80]	train-auc:0.661213	eval-auc:0.659218
[22:28:39] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[81]	train-auc:0.661276	eval-auc:0.659219
[22:28:40] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 0 pruned nodes, max_depth=4
[82]	train-auc:0.661286	eval-auc:0.659216
[22:28:41] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 26 extra nodes, 0 pruned nodes, max_depth=4
[83]	train-auc:0.661317	eval-auc:0.65919
[22:28:42] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[84]	train-auc:0.661373	eval-auc:0.659199
[22:28:43] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[85]	train-auc:0.661401	eval-auc:0.659205
[22:28:44] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[86]	train-auc:0.661476	eval-auc:0.659253
[22:28:45] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[87]	train-auc:0.661496	eval-auc:0.659204
[22:28:46] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[88]	train-auc:0.661553	eval-auc:0.659236
[22:28:46] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[89]	train-auc:0.661597	eval-auc:0.659245
[22:28:47] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[90]	train-auc:0.661669	eval-auc:0.659241
[22:28:48] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[91]	train-auc:0.661673	eval-auc:0.659263
[22:28:49] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 0 pruned nodes, max_depth=4
[92]	train-auc:0.661681	eval-auc:0.659259
[22:28:50] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 0 pruned nodes, max_depth=4
[93]	train-auc:0.661717	eval-auc:0.659258
[22:28:51] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[94]	train-auc:0.661811	eval-auc:0.659337
[22:28:51] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[95]	train-auc:0.661837	eval-auc:0.659333
[22:28:52] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[96]	train-auc:0.661886	eval-auc:0.659338
[22:28:53] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 26 extra nodes, 0 pruned nodes, max_depth=4
[97]	train-auc:0.661914	eval-auc:0.65931
[22:28:54] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 0 pruned nodes, max_depth=4
[98]	train-auc:0.661943	eval-auc:0.659323
[22:28:55] C:\Users\Administrator\Desktop\xgboost\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 0 pruned nodes, max_depth=4
[99]	train-auc:0.661976	eval-auc:0.659302

In [32]:
#Make predictions
y_pred = model1.predict(dtest)
predictions = [round(value) for value in y_pred]

#Find accuracy
accuracy = accuracy_score(Y_test,predictions)
print("Accuracy %f" %accuracy)

Accuracy 0.851239

In [33]:
# plot
xgb.plot_importance(model1)
plt.show()


We observe that the number of vehicles involved in the accident is a major (if not the most important) contributor towards prediction if an accident causes death or not. We also note that the variables we have examined: hour, speed limit and road type, are also significant predictors of accident fatality.

## Conclusion¶

In our exploratory analysis, we first examined the frequency of accidents across time. We found that the rate of accidents has not decreased over the years, and concluded that any efforts to reduce accident rates have been ineffective. We also found that the rate of accidents exhibits a monthly seasonality, with a higher rate of accidents nearing the end of the year.

We then explored certain features which were hypothesized to cause accidents. We found that accidents are more frequent during peak rush hour periods in the morning and evening. This suggests that accidents are caused more by impatience and congestion, rather than poor lighting conditions. We also foung that accidents occur most frequently on roads with a speed limit of 30, compared to other roads. This may be a direct consequence of the higher number of speed limit 30 roads in Britain. However, we also found another possible cause: most of the accidents are also on single carriageway roads (these roads are quite dangerous), which also mostly have a speed limit of 30.

Lastly, we constucted an xgboost model in order to understand which features of our dataset explains an accident fatality the most. We found that the number of vehicles involved in the accident is a major contributor towards prediction. We also note that the variables we have examined: hour, speed limit and road type, are also significant predictors of accident fatality.

Given further time and efforts, the following analysis could be made to improve our understanding of the accidents in Britain:

• Further analysis into other features such as the number of cars, types of junctions, etc
• Multivariate analysis of various features together to see the partial effects of each feature
• Usage of the whole dataset (we limited most of the analyses to 2012-2014 due to data size limitations)
• Joining of our accidents dataset with multiple other datasets provided by the same data source, which would allow us to analyze more features

### Other unused code for further exploratory/predictive analysis¶

In [34]:
# !python -m pip install --upgrade pip

Requirement already up-to-date: pip in c:\users\eugene lum\anaconda3\lib\site-packages (18.1)

In [35]:
# Install keras and TensorFlow if necessary
# !pip install keras
# !pip install tensorflow

In [36]:
# Plain jane keras. This is not inside tensorflow, just to show us this exists.

# %matplotlib inline #
# import tensorflow.keras
# from tensorflow.keras.models import Model, Sequential
# from tensorflow.keras.layers import *
# from tensorflow.keras.optimizers import Adam, SGD
# from tensorflow.keras import backend as K

# import tensorflow as tf

In [37]:
# n_classes = 2

# print(X_train.shape, 'train samples')
# print(X_test.shape, 'test samples')

# # convert class vectors to binary One Hot Encoded
# y_train = tf.keras.utils.to_categorical(Y_train, n_classes)
# y_test = tf.keras.utils.to_categorical(Y_test, n_classes)
# y_train[0]

(1099199, 27) train samples
(541398, 27) test samples

Out[37]:
array([0., 1.], dtype=float32)
In [38]:
# # Training Parameters for model
# learning_rate = 0.001
# training_epochs = 10
# batch_size = 1000

# # Network Parameters
# n_input = 27 # 27 features
# n_hidden_1 = 200 # 1st layer number of neurons
# n_hidden_2 = 10 # 2nd layer number of neurons
# n_hidden_3 = 10 # 3rd layer number of neurons
# n_hidden_4 = 10 # 4th layer number of neurons
# n_classes = 2 # Binary classes death/not death

In [39]:
# # Create functional model components
# Inp = Input(shape=(n_input,))
# x = Dense(n_hidden_1, activation='relu', name = "Dense_1")(Inp)
# x = Dropout(0.3, name = "Dropout_01")(x)
# x = Dense(n_hidden_2, activation='relu', name = "Dense_2")(x)
# x = Dropout(0.3, name = "Dropout_02")(x)
# x = Dense(n_hidden_3, activation='relu', name = "Dense_3")(x)
# x = Dropout(0.3, name = "Dropout_03")(x)
# x = Dense(n_hidden_4, activation='relu', name = "Dense_4")(x)
# output = Dense(n_classes, activation='softmax', name = "Outputlayer")(x)

In [40]:
# # Create model and display structure
# model = Model(Inp, output)
# model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         (None, 27)                0
_________________________________________________________________
Dense_1 (Dense)              (None, 200)               5600
_________________________________________________________________
Dropout_01 (Dropout)         (None, 200)               0
_________________________________________________________________
Dense_2 (Dense)              (None, 10)                2010
_________________________________________________________________
Dropout_02 (Dropout)         (None, 10)                0
_________________________________________________________________
Dense_3 (Dense)              (None, 10)                110
_________________________________________________________________
Dropout_03 (Dropout)         (None, 10)                0
_________________________________________________________________
Dense_4 (Dense)              (None, 10)                110
_________________________________________________________________
Outputlayer (Dense)          (None, 2)                 22
=================================================================
Total params: 7,852
Trainable params: 7,852
Non-trainable params: 0
_________________________________________________________________

In [41]:
# # Set model optimizer, compile model
# opt = SGD(lr = learning_rate)
# model.compile(loss='categorical_crossentropy',
#               optimizer= opt,
#               metrics=['accuracy'])

In [42]:
# # Train model
# history = model.fit(X_train, y_train,
#                     batch_size=batch_size,
#                     epochs=training_epochs,
#                     verbose=1, # print model progress
#                     validation_data=(X_test, y_test))

Train on 1099199 samples, validate on 541398 samples
Epoch 1/10
1099199/1099199 [==============================] - 16s 14us/step - loss: nan - acc: 0.1492 - val_loss: nan - val_acc: 0.1488
Epoch 2/10
1099199/1099199 [==============================] - 14s 12us/step - loss: nan - acc: 0.1482 - val_loss: nan - val_acc: 0.1488
Epoch 3/10
1099199/1099199 [==============================] - 14s 12us/step - loss: nan - acc: 0.1482 - val_loss: nan - val_acc: 0.1488
Epoch 4/10
1099199/1099199 [==============================] - 13s 12us/step - loss: nan - acc: 0.1482 - val_loss: nan - val_acc: 0.1488
Epoch 5/10
1099199/1099199 [==============================] - 14s 12us/step - loss: nan - acc: 0.1482 - val_loss: nan - val_acc: 0.1488
Epoch 6/10
1099199/1099199 [==============================] - 14s 12us/step - loss: nan - acc: 0.1482 - val_loss: nan - val_acc: 0.1488
Epoch 7/10
1099199/1099199 [==============================] - 14s 13us/step - loss: nan - acc: 0.1482 - val_loss: nan - val_acc: 0.1488
Epoch 8/10
1099199/1099199 [==============================] - 13s 12us/step - loss: nan - acc: 0.1482 - val_loss: nan - val_acc: 0.1488
Epoch 9/10
1099199/1099199 [==============================] - 13s 12us/step - loss: nan - acc: 0.1482 - val_loss: nan - val_acc: 0.1488
Epoch 10/10
1099199/1099199 [==============================] - 14s 12us/step - loss: nan - acc: 0.1482 - val_loss: nan - val_acc: 0.1488

In [ ]:
# It would seem that the data is insufficient for a deep learning network to train on.


# Related Topics

Obesity in America