Music Then and Now

Music Then and Now

Tammi Chng

Music Visualisation

Music Then and Now

Tammi Chng | DS102 | Course Assignment


Millennials destroyed music - or at least that's what baby boomers seem to think. But how has music actually changed over the past 50 years? Has it changed at all?

In order to find out, I compared the music from across 50 years in terms of:

  • Song Lyrics/Vocabulary
  • Song details like BPM (Beats per minute) and Duration


  1. Billboard Lyrics (1965-2015):

Columns: Rank, Song, Artist, Year, Lyrics, Source

  1. Tunebat music data scraped using Selenium:


  • Song
  • Key
  • Duration
  • Camelot - Represents compatability of keys
  • BPM - Beats Per Minute
  • Energy - How intense/active a track is
  • Danceability - How appropriate the track is for dancing
  • Happiness - How cheerful/positive the track is
  • Loudness - The average decibel amplitude
  • Acousticness - How likely the track is acoustic
  • Liveness - How likely the track was recorded with a live audience


Given that the original csv file contains 5000 lines, some of the processes I wanted to run took too long. To solve this problem, I limited the dataset to the top ten songs per year.

In [4]:
# Import relevant packages

import pandas as pd

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import matplotlib.pyplot as plt
import matplotlib
import statsmodels.api as sm
import seaborn as sns
import numpy as np'averaged_perceptron_tagger')'stopwords')'punkt')
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Tammi\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Tammi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Tammi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Step 1:
Create scraper to scrape tunebat according to Billboard top ten song metadata

In [53]:
# import csv and limit to Top Ten songs per year

df = pd.read_csv('billboard_lyrics_1964-2015.csv', encoding='latin-1')
top_ten_df = df[df['Rank'] <= 10]
top_ten_df = top_ten_df.reset_index(drop=True)

# create list of (song, artist) tuples from df

song_list = []
for i in range(len(top_ten_df['Song'])):
    song_list.append([top_ten_df['Song'][i].title(), top_ten_df['Artist'][i].title()])
In [ ]:
def song_link_maker(song, artist):
    This function takes in the variables song and artist and returns a tuple
    containing the respective song link and artist link that the function
    song_details will need to determine the right link to click through.
    song_link = ''
    artist_link = ''
    title_split = song.split(' ')
    artist_split = artist.split(' ')

    # create links
    for i in range(len(title_split)):
        song_link += title_split[i] + '-'

    for j in range(len(artist_split)):
        artist_link += artist_split[j] + '-'
    song_link = song_link[:len(song_link)-1]
    artist_link = artist_link[:len(artist_link)-1]

    return (song_link, artist_link)

def song_details(song, artist):
    This function takes in the variables song and artist and returns a list
    containing song details from song title and artist to danceability.
    import selenium as s
    from selenium import webdriver
    from import By

    # Gets the website
    driver = webdriver.Chrome(r'C:\\\Users\\\Tammi\\\Downloads\\\chromedriver_win32\chromedriver.exe')

    # Navigates to right webpage
    id_box = driver.find_element_by_id('q')

    search_button = driver.find_element_by_xpath("//button[@class='btn btn-default search-button']")

    song_link = song_link_maker(song, artist)
    links = [elem.get_attribute("href") for elem in driver.find_elements_by_tag_name('a')]

    for i in links:
        if song_link[0] in i and song_link[1]:

    site = driver.current_url

    from lxml import html
    import requests

    page = requests.get(site)
    tree = html.fromstring(page.content)

    data1 = [elem.text for elem in driver.find_elements_by_class_name('main-attribute-value')]
    data2 = []
    # Stores music data into a list
    for elem in driver.find_elements_by_class_name('attribute-table-element'):
        if len(data2) == 7:

    for i in range(len(data2)):
    data1.insert(0, song)
    final = data1
    return final
In [ ]:
# Compiles list of song details
music_details = []

for row in song_list:
    song = row[0]
    artist = row[1]
    music_details.append(song_details(song, artist))
In [57]:
# Reads music details into a csv file to store the data

data = music_details
music_data_df = pd.DataFrame.from_records(data)
music_data_df = music_data_df.rename(columns={0:'Song', 1:'Key', 2:'Camelot', 3:'Duration', 4:'BPM', 5:'Energy', 6:'Danceability', 7:'Happiness', 8:'Loudness', 9:'Acousticness', 10:'Instrumentalness', 11:'Liveness'})
music_data_df = music_data_df.set_index('Song')


Step 2:
Clean Dataframes and set to top ten songs

In [55]:
# Function to filter and stem lyrics

def filter_and_stem(lyric_list):
    tokenized = word_tokenize(str(lyric_list))
    filtered = [x for x in tokenized if x not in stopwords.words()]
    stemmer = PorterStemmer()
    stemmed = [stemmer.stem(word) for word in filtered]
    return stemmed
In [57]:
# Read data from csv file

df = pd.read_csv('billboard_lyrics_1964-2015.csv', encoding='Latin-1')
top_ten_df = df[df['Rank'] <= 10]
top_ten_df = top_ten_df.reset_index(drop=True)
top_ten_df['Stemmed'] = 0

# Filter and Stem lyrics of top ten songs per year

for i in range(len(top_ten_df)):
    top_ten_df['Stemmed'].loc[i]= filter_and_stem(top_ten_df['Lyrics'].loc[i])
C:\Users\Tammi\Anaconda3\lib\site-packages\pandas\core\ SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation:
  self._setitem_with_indexer(indexer, value)
In [310]:
# Function to make song titles lowercase

def lower(song):
    song = song.lower()
    return song

# Combined music metadata with tunebat data
top_copy = top_ten_df.copy()

music_data_df = pd.read_csv('Music_Data.csv')
top_copy = top_copy.set_index('Song')

music_data_df['Song'] = music_data_df['Song'].apply(lower)

final_df = top_copy.merge(music_data_df, left_on='Song', right_on='Song', how='outer')
In [326]:
# Clean dataframe and remove NaN

final_df = final_df.fillna(0)
remove_list = []
for i in range(len(final_df)):
    if final_df['BPM'][i] == 0:
final_df = final_df.drop(remove_list)

final_df['Year'] = final_df['Year'].apply(int)
final_df['Loudness'] = final_df['Loudness'].map(lambda x: str(x)[0:3])
final_df['Loudness'] = final_df['Loudness'].apply(int)

final_df = final_df.reset_index(drop=True)

# Change Duration fro 
dur_l = []
dur = 0
for k, v in final_df.iterrows():
    dur = v[9].split(':')
    dur = float(dur[0])*60 + float(dur[1])
final_df['Duration in Minutes'] = dur_l


Step 3:
Visualise differences in music details over time

In [329]:
# Function to calculate average of music details per year
# Returns a dataframe containing year and category
final_df = pd.read_pickle("Final_Music_Data")

def return_df(category):
    avg = {}
    for i in range(1, len(final_df)):
        if final_df['Year'][i] not in avg:
            avg[final_df['Year'][i]] = [final_df[category][i],1]
            avg[final_df['Year'][i]][0] += final_df[category][i]
            avg[final_df['Year'][i]][1] += 1

    avg_d = {}
    for k, v in avg.items():
        avg_d[k] = (float(v[0]) / float(v[1]))
    df = pd.DataFrame.from_dict(avg_d, orient='index')
    df = df.reset_index()
    df = df.rename(columns = {'index' :'Year', 0: category + ' Avg'})
    return df

# Create dataframes for averages of all eight categories

avg_dur_df = return_df('Duration in Minutes')
avg_bpm_df = return_df('BPM')
avg_energy_df = return_df('Energy')
avg_dance_df = return_df('Danceability')
avg_happiness_df = return_df('Happiness')
avg_loudness_df = return_df('Loudness')
avg_acoustic_df = return_df('Acousticness')
avg_live_df = return_df('Liveness')

# Combine and clean dataframes

avg_df = pd.concat([avg_dur_df, avg_bpm_df, avg_energy_df, avg_dance_df, avg_happiness_df, avg_loudness_df, avg_acoustic_df, avg_live_df], axis=1)

avg_df = avg_df.drop(['Year'], axis=1)
avg_df['Year'] = avg_dur_df['Year']
In [332]:
# Plot Category averages over time 

fig = plt.figure(figsize=(10, 10)) 

cmap = ['palevioletred', 'mediumvioletred', 'm', 'purple', 'darkorchid', 'mediumpurple', 'blue', 'navy']
for i in range(0, len(avg_df.columns)+1):
    if 'Year' in avg_df.columns[i]:
    avg_df.plot(title=avg_df.columns[i] + ' Over Time', kind='scatter', x='Year', y=avg_df.columns[i], xlim=(1960, 2020), c=cmap[i])
    ax = sns.regplot('Year', avg_df.columns[i], data=avg_df, color=cmap[i], order=2, ci=None)
<Figure size 720x720 with 0 Axes>
In [333]:
# To visualise relationships between these categories
# Create a pair-wise correlation matrix
final2_df = final_df.copy()
final2_df['Rank '] = final_df['Rank']
f, ax = plt.subplots(figsize=(10, 6))
corr = final2_df[final2_df.columns[11:]].iloc[0:260].corr()
hm = sns.heatmap(round(corr,2), annot=True, ax=ax, cmap='coolwarm',fmt='.2f',linewidths=.05)
t= f.suptitle('Music Attributes Correlation Heatmap 1965-1990', fontsize=14)

final2_df = final_df.copy()
final2_df['Rank '] = final_df['Rank']
f, ax = plt.subplots(figsize=(10, 6))
corr = final2_df[final2_df.columns[11:]].iloc[260:].corr()
hm = sns.heatmap(round(corr,2), annot=True, ax=ax, cmap='coolwarm',fmt='.2f',linewidths=.05)
t= f.suptitle('Music Attributes Correlation Heatmap 1990-2015', fontsize=14)

The correlation matrix above shows that while attributes like energy continue to be closely linked to danceability, happiness and loudness up till today, when it comes to the ranking of music, these attributes didn't really play a part until after the 1990s, when popular music became louder, more "live" and had more energy, although the correlation is not totally strong.

This goes against the idea that today's music is becoming less "Live", and more produced as seen in the graphs showing the average "Liveness" of music decreasing over time, albeit slightly.

Step 4:
Visualise differences in lyrics over time

Here, a wordcloud is used to display some of the most prominent lyrics in songs across the past 50 years in order to get a better idea of what kind of topics and ideas are typically explored in music

In [334]:
# Construct Word Cloud

from wordcloud import WordCloud, ImageColorGenerator
from os import path
import os
from PIL import Image

long_txt = ''
for song in top_ten_df['Stemmed']:
    for lyric in song:
        long_txt += lyric + ' '

d = path.dirname(__file__) if "__file__" in locals() else os.getcwd()
mask = np.array(, 'music.png')))

image = WordCloud(background_color='white',

image_colors = ImageColorGenerator(mask)
fig = plt.figure(figsize=(100, 100))
plt.title("Wordcloud of Lyrics", fontsize=100)
ax = plt.axis('off')

plt.imshow(image.recolor(color_func=image_colors), interpolation='bilinear')
<matplotlib.image.AxesImage at 0x1c2301cb6d8>