Sentiment Analysis and Naive Bayes Classification on e-commerce reviews

Sentiment Analysis and Naive Bayes Classification on e-commerce reviews

Valerie Lim Yan Hui

Sentiment Analysis and Naive Bayes Classification on e-commerce reviews

Background

One of the challenges that businesses face is how to gain a better understanding of the voice of the customer. This issue can be addressed by extracting additional information from customer reviews, using sentiment analysis. Customer feedback provides a useful platform to discover a huge range of customer-initiated reactions to the product(s) that they have purchased. Text analytics provide businesses a more holistic picture of customers’ satisfaction or dissatisfaction. It is insufficient to rely on customers’ ratings alone to find out their experiences. This is because reviews can provide important feedback to businesses, that would not be captured using a Likert scale. For instance, a five-star review can contain important requests for improved delivery time or customer support. Uncovering deeper insights about customers’ reviews enable businesses to respond to reviews more strategically and effectively and consider modifying their current systems and approaches so that they can better serve their customers.

Dataset characteristics

  • Clothing ID: Specific product being reviewed
  • Age : Reviewers' age.
  • Title : Title of the review
  • Review Text: Review body.
  • Rating: Product's score given by the customer from 1 Worst, to 5 Best.
  • Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
  • Positive Feedback Count: Number of other customers who found this review positive.
  • Division Name: Categorical name of the product high level division.
  • Department Name: Categorical name of the product department name.
  • Class Name: Categorical name of the product class name.

Import relevant libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import re

from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag, wordnet
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import MultinomialNB
from wordcloud import WordCloud

nltk.download('vader_lexicon')

#Settings
pd.options.display.float_format = '{:.2f}'.format
np.set_printoptions(threshold=np.nan)
sns.set()
DIMS=(20, 10)
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/admin/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
/Users/admin/anaconda3/lib/python3.6/site-packages/nltk/twitter/__init__.py:20: UserWarning: The twython library has not been installed. Some functionality from the twitter package will not be available.
  warnings.warn("The twython library has not been installed. "
In [2]:
print("Downloading corpora...")    
nltk.download('punkt')                        # Sentence Tokenizer
nltk.download('stopwords')                    # For stopwords in all languages
nltk.download('averaged_perceptron_tagger')   # For Sentiment analysis
nltk.download('wordnet')                      # For Lemmas
print("Downloads complete.")
Downloading corpora...
[nltk_data] Downloading package punkt to /Users/admin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/admin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/admin/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /Users/admin/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Downloads complete.
In [3]:
reviews_df = pd.read_csv('Womens-Clothing-E-Commerce-Reviews.csv',index_col=0)
reviews_df.head()
Out[3]:
Clothing ID Age Title Review Text Rating Recommended IND Positive Feedback Count Division Name Department Name Class Name
0 767 33 NaN Absolutely wonderful - silky and sexy and comf... 4 1 0 Initmates Intimate Intimates
1 1080 34 NaN Love this dress! it's sooo pretty. i happene... 5 1 4 General Dresses Dresses
2 1077 60 Some major design flaws I had such high hopes for this dress and reall... 3 0 0 General Dresses Dresses
3 1049 50 My favorite buy! I love, love, love this jumpsuit. it's fun, fl... 5 1 0 General Petite Bottoms Pants
4 847 47 Flattering shirt This shirt is very flattering to all due to th... 5 1 6 General Tops Blouses

Exploratory analysis

In [10]:
reviews_df['Division Name'].unique()
Out[10]:
array(['Initmates', 'General', 'General Petite', nan], dtype=object)
In [11]:
reviews_df['Department Name'].unique()
Out[11]:
array(['Intimate', 'Dresses', 'Bottoms', 'Tops', 'Jackets', 'Trend', nan],
      dtype=object)

Check for null values

In [6]:
reviews_df.isnull().sum()
Out[6]:
Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64
In [10]:
845/23486
Out[10]:
0.035978881035510515

Since null values in Review Text column only takes up a negligible 0.04% of the dataset, we'll exclude these records.

In [4]:
reviews_df_drop = reviews_df.copy()
reviews_df_drop.dropna(axis=0, inplace=True)
In [33]:
reviews_df_drop.isnull().sum()
Out[33]:
Clothing ID                0
Age                        0
Title                      0
Review Text                0
Rating                     0
Recommended IND            0
Positive Feedback Count    0
Division Name              0
Department Name            0
Class Name                 0
Age group                  8
dtype: int64
In [32]:
reviews_df_drop.describe()
Out[32]:
Clothing ID Age Rating Recommended IND Positive Feedback Count
count 19662.00 19662.00 19662.00 19662.00 19662.00
mean 921.30 43.26 4.18 0.82 2.65
std 200.23 12.26 1.11 0.39 5.83
min 1.00 18.00 1.00 0.00 0.00
25% 861.00 34.00 4.00 1.00 0.00
50% 936.00 41.00 5.00 1.00 1.00
75% 1078.00 52.00 5.00 1.00 3.00
max 1205.00 99.00 5.00 1.00 122.00
In [8]:
reviews_df.dtypes
Out[8]:
Clothing ID                 int64
Age                         int64
Title                      object
Review Text                object
Rating                      int64
Recommended IND             int64
Positive Feedback Count     int64
Division Name              object
Department Name            object
Class Name                 object
dtype: object
In [8]:
reviews_df_drop['Clothing ID'].nunique()
Out[8]:
1095
In [72]:
reviews_df.shape
Out[72]:
(23486, 11)

The number of unique values in Clothing ID is significantly lesser than the number of rows. This shows the Clothing ID are not unique. This is understandable, given that Clothing ID is the product that is being reviewed. Hence, this shows the same product was purchased and reviewed by multiple customers

However, this also means that we can find out what are the popular items (i.e those that appear more than 'n' times) that customers have commonly purchased and reviewed. In this case, I set an arbitray 'n' to be 100.

In [11]:
vc = reviews_df_drop['Clothing ID'].value_counts()
vc[vc > 100]
Out[11]:
1078    871
862     658
1094    651
1081    487
829     452
872     450
1110    419
868     370
895     336
867     291
936     289
1095    287
850     280
1077    251
1059    247
863     243
1086    241
1080    241
860     237
1083    214
861     203
873     197
1033    190
927     187
1092    187
828     181
1056    180
820     177
836     172
1022    172
1072    166
1008    163
1104    147
984     144
831     138
877     133
1020    133
833     132
854     130
834     125
864     125
835     121
1082    119
1099    115
1035    115
940     113
1087    109
865     108
907     106
909     102
875     101
Name: Clothing ID, dtype: int64

Given the age group of reviewers spans over wide range of age, we'll categorise them into separate bins (10-20], (20-30] etc. This enables subsequent analysis to be more meaningful, because based on the ratings and sentiments of each age group, the company can tailor its marketing strategies accordingly or provide a more unique shopping experience for each age group, assuming that women in the same age group tend to share some similarities.

In [5]:
bins = np.arange(0,100,10)
reviews_df_drop['Age group'] = pd.cut(reviews_df['Age'], bins)
reviews_df_drop.head()
Out[5]:
Clothing ID Age Title Review Text Rating Recommended IND Positive Feedback Count Division Name Department Name Class Name Age group
2 1077 60 Some major design flaws I had such high hopes for this dress and reall... 3 0 0 General Dresses Dresses (50, 60]
3 1049 50 My favorite buy! I love, love, love this jumpsuit. it's fun, fl... 5 1 0 General Petite Bottoms Pants (40, 50]
4 847 47 Flattering shirt This shirt is very flattering to all due to th... 5 1 6 General Tops Blouses (40, 50]
5 1080 49 Not for the very petite I love tracy reese dresses, but this one is no... 2 0 4 General Dresses Dresses (40, 50]
6 858 39 Cagrcoal shimmer fun I aded this in my basket at hte last mintue to... 5 1 1 General Petite Tops Knits (30, 40]
In [6]:
fig = plt.figure(figsize=DIMS)
ax = fig.add_subplot(111)


ratings_count_df = reviews_df_drop.groupby(['Rating', reviews_df_drop['Age group']]).size().reset_index(name='n')
ratings_count_df.columns

ratings_count_df_pivot = pd.pivot_table(ratings_count_df,index=["Age group"],
               values=["n"],
               columns=["Rating"],
               aggfunc=[np.sum])

ratings_count_df_pivot
ratings_count_df_pivot.plot(kind = 'bar', stacked=True, fontsize = 16, ax=ax)
ax.set_ylabel("No. of ratings", fontsize=16)
ax.set_xlabel("Age group", fontsize=16)
fig.suptitle('Stacked bar chart of number of ratings across age groups', fontsize=20)
plt.show()

Conclusion: Women in the 30s to 40s give the most number of online reviews, followed by women in the 30s to 40s, 50s to 60s then young women in their 20s to 30s . This is surprising, given that e-commerce sites tend to be more popular among young women. Perhaps youngsters shop online but do not provide their ratings, thus these people are not captured in dataset.

The most common rating given is 5 across women from ages 20 to 70. While this could mean the e-commerce company may be doing a good job at providing a good shopping experience for their customers, analysing the sentiments of the customers' reviews will tell us a more holistic story.

Text normalisation on Review Text
In [6]:
def clean_string(r):
    #remove regular expression and change to lowercase
    r1 = re.sub('[^A-Za-z]+', ' ', str(r))
    r1 = r1.strip().lower()
   
    #remove stopwords
    no_sw = []
    if r1 not in stopwords.words():
        no_sw.append(r1)
        
   #perform stemming
    stemmer = PorterStemmer() 
    final_list = []

    for i in no_sw:
        final_list.append(stemmer.stem(str(i)))
                
    return final_list
In [7]:
only_reviews_df = reviews_df_drop.copy()
only_reviews_df= only_reviews_df[['Clothing ID','Review Text', 'Age group', 'Recommended IND']]
only_reviews_df['Filtered Reviews'] = only_reviews_df['Review Text'].apply(clean_string)
only_reviews_df.head()
Out[7]:
Clothing ID Review Text Age group Recommended IND Filtered Reviews
2 1077 I had such high hopes for this dress and reall... (50, 60] 0 [i had such high hopes for this dress and real...
3 1049 I love, love, love this jumpsuit. it's fun, fl... (40, 50] 1 [i love love love this jumpsuit it s fun flirt...
4 847 This shirt is very flattering to all due to th... (40, 50] 1 [this shirt is very flattering to all due to t...
5 1080 I love tracy reese dresses, but this one is no... (40, 50] 0 [i love tracy reese dresses but this one is no...
6 858 I aded this in my basket at hte last mintue to... (30, 40] 1 [i aded this in my basket at hte last mintue t...
Sentiment analysis on Filtered reviews
In [8]:
sid = SentimentIntensityAnalyzer()

num_records = only_reviews_df.shape[0]
ones_array = np.ones(num_records)
    
only_reviews_df['Compound score'] = ones_array

for i in range(len(only_reviews_df['Filtered Reviews'])):

    element = only_reviews_df['Filtered Reviews'].iloc[i][0]
#     print(element)
    score = sid.polarity_scores(element)
    compound = score['compound']
#     print(compound)
    only_reviews_df['Compound score'].iloc[i] = compound
only_reviews_df.head()
/Users/admin/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py:189: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
Out[8]:
Clothing ID Review Text Age group Recommended IND Filtered Reviews Compound score
2 1077 I had such high hopes for this dress and reall... (50, 60] 0 [i had such high hopes for this dress and real... 0.94
3 1049 I love, love, love this jumpsuit. it's fun, fl... (40, 50] 1 [i love love love this jumpsuit it s fun flirt... 0.72
4 847 This shirt is very flattering to all due to th... (40, 50] 1 [this shirt is very flattering to all due to t... 0.92
5 1080 I love tracy reese dresses, but this one is no... (40, 50] 0 [i love tracy reese dresses but this one is no... 0.94
6 858 I aded this in my basket at hte last mintue to... (30, 40] 1 [i aded this in my basket at hte last mintue t... 0.46
Classify sentiments into different bins of age groups
In [44]:
fig = plt.figure(figsize=DIMS)
ax = fig.add_subplot(111)

sns.boxplot(data=only_reviews_df[['Compound score', 'Age group']],x= 'Age group',y='Compound score', ax=ax, fontsize = 20)
ax.set_title('Boxplot of compound sentiment score across age groups', fontsize=30)

plt.show()

Conclusion: Women across all age groups have given very positive sentiments about their shopping experience, with a general compound polarity of 0.9. However, there are more outliers in sentiment scores amongst women from 20s to 70s.

Classify sentiments into positive, negative, neutral
In [9]:
category = []

for i in only_reviews_df['Compound score']: 
    
    if i > 0:
        category.append('Positive')

    elif i < 0:
        category.append('Negative')

    else:
        category.append('Neutral')

only_reviews_df['Sentiment category'] = category
only_reviews_df.tail()
Out[9]:
Clothing ID Review Text Age group Recommended IND Filtered Reviews Compound score Sentiment category
23481 1104 I was very happy to snag this dress at such a ... (30, 40] 1 [i was very happy to snag this dress at such a... 0.91 Positive
23482 862 It reminds me of maternity clothes. soft, stre... (40, 50] 1 [it reminds me of maternity clothes soft stret... 0.67 Positive
23483 1104 This fit well, but the top was very see throug... (30, 40] 0 [this fit well but the top was very see throug... 0.93 Positive
23484 1084 I bought this dress for a wedding i have this ... (20, 30] 1 [i bought this dress for a wedding i have this... 0.82 Positive
23485 1104 This dress in a lovely platinum is feminine an... (50, 60] 1 [this dress in a lovely platinum is feminine a... 0.93 Positive
Number of sentiment categories
In [15]:
num_sentiment_df = only_reviews_df.copy()
num_sentiment_df = num_sentiment_df[['Clothing ID', 'Sentiment category']]
num_sentiment_agg = num_sentiment_df.groupby('Sentiment category').size().reset_index(name='No. of sentiments')
num_sentiment_agg
Out[15]:
Sentiment category No. of sentiments
0 Negative 1099
1 Neutral 121
2 Positive 18442
In [144]:
num_sentiment_agg.columns
Out[144]:
Index(['Sentiment category', 'No. of sentiments'], dtype='object')
In [150]:
fig, ax = plt.subplots()
num_sentiment_agg.plot(kind='bar', x='Sentiment category', y='No. of sentiments', figsize = DIMS, ax=ax, rot=0, fontsize = 20)
fig.suptitle('Number of sentiment category', fontsize=16)
ax.legend([No. of sentiments], fontsize=12)
Out[150]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a21001fd0>

Given that there are a lot more positive reviews, the list of popular items found earlier should reflect positive sentiments about their purchase. The company can consider manufacturing or bringing in more of such type of apparels.

The large proportion of positive reviews also corresponds with the greater proportion of maximum (5/5) ratings given.

Wordcloud for each category of sentiments

In [123]:
positive_reviews = only_reviews_df.loc[only_reviews_df['Sentiment category']=='Positive', 'Filtered Reviews'].tolist()


fig = plt.figure(figsize=DIMS)
desc_wordcloud = WordCloud(
    width=400, height=150,
    background_color="white", 
    max_words=150, relative_scaling = 1.0).generate(pos_rev_desc)
plt.imshow(desc_wordcloud)
plt.axis("off")
plt.title("Wordcloud of positive reviews", fontsize=30)
plt.show()
In [142]:
negative_reviews = only_reviews_df.loc[only_reviews_df['Sentiment category']=='Negative', 'Filtered Reviews']

neg_rev_desc = ''
for i in negative_reviews:
    for x in i:
        neg_rev_desc = neg_rev_desc + '' + x  
In [122]:
fig = plt.figure(figsize=DIMS)
desc_wordcloud = WordCloud(
    width=400, height=150,
    background_color="white", 
    max_words=150, relative_scaling = 1.0).generate(neg_rev_desc)
plt.imshow(desc_wordcloud)
plt.axis("off")
plt.title("Wordcloud of negative reviews", fontsize=30)
plt.show()

Conclusion:

The main elements contributing to both positive and negative reviews are top, dress. However, words like fit and size are more commonly associated with negative reviews, while words such as fabric and colour appear more often with positive reviews. While it seems that generally, the experience from the purchase of tops and dresses from this e-commerce company generate mixed reviews, I feel that the company can continue with its current choice of fabric and colour it chose for its tops and dresses, given that the proportion of positive reviews are significantly larger than negative reviews.

In [141]:
neutral_reviews = only_reviews_df.loc[only_reviews_df['Sentiment category']=='Neutral', 'Filtered Reviews']

neut_rev_desc = ''
for i in neutral_reviews:
    for x in i:
        neut_rev_desc = neut_rev_desc + '' + x
In [125]:
fig = plt.figure(figsize=DIMS)
desc_wordcloud = WordCloud(
    width=400, height=150,
    background_color="white", 
    max_words=150, relative_scaling = 1.0).generate(neut_rev_desc)
plt.imshow(desc_wordcloud)
plt.axis("off")
plt.title("Wordcloud of neutral reviews", fontsize=30)
plt.show()
Train
In [10]:
X = only_reviews_df['Sentiment category']
y = only_reviews_df['Recommended IND']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25)

Apply CountVectorizer and TdidfTransformer transformations

In [11]:
count_vect = CountVectorizer()
tfidf_tfm = TfidfTransformer()
train_df_counts = count_vect.fit_transform(X_train)
train_df_tfidf = tfidf_tfm.fit_transform(train_df_counts)

Fit the model using Naïve Bayes Classification.

In [12]:
clf = MultinomialNB().fit(train_df_tfidf, y_train)
Test
In [13]:
count_test_vect = count_vect.transform(X_test)
y_predict = clf.predict(count_test_vect)
Evaluate Model

Compare results in predicted model against the original results

In [18]:
y_confusion_df = pd.DataFrame({"y_test": y_test, "y_predict" : y_predict} )
y_confusion_df.head()
Out[18]:
y_test y_predict
20491 1 1
9659 1 1
5894 1 1
17413 1 1
9049 1 1
In [15]:
cm = confusion_matrix(y_test, y_predict)
cm
Out[15]:
array([[ 168,  690],
       [  94, 3964]])
In [72]:
fig = plt.figure()
ax1 = fig.add_subplot(111)
sns.heatmap(cm, annot=True, fmt='.0f', annot_kws={"size": 25}, ax=ax1)
ax1.set_title("Confusion matrix of Naïve Bayes Classification of Recommend IND")
plt.show()
In [88]:
print("True negatives:", (cm[0][0]))
print("True positives:", (cm[1][1]))
print("False negatives:", (cm[0][1]))
print("Total:", (sum(sum(cm))))
print("Accuracy:", ((cm[0][0] + cm[1][1])/sum(sum(cm))))
print("Proportion of False negatives:", (cm[0][1] /sum(sum(cm))))
True negatives: 171
True positives: 3949
False negatives: 700
Total: 4916
Accuracy: 0.8380797396257119
Proportion of False negatives: 0.14239218877135884

Conclusion: Accuracy of model is 83.8%

More importantly, 14.2% of the time, the model wrongly predicts that a customer would recommend her product based on her sentiments, when she actually would not have

Inspect the wrongly classified reviews.
In [19]:
only_reviews_df['id'] = only_reviews_df.index
y_confusion_df['id'] = y_confusion_df.index

wrong_indices = y_confusion_df[
    y_confusion_df['y_test'] != y_confusion_df['y_predict']].index

mismatch_df = only_reviews_df.merge(y_confusion_df, on='id')
mismatch_df = mismatch_df[mismatch_df['id'].isin(wrong_indices)]
mismatch_df.head()
Out[19]:
Clothing ID Review Text Age group Recommended IND Filtered Reviews Compound score Sentiment category id y_test y_predict
1 1080 I love tracy reese dresses, but this one is no... (40, 50] 0 [i love tracy reese dresses but this one is no... 0.94 Positive 5 0 1
12 822 Why do designers keep making crop tops??!! i c... (30, 40] 0 [why do designers keep making crop tops i can ... 0.92 Positive 71 0 1
20 923 I was so excited to try out this top since it ... (40, 50] 0 [i was so excited to try out this top since it... 0.69 Positive 124 0 1
36 1020 This skirt looks exactly as pictured and fits ... (40, 50] 0 [this skirt looks exactly as pictured and fits... 0.78 Positive 199 0 1
37 1020 I love the rich deep color and the style but o... (40, 50] 0 [i love the rich deep color and the style but ... 0.90 Positive 205 0 1
In [21]:
only_reviews_df['id'] = only_reviews_df.index
y_confusion_df['id'] = y_confusion_df.index

wrong_indices = y_confusion_df[
    y_confusion_df['y_test'] != y_confusion_df['y_predict']].index

mismatch_df = only_reviews_df.merge(y_confusion_df, on='id')
mismatch_df = mismatch_df[mismatch_df['id'].isin(wrong_indices)].reset_index()
mismatch_df.head()
Out[21]:
index Clothing ID Review Text Age group Recommended IND Filtered Reviews Compound score Sentiment category id y_test y_predict
0 1 1080 I love tracy reese dresses, but this one is no... (40, 50] 0 [i love tracy reese dresses but this one is no... 0.94 Positive 5 0 1
1 12 822 Why do designers keep making crop tops??!! i c... (30, 40] 0 [why do designers keep making crop tops i can ... 0.92 Positive 71 0 1
2 20 923 I was so excited to try out this top since it ... (40, 50] 0 [i was so excited to try out this top since it... 0.69 Positive 124 0 1
3 36 1020 This skirt looks exactly as pictured and fits ... (40, 50] 0 [this skirt looks exactly as pictured and fits... 0.78 Positive 199 0 1
4 37 1020 I love the rich deep color and the style but o... (40, 50] 0 [i love the rich deep color and the style but ... 0.90 Positive 205 0 1
In [30]:
mismatch_sample = mismatch_df.iloc[4]['Filtered Reviews']
mismatch_sample
Out[30]:
['i love the metallic colors of this top and figured i could wear it under a ruched jacket and circle skirt for work welp that s out the window this design is poor for one this is not a piece for a petite woman with no torso and i don t know how anyone with a longer torso wears t his this hits above my belly botton on and i got apetite i have no torso so without a jacket i would never wear this it s very low cut the back is very low it s a little loose but i run between a and a']
In [22]:
spl = mismatch_df.sample()
print(spl[['Clothing ID', 'id','Sentiment category','y_test','y_predict']])
print()
print(spl.iloc[0]['Filtered Reviews'])
    Clothing ID   id Sentiment category  y_test  y_predict
24          937  745           Positive       0          1

['yikes quite a smell off of this one like wet hot wool the color was beautiful but the sweater is enormous strange fit under the arms as well this one went back the same day']

The above sample shows how the model wrongly predicts that a customer would recommend her product based on her sentiments, when she actually would not have. Hence the model should be interpreted and used with caution.

Future Work

Regarding the list of popular items, the company can decide the optimal range of 'n' that it wishes to be considered as commonly purchased. Subsequently, it can find out which items are more popular among which age groups, and tailor its promotion accordingly.

Given that this is a general dataset, the current analysis is limited to identification of trends and insights to help make broad business decisions. In order to build a model that customised recommendations to customers based on their history, then a query into the database that provides customers' purchasing history would be necessary. In this way, one would be able to employ collaborative filtering and personalise customers' online shopping experiences. Such a recommendation system is gaining traction in numerous e-commerce sites. Thus, this e-commerce company can consider adopting such a model to enhance its customers' web experience and boost its revenue.

August 12, 2020 Published by  Valerie Lim Yan Hui-

Related Topics

A look on income increment and poverty in Singapore

Read more

PUBG Analysis

Read more

Music Then and Now

Read more