Advertising Analysis Project

This data set contains the following features:

  • ‘Daily Time Spent on Site’: consumer time on site in minutes
  • ‘Age’: cutomer age in years
  • ‘Area Income’: Avg. Income of geographical area of consumer
  • ‘Daily Internet Usage’: Avg. minutes a day consumer is on the internet
  • ‘Ad Topic Line’: Headline of the advertisement
  • ‘City’: City of consumer
  • ‘Male’: Whether or not consumer was male
  • ‘Country’: Country of consumer
  • ‘Timestamp’: Time at which consumer clicked on Ad or closed window
  • ‘Clicked on Ad’: 0 or 1 indicated clicking on Ad

Checking Out the Data

Importing Libraries

1
2
3
4
5
6
7
8
9
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report,confusion_matrix
%matplotlib inline

Reading in Data

1
2
ad = pd.read_csv('advertising.csv')
ad.head()
Daily Time Spent on Site Age Area Income Daily Internet Usage Ad Topic Line City Male Country Timestamp Clicked on Ad
0 68.95 35 61833.90 256.09 Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia 2016-03-27 00:53:11 0
1 80.23 31 68441.85 193.77 Monitored national standardization West Jodi 1 Nauru 2016-04-04 01:39:02 0
2 69.47 26 59785.94 236.50 Organic bottom-line service-desk Davidton 0 San Marino 2016-03-13 20:35:42 0
3 74.15 29 54806.18 245.89 Triple-buffered reciprocal time-frame West Terrifurt 1 Italy 2016-01-10 02:31:19 0
4 68.37 35 73889.99 225.58 Robust logistical utilization South Manuel 0 Iceland 2016-06-03 03:36:18 0

Data Summary

1
ad.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
Daily Time Spent on Site    1000 non-null float64
Age                         1000 non-null int64
Area Income                 1000 non-null float64
Daily Internet Usage        1000 non-null float64
Ad Topic Line               1000 non-null object
City                        1000 non-null object
Male                        1000 non-null int64
Country                     1000 non-null object
Timestamp                   1000 non-null object
Clicked on Ad               1000 non-null int64
dtypes: float64(3), int64(3), object(4)
memory usage: 78.2+ KB
1
ad.describe()
Daily Time Spent on Site Age Area Income Daily Internet Usage Male Clicked on Ad
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.00000
mean 65.000200 36.009000 55000.000080 180.000100 0.481000 0.50000
std 15.853615 8.785562 13414.634022 43.902339 0.499889 0.50025
min 32.600000 19.000000 13996.500000 104.780000 0.000000 0.00000
25% 51.360000 29.000000 47031.802500 138.830000 0.000000 0.00000
50% 68.215000 35.000000 57012.300000 183.130000 0.000000 0.50000
75% 78.547500 42.000000 65470.635000 218.792500 1.000000 1.00000
max 91.430000 61.000000 79484.800000 269.960000 1.000000 1.00000

Exploratory Data Analysis

Here I’m doing a general exploration of the data to see if anything stands out or if there are any noticeable trends.

1
2
sns.set_context('notebook')
sns.set_style('white')
1
2
ad['Age'].plot.hist(bins=40)
plt.xlabel('Age')
1
<matplotlib.text.Text at 0x10d75b978>

png

The age counts follow a normal distribution with a spike around 40-42 years old. Let’s see if age correlates with ad clicks.

1
sns.countplot('Age',data=ad,hue='Clicked on Ad')
1
<matplotlib.axes._subplots.AxesSubplot at 0x10d982fd0>

png

1
1
sns.factorplot(x='Clicked on Ad',y='Age',data=ad,kind='swarm')
1
<seaborn.axisgrid.FacetGrid at 0x110d15198>

png

Age doesn’t seem to have a high correlation with ad clicks, but there is some grouping on the extreme ends. Now to check for general trends.

1
sns.pairplot(ad,hue='Clicked on Ad',kind='scatter')
1
<seaborn.axisgrid.PairGrid at 0x110f15240>

png

1
sns.heatmap(ad.corr())
1
<matplotlib.axes._subplots.AxesSubplot at 0x1132cb630>

png

1
sns.factorplot(x='Clicked on Ad',y='Daily Time Spent on Site',data=ad,kind='swarm')
1
<seaborn.axisgrid.FacetGrid at 0x1131ff6d8>

png

1
sns.factorplot(x='Clicked on Ad',y='Area Income',data=ad,kind='swarm')
1
<seaborn.axisgrid.FacetGrid at 0x112ac8c18>

png

1
sns.factorplot(x='Clicked on Ad',y='Daily Internet Usage',data=ad,kind='swarm')
1
<seaborn.axisgrid.FacetGrid at 0x112ac8ac8>

png

1
1
sns.factorplot(x='Clicked on Ad',y='Age',data=ad,kind='swarm')
1
<seaborn.axisgrid.FacetGrid at 0x113789d68>

png

Findings

Most of the groupings have overlap with another group when analyzing who clicked on an ad relative to each feature. The highest levels of grouping was found between Clicked on Ad and:

  • Daily Internet Usage
  • Daily Time Spent on Site
  • Area Income

Being male didn’t seem to have much affect on whether an ad was clicked. I’m going to model the data based on all columns and just the most relevant columns.

1
2
all_columns = ['Male','Age','Daily Internet Usage', 'Daily Time Spent on Site', 'Area Income']
relevant_columns = ['Daily Internet Usage', 'Daily Time Spent on Site', 'Area Income']

Prediction and Modeling

1
from sklearn.preprocessing import StandardScaler
1
scaler = StandardScaler()
1
scaler.fit(ad[all_columns])
1
StandardScaler(copy=True, with_mean=True, with_std=True)
1
2
scaled_features = scaler.transform(ad[all_columns])
scaled_features[:5]
1
2
3
4
5
array([[-0.96269532, -0.11490498,  1.73403   ,  0.24926659,  0.50969109],
       [ 1.03875025, -0.57042523,  0.31380538,  0.96113227,  1.00253021],
       [-0.96269532, -1.13982553,  1.28758905,  0.28208309,  0.35694859],
       [ 1.03875025, -0.79818535,  1.50157989,  0.57743162, -0.01445564],
       [-0.96269532, -0.11490498,  1.03873069,  0.21266356,  1.40886751]])
1
2
scaled_ad = pd.DataFrame(scaled_features,columns=all_columns)
scaled_ad.head()
Male Age Daily Internet Usage Daily Time Spent on Site Area Income
0 -0.962695 -0.114905 1.734030 0.249267 0.509691
1 1.038750 -0.570425 0.313805 0.961132 1.002530
2 -0.962695 -1.139826 1.287589 0.282083 0.356949
3 1.038750 -0.798185 1.501580 0.577432 -0.014456
4 -0.962695 -0.114905 1.038731 0.212664 1.408868

Setting up data

I’m using two training and testing sets on the standardized data.

1
2
3
X_all = scaled_ad
X_rel = scaled_ad[relevant_columns]
y = ad['Clicked on Ad']
1
2
X_train, X_test, y_train, y_test = train_test_split(X_all, y, test_size=0.3, random_state=42)
X_rel_train, X_rel_test, y_rel_train, y_rel_test = train_test_split(X_rel, y, test_size=0.3, random_state=42)

Logisitic Regression

1
2
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
1
2
3
4
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
1
pred = logmodel.predict(X_test)
1
2
3
1
2
3
print(classification_report(y_test,pred))
print('\n')
print(confusion_matrix(y_test,pred))
1
2
3
4
5
6
7
8
9
10
             precision    recall  f1-score   support

          0       0.96      0.98      0.97       146
          1       0.98      0.96      0.97       154

avg / total       0.97      0.97      0.97       300


[[143   3]
 [  6 148]]

Findings

Training the logistic model on all the data actually led to pretty good results. I want to see if narrowing down the columns to the most relevant ones can boost the accuracy even more.

1
2
logmodel.fit(X_rel_train,y_rel_train)
pred_rel = logmodel.predict(X_rel_test)
1
2
3
1
2
3
print(classification_report(y_rel_test,pred_rel))
print('\n')
print(confusion_matrix(y_rel_test,pred_rel))
1
2
3
4
5
6
7
8
9
10
             precision    recall  f1-score   support

          0       0.93      0.97      0.95       146
          1       0.97      0.93      0.95       154

avg / total       0.95      0.95      0.95       300


[[141   5]
 [ 11 143]]

It turns out that the accuracy was actually worse. This is likely due to the fact that there were very few columns to begin with so having more data was better than less in this case. This could also be due to the fact that the dataset isn’t based on real world data.

K-Nearest Neighbors

1
2
3
knn = KNeighborsClassifier()
knn.fit(X_train,y_train)
pred = knn.predict(X_test)
1
2
3
1
2
3
print(classification_report(y_test,pred))
print('\n')
print(confusion_matrix(y_test,pred))
1
2
3
4
5
6
7
8
9
10
             precision    recall  f1-score   support

          0       0.92      0.97      0.95       146
          1       0.97      0.92      0.95       154

avg / total       0.95      0.95      0.95       300


[[142   4]
 [ 12 142]]
1
2
knn.fit(X_rel_train,y_rel_train)
pred_rel = knn.predict(X_rel_test)
1
2
3
1
2
3
print(classification_report(y_rel_test,pred_rel))
print('\n')
print(confusion_matrix(y_rel_test,pred_rel))
1
2
3
4
5
6
7
8
9
10
             precision    recall  f1-score   support

          0       0.92      0.96      0.94       146
          1       0.96      0.92      0.94       154

avg / total       0.94      0.94      0.94       300


[[140   6]
 [ 12 142]]

Here I was expecting K-Nearest Neighbors to do a little better than the logistic regression based on past experience. Just goes to show you have to treat each dataset as its own unique case.

The model still worked better having more features to work with.

Optimizing K-Nearest Neighbors

Previously, I used the default number of neighbors for the classifier. Here I’m going to find the best number to work by minimizing the error rate.

1
2
3
4
5
6
7
8
9
error_rate = []

# Will take some time
for i in range(1,40):

    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))
1
2
3
4
5
6
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
1
<matplotlib.text.Text at 0x11464c828>

png

Based on the graph there were several points the had the minimum error rate. I just chose the smallest value of 14.

1
2
3
4
5
6
7
knn = KNeighborsClassifier(n_neighbors=14)

knn.fit(X_train,y_train)
pred = knn.predict(X_test)

knn.fit(X_rel_train,y_rel_train)
pred_rel = knn.predict(X_rel_test)
1
2
3
4
5
6
7
8
9
10
print('FOR ALL COLUMNS:')
print(classification_report(y_test,pred))
print('\n')
print(confusion_matrix(y_test,pred))
print('\n')

print('FOR RELEVANT COLUMNS:')
print(classification_report(y_rel_test,pred_rel))
print('\n')
print(confusion_matrix(y_rel_test,pred_rel))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
FOR ALL COLUMNS:
             precision    recall  f1-score   support

          0       0.92      0.99      0.96       146
          1       0.99      0.92      0.96       154

avg / total       0.96      0.96      0.96       300


[[145   1]
 [ 12 142]]


FOR RELEVANT COLUMNS:
             precision    recall  f1-score   support

          0       0.92      0.97      0.94       146
          1       0.97      0.92      0.94       154

avg / total       0.94      0.94      0.94       300


[[141   5]
 [ 12 142]]

Using 14 for the n_neighbors parameter improved the accuracy as expected, but not enough to do better than the logisitic regression.

Conclusion

Overall I was able to get a 97% accuracy in predicting whether a person will click an ad based on some user data. This information could be used to determine the best users to target for an ad campaign.

While ad clicks were measured, it would likely be more helpful to determine who is more likely to purchase a product given a set of user data. Since advertisements are meant to drive purchases, being able to directly predict user value given data would decrease costs spent on ads while increasing revenue.