This data set contains the following features:

• ‘Daily Time Spent on Site’: consumer time on site in minutes
• ‘Age’: cutomer age in years
• ‘Area Income’: Avg. Income of geographical area of consumer
• ‘Daily Internet Usage’: Avg. minutes a day consumer is on the internet
• ‘City’: City of consumer
• ‘Male’: Whether or not consumer was male
• ‘Country’: Country of consumer
• ‘Timestamp’: Time at which consumer clicked on Ad or closed window

# Checking Out the Data

## Importing Libraries

 123456789 import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import classification_report,confusion_matrix %matplotlib inline 

 12 ad = pd.read_csv('advertising.csv') ad.head() 
Daily Time Spent on Site Age Area Income Daily Internet Usage Ad Topic Line City Male Country Timestamp Clicked on Ad
0 68.95 35 61833.90 256.09 Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia 2016-03-27 00:53:11 0
1 80.23 31 68441.85 193.77 Monitored national standardization West Jodi 1 Nauru 2016-04-04 01:39:02 0
2 69.47 26 59785.94 236.50 Organic bottom-line service-desk Davidton 0 San Marino 2016-03-13 20:35:42 0
3 74.15 29 54806.18 245.89 Triple-buffered reciprocal time-frame West Terrifurt 1 Italy 2016-01-10 02:31:19 0
4 68.37 35 73889.99 225.58 Robust logistical utilization South Manuel 0 Iceland 2016-06-03 03:36:18 0

## Data Summary

 1 ad.info() 
 123456789101112131415  RangeIndex: 1000 entries, 0 to 999 Data columns (total 10 columns): Daily Time Spent on Site 1000 non-null float64 Age 1000 non-null int64 Area Income 1000 non-null float64 Daily Internet Usage 1000 non-null float64 Ad Topic Line 1000 non-null object City 1000 non-null object Male 1000 non-null int64 Country 1000 non-null object Timestamp 1000 non-null object Clicked on Ad 1000 non-null int64 dtypes: float64(3), int64(3), object(4) memory usage: 78.2+ KB 
 1 ad.describe() 
Daily Time Spent on Site Age Area Income Daily Internet Usage Male Clicked on Ad
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.00000
mean 65.000200 36.009000 55000.000080 180.000100 0.481000 0.50000
std 15.853615 8.785562 13414.634022 43.902339 0.499889 0.50025
min 32.600000 19.000000 13996.500000 104.780000 0.000000 0.00000
25% 51.360000 29.000000 47031.802500 138.830000 0.000000 0.00000
50% 68.215000 35.000000 57012.300000 183.130000 0.000000 0.50000
75% 78.547500 42.000000 65470.635000 218.792500 1.000000 1.00000
max 91.430000 61.000000 79484.800000 269.960000 1.000000 1.00000

# Exploratory Data Analysis

Here I’m doing a general exploration of the data to see if anything stands out or if there are any noticeable trends.

 12 sns.set_context('notebook') sns.set_style('white') 
 12 ad['Age'].plot.hist(bins=40) plt.xlabel('Age') 
 1  

The age counts follow a normal distribution with a spike around 40-42 years old. Let’s see if age correlates with ad clicks.

 1 sns.countplot('Age',data=ad,hue='Clicked on Ad') 
 1  

1
 1 sns.factorplot(x='Clicked on Ad',y='Age',data=ad,kind='swarm') 
 1  

Age doesn’t seem to have a high correlation with ad clicks, but there is some grouping on the extreme ends. Now to check for general trends.

 1 sns.pairplot(ad,hue='Clicked on Ad',kind='scatter') 
 1  

 1 sns.heatmap(ad.corr()) 
 1  

 1 sns.factorplot(x='Clicked on Ad',y='Daily Time Spent on Site',data=ad,kind='swarm') 
 1  

 1 sns.factorplot(x='Clicked on Ad',y='Area Income',data=ad,kind='swarm') 
 1  

 1 sns.factorplot(x='Clicked on Ad',y='Daily Internet Usage',data=ad,kind='swarm') 
 1  

1
 1 sns.factorplot(x='Clicked on Ad',y='Age',data=ad,kind='swarm') 
 1  

## Findings

Most of the groupings have overlap with another group when analyzing who clicked on an ad relative to each feature. The highest levels of grouping was found between Clicked on Ad and:

• Daily Internet Usage
• Daily Time Spent on Site
• Area Income

Being male didn’t seem to have much affect on whether an ad was clicked. I’m going to model the data based on all columns and just the most relevant columns.

 12 all_columns = ['Male','Age','Daily Internet Usage', 'Daily Time Spent on Site', 'Area Income'] relevant_columns = ['Daily Internet Usage', 'Daily Time Spent on Site', 'Area Income'] 

# Prediction and Modeling

 1 from sklearn.preprocessing import StandardScaler 
 1 scaler = StandardScaler() 
 1 scaler.fit(ad[all_columns]) 
 1 StandardScaler(copy=True, with_mean=True, with_std=True) 
 12 scaled_features = scaler.transform(ad[all_columns]) scaled_features[:5] 
 12345 array([[-0.96269532, -0.11490498, 1.73403 , 0.24926659, 0.50969109], [ 1.03875025, -0.57042523, 0.31380538, 0.96113227, 1.00253021], [-0.96269532, -1.13982553, 1.28758905, 0.28208309, 0.35694859], [ 1.03875025, -0.79818535, 1.50157989, 0.57743162, -0.01445564], [-0.96269532, -0.11490498, 1.03873069, 0.21266356, 1.40886751]]) 
 12 scaled_ad = pd.DataFrame(scaled_features,columns=all_columns) scaled_ad.head() 
Male Age Daily Internet Usage Daily Time Spent on Site Area Income
0 -0.962695 -0.114905 1.734030 0.249267 0.509691
1 1.038750 -0.570425 0.313805 0.961132 1.002530
2 -0.962695 -1.139826 1.287589 0.282083 0.356949
3 1.038750 -0.798185 1.501580 0.577432 -0.014456
4 -0.962695 -0.114905 1.038731 0.212664 1.408868

## Setting up data

I’m using two training and testing sets on the standardized data.

 123 X_all = scaled_ad X_rel = scaled_ad[relevant_columns] y = ad['Clicked on Ad'] 
 12 X_train, X_test, y_train, y_test = train_test_split(X_all, y, test_size=0.3, random_state=42) X_rel_train, X_rel_test, y_rel_train, y_rel_test = train_test_split(X_rel, y, test_size=0.3, random_state=42) 

## Logisitic Regression

 12 logmodel = LogisticRegression() logmodel.fit(X_train,y_train) 
 1234 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False) 
 1 pred = logmodel.predict(X_test) 
123
 123 print(classification_report(y_test,pred)) print('\n') print(confusion_matrix(y_test,pred)) 
 12345678910  precision recall f1-score support 0 0.96 0.98 0.97 146 1 0.98 0.96 0.97 154 avg / total 0.97 0.97 0.97 300 [[143 3] [ 6 148]] 

## Findings

Training the logistic model on all the data actually led to pretty good results. I want to see if narrowing down the columns to the most relevant ones can boost the accuracy even more.

 12 logmodel.fit(X_rel_train,y_rel_train) pred_rel = logmodel.predict(X_rel_test) 
123
 123 print(classification_report(y_rel_test,pred_rel)) print('\n') print(confusion_matrix(y_rel_test,pred_rel)) 
 12345678910  precision recall f1-score support 0 0.93 0.97 0.95 146 1 0.97 0.93 0.95 154 avg / total 0.95 0.95 0.95 300 [[141 5] [ 11 143]] 

It turns out that the accuracy was actually worse. This is likely due to the fact that there were very few columns to begin with so having more data was better than less in this case. This could also be due to the fact that the dataset isn’t based on real world data.

## K-Nearest Neighbors

 123 knn = KNeighborsClassifier() knn.fit(X_train,y_train) pred = knn.predict(X_test) 
123
 123 print(classification_report(y_test,pred)) print('\n') print(confusion_matrix(y_test,pred)) 
 12345678910  precision recall f1-score support 0 0.92 0.97 0.95 146 1 0.97 0.92 0.95 154 avg / total 0.95 0.95 0.95 300 [[142 4] [ 12 142]] 
 12 knn.fit(X_rel_train,y_rel_train) pred_rel = knn.predict(X_rel_test) 
123
 123 print(classification_report(y_rel_test,pred_rel)) print('\n') print(confusion_matrix(y_rel_test,pred_rel)) 
 12345678910  precision recall f1-score support 0 0.92 0.96 0.94 146 1 0.96 0.92 0.94 154 avg / total 0.94 0.94 0.94 300 [[140 6] [ 12 142]] 

Here I was expecting K-Nearest Neighbors to do a little better than the logistic regression based on past experience. Just goes to show you have to treat each dataset as its own unique case.

The model still worked better having more features to work with.

## Optimizing K-Nearest Neighbors

Previously, I used the default number of neighbors for the classifier. Here I’m going to find the best number to work by minimizing the error rate.

 123456789 error_rate = [] # Will take some time for i in range(1,40): knn = KNeighborsClassifier(n_neighbors=i) knn.fit(X_train,y_train) pred_i = knn.predict(X_test) error_rate.append(np.mean(pred_i != y_test)) 
 123456 plt.figure(figsize=(10,6)) plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o', markerfacecolor='red', markersize=10) plt.title('Error Rate vs. K Value') plt.xlabel('K') plt.ylabel('Error Rate') 
 1  

Based on the graph there were several points the had the minimum error rate. I just chose the smallest value of 14.

 1234567 knn = KNeighborsClassifier(n_neighbors=14) knn.fit(X_train,y_train) pred = knn.predict(X_test) knn.fit(X_rel_train,y_rel_train) pred_rel = knn.predict(X_rel_test) 
 12345678910 print('FOR ALL COLUMNS:') print(classification_report(y_test,pred)) print('\n') print(confusion_matrix(y_test,pred)) print('\n') print('FOR RELEVANT COLUMNS:') print(classification_report(y_rel_test,pred_rel)) print('\n') print(confusion_matrix(y_rel_test,pred_rel)) 
 123456789101112131415161718192021222324 FOR ALL COLUMNS: precision recall f1-score support 0 0.92 0.99 0.96 146 1 0.99 0.92 0.96 154 avg / total 0.96 0.96 0.96 300 [[145 1] [ 12 142]] FOR RELEVANT COLUMNS: precision recall f1-score support 0 0.92 0.97 0.94 146 1 0.97 0.92 0.94 154 avg / total 0.94 0.94 0.94 300 [[141 5] [ 12 142]] 

Using 14 for the n_neighbors parameter improved the accuracy as expected, but not enough to do better than the logisitic regression.

## Conclusion

Overall I was able to get a 97% accuracy in predicting whether a person will click an ad based on some user data. This information could be used to determine the best users to target for an ad campaign.

While ad clicks were measured, it would likely be more helpful to determine who is more likely to purchase a product given a set of user data. Since advertisements are meant to drive purchases, being able to directly predict user value given data would decrease costs spent on ads while increasing revenue.