Sales Prediction｜Coding Crossroads

In the competitive world of e-commerce, accurately predicting sales is crucial for optimizing marketing strategies, managing inventory, and maximizing revenue.

By leveraging the Google Analytics Sample Dataset in BigQuery, we can gain valuable insights into how traffic sources and user demographics influence purchase behavior.

The goal of this analysis is to build predictive models using BigQuery ML to forecast e-commerce sales and identify key factors driving these sales.

Data Preparation
Exploratory Data Analysis
- Traffic Sources and Sales Impact
- User Demographics and Purchase Behavior
Feature Engineering
- Categorical Features
- Normalization
Model Selection
- Logistic Regression
- Random Forest
- XGBoost
- Deep Neural Network (DNN)
Model Implementation
- Model Creation and Training
- Model Evaluation
Model Comparison
- Performance Metrics (AUC, Precision, Recall, Accuracy, F1 Score)
- Result
- Discussion
Impact of Additional Features on Model Performance
Adjustments and Their Impact on Deep Neural Network Performance
Conclusion
- Advantages of Using BigQuery ML
- Disadvantages of Using BigQuery ML

1. Data Preparation

We begin by preparing our dataset from the Google Analytics Sample Dataset. The following SQL query creates a table that consolidates traffic source information, user demographics, and sales data.

SQL

-- Data Preparation
CREATE OR REPLACE TABLE `predictive-behavior-analytics.Section3.sales_prediction_data` AS
SELECT
  CONCAT(fullVisitorId, CAST(visitId AS STRING)) AS session_id,
  trafficSource.source AS traffic_source,
  trafficSource.medium AS traffic_medium,
  device.deviceCategory AS device_type,
  geoNetwork.country AS country,
  totals.pageviews AS pageviews,
  totals.timeOnSite AS time_on_site,
  totals.transactions AS transactions,
  IFNULL(totals.transactionRevenue, 0) / 1000000 AS revenue
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20160801' AND '20170731';

2. Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a critical step in understanding the underlying patterns and relationships in your dataset. For this blog post, we’ll perform EDA to gain insights into how different traffic sources and user demographics impact sales. We’ll also create visualizations to illustrate these insights effectively.

Traffic Sources and Sales Impact

Objective: Analyze how different traffic sources contribute to sales.

Approach

Identify Top Traffic Sources: Determine which traffic sources (e.g., organic search, direct, referral) are driving the most visits and transactions.
Calculate Conversion Rates: Measure the conversion rate for each traffic source to understand their effectiveness in driving sales.
Revenue Contribution: Assess the total revenue generated by each traffic source to identify the most profitable sources.

SQL

-- Traffic Sources and Sales Impact
SELECT
  traffic_source,
  COUNT(*) AS total_visits,
  SUM(transactions) AS total_transactions,
  SUM(transactions) / COUNT(*) AS conversion_rate,
  SUM(revenue) AS total_revenue
FROM
  `predictive-behavior-analytics.Section3.sales_prediction_data`
GROUP BY
  traffic_source
ORDER BY
  total_visits DESC;

Explanation:

total_visits: The number of visits from each traffic source.
total_transactions: The total number of transactions attributed to each traffic source.
conversion_rate: The ratio of transactions to total visits, indicating the effectiveness of the traffic source.
total_revenue: The total revenue generated from each traffic source.

Result:

The table below summarizes the total 10 visits, total transactions, conversion rates, and total revenue generated by traffic source:

The analysis reveals interesting insights into the performance of various traffic sources in driving sales and generating revenue.

Top Traffic Sources

1) Direct
– Direct traffic is the highest in terms of total visits and total revenue, indicating that users who visit the site directly are more likely to make purchases.

2) google
– Google search traffic also contributes significantly to the site’s revenue. Although the conversion rate is not high, the sheer volume of visits results in a substantial total revenue.

3) youtube.com
– YouTube traffic generates a high volume of visits, but the conversion rate and total revenue are relatively low. This indicates that while YouTube is effective in driving traffic, it may not be as effective in converting that traffic into sales.

4) Partners
– Partner referrals show a higher conversion rate compared to other sources, but the total revenue remains modest due to the lower volume of visits.

5) DoubleClick for Advertisers (DFA)
– DoubleClick for Advertisers (DFA) traffic shows a high conversion rate and significant total revenue, highlighting its effectiveness in driving sales.

Traffic Source	Total Visits	Total Transactions	Conversion Rate	Total Revenue
Direct	369,303	9,116	0.0028%	$1,181,882
google	241,280	2,441	0.0043%	$229,022
youtube.com	212,422	11	0.0047%	$314
Partners	16,359	9	0.0611%	$597
analytics.google.com	16,115	0	0.0000%	$0
DFA	5,671	132	0.0188%	$76,928
google.com	4,657	5	0.0215%	$347
m.facebook.com	3,360	11	0.0546%	$187
baidu	3,351	0	0.0000%	$0
sites.google.com	2,975	43	0.0344%	$4,391

User Demographics and Purchase Behavior

Objective: Examine the relationship between user demographics (e.g., device type, country) and purchase behavior.

Approach

Device Type Analysis: Analyze how different device types (e.g., desktop, mobile, tablet) impact user behavior and sales.
Geographical Analysis: Examine sales performance across different countries to identify high-value markets.

SQL Query for Device Type Analysis:

SQL

-- Device Type Analysis
SELECT
  device_type,
  COUNT(*) AS total_visits,
  SUM(transactions) AS total_transactions,
  SUM(transactions) / COUNT(*) AS conversion_rate,
  SUM(revenue) AS total_revenue
FROM
  `predictive-behavior-analytics.Section3.sales_prediction_data`
GROUP BY
  device_type
ORDER BY
  total_visits DESC;

SQL Query for Geographical Analysis:

SQL

-- Geographical Analysis
SELECT
  country,
  COUNT(*) AS total_visits,
  SUM(transactions) AS total_transactions,
  SUM(transactions) / COUNT(*) AS conversion_rate,
  SUM(revenue) AS total_revenue
FROM
  `predictive-behavior-analytics.Section3.sales_prediction_data`
GROUP BY
  country
ORDER BY
  total_revenue DESC
LIMIT 10;

Explanation:

total_visits: The number of visits by device type or country.
total_transactions: The total number of transactions by device type or country.
conversion_rate: The ratio of transactions to total visits by device type or country.
total_revenue: The total revenue generated by device type or country.

Result:

1) Device Type Analysis

The table below summarizes the total visits, total transactions, conversion rates, and total revenue generated by each device type:

Device Type	Total Visits	Total Transactions	Conversion Rate	Total Revenue
Desktop	662,737	11,028	0.0001587%	$1,472,575
Mobile	208,000	864	0.0004864%	$49,769
Tablet	30,360	178	0.0034899%	$9,421

Desktop: Desktop users account for the highest number of visits and total revenue. Despite having a low conversion rate, the high volume of visits leads to substantial total revenue, making desktop the most profitable device type.

Mobile: Mobile users show a higher conversion rate compared to desktop users, but the total revenue generated is significantly lower. This indicates that while mobile users are more likely to convert, their purchase amounts may be smaller, or there may be fewer high-value transactions.

Tablet: Tablet users have the highest conversion rate among the three device types. However, the total revenue from tablet users is the lowest. This suggests that tablet users, although fewer in number, are more likely to make purchases when they visit the site.

2) Geographical Analysis

The table below summarizes the total visits, total transactions, conversion rates, and total revenue generated by the top 10 countries:

Country	Total Visits	Total Transactions	Conversion Rate	Total Revenue
United States	363,457	11,396	0.0002864%	$1,444,138
Canada	25,784	199	0.0040408%	$32,824
Venezuela	2,128	153	0.0114125%	$13,374
Japan	19,690	18	0.0053775%	$6,728
Kenya	769	3	0.0130039%	$5,268
Nigeria	1,440	2	0.0006944%	$3,302
Taiwan	12,975	19	0.0077071%	$1,920
Indonesia	9,258	18	0.0017675%	$1,840
Australia	12,654	13	0.0079026%	$1,745
United Kingdom	37,251	19	0.0031878%	$1,689

United States: The United States leads in total visits, transactions, and revenue, making it the most valuable market. Despite a low conversion rate, the large volume of visits results in significant total revenue.

Canada: Canada shows a higher conversion rate compared to the United States, indicating that Canadian users are more likely to make purchases. The total revenue is also notable, making Canada a high-value market.

Venezuela: Venezuela stands out with the highest conversion rate among the top 10 countries. Although the total visits are relatively low, the high conversion rate leads to substantial revenue, indicating a highly engaged user base.

Japan, Kenya, Nigeria, Taiwan, Indonesia, Australia, and United Kingdom: These countries exhibit varying conversion rates and total revenues. Japan and Kenya have higher conversion rates, suggesting strong purchase behavior among users from these countries. However, the total revenue from these countries is lower compared to the United States and Canada, likely due to fewer total visits.

3. Feature Engineering

Feature engineering is a crucial step in preparing the dataset for predictive modeling. It involves transforming raw data into meaningful features that can be used by machine learning algorithms to improve model performance. Here, we’ll cover three main aspects of feature engineering: creating categorical features, deriving additional features, and normalizing numerical features.

Categorical Features

Objective: Convert traffic sources and user demographics into categorical features that can be used in machine learning models.

Approach

Traffic Source: Convert the traffic source information (e.g., trafficSource.source) into categorical features. This allows the model to differentiate between various traffic sources like ‘google’, ‘direct’, ‘referral’, etc.
User Demographics: Convert user demographics such as device type (device.deviceCategory) and country (geoNetwork.country) into categorical features. This helps the model understand the impact of these demographics on purchase behavior.

SQL

-- Feature Engineering: Categorical Variables
CREATE OR REPLACE TABLE `predictive-behavior-analytics.Section3.sales_prediction_data_with_categorical` AS
SELECT
  *,
  CAST(traffic_source AS STRING) AS traffic_source_cat,
  CAST(device_type AS STRING) AS device_type_cat,
  CAST(country AS STRING) AS country_cat
FROM
   `predictive-behavior-analytics.Section3.sales_prediction_data`;

Explanation:

traffic_source_cat: Converts the trafficSource.source to a categorical string feature.
device_type_cat: Converts the device.deviceCategory to a categorical string feature.
country_cat: Converts the geoNetwork.country to a categorical string feature.

Normalization

Objective: Normalize numerical features like pageviews and time on site to ensure they contribute effectively to the model and are on a comparable scale.

Approach

Pageviews: Normalize the number of pageviews.
Time on Site: Normalize the time spent on site.

SQL

-- Feature Engineering: Normalized Variables
CREATE OR REPLACE TABLE `predictive-behavior-analytics.Section3.sales_prediction_data_normalized` AS
SELECT
  *,
  (pageviews - (SELECT AVG(pageviews) FROM `predictive-behavior-analytics.Section3.sales_prediction_data`)) / (SELECT STDDEV(pageviews) FROM `predictive-behavior-analytics.Section3.sales_prediction_data`) AS normalized_pageviews,
  (time_on_site - (SELECT AVG(time_on_site) FROM `predictive-behavior-analytics.Section3.sales_prediction_data`)) / (SELECT STDDEV(time_on_site) FROM `predictive-behavior-analytics.Section3.sales_prediction_data`) AS normalized_time_on_site
FROM
  `predictive-behavior-analytics.Section3.sales_prediction_data`;

Explanation

normalized_pageviews: Normalizes the pageviews feature by subtracting the mean and dividing by the standard deviation.
normalized_time_on_site: Normalizes the time_on_site feature by subtracting the mean and dividing by the standard deviation.

4. Model Selection

In this section, we will showcase four different models for predicting whether a sale occurs. Each model has its own strengths and weaknesses, and by comparing them, we can determine the best approach for our predictive task. The models we will use are Logistic Regression, Random Forest Classifier, XGBoost, and Deep Neural Network (DNN).

Initial Feature Selection Rationale

In the initial analysis, only the following features were included:

traffic_source_cat
device_type_cat
country_cat
normalized_pageviews
normalized_time_on_site

Explanation: These features were selected based on the results of the Exploratory Data Analysis (EDA) and the limitations of BigQuery ML to handle a large number of features efficiently. The EDA revealed that traffic sources, device types, and user demographics such as country were significant factors influencing purchase behavior. Meanwhile, BigQuery ML’s limitations in managing a vast feature set necessitated focusing on the most impactful variables to ensure model efficiency and performance.

1. Logistic Regression

Overview:

Type: Linear Model
Purpose: Acts as a baseline model to compare against more complex models.
Strengths: Simple, interpretable, and computationally efficient.
Weaknesses: Assumes a linear relationship between the features and the target variable, which may not capture complex patterns.

2. Random Forest Classifier

Overview:

Type: Ensemble Learning Model
Purpose: Combines multiple decision trees in parallel to create a “forest” of trees for classification tasks.
Strengths: Handles high-dimensional data well, robust to overfitting due to averaging of multiple trees, can capture non-linear relationships, provides feature importance rankings
Weaknesses: Can be computationally intensive for very large datasets, less interpretable than single decision trees, ,may struggle with very sparse datasets

3. XGBoost

Overview:

Type: Gradient Boosting Model
Purpose: Uses boosting to create a strong predictive model by sequentially adding trees to correct errors of previous trees.
Strengths: High predictive accuracy, handles missing data well, and provides feature importance.
Weaknesses: Can be sensitive to hyperparameters and prone to overfitting if not properly tuned.

4. Deep Neural Network (DNN)

Overview:

Type: Neural Network
Purpose: Captures complex patterns in data by using multiple layers of neurons.
Strengths: Capable of modeling complex, non-linear relationships, and performs well on large datasets.
Weaknesses: Requires more computational resources, can be difficult to interpret, and needs careful tuning of hyperparameters.

5. Model Implementation

For each model, we will follow these steps:

Model Creation and Training:

Use BigQuery ML to define and train the model using SQL.
Specify the model type and the target label column (made_purchase).

SQL

-- Logistic Regression Model
CREATE OR REPLACE MODEL `predictive-behavior-analytics.Section3.log_reg_sales_model`
OPTIONS(model_type='logistic_reg', input_label_cols=['made_purchase']) AS
SELECT
  traffic_source_cat,
  device_type_cat,
  country_cat,
  normalized_pageviews,
  normalized_time_on_site,
  IF(transactions > 0, 1, 0) AS made_purchase
FROM
  `predictive-behavior-analytics.Section3.sales_prediction_data_final`;

SQL

-- Random Forest Model with Specific Features
CREATE OR REPLACE MODEL `predictive-behavior-analytics.Section3.random_forest_sales_model_revised`
OPTIONS(
  model_type='random_forest_classifier',
  input_label_cols=['made_purchase']
) AS
SELECT
  traffic_source_cat,
  device_type_cat,
  country_cat,
  normalized_pageviews,
  normalized_time_on_site,
  IF(transactions > 0, 1, 0) AS made_purchase
FROM
  `predictive-behavior-analytics.Section3.sales_prediction_data_final`;

SQL

-- XGBoost Model
CREATE OR REPLACE MODEL `predictive-behavior-analytics.Section3.xgboost_sales_model`
OPTIONS(model_type='boosted_tree_classifier', input_label_cols=['made_purchase']) AS
SELECT
  traffic_source_cat,
  device_type_cat,
  country_cat,
  normalized_pageviews,
  normalized_time_on_site,
  IF(transactions > 0, 1, 0) AS made_purchase
FROM
  `predictive-behavior-analytics.Section3.sales_prediction_data_final`;

SQL

-- Deep Neural Network Model
CREATE OR REPLACE MODEL `predictive-behavior-analytics.Section3.dnn_sales_model`
OPTIONS(model_type='dnn_classifier', hidden_units=[128, 64, 32], input_label_cols=['made_purchase']) AS
SELECT
  traffic_source_cat,
  device_type_cat,
  country_cat,
  normalized_pageviews,
  normalized_time_on_site,
  IF(transactions > 0, 1, 0) AS made_purchase
FROM
  `predictive-behavior-analytics.Section3.sales_prediction_data_final`;

Model Evaluation:

Evaluate the model’s performance using metrics such as AUC, precision, recall, and accuracy.
Use BigQuery ML’s built-in evaluation functions to assess the model.

Example Evaluation Query:

SQL

-- Evaluate the Logistic Regression Model and save the results
CREATE OR REPLACE TABLE `predictive-behavior-analytics.Section3.log_reg_sales_model_evaluation` AS
SELECT
  *
FROM
  ML.EVALUATE(
    MODEL `predictive-behavior-analytics.Section3.log_reg_sales_model`,
    (
      SELECT
        traffic_source_cat,
        device_type_cat,
        country_cat,
        normalized_pageviews,
        normalized_time_on_site,
        IF(transactions > 0, 1, 0) AS made_purchase
      FROM
        `predictive-behavior-analytics.Section3.sales_prediction_data_final`
    )
  );

You can find the complete code in my GitHub repository.

6. Model Comparison

In this section, we will compare the performance of the four models we built for predicting e-commerce sales: Logistic Regression, Random Forest Classifier, XGBoost, and Deep Neural Network (DNN). We will use several performance metrics to evaluate and contrast these models, and discuss the trade-offs between their complexity and interpretability.

Performance Metrics

To determine which model provides the best predictive performance, I evaluated each model using the following metrics:

AUC (Area Under the ROC Curve):

Definition: AUC measures the ability of the model to distinguish between positive and negative classes. It ranges from 0 to 1, with 1 indicating perfect classification and 0.5 indicating no discriminatory power.
Usage: AUC is useful for understanding the overall performance of a model regardless of the classification threshold.

Precision:

Definition: Precision is the ratio of true positive predictions to the total number of positive predictions (both true positives and false positives). It indicates how many of the predicted positive instances are actually positive.
Formula: Precision = TP / (TP + FP)
Usage: Precision is important when the cost of false positives is high, and we want to minimize incorrect positive predictions.

Recall:

Definition: Recall (or Sensitivity) is the ratio of true positive predictions to the total number of actual positive instances. It indicates how many of the actual positive instances are correctly predicted.
Formula: Recall = TP / (TP + FN)
Usage: Recall is important when the cost of false negatives is high, and we want to capture as many true positives as possible.

Accuracy:

Definition: Accuracy is the ratio of correct predictions (both true positives and true negatives) to the total number of predictions.
Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)
Usage: Accuracy provides a general measure of the model’s performance, but it can be misleading in imbalanced datasets.

F1 Score:

Definition: F1 Score is the harmonic mean of precision and recall. It provides a balance between precision and recall.
Formula: F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
Usage: F1 Score is useful when we need to balance precision and recall.

Results

In this section, we compare the performance of the four models built for predicting e-commerce sales: Logistic Regression, Random Forest Classifier, XGBoost, and Deep Neural Network (DNN). We evaluated these models using AUC, precision, recall, accuracy, and F1 score. Below is a summary of the results:

Model	AUC	Precision	Recall	Accuracy	F1 Score
Logistic Regression	0.98	0.41	0.17	0.99	0.24
Random Forest Classifier	0.98	0.55	0.09	0.99	0.16
XGBoost	0.98	0.00	0.00	0.99	0.00
Deep Neural Network	0.98	1.00	0.00	0.99	0.00

AUC (Area Under the ROC Curve):

All four models achieve a high AUC of 0.98, indicating that they are similarly effective at distinguishing between classes (purchases vs. non-purchases).

Precision:

Precision varies significantly across the models. The Deep Neural Network achieves perfect precision (1.00), meaning that when it predicts a purchase, it is always correct.
However, this comes at the cost of an extremely low recall, as the model fails to identify most purchasers.
The Random Forest Classifier and Logistic Regression models have lower precision (0.55 and 0.41, respectively), but they may offer a better balance between precision and recall.
XGBoost, with a precision of 0.00, fails to make any correct purchase predictions, indicating potential issues with its configuration or suitability for this dataset.

Recall:

Recall measures the model’s ability to identify all positive instances. The recall values were very low across all models.
The Deep Neural Network model had a recall of 0.00, indicating it captured very few actual positive instances.
The Logistic Regression model had a recall of 0.17, which is higher than the other models but still indicates room for improvement.
Both the Random Forest Classifier and XGBoost models had a recall of 0.0, similar to their precision, indicating a failure to identify positive instances.

Accuracy:

All models have a high accuracy of 0.99, but this metric is potentially misleading due to the class imbalance in the dataset (with far more non-purchasers than purchasers).
High accuracy in this context means that the models are primarily predicting non-purchasers correctly, which aligns with the low recall and F1 scores.

F1 Score:

The F1 Score, which balances precision and recall, further illustrates the challenges these models face.
Logistic Regression achieves an F1 Score of 0.24, the highest among the four, but still reflects poor performance in identifying true positives.
The Random Forest Classifier has an F1 Score of 0.16, indicating even worse performance.
XGBoost and the Deep Neural Network both have an F1 Score of 0.00, confirming their inability to predict actual purchasers despite their high accuracy and AUC.

Discussion

While all models show high AUC and accuracy, these metrics alone do not capture the true performance of the models in this use case.

The primary issue across all models is the poor recall, which means they are not effectively identifying actual purchasers.

Logistic Regression shows the best balance among the models, but its performance is still inadequate for practical use.

XGBoost and the Deep Neural Network, despite their potential, fail to deliver meaningful predictions due to extremely low recall and F1 scores.

Impact of Additional Features

In the process of refining Logistic Regression, Random Forest Classifier and XGBoost models, I expanded the feature set to all available features. This enhancement led to significant improvements in model performance, as evidenced by the comparison between the old and new results.

Old Results

Model	AUC	Precision	Recall	Accuracy	F1 Score
Logistic Regression	0.98	0.41	0.17	0.99	0.24
Random Forest Classifier	0.98	0.55	0.09	0.99	0.16
XGBoost	0.98	0.00	0.00	0.99	0.00

New Results

Model	AUC	Precision	Recall	Accuracy	F1 Score
Logistic Regression	1.00	0.99	0.99	1.00	0.99
Random Forest Classifier	1.00	1.00	1.00	1.00	1.00
XGBoost	1.00	1.00	1.00	1.00	1.00

Key Observations

Logistic Regression:

With the expanded feature set, Logistic Regression achieved near-perfect performance (AUC, precision, recall, accuracy, and F1 Score close to 1.0).
This improvement highlights that Logistic Regression, a simpler model, can significantly benefit from a richer feature set, improving its ability to capture relationships within the data.

Random Forest Classifier and XGBoost Models:

After incorporating all available features, these models achieved perfect scores (AUC, precision, recall, accuracy, and F1 Score all equal to 1.0).
This drastic improvement suggests that the additional features provided the necessary information for the models to distinguish between positive and negative instances effectively.
However, the perfect performance could also indicate potential overfitting, where the model performs exceptionally well on the training data but might not generalize to new data as effectively.

Adjustments and Their Impact on Deep Neural Network Performance

In the course of refining the models, it was necessary to make adjustments to the Deep Neural Network (DNN) due to BigQuery’s capacity limitations. While the other models benefited from the inclusion of all available features, the DNN was restricted to using the original feature set.

To mitigate the impact of this limitation and improve the DNN’s performance, several changes were made to the model’s configuration:

Changes to DNN Configuration:

Reduction in Hidden Units:
- Old Configuration: [128, 64, 32]
- New Configuration: [64, 32]
- Rationale: The number of units in each hidden layer was reduced to simplify the model. This adjustment helps in managing the model size within BigQuery’s capacity constraints and can also prevent overfitting, particularly when the dataset is relatively small or when the model is complex.
Auto Class Weights:
- New Addition: auto_class_weights=true
- Rationale: Automatically adjusting class weights helps in addressing class imbalance, ensuring that the model does not become biased towards the majority class. This is particularly important in scenarios where there are significantly more instances of one class than another, as it helps the model learn to recognize minority class instances more effectively.
Auto Category Hash:
- New Addition: auto_category_hash=true
- Rationale: This option reduces the number of input features by automatically hashing categorical variables. This not only helps in reducing the model size but also enables the model to handle high-cardinality categorical features more efficiently.

Performance Comparison

Old Results

Model	AUC	Precision	Recall	Accuracy	F1 Score
Deep Neural Network	0.98	1.00	0.00	0.99	0.00

New Results

Model	AUC	Precision	Recall	Accuracy	F1 Score
Deep Neural Network	0.98	0.16	0.98	0.93	0.27

Improved Recall:
The new configuration led to a significant improvement in recall (from 0.00 to 0.98). This suggests that the model is now much better at identifying positive instances (purchases) than before. The use of auto_class_weights=true likely played a key role in this improvement by ensuring that the model paid adequate attention to the minority class during training.

Reduced Precision:
While recall improved, precision dropped from 1.0 to 0.16. This drop indicates that while the model became better at identifying true positives, it also increased the number of false positives. This trade-off is common when dealing with imbalanced datasets and highlights the importance of balancing precision and recall.

F1 Score Improvement:
The F1 Score, which balances precision and recall, improved from 0.00 to 0.27. This improvement indicates that the new configuration provides a better overall balance between identifying true positives and minimizing false positives.

AUC and Accuracy:
The AUC remained high, indicating that the model maintains strong discriminatory power. The accuracy decreased slightly to 0.93, but this is expected given the trade-off between precision and recall. A decrease in accuracy is not necessarily negative in this context, as it can reflect a model’s improved ability to correctly classify positive instances in an imbalanced dataset.

Conclusion

I explored the process of predicting e-commerce sales using BigQuery ML, leveraging the Google Analytics Sample Dataset.

The analysis aimed to understand how traffic sources and user demographics influence purchase behavior and to build predictive models to forecast sales. The results provided valuable insights into model performance, particularly the strengths and weaknesses of different machine learning approaches.

Advantages of Using BigQuery ML:

Ease of Integration:

BigQuery ML offers seamless integration with SQL, making it accessible for data analysts who are familiar with SQL but not necessarily with more complex machine learning frameworks. This lowers the barrier to entry for building and deploying models.

Scalability:

BigQuery ML is designed to handle large datasets efficiently, making it ideal for processing and analyzing massive e-commerce datasets. The ability to train models directly on data stored in BigQuery without needing to export it to another platform is a significant advantage.

Speed and Performance:

By leveraging Google Cloud’s infrastructure, BigQuery ML can train and evaluate models quickly, even on large datasets. This speed enables faster iteration and experimentation, which is crucial in a competitive e-commerce environment.

Built-in Model Evaluation:

BigQuery ML provides built-in functions for model evaluation, such as calculating AUC, precision, recall, accuracy, and F1 score. This simplifies the process of assessing model performance and comparing different models.

Disadvantages of Using BigQuery ML:

Model Complexity and Size Limitations:

As seen with the Deep Neural Network (DNN) model, BigQuery ML imposes limits on model size. Complex models with many features or layers can exceed these limits, requiring simplifications that might impact model performance.

Limited Customization:

While BigQuery ML is convenient, it offers less flexibility compared to dedicated machine learning frameworks like TensorFlow or PyTorch. Advanced users might find the lack of detailed customization options limiting, particularly when tuning hyperparameters or experimenting with novel architectures.

Feature Engineering Constraints:

Some advanced feature engineering techniques, such as manual feature interactions or custom transformations, are more challenging to implement in BigQuery ML compared to Python-based environments like Jupyter Notebooks. This can limit the ability to fully optimize models.

Potential Overfitting:

The perfect scores achieved by models like Random Forest Classifier and XGBoost after expanding the feature set suggest possible overfitting, where the model performs exceptionally well on the training data but may not generalize to new data as effectively. This highlights the importance of cross-validation and careful model tuning, which can be more challenging in the BigQuery ML environment.

Final Thoughts

Overall, BigQuery ML is a powerful tool for building and deploying machine learning models directly within a data warehouse environment. Its ease of use, scalability, and integration with SQL make it an excellent choice for data analysts and businesses looking to quickly derive insights and predictions from large datasets.

However, users must be aware of its limitations, particularly concerning model complexity and customization. For more advanced use cases or when working with highly complex models, it might be necessary to complement BigQuery ML with other machine learning tools or platforms.

Contents

1. Data Preparation

2. Exploratory Data Analysis

Traffic Sources and Sales Impact

Approach

Explanation:

Result:

User Demographics and Purchase Behavior

Approach

Explanation:

Result:

3. Feature Engineering

Categorical Features

Approach

Explanation:

Normalization

Approach

Explanation

4. Model Selection

Initial Feature Selection Rationale

1. Logistic Regression

2. Random Forest Classifier

3. XGBoost

4. Deep Neural Network (DNN)

5. Model Implementation

Model Creation and Training:

Model Evaluation:

6. Model Comparison

Performance Metrics

Results

Discussion

Impact of Additional Features

Old Results

New Results

Key Observations

Logistic Regression:

Random Forest Classifier and XGBoost Models:

Adjustments and Their Impact on Deep Neural Network Performance

Changes to DNN Configuration:

Performance Comparison

Old Results

New Results

Conclusion

Final Thoughts