Predicting User Conversion｜Coding Crossroads

In today’s data-driven marketing landscape, predicting which users are likely to convert is crucial for optimizing strategies and improving ROI. In this blog post, we’ll walk through the process of building user conversion prediction models using BigQuery ML and the Google Analytics Sample Dataset. We’ll cover feature development, model building, and evaluation, all using SQL.

Feature Development
Building and Training Predictive Models
- Logistic Regression Model
- Random Forest Model
- XGBoost Model
Confusion Matrix Results
Performance Comparison
Feature Importance
Conclusion

1. Feature Development

Let’s create features that could be indicative of user conversion:

SQL

CREATE OR REPLACE TABLE `your-project.your-dataset.user_features` AS
WITH user_sessions AS (
  SELECT
    fullVisitorId,
    PARSE_DATE('%Y%m%d', date) AS visit_date,
    totals.transactions,
    totals.timeOnSite,
    totals.pageviews,
    device.deviceCategory,
    geoNetwork.country,
    trafficSource.medium
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_*`
  WHERE
    _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'
)
SELECT
  fullVisitorId,
  MAX(CASE WHEN transactions > 0 THEN 1 ELSE 0 END) AS has_converted,
  COUNT(DISTINCT visit_date) AS num_visits,
  AVG(timeOnSite) AS avg_time_on_site,
  AVG(pageviews) AS avg_pageviews,
  MAX(deviceCategory) AS device_category,
  MAX(country) AS country,
  MAX(medium) AS traffic_medium,
  SUM(pageviews) AS total_pageviews,
  SUM(timeOnSite) AS total_time_on_site,
  DATE_DIFF(MAX(visit_date), MIN(visit_date), DAY) AS days_since_first_visit
FROM
  user_sessions
GROUP BY
  fullVisitorId;

In this query, we’ve created several features that might be predictive of conversion:

has_converted: Whether the user has made a purchase (our target variable)
num_visits: Number of visits by the user
avg_time_on_site: Average time spent on the site per visit
avg_pageviews: Average number of pages viewed per visit
device_category: The user’s device type
country: The user’s country
traffic_medium: The traffic source medium
total_pageviews: Total number of pages viewed across all visits
total_time_on_site: Total time spent on the site across all visits
days_since_first_visit: Number of days between the user’s first and last visit

2. Building and Training Predictive Models

Let’s build three different types of models to predict user conversion: Logistic Regression, Random Forest, and XGBoost.

Logistic Regression Model

SQL

CREATE OR REPLACE MODEL `your-project.your-dataset.user_conversion_logistic`
OPTIONS(model_type='logistic_reg', input_label_cols=['has_converted']) AS
SELECT
  * EXCEPT(fullVisitorId)
FROM
  `your-project.your-dataset.user_features`;

Random Forest Model

SQL

CREATE OR REPLACE MODEL `your-project.your-dataset.user_conversion_random_forest`
OPTIONS(model_type='random_forest_classifier', input_label_cols=['has_converted']) AS
SELECT
  * EXCEPT(fullVisitorId)
FROM
  `your-project.your-dataset.user_features`;

XGBoost Model

SQL

CREATE OR REPLACE MODEL `your-project.your-dataset.user_conversion_xgboost`
OPTIONS(model_type='boosted_tree_classifier', input_label_cols=['has_converted']) AS
SELECT
  * EXCEPT(fullVisitorId)
FROM
  `your-project.your-dataset.user_features`;

3. Confusion Matrix Results

Let’s evaluate the performance of all three models:

SQL

-- Function to evaluate a model
CREATE TEMP FUNCTION EvaluateModel(model_name STRING)
RETURNS TABLE<
  model STRING,
  accuracy FLOAT64,
  precision FLOAT64,
  recall FLOAT64,
  f1_score FLOAT64,
  log_loss FLOAT64,
  roc_auc FLOAT64
>
AS ((
  SELECT
    model_name AS model,
    accuracy,
    precision,
    recall,
    f1_score,
    log_loss,
    roc_auc
  FROM
    ML.EVALUATE(MODEL `your-project.your-dataset.${model_name}`,
      (
      SELECT
        * EXCEPT(fullVisitorId)
      FROM
        `your-project.your-dataset.user_features`
      )
    )
));

-- Evaluate all models
SELECT * FROM EvaluateModel('user_conversion_logistic')
UNION ALL
SELECT * FROM EvaluateModel('user_conversion_random_forest')
UNION ALL
SELECT * FROM EvaluateModel('user_conversion_xgboost')
ORDER BY model;

Confusion matrices provide a detailed breakdown of the classification results by showing the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Here’s an analysis of the confusion matrices for Logistic Regression, Random Forest, and XGBoost models.

Logistic Regression

Logistic Regression shows a high number of true negatives (59,339) and a low number of false positives (184), which indicates that it correctly identifies non-converters most of the time. However, it struggles with correctly identifying converters, as shown by the relatively high number of false negatives (791) and the low number of true positives (209).

	Predicted: 0 (not converted)	Predicted: 1 (converted)
Actual: 0 (not converted)	(TN) 59,339	(FP) 184
Actual: 1 (converted)	(FN) 791	(TP) 209

Random Forest

The Random Forest model improves significantly over Logistic Regression in terms of detecting true positives (639 vs. 209), which indicates that it is better at identifying actual converters. The number of false negatives is also reduced to 361, meaning fewer actual converters are missed. The model also maintains a low false positive rate (126), which means it remains accurate in not falsely predicting conversions when there are none. Overall, Random Forest offers a better balance between detecting converters and avoiding false alarms.

	Predicted: 0 (not converted)	Predicted: 1 (converted)
Actual: 0 (not converted)	(TN) 59,397	(FP) 126
Actual: 1 (converted)	(FN) 361	(TP) 639

XGBoost

XGBoost provides a mixed performance compared to Random Forest. It has slightly more false positives (246) than Random Forest. It also shows a higher number of false negatives (512) than Random Forest, meaning it misses more actual converters. Despite having a higher true positive count (488) than Logistic Regression, XGBoost underperforms compared to Random Forest in detecting actual converters.

	Predicted: 0 (not converted)	Predicted: 1 (converted)
Actual: 0 (not converted)	(TN) 59,277	(FP) 246
Actual: 1 (converted)	(FN) 512	(TP) 488

Based on the confusion matrix analysis, Random Forest emerges as the top-performing model for predicting user conversion due to its superior balance between identifying actual conversions and minimizing false predictions.

XGBoost is a viable alternative but may require adjustments, while Logistic Regression is the least effective in this context, particularly due to its high false negative rate.

4. Performance Comparison

	Logistic Regression	Random Forest	XGBoost
Precision	0.532	0.835	0.665
Recall	0.209	0.639	0.488
Accuracy	0.984	0.992	0.987
F1 Score	0.300	0.724	0.563
Log Loss	0.046	0.144	0.028
AUC	0.986	0.986	0.993

When evaluating predictive models, key metrics include precision, recall, accuracy, F1 score, Log Loss, and AUC.

Precision measures how often positive predictions are correct, while recall assesses how well the model identifies actual positives. Accuracy gives an overall correctness measure, but can be misleading in imbalanced datasets. F1 Score balances precision and recall. Log Loss penalizes incorrect predictions, and AUC shows the model’s ability to distinguish between classes.

1. Precision

Logistic Regression (0.532): This model shows moderate precision, meaning that 53.2% of the users it predicted as converters were indeed converters. However, this also indicates that nearly half of the positive predictions were false positives.
Random Forest (0.835): Random Forest excels in precision with 83.5%, making it very effective in minimizing false positives. This model is ideal if the goal is to avoid wasting resources on users who are unlikely to convert.
XGBoost (0.665): XGBoost offers a precision of 66.5%, which is higher than Logistic Regression but lower than Random Forest.

2. Recall

Logistic Regression (0.209): The recall of 20.9% suggests that Logistic Regression misses a significant portion of actual converters. While it is conservative in predicting conversions, it could lead to lost opportunities by not identifying potential converters.
Random Forest (0.639): With a recall of 63.9%, Random Forest is much better at capturing true positives. It balances between identifying converters and maintaining precision, making it a strong candidate for conversion prediction.
XGBoost (0.488): XGBoost’s recall of 48.8% falls between Logistic Regression and Random Forest. It is less aggressive than Random Forest but still better than Logistic Regression in identifying potential converters.

3. Accuracy

Logistic Regression (0.984): Despite its lower recall and precision, Logistic Regression shows high accuracy at 98.4%. However, accuracy can be misleading in imbalanced datasets, where the majority class dominates.
Random Forest (0.992): Random Forest’s accuracy of 99.2% indicates it performs well overall, reducing both false positives and false negatives.
XGBoost (0.987): XGBoost also shows high accuracy at 98.7%, slightly below Random Forest but still indicating strong overall performance.

4. F1 Score

Logistic Regression (0.300): The F1 score of 30.0% is low, reflecting the model’s struggle to balance precision and recall. This low score indicates that the model is not well-suited for conversion prediction, especially if both precision and recall are important.
Random Forest (0.724): Random Forest achieves a high F1 score of 72.4%, suggesting a good balance between precision and recall. This makes it a robust choice for conversion prediction where both metrics are crucial.
XGBoost (0.563): XGBoost’s F1 score of 56.3% suggests it provides a better balance than Logistic Regression but is slightly less effective than Random Forest.

5. Log Loss

Logistic Regression (0.046): Logistic Regression has the lowest log loss at 0.046, indicating that it is very confident in its predictions, even though the overall effectiveness may be compromised by low recall.
Random Forest (0.144): Random Forest has a higher log loss of 0.144, indicating that while it is accurate, it is less confident in its predictions compared to Logistic Regression.
XGBoost (0.028): XGBoost outperforms the other models with the lowest log loss of 0.028, indicating high confidence in its predictions. This, combined with its balanced performance, makes it a strong contender.

6. AUC

Logistic Regression (0.986): The AUC of 98.6% indicates that Logistic Regression is very good at distinguishing between converters and non-converters, despite its lower recall.
Random Forest (0.986): Random Forest also achieves an AUC of 98.6%, matching Logistic Regression and showing strong overall performance in distinguishing between classes.
XGBoost (0.993): XGBoost tops the AUC at 99.3%, indicating that it is the most effective model at distinguishing between converters and non-converters, making it particularly powerful for this task.

Conclusion

Logistic Regression: Offers high accuracy and low log loss, but its low recall and F1 score suggest it is less effective for predicting user conversions when both precision and recall are essential.
Random Forest: Strikes a strong balance between precision, recall, and overall accuracy. Its high F1 score and solid AUC make it a very reliable model for conversion prediction.
XGBoost: Delivers the best AUC and the lowest log loss, indicating strong predictive power and confidence. It provides a good balance, though it may require fine-tuning to optimize precision and recall fully.

For predicting user conversion, XGBoost stands out as a robust model, particularly for scenarios where distinguishing between converters and non-converters is critical.

However, Random Forest remains a strong alternative, offering a slightly better balance between precision and recall.

5. Feature Importance

Given that both the Confusion Matrix results and the performance metrics comparison indicate the Random Forest model outperforms the other two models, I will focus on the Random Forest model to discuss feature importance.

Understanding the key features driving the Random Forest model’s success is crucial. The model highlights “avg_time_on_site” and “total_pageviews” as the top factors, indicating that user engagement and browsing behavior are pivotal in predicting conversion potential within a non-linear context.

avg_time_on_site	14436.0
total_pageviews	14121.0
avg_pageviews	7509.0
total_time_on_site	7108.0
days_since_first_visit	7085.0
num_visits	5667.0
traffic_medium	3484.0
device_category	1852.0
country	825.0

While the Random Forest model stands out in this analysis, it’s also worth noting that Logistic Regression and XGBoost, while slightly less effective, provide valuable insights.

However, they may not fully capture the complexity of user behavior that the Random Forest model does, particularly in terms of how engagement metrics and traffic sources interact to influence conversion outcomes.

6. Conclusion

In the competitive landscape of digital marketing, accurately predicting user conversion is essential for optimizing strategies and maximizing ROI.

I built and compared three models: Logistic Regression, Random Forest, and XGBoost.

Each model provided valuable insights, with Random Forest emerging as the most balanced in terms of precision, recall, and overall accuracy.

XGBoost, while showing exceptional AUC and low log loss, also proved to be a strong contender, especially in distinguishing between converters and non-converters.

The analysis of feature importance within the Random Forest model underscored the significance of user engagement metrics like average time on site and total page views in predicting conversions.

These findings highlight the importance of understanding user behavior holistically to develop more effective marketing strategies.

You can find the complete code in my GitHub repository.

Contents