In the highly competitive world of e-commerce, understanding who your high-value customers are can make a significant difference in optimizing your marketing strategies, improving customer retention, and maximizing revenue.
High-value customers are those who contribute the most to your revenue, not just in terms of transaction size but also in terms of frequency and loyalty.
By identifying and profiling these customers, businesses can tailor their marketing efforts to enhance customer satisfaction and loyalty, ultimately driving sustained growth.
In this post, we’ll explore how to use BigQuery ML and the Google Analytics Sample Dataset to identify and profile high-value customers based on their transaction history.
We’ll also analyze patterns that distinguish high-value customers from others, providing insights into how to better engage with this critical segment of your customer base.
Contents
- Data Preparation
- Defining High-Value Customers
- Model Building and Training
- Logistic Regression
- K-Means Clustering
- Random Forest
- Model Evaluation
- Conclusion
1. Data Preparation
Before we can accurately identify and profile high-value customers, it’s crucial to prepare our dataset in a way that maximizes the effectiveness of our predictive models.
The Google Analytics Sample Dataset offers a wealth of data, including transaction details, customer behavior, and demographic information. However, to truly harness this data, we must first consolidate it into a structured format that allows for meaningful analysis of customer transactions.
A key step in this process is feature normalization. Normalization adjusts the scale of our features, ensuring that all variables contribute equally to the models.
Without normalization, models like Logistic Regression, Random Forest, and K-Means Clustering can be skewed by features with larger numerical ranges, leading to biased predictions.
For instance, in K-Means Clustering, features with higher magnitudes could disproportionately influence the clustering process, making it harder to detect meaningful patterns in the data.
You can find the complete code in my GitHub repository.
2. Defining High-Value Customers
High-value customers can be defined in various ways, depending on the business model and objectives.
For this analysis, I’ll define high-value customers as those who are in the top 20% of revenue contributors. This definition will help us focus on customers who have a significant impact on the business’s bottom line.
-- Define High-Value Customers
CREATE OR REPLACE TABLE `predictive-behavior-analytics.Section5.high_value_customers` AS
WITH customer_revenue AS (
SELECT
customer_id,
SUM(revenue) AS total_revenue,
COUNT(*) AS transaction_count,
AVG(revenue) AS avg_revenue_per_transaction
FROM
`predictive-behavior-analytics.Section5.customer_transaction_data`
GROUP BY
customer_id
),
percentile_80th AS (
SELECT
APPROX_QUANTILES(total_revenue, 100)[OFFSET(80)] AS p80_revenue
FROM
customer_revenue
)
SELECT
cr.customer_id,
cr.total_revenue,
cr.transaction_count,
cr.avg_revenue_per_transaction,
IF(cr.total_revenue > p.p80_revenue, 1, 0) AS high_value_status
FROM
customer_revenue cr,
percentile_80th p;
3. Model Building and Training
In this section, we will walk through the process of building and training three different machine learning models: Logistic Regression, K-Means Clustering, and Random Forest Classifier.
Logistic Regression
The Logistic Regression model is used to predict whether a customer is a high-value customer or not. This model is particularly useful for binary classification tasks and helps in understanding which features are most influential in determining high-value customers.
-- Logistic Regression
CREATE OR REPLACE MODEL `predictive-behavior-analytics.Section5.logistic_regression_model`
OPTIONS(model_type='logistic_reg', input_label_cols=['high_value_status']) AS
SELECT
normalized_transaction_count,
normalized_avg_revenue,
device_type_cat,
country_cat,
high_value_status -- Include the target column
FROM
`predictive-behavior-analytics.Section5.customer_features`;
K-Means Clustering
K-Means Clustering is an unsupervised learning algorithm used to segment customers into distinct groups based on their purchasing behavior and demographics. This model can help identify natural groupings within the customer base, some of which may correspond to high-value customers.
-- K-Means Clustering
CREATE OR REPLACE MODEL `predictive-behavior-analytics.Section5.kmeans_model`
OPTIONS(model_type='kmeans', num_clusters=5) AS
SELECT
normalized_transaction_count,
normalized_avg_revenue,
device_type_cat,
country_cat
FROM
`predictive-behavior-analytics.Section5.customer_features`;
Random Forest Classifier
The Random Forest Classifier is an ensemble learning method used to predict high-value customer status. This model builds multiple decision trees and merges them to get a more accurate and stable prediction. Additionally, it provides insights into feature importance, highlighting which factors most influence the prediction.
-- Random Forest Classifier
CREATE OR REPLACE MODEL `predictive-behavior-analytics.Section5.random_forest_model`
OPTIONS(model_type='random_forest_classifier', input_label_cols=['high_value_status']) AS
SELECT
normalized_transaction_count,
normalized_avg_revenue,
device_type_cat,
country_cat,
high_value_status -- Include the target column
FROM
`predictive-behavior-analytics.Section5.customer_features`;
Model Evaluation
The evaluation of the different machine learning models—Logistic Regression, K-Means Clustering, and Random Forest Classifier—yields important insights into their effectiveness in identifying high-value customers.
Logistic Regression vs. Random Forest Classifier
The performance evaluation of the Logistic Regression and Random Forest Classifier models offers valuable insights into their effectiveness in identifying high-value customers. Below is a detailed comparison of the two models based on key evaluation metrics.
Results
Model | Precision | Recall | Accuracy | F1 Score | Log Loss | AUC |
Logistic Regression | 1.000 | 0.649 | 0.996 | 0.788 | 0.014 | 0.998 |
Random Forest Classifier | 1.000 | 0.985 | 1.000 | 0.992 | 0.1273 | 0.988 |
The evaluation of Logistic Regression and Random Forest Classifier models reveals distinct strengths and weaknesses in identifying high-value customers. Both models achieve perfect precision, meaning they accurately identify all predicted high-value customers as such. However, their performance diverges significantly when considering other metrics such as recall, accuracy, F1 score, log loss, and AUC.
Precision
Both the Logistic Regression and Random Forest Classifier models achieve a precision of 1.000, indicating that every customer predicted as high-value by these models is indeed high-value. This high precision is crucial for scenarios where false positives are costly, as it ensures that resources are not wasted on incorrectly identified customers.
Recall
The Random Forest Classifier outperforms Logistic Regression in terms of recall. This suggests that the Random Forest model is far more effective at identifying a higher proportion of actual high-value customers. A higher recall is critical when the goal is to capture as many high-value customers as possible, making the Random Forest model more reliable in this regard.
Accuracy
In terms of accuracy, the Random Forest Classifier also slightly edges out Logistic Regression. While both models perform well, Random Forest’s perfect accuracy underscores its robustness in classifying customers correctly across the entire dataset.
F1 Score
The F1 score, which balances precision and recall, is notably higher for the Random Forest model compared Logistic Regression. This higher F1 score indicates that Random Forest not only predicts high-value customers with precision but also does so while capturing a greater number of actual high-value customers, making it a more balanced model overall.
Log Loss
Log Loss, which measures the model’s confidence in its predictions, is much lower for Logistic Regression than for Random Forest. This lower log loss for Logistic Regression suggests that, although its recall is lower, it is more confident in its predictions when it does classify a customer as high-value. However, this confidence comes at the cost of missing out on a significant portion of actual high-value customers, as indicated by its lower recall.
AUC (Area Under the Curve)
The AUC score, which measures the model’s ability to distinguish between high-value and non-high-value customers, is slightly higher for Logistic Regression than for Random Forest. Although both models perform exceptionally well, the slightly higher AUC of Logistic Regression indicates it has a marginally better overall capacity to differentiate between customer classes. However, this advantage is offset by its lower recall.
Feature Importance
The feature importance analysis indicates that the average revenue is by far the most critical feature, while others, such as device type and transaction count, seem to have no impact. It may suggest that these features could be redundant in this context or that the model is heavily dependent on a single feature.
K-Means Clustering
Summary of Results:
- Davies-Bouldin Index: 1.473
- Mean Squared Distance: 2.413
The Davies-Bouldin Index is a metric used to evaluate the quality of clustering. It measures the average similarity ratio of each cluster with its most similar cluster, where a lower DBI indicates better clustering quality.
A DBI of 1.473 suggests that the clusters formed by the K-Means model are moderately well-separated but could potentially be improved. While this score does not indicate poor clustering, it suggests that there is room for further optimization to enhance the distinctness between clusters.
The Mean Squared Distance metric provides an average measure of the distance between data points and their respective cluster centroids.
A lower value indicates that data points are closely clustered around their centroids, implying tight and well-defined clusters.
In this case, a Mean Squared Distance of 2.413 indicates a reasonable clustering performance, with data points being moderately close to their respective centroids.
However, there is an indication that some data points might be spread out within their clusters, suggesting that the model may benefit from further refinement to achieve tighter clusters.
Conclusion
The evaluation of Logistic Regression, Random Forest, and K-Means Clustering models reveals key insights.
While Logistic Regression excels in precision, it struggles with recall, missing many high-value customers.
Random Forest outperforms in recall, accuracy, and F1 score, making it a more balanced model for identifying high-value customers.
K-Means Clustering shows moderate clustering quality with potential for improvement. Overall, Random Forest is the most effective, but further optimization could enhance results.