Revenue Prediction

In today’s fast-paced e-commerce landscape, accurate revenue prediction is a game-changer for businesses. It enables companies to optimize marketing strategies, manage inventory efficiently, and maximize profits. By leveraging data-driven insights, we can develop predictive models that forecast revenue based on various factors such as traffic sources, user demographics, and behavioral patterns.

This post will guide you through the process of building a revenue prediction model using BigQuery ML and the Google Analytics Sample Dataset. We’ll cover data preparation, feature engineering, model building and training, and performance evaluation. By the end, you’ll have a clear understanding of how to implement and assess different regression models for revenue prediction.

Contents

  1. Data Preparation
  2. Feature Engineering
  3. Model Building and Training
    • Linear Regression
    • Ridge Regression
    • Lasso Regression
    • Random Forest
  4. Model Evaluation
  5. Feature Importance
  6. Conclusion

1. Data Preparation

Before diving into model building, we need to prepare our dataset. The Google Analytics Sample Dataset provides a rich source of data, including traffic sources, user demographics, and transaction information. Our goal is to consolidate this data into a structured format suitable for training a regression model.

SQL
-- Data Preparation
CREATE OR REPLACE TABLE `predictive-behavior-analytics.Section4.revenue_prediction_data` AS
SELECT
  CONCAT(fullVisitorId, CAST(visitId AS STRING)) AS session_id,
  trafficSource.source AS traffic_source,
  trafficSource.medium AS traffic_medium,
  device.deviceCategory AS device_type,
  geoNetwork.country AS country,
  totals.pageviews AS pageviews,
  totals.timeOnSite AS time_on_site,
  IFNULL(totals.transactionRevenue, 0) / 1000000 AS revenue
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20160801' AND '20170731';

2. Feature Engineering

Exploratory Data Analysis (EDA) Insights

In our previous analysis, we explored several key aspects of the data:

  • Traffic Source Analysis: We found that organic search and referral traffic were the top revenue drivers.
  • User Demographics: Desktop users and visitors from the United States contributed significantly to overall revenue.
  • Behavioral Patterns: Higher pageviews and longer time on site generally correlated with increased revenue.

These insights inform our feature selection and engineering process.

Feature Engineering

Feature engineering is crucial for improving model performance. We’ll transform raw data into meaningful features that can help our model make accurate predictions. Our focus will be on:

  • Traffic Source: This can indicate user intent and likelihood of purchase.
  • User Demographics: Device type and country provide insights into user behavior and purchasing power.
  • User Behavior: Metrics like pageviews and time on site indicate engagement levels, which may correlate with revenue.

I also created categorical features and normalized numerical features to improve model performance.

You can find the complete code in my GitHub repository.

SQL
-- Feature Engineering: Categorical Variables
CREATE OR REPLACE TABLE `predictive-behavior-analytics.Section4.revenue_prediction_data_with_features` AS
SELECT
  *,
  CAST(traffic_source AS STRING) AS traffic_source_cat,
  CAST(device_type AS STRING) AS device_type_cat,
  CAST(country AS STRING) AS country_cat,
  (pageviews - (SELECT AVG(pageviews) FROM `predictive-behavior-analytics.Section4.revenue_prediction_data`)) / (SELECT STDDEV(pageviews) FROM `predictive-behavior-analytics.Section4.revenue_prediction_data`) AS normalized_pageviews,
  (time_on_site - (SELECT AVG(time_on_site) FROM `predictive-behavior-analytics.Section4.revenue_prediction_data`)) / (SELECT STDDEV(time_on_site) FROM `predictive-behavior-analytics.Section4.revenue_prediction_data`) AS normalized_time_on_site
FROM
  `predictive-behavior-analytics.Section4.revenue_prediction_data`;

3. Model Building and Training

With our features ready, we can now build and train regression models to predict revenue.

Linear Regression

SQL
CREATE OR REPLACE MODEL `predictive-behavior-analytics.Section4.revenue_prediction_model`
OPTIONS(model_type='linear_reg', input_label_cols=['revenue']) AS
SELECT
  traffic_source_cat,
  device_type_cat,
  country_cat,
  normalized_pageviews,
  normalized_time_on_site,
  revenue
FROM
  `predictive-behavior-analytics.Section4.revenue_prediction_data_with_features`;

Ridge Regression

SQL
CREATE OR REPLACE MODEL `predictive-behavior-analytics.Section4.ridge_reg_sales_model`
OPTIONS(model_type='linear_reg', l2_reg=0.1, input_label_cols=['revenue']) AS
SELECT
  traffic_source_cat,
  device_type_cat,
  country_cat,
  normalized_pageviews,
  normalized_time_on_site,
  revenue
FROM
  `predictive-behavior-analytics.Section4.revenue_prediction_data_with_features`;

Lasso Regression

SQL
CREATE OR REPLACE MODEL `predictive-behavior-analytics.Section4.lasso_reg_sales_model`
OPTIONS(model_type='linear_reg', l1_reg=0.1, input_label_cols=['revenue']) AS
SELECT
  traffic_source_cat,
  device_type_cat,
  country_cat,
  normalized_pageviews,
  normalized_time_on_site,
  revenue
FROM
  `predictive-behavior-analytics.Section4.revenue_prediction_data_with_features`;

Random Forest Model

SQL
CREATE OR REPLACE MODEL `predictive-behavior-analytics.Section4.random_forest_sales_model`
OPTIONS(model_type='random_forest_regressor', input_label_cols=['revenue']) AS
SELECT
  traffic_source_cat,
  device_type_cat,
  country_cat,
  normalized_pageviews,
  normalized_time_on_site,
  revenue
FROM
  `predictive-behavior-analytics.Section4.revenue_prediction_data_with_features`;

4. Model Evaluation

After training the models, it’s essential to evaluate their performance using the following metrics:

Mean Absolute Error (MAE): Measures the average magnitude of errors in a set of predictions, without considering their direction. It provides a straightforward interpretation of how far off predictions are on average.

Mean Squared Error (MSE): Like MAE, but squares the errors before averaging them. This metric penalizes larger errors more heavily, which can be useful if large errors are particularly undesirable.

Mean Squared Log Error (MSLE): Similar to MSE, but applies a logarithm transformation to the predicted and actual values before calculating the error. This is useful when the target variable has a wide range of values and you want to penalize underestimations more than overestimations.

Median Absolute Error (MedAE): Provides the median of the absolute errors, which can be more robust to outliers compared to MAE. This is useful if your dataset contains significant outliers.

R-Squared (R²): Measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It gives a general sense of the model’s explanatory power.

Root Mean Squared Error (RMSE): RMSE is a critical metric in regression analysis because it gives a sense of how large the prediction errors are on average, taking into account their magnitude. Lower RMSE values indicate better model performance, as they suggest that the model is making predictions that are closer to the actual observed values, with fewer large errors.

Results

Linear RegressionLasso RegressionRidge RegressionRandom Forest
Mean Absolute Error (MAE)4.5323.9274.5322.420
Mean Squared Error (MSE)2711.892719.702710.972503.89
Mean Squared Log Error (MSLE)2.5731.7682.5680.373
Median Absolute Error (MedAE)1.5541.0831.5544.571
R-Squared (R²)0.0250.0220.0250.100
Root Mean Squared Error (RMSE)52.07552.15052.06750.039

Ridge Regression

Ridge Regression shows a relatively high MAE and MSE, with a moderate RMSE. The R² score is quite low, indicating that the model explains very little of the variance in the data. The high MSLE suggests significant differences between predicted and actual values, especially for smaller revenue predictions.

Linear Regression

Linear Regression performs almost identically to Ridge Regression, with very similar MAE, MSE, RMSE, and R² scores. This similarity suggests that Ridge Regression’s regularization did not significantly alter the model’s predictions, indicating that the features might not require strong regularization.

Lasso Regression

Lasso Regression achieves a lower MAE than Ridge and Linear Regression, indicating better overall predictive accuracy. The MedAE is also lower, suggesting that Lasso Regression performs better on the majority of the dataset. However, its R² score is slightly lower, and the RMSE remains high, indicating that while it may perform better on average, it struggles with larger errors.

Random Forest

Random Forest outperforms the other models across almost all metrics.

It has the lowest MAE and RMSE, indicating it makes fewer and smaller errors overall.

The MSLE is significantly lower, showing that Random Forest handles predictions for lower revenue amounts more effectively.

The R² score is higher at 0.100, indicating that it explains more variance in the data compared to the other models.

The very low MedAE suggests that Random Forest makes very accurate predictions for the majority of cases, though the explained variance and R² still indicate that there’s room for improvement in capturing the complexity of the data.

Conclusion

Overall, Random Forest is the best-performing model in this evaluation, particularly in terms of minimizing errors and explaining the variance in the data.

The regularized models (Ridge and Lasso Regression) offer marginal benefits over simple Linear Regression, but none match the performance of Random Forest. This suggests that a more complex, non-linear approach like Random Forest is better suited for this dataset, likely due to its ability to capture more intricate patterns in the data.

5. Feature Importance

The results from the Random Forest model provide valuable insights into the predictive power of the features used in the Revenue Prediction analysis. By examining the feature importance metrics—particularly Importance Gain and Importance Weight—we can identify which factors most significantly influence the model’s performance.

FeatureImportance GainImportance Weight
Pageviews143,75565,971
Traffic Source66,04354,100
Time on Site58,55277,223
Country 41,10541,129
Device Type35,01521,065

Below is a discussion of these key features based on the provided metrics.

Pageviews

  • Pageviews stand out as the most influential feature in the model. This indicates that the number of pageviews has a substantial impact on improving the accuracy of the revenue predictions. The high importance weight also suggests that this feature is frequently used in decision splits, reaffirming its critical role in the model.

Traffic Source

  • Traffic Source is another significant factor. This shows that where the traffic originates (e.g., direct, referral, search engines) plays a crucial role in predicting revenue. The high importance weight indicates that this feature is consistently used across the trees in the Random Forest, further emphasizing its relevance.

Time on Site

  • Time on Site has a high importance weight, making it the most frequently utilized feature in the model. Although its importance gain is lower than that of Pageviews and Traffic Source, the frequent use of this feature suggests that the duration a user spends on the site is a key determinant in revenue prediction. It may have a broad but moderate impact on the model’s accuracy.

Country

  • The country from which a user accesses the site is also an important factor. This suggests that geographic location significantly contributes to revenue prediction. The Importance Weight of 41,129 indicates that this feature is moderately used in the model, underscoring its role in influencing revenue outcomes.

Device Type

  • Device Type highlights the relevance of the device category (e.g., desktop, mobile, tablet) in predicting revenue. Although it has the lowest Importance Weight among the features, its contribution to the model is still noteworthy, indicating that the type of device used by a customer can influence revenue predictions, albeit to a lesser extent than other factors.

Conclusion

The analysis highlights that Pageviews and Traffic Source are the most critical factors in predicting revenue, both in terms of their ability to improve model accuracy (Importance Gain) and their frequent usage in decision splits (Importance Weight).

Time on Site also plays a significant role, particularly due to its broad application across the model. Country and Device Type, while slightly less impactful, still contribute valuable information to the revenue prediction process. Understanding these key factors allows for better targeting and optimization strategies in e-commerce, ultimately driving more accurate revenue forecasts.

6. Conclusion

This analysis demonstrates that Random Forest is the most effective model for predicting e-commerce revenue, outperforming linear and regularized regression models across all key metrics. The model’s ability to capture complex, non-linear relationships in the data made it particularly suited for this task.

Key features like Pageviews, Traffic Source, and Time on Site emerged as the most critical predictors of revenue. Pageviews had the highest impact, highlighting its importance in driving accurate revenue predictions. Traffic Source and Time on Site also played significant roles, indicating that both user engagement and the origin of traffic are crucial factors in forecasting revenue.

Overall, the insights gained from this analysis can help businesses better understand the drivers of revenue and optimize their strategies accordingly, leading to more accurate and actionable revenue forecasts.

RSS
Follow by Email
LinkedIn
Share