Revenue Prediction｜Coding Crossroads

Accurate revenue prediction is crucial for businesses to optimize their strategies and maximize profits.

Previously, we explored revenue prediction using BigQuery ML, which provided a solid foundation but left room for improvement.

In this post, we’ll dive into how we can enhance our revenue prediction model using Python, leveraging its flexibility and powerful libraries to achieve better results.

Motivation: Limitations of BigQuery ML
Data Preparation
Feature Engineering
Model Selection
Results and Comparison
Conclusion

Motivation: Limitations of BigQuery ML

Revenue prediction is a crucial aspect of understanding customer behavior and optimizing business strategies. While BigQuery ML offers an accessible platform for building models directly within your data warehouse, it can have limitations, particularly in handling complex interactions and feature engineering tasks.

In my analysis using BigQuery ML, I implemented several models: Linear Regression, Lasso Regression, Ridge Regression, and Random Forest.

The results revealed some critical insights:

Results of the BigQuery ML analysis

	Linear Regression	Lasso Regression	Ridge Regression	Random Forest
Mean Absolute Error (MAE)	4.532	3.927	4.532	2.420
Mean Squared Error (MSE)	2711.89	2719.70	2710.97	2503.89
Mean Squared Log Error (MSLE)	2.573	1.768	2.568	0.373
Median Absolute Error (MedAE)	1.554	1.083	1.554	4.571
R-Squared (R²)	0.025	0.022	0.025	0.100
Root Mean Squared Error (RMSE)	52.075	52.150	52.067	50.039

Mean Absolute Error (MAE): The Random Forest model in BigQuery ML significantly outperformed the linear models, with an MAE of 2.420 compared to over 4.5 for the others.
Mean Squared Error (MSE): Again, Random Forest had the lowest MSE, indicating better overall predictive accuracy.
Mean Squared Log Error (MSLE): This metric showed a stark difference, with Random Forest achieving an MSLE of 0.373, while the linear models were significantly higher.
R-Squared (R²): Despite the improvements, the R² values were relatively low across the board, with the highest being 0.100 for Random Forest, indicating that the models were not capturing much of the variance in the data.

Data Preparation

Working with the Google Analytics Sample Dataset presents challenges due to its size and complexity. To overcome these, we implemented the following strategies:

Chunked Data Processing: Instead of loading the entire dataset at once, we processed it in smaller, manageable chunks. This approach significantly reduced memory usage and allowed for more efficient processing.

Data Cleaning and Flattening: The raw data contains nested structures that needed to be flattened for analysis. We performed this process in chunks, saving intermediate results to disk to manage memory effectively.

Imputation and Scaling: We handled missing values through imputation and normalized features through scaling, all done in chunks to maintain efficiency.

Leveraging Colab and A100: To handle the large dataset and intensive computations, I utilize Google Colab paired with an A100 GPU. This setup provided the necessary computational power to efficiently process and analyze the data.

Here’s a snippet of our data preparation process:

Python

def clean_and_engineer_data(df):
    logger.info("Starting data cleaning and feature engineering...")

    # Flatten any nested columns
    df_cleaned = flatten_nested_columns(df)
    logger.info(f"Flattened DataFrame shape: {df_cleaned.shape}")

    # Identify columns with complex data types (arrays, lists, etc.)
    complex_columns = []
    for col in df_cleaned.columns:
        sample = df_cleaned[col].head(100)
        if sample.apply(lambda x: isinstance(x, (list, np.ndarray))).any():
            logger.warning(f"Column '{col}' contains complex data types. Dropping this column.")
            complex_columns.append(col)

    # Drop complex columns
    df_cleaned = df_cleaned.drop(columns=complex_columns)

    # Optimize data types to reduce memory usage
    df_cleaned = optimize_dtypes(df_cleaned)

    df_cleaned['date'] = pd.to_datetime(df_cleaned['date'], format='%Y%m%d')
    numeric_columns = df_cleaned.select_dtypes(include=[np.number]).columns
    categorical_columns = df_cleaned.select_dtypes(exclude=[np.number, 'datetime64']).columns

    # Handle all-NaN columns
    all_nan_columns = df_cleaned.columns[df_cleaned.isna().all()].tolist()
    if all_nan_columns:
        df_cleaned = df_cleaned.drop(columns=all_nan_columns)
        numeric_columns = [col for col in numeric_columns if col not in all_nan_columns]
        categorical_columns = [col for col in categorical_columns if col not in all_nan_columns]

    # Imputation
    for col in numeric_columns:
        df_cleaned[col] = df_cleaned[col].astype('float32')  # Ensure the column is float
        df_cleaned[col] = df_cleaned[col].fillna(df_cleaned[col].mean())  # Fill NaN with mean
    
    for col in categorical_columns:
        df_cleaned[col] = df_cleaned[col].fillna(df_cleaned[col].mode().iloc[0])


    # Feature engineering
    df_cleaned['day_of_week'] = df_cleaned['date'].dt.dayofweek
    df_cleaned['is_weekend'] = df_cleaned['day_of_week'].isin([5, 6]).astype(int)
    df_cleaned['month'] = df_cleaned['date'].dt.month
    df_cleaned['quarter'] = df_cleaned['date'].dt.quarter

Model Selection: Random Forest

For our revenue prediction task, we chose the Random Forest Regressor. This decision was based on several factors:

Ability to handle high-dimensional data: Random Forest can effectively manage the large number of features we engineered.
Robustness: It’s less prone to overfitting compared to single decision trees.
Feature importance: Random Forest provides insights into feature importance, helping us understand which factors most influence revenue.
Non-linear relationships: It can capture complex, non-linear relationships in the data.
Superior performance: The Random Forest model outperformed other models in key metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-Squared (R²) based on the BigQuery ML analysis, demonstrating its effectiveness in predicting revenue.

Results of the BigQuery ML analysis

	Linear Regression	Lasso Regression	Ridge Regression	Random Forest
Mean Absolute Error (MAE)	4.532	3.927	4.532	2.420
Mean Squared Error (MSE)	2711.89	2719.70	2710.97	2503.89
Mean Squared Log Error (MSLE)	2.573	1.768	2.568	0.373
Median Absolute Error (MedAE)	1.554	1.083	1.554	4.571
R-Squared (R²)	0.025	0.022	0.025	0.100
Root Mean Squared Error (RMSE)	52.075	52.150	52.067	50.039

Results and Comparison

Conclusion

Due to the large size of the dataset, I am currently encountering challenges in obtaining prediction results. Stay tuned for updates as I work to resolve these issues!

Table of Contents

Motivation: Limitations of BigQuery ML

Results of the BigQuery ML analysis

Data Preparation

Model Selection: Random Forest

Results of the BigQuery ML analysis

Results and Comparison

Conclusion