Random Forest｜Coding Crossroads

Safety by Design Expert’s Note

For safety experts, understanding Random Forest algorithms is crucial for developing robust and fair AI systems:

Robustness: The ensemble nature of Random Forest provides resilience against adversarial attacks and data perturbations.
Uncertainty Quantification: Random Forest can provide measures of prediction uncertainty, crucial for safety-critical applications.
Feature Importance: Understanding which features drive predictions is vital for ensuring model fairness and identifying potential safety risks.
Non-linear Relationships: Random Forest can capture complex, non-linear relationships in data, potentially revealing safety-relevant patterns that simpler models might miss.

Random Forest is a versatile and robust machine learning algorithm that combines the power of multiple decision trees to create a more accurate and stable predictive model.

Known for its high accuracy, ability to handle large datasets with higher dimensionality, and resistance to overfitting, Random Forest has become a go-to method for both classification and regression tasks.

Unlike linear models, which assume a linear relationship between features and the target variable, Random Forest is a non-parametric method that can capture complex interactions and non-linear patterns in the data.

In this post, we’ll explore the fundamentals of Random Forest, its advantages, and how it performs in practice, particularly in the context of our house price prediction task.

You can find the complete code in my GitHub repository.

Understanding Random Forest
Implementing Random Forest for House Price Prediction
Model Performance and Evaluation
Feature Importance in Random Forest
Conclusion

Random Forest
Random Forest is an ensemble learning algorithm that creates multiple decision trees during training and combines their outputs to make a final prediction. This method improves accuracy and reduces the risk of overfitting by averaging the results of all the trees, which helps to handle complex, non-linear relationships in data.

1. Understanding Random Forest

Random Forest is an ensemble learning method that operates by constructing multiple decision trees during training and outputting the mean prediction (for regression) or the mode of the classes (for classification) of the individual trees.

Key aspects of Random Forest include:

Bootstrap Aggregating (Bagging): Each tree is trained on a random subset of the training data, sampled with replacement.
Feature Randomness: At each split in the tree, only a random subset of features is considered, which adds an additional layer of randomness.
Ensemble Decision: The final prediction is made by averaging the predictions of all trees (for regression) or by majority vote (for classification).

These characteristics contribute to Random Forest’s ability to reduce overfitting, handle high-dimensional data, and provide robust predictions.

Ensemble Learning
Ensemble learning is a technique in machine learning where multiple models (e.g., decision trees) are combined to produce a stronger overall model. The idea is that by averaging or voting on predictions from several models, the ensemble can correct the mistakes of individual models, leading to better performance.

Bootstrap Aggregating (Bagging)
Bagging is a technique used in ensemble learning where multiple versions of a model are trained on different subsets of the training data. These subsets are created by randomly sampling the data with replacement. By combining the predictions of these models, bagging reduces the model’s variance and improves its accuracy.

Feature Randomness
In Random Forest, feature randomness refers to the process of selecting a random subset of features for each split in a decision tree. This technique helps to make the trees more diverse and reduces the chances of overfitting, ensuring that no single feature dominates the model’s decisions.

Overfitting
Overfitting occurs when a model learns the details and noise in the training data to the extent that it performs well on the training data but poorly on unseen data. In other words, the model becomes too complex and captures random fluctuations rather than the underlying pattern, leading to poor generalization.

2. Implementing Random Forest for House Price Prediction

Let’s implement a Random Forest model for our house price prediction task. We’ll use scikit-learn’s RandomForestRegressor class:

Python

# Split the processed data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)

# Initialize the model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model
rf_model.fit(X_train, y_train)

# Calculate performance metrics
mae = -cross_val_score(rf_model, X_processed, y, cv=5, scoring='neg_mean_absolute_error')
mse = -cross_val_score(rf_model, X_processed, y, cv=5, scoring='neg_mean_squared_error')
rmse = np.sqrt(mse)
mape = -cross_val_score(rf_model, X_processed, y, cv=5, scoring='neg_mean_absolute_percentage_error')
r2 = cross_val_score(rf_model, X_processed, y, cv=5, scoring='r2')

3. Model Performance and Evaluation

The Random Forest regression model’s performance can be evaluated by analyzing the results obtained through cross-validation and testing on a separate test set. Let’s break down the key metrics.

Cross-Validation
Cross-validation is a method used to evaluate a model’s performance by dividing the data into several subsets or “folds.” The model is trained on some folds and tested on the others. This process is repeated multiple times, and the results are averaged to provide a more reliable estimate of the model’s ability to generalize to new data.

Cross-Validation Results

Cross-Validation results give a robust estimate of the model’s performance across different subsets of the data. They show how the model performs on average and how much variation there is in its performance.

The table summarizes the performance metrics of four different models: Linear Regression, Ridge Regression, Elastic Net Regression, and Random Forest. Here’s a detailed comparison of Random Forest with the other three models based on these metrics:

	Linear Regression	Ridge Regression	Elastic Net Regression	Random Forest
Mean Absolute Error (MAE)	18,647 (+/- 2,643)	16,753 (+/- 2554)	16,613 (+/- 2,387)	17,508 (+/- 2,042)
Mean Squared Error (MSE)	1,377,330,606 (+/- 982,155,554)	853,323,765 (+/- 614,762,731)	834,494,733 (+/- 611,226,242)	927,232,894 (+/- 506,114,186)
Root Mean Squared Error (RMSE)	36,411 (+/- 14,357)	28,730 (+/- 10,565)	28,388 (+/- 10,703)	30,177 (+/- 8,145)
Mean Absolute Percentage Error (MAPE)	11.036% (+/- 0.977%)	9.799% (+/- 1.089%)	9.860% (+/- 1.127%)	0.100% (+/- 0.01%)
Median Absolute Error (MedAE)	11,607 (+/- 1,154)	11,458 (+/- 1,770)	11,318 (+/- 1,634)	11,101 (+/- 2,006)
R-squared (R2)	0.780 (+/- 0.178)	0.867 (+/- 0.073)	0.870 (+/- 0.074)	0.853 (+/- 0.065)

1. Mean Absolute Error (MAE)

Random Forest performs better than Linear Regression in terms of MAE, indicating more accurate predictions on average. However, it slightly underperforms compared to Ridge and Elastic Net Regression, suggesting that the regularization techniques help to improve prediction accuracy.

2. Mean Squared Error (MSE)

Random Forest reduces MSE significantly compared to Linear Regression, showcasing its strength in minimizing larger errors. However, Ridge and Elastic Net Regression achieve even lower MSE values, indicating they are better at controlling large prediction errors, likely due to their ability to regularize coefficients effectively.

3. Root Mean Squared Error (RMSE)

The RMSE metric shows that Random Forest outperforms Linear Regression by a notable margin, producing predictions with smaller average error magnitudes. However, both Ridge and Elastic Net still perform slightly better than Random Forest, indicating they might offer more consistent predictions with fewer outliers.

4. Mean Absolute Percentage Error (MAPE)

Random Forest achieves an extraordinarily low MAPE, far outperforming all other models. This indicates that, on a percentage basis, Random Forest provides extremely accurate predictions relative to the scale of house prices, making it a highly reliable model in this context.

5. Median Absolute Error (MedAE)

The MedAE metric indicates that Random Forest has the lowest median absolute error among all models, suggesting that it is particularly effective at making typical predictions close to the actual values. Random Forest’s slight edge in MedAE suggests it may be better at handling outliers or variations in the data, leading to more consistent predictions closer to the median.

6. R-squared (R²)

R² indicates how well the model explains the variance in the target variable. Random Forest shows a strong R², outperforming Linear Regression by a significant margin. However, Ridge and Elastic Net have slightly higher R² values, indicating they explain even more variance, likely due to their regularization techniques that reduce overfitting and improve generalization.

Conclusion

While Random Forest is highly effective, particularly in reducing relative error (MAPE), Ridge and Elastic Net Regression models provide slightly better performance across most other metrics. Therefore, the choice of model may depend on the specific goals—whether prioritizing overall accuracy and generalization (Ridge/Elastic Net) or achieving the most accurate percentage-based predictions (Random Forest).

4. Test Set Results

The test set results provide crucial insights into how well the Random Forest model generalizes to unseen data. Here are the key metrics and their interpretations:

Test Mean Absolute Error (MAE): 17,107

The test MAE of $17,107 indicates that, on average, the model’s predictions deviate from the actual house prices by this amount.

Notably, this error is slightly lower than the cross-validated MAE, which suggests that the model performs slightly better on the test set than during cross-validation.

This consistency between the cross-validated and test MAE reinforces the reliability of the model in predicting house prices with reasonable accuracy.

Test Mean Squared Error (MSE): 887,917,562

The test MSE is $887,917,562, which is also slightly lower than the cross-validated MSE.

This consistency in MSE values between the test set and cross-validation indicates that the model is not overfitting and is performing robustly across different subsets of data.

The MSE reflects the model’s ability to handle larger errors effectively, ensuring that significant deviations from actual house prices are kept under control.

Test Root Mean Squared Error (RMSE): 29,798

The RMSE for the test set is $29,798, aligning closely with the cross-validated RMSE.

This metric provides a direct interpretation of the average magnitude of error in the same units as the target variable (house prices).

The similarity between the test RMSE and the cross-validated RMSE underscores the model’s ability to generalize well to new, unseen data, maintaining a typical prediction error around this value.

Test R-squared (R²): 0.884

The test set R² score of 0.884 suggests that the model explains approximately 88.4% of the variance in house prices on unseen data.

This score is slightly higher than the cross-validated R², indicating that the model performs slightly better when predicting on the test set.

A high R² value on the test set confirms the model’s effectiveness in capturing the underlying relationships in the data and delivering accurate predictions on house prices.

These test set results collectively demonstrate that the Random Forest model not only performs well during cross-validation but also generalizes effectively to new data, making it a reliable choice for house price prediction tasks.

Mean Absolute Error (MAE)
MAE measures the average magnitude of errors in a set of predictions, without considering their direction. It is calculated as the average of the absolute differences between the predicted and actual values. A lower MAE indicates better model accuracy.

Mean Squared Error (MSE)
MSE measures the average of the squared differences between predicted and actual values. Because it squares the errors, MSE penalizes larger errors more than MAE, making it useful for identifying models that produce large prediction errors.

Root Mean Squared Error (RMSE)
RMSE is the square root of the Mean Squared Error. It provides an error metric in the same units as the target variable, making it easier to interpret. RMSE is useful for understanding the average magnitude of error in the model’s predictions.

Mean Absolute Percentage Error (MAPE)
MAPE measures the accuracy of predictions as a percentage by comparing the absolute differences between predicted and actual values to the actual values themselves. It is useful for understanding how large the errors are relative to the actual values, but it can be misleading if actual values are close to zero.

R-squared (R²)
R² is a statistical measure that represents the proportion of the variance in the dependent variable (e.g., house prices) that is explained by the independent variables (e.g., house features) in the model. An R² value close to 1 indicates that the model explains most of the variance in the target variable, while a value close to 0 suggests that the model does not explain much of the variance.

Feature Importance

The feature importance results from the Random Forest model reveal which predictors most strongly influence house prices based on the dataset used. The top features and their corresponding importance scores are as follows:

TotalSF_OverallQual (0.799): This feature dominates the model, accounting for nearly 80% of the total importance. It combines total square footage with overall quality, suggesting that the size and quality of a house are by far the most critical factors in determining its price.
TotalSF_OverallCond (0.017): This feature combines total square footage with overall condition, further emphasizing the importance of both size and the state of the property.
YearBuilt_YearRemodAdd (0.012): This interaction between the year built and year remodeled suggests that the age of the house and any updates made to it play a role in determining its value.
LotFrontage (0.010): This refers to the linear feet of street connected to the property, indicating that the property’s street presence is a factor in its price.
BsmtFinSF1 (0.008): This represents the finished square feet of the first floor of the basement, suggesting that finished basement space adds value to a home.
LotArea (0.008): The total square footage of the lot is also a factor, though less important than the house’s square footage.
GarageArea_GarageCars (0.008): This interaction feature suggests that both the size of the garage and its capacity (number of cars) influence the house price.
BsmtUnfSF (0.008): Unfinished square feet in the basement also contribute to the house price, though less than finished basement space.
TotalBathrooms (0.005): The total number of bathrooms in the house is a factor in determining its price.
TotalSF (0.005): The total square footage of the house appears again, reinforcing the importance of the house’s size.

Comparison with Ridge and Elastic Net Results

Similarities:

All three models identify TotalSF_OverallQual as the most important feature, emphasizing the crucial role of a house’s size and quality in determining its price.
GarageArea_GarageCars appears in all three models, indicating the consistent importance of garage features.
All models consider some aspect of the basement (BsmtFinSF1, BsmtUnfSF, BsmtQual_Gd, BsmtExposure_Gd) as important, highlighting the value of basement space.

Differences:

Feature Importance Distribution: The Random Forest model assigns a much higher importance (about 80%) to the top feature (TotalSF_OverallQual) compared to Ridge and Elastic Net, which show a more even distribution of importance across features.
Neighborhood Features: Ridge and Elastic Net highlight specific neighborhood features (e.g., Neighborhood_CollgCr) that don’t appear in the top 10 for Random Forest.
Kitchen Quality: KitchenQual_Gd is important in Ridge and Elastic Net but doesn’t appear in the top 10 for Random Forest.
Specific Materials: Elastic Net identifies RoofMatl_WdShngl as important, a level of specificity not seen in the Random Forest or Ridge results.
Interaction Features: While all models use some interaction features, Random Forest seems to rely more heavily on them (e.g., TotalSF_OverallQual, TotalSF_OverallCond).

Key Takeaways:

The Random Forest model puts a very strong emphasis on the combination of total square footage and overall quality, suggesting it might be capturing non-linear relationships between these features and house prices.
Ridge and Elastic Net models provide a more granular view of feature importance, highlighting specific qualities (like kitchen quality) and neighborhood effects that don’t appear explicitly in the Random Forest top features.
The Random Forest model might be better at capturing complex interactions between features, as evidenced by the high importance of several interaction terms.
The more even distribution of feature importance in Ridge and Elastic Net models might make them more interpretable and potentially more stable when applied to slightly different datasets.

In conclusion, while all three models agree on the general importance of size, quality, and certain features like garages and basements, they differ in how they weight these features and in the specific aspects they highlight.

This suggests that using multiple models can provide a more comprehensive understanding of the factors influencing house prices.

Conclusion

In this exploration of Random Forest for house price prediction, we’ve seen how this powerful ensemble method can capture complex relationships in real estate data.

The model demonstrated strong performance, explaining 88.4% of the variance in house prices on the test set. Its strength lies in capturing non-linear relationships and interactions between features, evident in the dominance of the TotalSF_OverallQual feature.

However, the comparison with Ridge and Elastic Net Regression reveals that while Random Forest offers excellent accuracy, particularly in percentage-based predictions, the regularized linear models provide slightly better overall performance in terms of Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R²).

This suggests that while Random Forest is robust and flexible, especially in handling non-linear patterns, Ridge and Elastic Net may offer more stable and interpretable results, particularly when the dataset exhibits multicollinearity.

The feature importance analysis highlights that Random Forest places overwhelming emphasis on certain key features, particularly the interaction between size and quality, while the Ridge and Elastic Net models distribute importance more evenly, giving attention to neighborhood effects, kitchen quality, and specific materials.

In practice, the choice between these models depends on the specific needs of the task. If the goal is to capture complex interactions and achieve highly accurate percentage-based predictions, Random Forest is a strong candidate.

However, for tasks requiring more balanced and interpretable models that generalize well across different datasets, Ridge and Elastic Net may be preferable.

Combining insights from multiple models can provide a more comprehensive understanding of the factors influencing house prices, leading to more informed decision-making in real estate analytics.

Contents