Support Vector Machines｜Coding Crossroads

Safety by Design Expert’s Note

For safety experts, understanding Support Vector Machines (SVMs) is crucial for developing robust and reliable AI systems:

Handling Non-linearity: SVMs can capture complex, non-linear relationships in data, essential for modeling intricate safety scenarios.
Margin Maximization: The SVM’s principle of maximizing the margin between classes can lead to more robust models, reducing the risk of misclassification in safety-critical applications.
Kernel Selection: Understanding different kernels allows for tailoring the model to specific safety-related data structures, potentially improving performance in diverse safety contexts.
Outlier Handling: SVMs can be less sensitive to outliers, which is valuable when dealing with noisy or anomalous data in safety-critical systems.
High-dimensional Data: SVMs’ effectiveness in high-dimensional spaces makes them suitable for complex safety problems with numerous variables.
Model Interpretability: While more complex than linear models, SVMs still offer ways to interpret their decisions, crucial for auditing AI systems in safety-critical contexts.

Support Vector Machines (SVMs) are powerful and versatile machine learning algorithms that have gained significant popularity in recent years.

Known for their effectiveness in high-dimensional spaces and their ability to handle non-linear decision boundaries, SVMs have become a go-to method for both classification and regression tasks in various domains, including image recognition, text classification, and bioinformatics.

In this post, we’ll explore the fundamentals of Support Vector Machines, with a particular focus on the kernel trick and how SVMs handle high-dimensional data. We’ll also implement an SVM for our ongoing house price prediction task to see how it performs in practice.

You can find the complete code in my GitHub repository.

Margin Maximization: A principle in SVMs where the model seeks to maximize the distance (margin) between the decision boundary and the closest data points from each class.

Understanding Support Vector Machines
The Kernel Trick: Unveiling Hidden Patterns
SVMs in High-Dimensional Spaces
Implementing SVM for House Price Prediction
Model Performance and Evaluation
SVM’s Poor Performance and Improvement Strategies
Hyperparameter Tuning
Kernel Selection
Conclusion

1. Understanding Support Vector Machines

Support Vector Machines are a class of algorithms that aim to find the optimal hyperplane that separates different classes in a dataset. In the case of regression, SVMs try to find the hyperplane that best fits the data points while allowing for some margin of error.

Key aspects of Support Vector Machines include:

Maximizing Margin: SVMs seek to maximize the margin between classes, which often leads to better generalization.
Support Vectors: The data points closest to the decision boundary, which are crucial in defining the hyperplane.
Soft Margin: Allows for some misclassification to achieve better overall performance.
Kernel Functions: Enable SVMs to handle non-linear relationships in the data.

2. The Kernel Trick: Unveiling Hidden Patterns

The kernel trick is a fundamental concept in SVMs that allows them to operate in high-dimensional feature spaces without explicitly computing the coordinates of the data in that space. This is particularly useful when dealing with non-linear relationships in the data.

Here’s how the kernel trick works:

Instead of computing the dot product between two vectors in high-dimensional space, we use a kernel function that computes this dot product in the original space.
This allows SVMs to find non-linear decision boundaries in the original feature space, which correspond to linear boundaries in the higher-dimensional space.

Common kernel functions include:

Linear Kernel: K(x, y) = x · y
Polynomial Kernel: K(x, y) = (γx · y + r)^d
Radial Basis Function (RBF) Kernel: K(x, y) = exp(-γ||x – y||^2)

The choice of kernel function depends on the nature of our data and the problem you’re trying to solve.

Kernel Trick:
A method used in SVMs to transform data into a higher-dimensional space to make it easier to separate using a linear decision boundary.

3. SVMs in High-Dimensional Spaces

SVMs are particularly well-suited for handling high-dimensional data due to several key characteristics:

Curse of Dimensionality Mitigation: SVMs are less affected by the curse of dimensionality compared to many other algorithms.
Effective Feature Selection: SVMs implicitly perform feature selection by assigning weights to features.
Regularization: The soft margin concept in SVMs acts as a form of regularization, helping to prevent overfitting in high-dimensional spaces.
Sparsity of Solution: SVMs often produce sparse solutions, effectively using only a subset of the training data (the support vectors) to define the decision boundary.

These properties make SVMs an excellent choice for problems with many features, such as text classification or image recognition.

Curse of Dimensionality:
The phenomenon where the performance of certain models deteriorates as the number of features increases.

4. Implementing SVM for House Price Prediction

Let’s implement a Support Vector Machine for our house price prediction task using scikit-learn’s SVR (Support Vector Regression) class.

Python

# Initialize the model
svm_model = SVR(kernel='rbf', C=1.0, epsilon=0.1)

# Fit the model
svm_model.fit(X_train, y_train)

5. Model Performance and Evaluation

	MAE (thousand)	MSE (million)	RMSE (thousand)	MAPE	MedAE (thousand)	R²
Linear	18.6 (±2.6)	1,377 (±982)	36.4 (±14.4)	11.04% (±0.98%)	11.6 (±1.2)	0.780 (±0.178)
Ridge	16.8 (±2.6)	853 (±615)	28.7 (±10.6)	9.80% (±1.09%)	11.5 (±1.8)	0.867 (±0.073)
Elastic Net	16.6 (±2.4)	834 (±611)	28.4 (±10.7)	9.86% (±1.13%)	11.3 (±1.6)	0.870 (±0.074)
Random Forest	17.5 (±2.0)	927 (±506)	30.2 (± 8.1)	0.10% (±0.01%)	11.1 (±2.0)	0.853 (±0.065)
GBDT	16.8 (±2.4)	953 (±595)	30.5 (±9.9)	0.09% (±0.02%)	10.3 (±1.8)	0.839 (±0.094)
XGBoost	17.8 (±2.4)	1,177 (±748)	33.9 (±11.1)	0.10% (±0.02%)	10.8 (±2.2)	0.806 (±0.078)
LightGBM	16.7 (±3.4)	838 (±513)	28.6 (±9,1)	0.09% (±0.02%)	10.8 (±0.5)	0.860 (±0.071)
SVM	54.6 (±5.0)	6,245 (±1,706)	78.8 (± 10.7)	31.18% (±6.35%)	37.9 (±5.5)	0.053 (±0.067)

1. Mean Absolute Error (MAE)

The MAE for SVM is significantly higher than all the other models.

2. Mean Squared Error (MSE)

The SVM’s MSE is substantially higher than the other models. This indicates that SVM struggles more with large prediction errors.

3. Root Mean Squared Error (RMSE)

The RMSE for SVM is the highest among all models, highlighting that its predictions have a larger magnitude of error.

4. Mean Absolute Percentage Error (MAPE)

The MAPE for SVM is significantly higher than the others.

5. Median Absolute Error (MedAE)

SVM’s MedAE is much higher, indicating that even for the median prediction, SVM is less accurate.

6. R-squared (R²)

SVM’s R² is very low, indicating that it explains only about 5.3% of the variance in house prices. This is in stark contrast to the other models, with Ridge and Elastic Net both explaining around 87% of the variance.

Key Takeaways

The SVM model underperforms significantly compared to the other models across all metrics. It shows the highest errors (MAE, MSE, RMSE, MAPE, MedAE) and the lowest R², indicating that it is not capturing the relationship between features and house prices effectively.

6. SVM’s Poor Performance and Improvement Strategies

In our initial experiments with Support Vector Machines (SVMs) for house price prediction, we encountered surprisingly poor performance.

Given that scaling has already been addressed, here are possible reasons for poor performance, and several strategies to potentially enhance the performance of the SVM model.

Possible Reasons for Initial Poor Performance

Suboptimal Hyperparameters:

SVMs are highly sensitive to their hyperparameters, particularly C, epsilon, and gamma.
Initial poor performance was likely due to using default parameters that were not well-suited to our specific dataset.

Unsuitable Kernel for Data Structure:

Our initial tests used the RBF (Radial Basis Function) kernel, which is often a default choice but may not always be optimal.

High Dimensionality of the Feature Space

SVMs can struggle with high-dimensional data, especially when the number of features is large relative to the number of samples.
This “curse of dimensionality” can lead to overfitting and poor generalization, particularly with non-linear kernels.

Presence of Irrelevant or Redundant Features

SVMs consider all input features, which can be problematic if many features are irrelevant or redundant.
This can lead to increased noise in the decision function and poorer generalization.

Non-Uniform Importance of Features

SVMs treat all features equally by default, which may not be optimal if some features are much more important than others in predicting house prices.

Strategies for Improving SVM Performance

Given the SVM’s initial poor performance, several strategies were implemented to enhance its performance:

Detailed Hyperparameter Tuning

Implementing a more comprehensive hyperparameter tuning strategy, such as grid search or random search, could help in finding the optimal values for the C, epsilon, and gamma parameters.

Experiment with Different Kernels

Testing different kernels, such as linear, polynomial or sigmoid, may better capture the underlying relationships in the data. While RBF is often a good default, it may not be optimal for all datasets.

Feature Selection

Use feature selection techniques to reduce the number of features. It may prevent overfitting and improve generalization.

Hyperparameter Tuning:
The process of optimizing the settings of a model to improve its performance.

7. Hyperparameter Tuning

I implemented hyperparameter tuning to optimize our Support Vector Machine (SVM) model. This process involves fine-tuning key parameters such as C (regularization), epsilon (for regression), and gamma (kernel coefficient for RBF kernel).

I employed RandomizedSearchCV, which efficiently samples the parameter space, making it suitable for our high-dimensional problem.

This method searches for the best combination of hyperparameters that optimizes the model’s performance, measured by mean squared error (MSE).

The best hyperparameters found were:

C: 834.3
epsilon: 0.772
gamma: 0.0006

Using these optimized parameters, I configured our SVM model and subjected it to 5-fold cross-validation. This thorough evaluation process, using metrics such as MSE, RMSE, and R², ensures that our tuned model’s performance is robustly validated across different subsets of our data.

Performance After Tuning

	MAE (thousand)	MSE (million)	RMSE (thousand)	MAPE	MedAE (thousand)	R²
Linear	18.6 (±2.6)	1,377 (±982)	36.4 (±14.4)	11.04% (±0.98%)	11.6 (±1.2)	0.780 (±0.178)
Ridge	16.8 (±2.6)	853 (±615)	28.7 (±10.6)	9.80% (±1.09%)	11.5 (±1.8)	0.867 (±0.073)
Elastic Net	16.6 (±2.4)	834 (±611)	28.4 (±10.7)	9.86% (±1.13%)	11.3 (±1.6)	0.870 (±0.074)
Random Forest	17.5 (±2.0)	927 (±506)	30.2 (± 8.1)	0.10% (±0.01%)	11.1 (±2.0)	0.853 (±0.065)
GBDT	16.8 (±2.4)	953 (±595)	30.5 (±9.9)	0.09% (±0.02%)	10.3 (±1.8)	0.839 (±0.094)
XGBoost	17.8 (±2.4)	1,177 (±748)	33.9 (±11.1)	0.10% (±0.02%)	10.8 (±2.2)	0.806 (±0.078)
LightGBM	16.7 (±3.4)	838 (±513)	28.6 (±9,1)	0.09% (±0.02%)	10.8 (±0.5)	0.860 (±0.071)
SVM	54.6 (±5.0)	6,245 (±1,706)	78.8 (±10.7)	31.18% (±6.35%)	37.9 (±5.5)	0.053 (±0.067)
SVM after tuning	40.8 (±0.8)	4,108 (±310)	79.0 (±2.4)	22.04% (±0.97%)	25.9 (±1.0)	0.311 (±0.006)

1. Improvement from Tuning

The tuning process has significantly improved the SVM’s performance across all metrics:

MAE decreased from 54.6 to 40.8 (25% improvement)
MSE decreased from 6,245 million to 4,108 million (34% improvement)
RMSE remained similar (78.8 to 79.0)
MAPE decreased from 31.18% to 22.04% (29% improvement)
MedAE decreased from 37.9 to 25.9 (32% improvement)
R² increased from 0.053 to 0.311 (486% improvement)

This demonstrates the importance of hyperparameter tuning for SVMs. The model’s ability to explain variance in the data (R²) has improved dramatically, though it’s still lower than other models.

2. Comparison to Other Models

Despite the improvements, the tuned SVM still underperforms compared to other models:

MAE (40.8) is more than twice that of the best-performing models.
MSE (4,108 million) is significantly higher than other models (mostly under 1,000 million).
RMSE (79.0) is more than twice that of other models (mostly around 28-36).
MAPE (22.04%) is much higher than other models (mostly under 10%, with tree-based models showing suspiciously low MAPEs).
MedAE (25.9) is more than twice that of other models (mostly around 10-11).
R² (0.311) is significantly lower than other models (mostly above 0.8).

3. Consistency of Performance

The standard errors for the tuned SVM are generally lower than before tuning, indicating more consistent performance across different subsets of the data. However, the standard errors are still higher than most other models for several metrics.

8. Kernel Selection

When working with Support Vector Machines (SVM), it’s crucial to experiment with different kernel functions to identify the most suitable one for your dataset.

While the Radial Basis Function (RBF) kernel is often a strong default choice, it may not always be optimal.

Exploring other kernels such as linear, polynomial, and sigmoid can yield better results depending on the data structure.

By iterating over these kernels and evaluating their performance using cross-validation, you can compare the mean MSE for each, allowing you to select the kernel that minimizes error and best captures the underlying patterns in your data.

Kernel Selection in SVM: Which One Works Best?

Kernel	Mean MSE (million)	RMSE (thousand)
Linear	800	28.3
Poly	22,241,597,288	149,136.2
RBF	4,108	64.1
Sigmoid	6,245	79.0

The results of experimenting with different kernel functions for the SVM model reveal significant variations in performance, highlighting the importance of kernel selection.

The linear kernel produced a Mean Squared Error (MSE) of approximately 799.6 million, with a corresponding Root Mean Squared Error (RMSE) of 28,276.75, indicating a relatively stable performance.

These results suggest that the linear kernel is the most suitable for this dataset, offering a balance between complexity and performance.

This analysis underscores the importance of evaluating multiple kernels to identify the most effective one for the specific characteristics of your data.

Performance After Kernel Selection

	MAE (thousand)	MSE (million)	RMSE (thousand)	MAPE	MedAE (thousand)	R²
Linear	18.6 (±2.6)	1,377 (±982)	36.4 (±14.4)	11.04% (±0.98%)	11.6 (±1.2)	0.780 (±0.178)
Ridge	16.8 (±2.6)	853 (±615)	28.7 (±10.6)	9.80% (±1.09%)	11.5 (±1.8)	0.867 (±0.073)
Elastic Net	16.6 (±2.4)	834 (±611)	28.4 (±10.7)	9.86% (±1.13%)	11.3 (±1.6)	0.870 (±0.074)
Random Forest	17.5 (±2.0)	927 (±506)	30.2 (± 8.1)	0.10% (±0.01%)	11.1 (±2.0)	0.853 (±0.065)
GBDT	16.8 (±2.4)	953 (±595)	30.5 (±9.9)	0.09% (±0.02%)	10.3 (±1.8)	0.839 (±0.094)
XGBoost	17.8 (±2.4)	1,177 (±748)	33.9 (±11.1)	0.10% (±0.02%)	10.8 (±2.2)	0.806 (±0.078)
LightGBM	16.7 (±3.4)	838 (±513)	28.6 (±9,1)	0.09% (±0.02%)	10.8 (±0.5)	0.860 (±0.071)
SVM	54.6 (±5.0)	6,245 (±1,706)	78.8 (±10.7)	31.18% (±6.35%)	37.9 (±5.5)	0.053 (±0.067)
SVM after tuning	40.8 (±3.7)	4,108 (±310)	79.0 (±2.4)	22.04% (±0.97%)	25.9 (±1.0)	0.311 (±0.006)
SVM after kernel selection	15.9 (±1,2)	800 (±296)	28.3 (±5.2)	9.19% (±0.85%)	9.9 (±0.8)	0.869 (±0.037)

1. Improvement from Initial SVM to Final SVM

The progression of the SVM model shows dramatic improvements:

Initial SVM → Tuned SVM → SVM after kernel selection

MAE: 54.6 → 40.8 → 15.9 (71% total improvement)

MSE: 6,245 → 4,108 → 800 (87% total improvement)

RMSE: 78.8 → 79.0 → 28.3 (64% total improvement)

MAPE: 31.18% → 22.04% → 9.19% (71% total improvement)

MedAE: 37.9 → 25.9 → 9.9 (74% total improvement)

R²: 0.053 → 0.311 → 0.869 (1540% total improvement)

This progression demonstrates the critical importance of both hyperparameter tuning and appropriate kernel selection in SVM models.

2. SVM After Kernel Selection vs. Other Models

The SVM with linear kernel and optimized parameters now performs exceptionally well:

MAE: Now the best performer, slightly better than Elastic Net and LightGBM.

MSE: Second best, only slightly behind Elastic Net.

RMSE: Best performer, slightly better than Elastic Net

MAPE: Competitive with Ridge and Elastic Net), much better than linear regression.

MedAE: Best performer, outperforming even GBDT and LightGBM.

R²: Tied for second best with Ridge, just behind Elastic Net.

3. Consistency of Performance

The standard errors for the final SVM model are generally lower than or comparable to other models, indicating consistent performance across different subsets of the data:

MAE SE: Among the lowest
MSE SE: Lower than most models
RMSE SE: Lower than many models
MAPE SE: Comparable to other linear models
MedAE SE: The lowest among all models
R² SE: Lower than most models

4. Comparison with Tree-based Models

The final SVM outperforms tree-based models in most metrics:

Better MAE, MSE, and RMSE than Random Forest, GBDT, XGBoost, and LightGBM
Comparable or better R² than tree-based models
Much higher MAPE than tree-based models, which show suspiciously low values.

5. Linear vs. Non-linear Models

The success of the linear kernel SVM suggests that the relationships between features and house prices in this dataset are predominantly linear.

This is further supported by the strong performance of other linear models like Ridge and Elastic Net.

Key Takeaways

Importance of Model Tuning

The dramatic improvement from the initial SVM to the final model underscores the crucial role of hyperparameter tuning and kernel selection in achieving optimal performance.

Competitive Performance

The final SVM model is now among the top performers, often outperforming or matching the best models across various metrics.

Consistency

The relatively low standard errors indicate that the model’s performance is stable across different subsets of the data.

Linear Relationships

The success of the linear kernel SVM, along with the strong performance of other linear models, suggests that the feature-price relationships in this dataset are largely linear.

Conclusions

The SVM model, after proper tuning and kernel selection, has emerged as one of the top-performing models for this house price prediction task. Its performance is particularly impressive given its initial poor results, highlighting the importance of thorough model optimization.

The success of the linear kernel provides valuable insights into the nature of the data, suggesting that linear models or models that can effectively capture linear relationships (like the optimized SVM) are well-suited for this particular problem.

This case study in SVM optimization serves as an excellent example of how a seemingly underperforming model can become highly competitive with proper tuning and configuration. It reinforces the importance of not dismissing a model based on initial poor performance and the value of systematic optimization in machine learning projects.

Contents

1. Understanding Support Vector Machines

2. The Kernel Trick: Unveiling Hidden Patterns

3. SVMs in High-Dimensional Spaces

4. Implementing SVM for House Price Prediction

5. Model Performance and Evaluation

1. Mean Absolute Error (MAE)

2. Mean Squared Error (MSE)

3. Root Mean Squared Error (RMSE)

4. Mean Absolute Percentage Error (MAPE)

5. Median Absolute Error (MedAE)

6. R-squared (R²)

Key Takeaways

6. SVM’s Poor Performance and Improvement Strategies

Possible Reasons for Initial Poor Performance

Suboptimal Hyperparameters:

Unsuitable Kernel for Data Structure:

High Dimensionality of the Feature Space

Presence of Irrelevant or Redundant Features

Non-Uniform Importance of Features

Strategies for Improving SVM Performance

Detailed Hyperparameter Tuning

Experiment with Different Kernels

Feature Selection

7. Hyperparameter Tuning

Performance After Tuning

1. Improvement from Tuning

2. Comparison to Other Models

3. Consistency of Performance

8. Kernel Selection

Kernel Selection in SVM: Which One Works Best?

Performance After Kernel Selection

1. Improvement from Initial SVM to Final SVM

2. SVM After Kernel Selection vs. Other Models

3. Consistency of Performance

4. Comparison with Tree-based Models

5. Linear vs. Non-linear Models

Key Takeaways

Importance of Model Tuning

Competitive Performance

Consistency

Linear Relationships

Conclusions