Feature Selection｜Coding Crossroads

In our previous posts, we covered data cleaning and feature engineering for user behavior analytics.

Now, we’ll focus on feature selection, a crucial step in preparing our data for machine learning models.

Feature selection helps us identify the most relevant features, reducing noise and improving model performance.

Why Feature Selection is Important

Feature selection is a key step in machine learning, essential for:

1. Reducing Overfitting

By focusing on the most relevant features, feature selection helps prevent the model from learning noise, thereby reducing overfitting.

2. Improving Accuracy

Selecting the right features ensures the model captures the most important patterns, leading to more accurate predictions.

3. Speeding Up Training

With fewer features, models train faster, saving computational resources and time.

4. Enhancing Interpretability

A simplified model with fewer features is easier to understand and explain, which is crucial for gaining trust and ensuring transparency.

In essence, feature selection boosts model performance, efficiency, and clarity.

Feature Selection Techniques

We’ll explore three main categories of feature selection:

Filter Methods
Wrapper Methods
Embedded Methods

1. Filter Methods

Filter methods select features based on statistical measures, independent of any machine learning algorithm.

Correlation Analysis

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import spearmanr

def correlation_analysis(df, threshold=0.8):
    # Calculate the correlation matrix
    corr_matrix = df.corr(method='spearman')

    # Create a heatmap
    plt.figure(figsize=(12, 10))
    sns.heatmap(corr_matrix, annot=False, cmap='coolwarm')
    plt.title('Feature Correlation Heatmap')
    plt.show()

    # Find highly correlated features
    high_corr_vars = np.where(np.abs(corr_matrix) > threshold)
    high_corr_vars = [(corr_matrix.index[x], corr_matrix.columns[y]) for x, y in zip(*high_corr_vars) if x != y and x < y]

    return high_corr_vars

high_corr_features = correlation_analysis(df_engineered)
print("Highly correlated feature pairs:")
for feat1, feat2 in high_corr_features:
    print(f"{feat1} - {feat2}")

Mutual Information

from sklearn.feature_selection import mutual_info_regression

def mutual_information_analysis(X, y, top_n=10):
    mi_scores = mutual_info_regression(X, y)
    mi_scores = pd.Series(mi_scores, index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)

    plt.figure(figsize=(10, 6))
    mi_scores[:top_n].plot(kind='bar')
    plt.title('Top Features by Mutual Information')
    plt.xlabel('Features')
    plt.ylabel('Mutual Information Score')
    plt.tight_layout()
    plt.show()

    return mi_scores

# Assuming 'total_revenue' is our target variable
X = df_engineered.drop('total_revenue', axis=1)
y = df_engineered['total_revenue']

mi_scores = mutual_information_analysis(X, y)
print("Top 10 features by Mutual Information:")
print(mi_scores[:10])

2. Wrapper Methods

Wrapper methods use a predictive model to score feature subsets and select the best performing features.

Recursive Feature Elimination (RFE)

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

def recursive_feature_elimination(X, y, n_features_to_select=20):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    model = RandomForestRegressor(n_estimators=100, random_state=42)
    rfe = RFE(estimator=model, n_features_to_select=n_features_to_select)
    rfe = rfe.fit(X_train, y_train)

    selected_features = X.columns[rfe.support_]

    print("Selected features:")
    for feature in selected_features:
        print(feature)

    return selected_features

selected_features_rfe = recursive_feature_elimination(X, y)

3. Embedded Methods

Embedded methods perform feature selection as part of the model training process.

Random Forest Feature Importance

from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt

def random_forest_feature_importance(X, y, top_n=20):
    model = RandomForestRegressor(n_estimators=100, random_state=42)
    model.fit(X, y)

    feature_importance = pd.Series(model.feature_importances_, index=X.columns)
    feature_importance = feature_importance.sort_values(ascending=False)

    plt.figure(figsize=(10, 6))
    feature_importance[:top_n].plot(kind='bar')
    plt.title('Top Features by Random Forest Importance')
    plt.xlabel('Features')
    plt.ylabel('Importance')
    plt.tight_layout()
    plt.show()

    return feature_importance

rf_importance = random_forest_feature_importance(X, y)
print("Top 20 features by Random Forest Importance:")
print(rf_importance[:20])

Combining Feature Selection Methods

To get the most robust set of features, we can combine the results from different feature selection methods:

def combine_feature_selection_methods(mi_scores, rf_importance, rfe_features, top_n=20):
    # Normalize scores
    mi_scores = (mi_scores - mi_scores.min()) / (mi_scores.max() - mi_scores.min())
    rf_importance = (rf_importance - rf_importance.min()) / (rf_importance.max() - rf_importance.min())

    # Combine scores
    combined_scores = mi_scores + rf_importance
    combined_scores = combined_scores.sort_values(ascending=False)

    # Add RFE selected features
    for feature in rfe_features:
        if feature not in combined_scores.index:
            combined_scores[feature] = 0

    # Select top features
    top_features = combined_scores.nlargest(top_n)

    return top_features

top_features = combine_feature_selection_methods(mi_scores, rf_importance, selected_features_rfe)
print("Top 20 features after combining selection methods:")
print(top_features)

Conclusion

In this blog post, we explored various feature selection techniques for user behavior analytics. We used filter methods like correlation analysis and mutual information, wrapper methods like recursive feature elimination, and embedded methods like random forest feature importance. By combining these methods, we can identify the most relevant features for our predictive models.

In the next post, we’ll use these selected features to build predictive models for sales and revenue prediction.

Contents