As safety experts, understanding data cleaning and preparation is crucial. Poor data quality can lead to biased or unreliable AI models, potentially causing harm in high-stakes applications.
By mastering these techniques, you can ensure that the foundation of AI systems is robust, reducing risks associated with data-driven decision-making.
Welcome to the first step in our house price prediction journey!
Before we can build sophisticated models, we need to ensure our data is clean, consistent, and ready for analysis.
In this post, we’ll dive into the critical process of data cleaning and preparation using the “House Prices – Advanced Regression Techniques” dataset from Kaggle.
Contents
- Why Data Cleaning and Preparation?
- Step-by-Step Guide
- Feature and Target Separation with Data Type Identification
- Constructing a Comprehensive Preprocessing Pipeline
- Handling Missing Values
- Dealing with Outliers
- Feature Scaling
- Handling Categorical Variables
- Data Type Conversion
- Next Steps
Why Data Cleaning and Preparation?
Clean data is crucial for model accuracy and reliability, as raw data often contains issues such as missing values, outliers, and inconsistencies that need to be addressed.
We’ll cover essential techniques and best practices to prepare your data for analysis and modeling, setting a solid foundation for accurate predictions and reliable insights.
Step-by-Step Guide
The “House Prices – Advanced Regression Techniques” dataset from Kaggle is a rich collection of residential properties in Ames, Iowa.
It includes 79 explanatory variables describing various aspects of homes, such as:
- Overall quality and condition
- Year built and remodeled
- Living area square footage
- Number of bedrooms and bathrooms
- Neighborhood information
- Various area calculations (basement, garage, etc.)
Our target variable is the sale price of each house.
1. Feature and Target Separation with Data Type Identification
We begin by separating the features from the target variable in our dataset. The target variable, SalePrice, represents the house prices we aim to predict. By dropping this column from the original DataFrame and storing it in a separate variable y
, we isolate the features in a new DataFrame X
.
This separation is crucial for building machine learning models, as it allows us to independently process the input features and the target variable, ensuring that the model focuses on learning the relationships between them.
Next, we identify the numeric and categorical columns within our features. Using the select_dtypes method, we filter the columns based on their data types.
Numeric features are stored in the numeric_features
variable.
Categorical features (objects or category data) are stored in the categorical_features
variable.
This distinction is important as it allows us to apply different preprocessing techniques tailored to the nature of the data, ensuring that both numeric and categorical features are appropriately handled before they are used in our predictive models.
# Separate features and target
X = df.drop('SalePrice', axis=1)
y = df['SalePrice']
# Identify numeric, categorical, and year columns
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.drop(['YearBuilt', 'YearRemodAdd', 'YrSold'])
categorical_features = X.select_dtypes(include=['object']).columns
year_features = ['YearBuilt', 'YearRemodAdd', 'YrSold']
2. Constructing a Comprehensive Preprocessing Pipeline
Next, we construct a robust preprocessing pipeline that ensures our data is clean, consistent, and ready for modeling.
The Pipeline class from sklearn.pipeline is used to create a sequence of data processing steps.
# Create preprocessing steps
numeric_transformer = Pipeline(steps=[
('imputer', KNNImputer(n_neighbors=5)),
('outlier_capper', OutlierCapper()),
('scaler', StandardScaler()),
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')),
])
year_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('converter', YearConverter()),
])
3. Handling Missing Values
Numeric Features
For the numeric features, the pipeline addresses missing values using a KNNImputer, which imputes missing data points by leveraging the information from the nearest neighbors.
There are several strategies for dealing with missing values:
- Delete rows with missing data (if few)
- Impute missing values (mean, median, mode, or more advanced methods)
- Use algorithms, such as k-Nearest Neighbors, that can handle missing values
I used k-Nearest Neighbors because it offers a more sophisticated approach to imputing missing values by considering the similarity between data points. Instead of simply replacing missing values with a mean, median, or mode, KNNImputer looks at the nearest neighbors in the feature space and fills in the missing values based on the most similar observations.
This method preserves the underlying structure of the data, leading to potentially more accurate imputations and helping to maintain the relationships between features that could be crucial for the predictive model’s performance.
Categorical Features
For the categorical features, the pipeline imputes missing values using SimpleImputer with a strategy of filling in with a constant value, specifically the string ‘missing’.
Choosing to impute missing values in categorical features with a constant value like ‘missing’ is a deliberate decision aimed at explicitly marking and tracking where data was absent.
By using this approach, we ensure that the imputed values are easily identifiable, which can be particularly useful if missing data holds some significance or if we want the model to learn that these instances represent a distinct, separate category.
This method avoids assumptions about the nature of the missing data and ensures that the imputed values do not interfere with the existing distribution of categories. Instead, it clearly differentiates imputed entries from the original categories, maintaining the integrity of the dataset while allowing the model to handle missing data effectively.
4. Dealing with Outliers
Detecting outliers is crucial in data cleaning and preparation because outliers can significantly impact the performance and accuracy of machine learning models. By identifying and handling outliers, we ensure that our data is more representative of typical conditions, leading to more robust and reliable models.
The steps outlined in the OutlierCapper class below are designed to cap outliers by setting upper and lower bounds based on specified quantiles, ensuring that extreme values do not distort the model’s performance.
class OutlierCapper(BaseEstimator, TransformerMixin):
def __init__(self, lower_quantile=0.01, upper_quantile=0.99):
self.lower_quantile = lower_quantile
self.upper_quantile = upper_quantile
self.lower_bounds = None
self.upper_bounds = None
def fit(self, X, y=None):
self.lower_bounds = np.quantile(X, self.lower_quantile, axis=0)
self.upper_bounds = np.quantile(X, self.upper_quantile, axis=0)
return self
def transform(self, X):
return np.clip(X, self.lower_bounds, self.upper_bounds)
5. Feature Scaling
Scaling features is crucial for several machine learning algorithms to ensure optimal performance.
Algorithms that Require Feature Scaling
1. Gradient Descent-Based Algorithms:
- Linear Regression: When using gradient descent for optimization, scaling ensures that the algorithm converges more quickly and avoids getting stuck in local minima.
- Logistic Regression: Similar to linear regression, scaling helps in achieving faster convergence during gradient descent optimization.
- Neural Networks: Scaling inputs can lead to faster training times and better performance. It helps in maintaining the stability of the network during training.
- Support Vector Machines (SVMs): Scaling is crucial for the kernel functions to work properly. SVMs are sensitive to the relative scales of input features.
- Gradient Boosting Machines (GBM): Although less sensitive than SVMs and neural networks, scaling can still improve the performance and convergence speed of gradient boosting models.
2. Distance-Based Algorithms:
- K-Nearest Neighbors (KNN): KNN is heavily influenced by the scale of the features since it relies on distance metrics like Euclidean distance. Features on a larger scale can disproportionately influence the distance calculations.
- K-Means Clustering: The algorithm uses distance measures to assign data points to clusters. Scaling ensures that no single feature dominates the clustering process.
- Principal Component Analysis (PCA): PCA involves variance calculation, which can be skewed by features on different scales. Scaling ensures that all features contribute equally to the principal components.
3. Regularization Algorithms:
- Ridge Regression: Involves an L2 penalty which is sensitive to the scale of the features.
- Lasso Regression: Similar to ridge regression but uses an L1 penalty. Scaling helps in balanced regularization across all features.
Gradient Boosting Machines are generally more robust to differences in feature scales compared to the other algorithms mentioned.
Tree-based methods like Random Forests are generally invariant to feature scaling.
The Choice of Scaling Method
General guideline on which scaling methods work best for various models:
Standardizes features by removing the mean and scaling to unit variance.
Best for:
- Linear Models: Linear Regression, Logistic Regression
- SVM: Support Vector Machines
- PCA: Principal Component Analysis
- K-Means: Clustering
Data Distribution: Works best when the data is normally distributed (Gaussian distribution).
Scales features to a fixed range, usually [0, 1].
Best for:
- Neural Networks: Ensures that all input features have the same scale.
- KNN: k-Nearest Neighbors (sensitive to the distance between points).
- Tree-based models: Sometimes beneficial to keep features within a specific range, although not as critical as for other models.
Data Distribution: Works well when the data does not have outliers or when the data is uniformly distributed.
Scales features using statistics that are robust to outliers (uses median and interquartile range).
Best for:
- Any model where the data contains many outliers: Robust to the presence of outliers which can skew the results of other scalers.
Data Distribution: Especially useful for data with many outliers.
I chose StandardScaler for the analysis because many machine learning algorithms, especially those that involve optimization or distance measures (e.g., support vector machines, linear regression, k-nearest neighbors), perform better when the data is centered around zero with unit variance.
StandardScaler ensures that all features contribute equally to the model by placing them on a comparable scale.
6. Handling Categorical Variables
For machine learning models, we need to convert categorical variables to numerical.
To convert categorical variables into a format that machine learning models can understand, we apply a technique called one-hot encoding. One-hot encoding transforms each categorical variable into multiple binary columns—one for each unique category in the original feature. In each binary column, a value of 1 indicates the presence of a specific category, while a 0 indicates its absence.
7. Data Type Conversion
Ensuring correct data types is crucial for proper analysis.
Converting to datetime allows for more advanced date-time operations such as extracting the year, month, day, and performing time-series analysis.
The YearConverter class below is a custom transformer designed to handle and preprocess year-related data in a dataset.
The YearConverter ensures that year-related data falls within a valid range, which is particularly important for datasets where incorrect year values could distort analysis or model performance.
class YearConverter(BaseEstimator, TransformerMixin):
def __init__(self):
self.min_year = None
self.max_year = None
def fit(self, X, y=None):
X_flat = X.ravel() if X.ndim > 1 else X
self.min_year = np.min(X_flat)
self.max_year = min(np.max(X_flat), pd.Timestamp.now().year)
return self
def transform(self, X):
X_numeric = pd.to_numeric(X.ravel() if X.ndim > 1 else X, errors='coerce')
X_clipped = np.clip(X_numeric, self.min_year, self.max_year)
return X_clipped.reshape(X.shape)
Code Repository
You can find the complete code for this data cleaning process in my GitHub repository.
Conclusion
Data cleaning and preparation are the foundational steps in any successful machine learning project. By meticulously handling missing values, outliers, and scaling features, we set the stage for building robust and reliable models. This process ensures that our data is accurate, consistent, and ready for the advanced techniques we’ll be exploring in future posts.
As safety experts, mastering these skills is critical—not just for achieving technical accuracy but for ensuring that the AI systems we develop are fair, transparent, and safe for all users. By building a strong foundation in data preparation, we reduce the risks associated with biased or faulty models, paving the way for ethical and responsible AI development.
Next Steps
In this post, we’ve tackled the essential steps of data cleaning and preparation—separating features, handling missing values, outliers, and scaling. These steps ensure our data is primed for accurate modeling, reducing the risk of biases or errors.
With our data now clean and structured, the next crucial step is visualization. Data visualization will allow us to explore relationships, identify patterns, and gain insights that are key for feature selection and model building.
Join me in the next post, where we’ll dive into data visualization techniques to bring your data to life and guide your modeling decisions.