Data cleaning is a crucial step in the data science pipeline. It involves preparing and transforming raw data into a format suitable for analysis.
In this post, we’ll explore how to clean and prepare data from Google Analytics Sample Dataset in BigQuery for various machine learning tasks.
We’ll go through key steps such as handling missing values, feature scaling, and encoding categorical variables, using Python and popular data processing libraries.
You can find the complete code in my GitHub repository.
The Importance of Data Cleaning
This data cleaning process transforms our raw, messy data into a structured format that’s ready for analysis.
By flattening nested structures, handling missing data, converting data types, and encoding categorical variables, we’ve addressed many common data quality issues.
- Perform more accurate analyses
- Build more reliable machine learning models
- Make data-driven decisions with confidence
- Save time in the long run by avoiding errors and inconsistencies
The Dataset
We’re working with Google Analytics data, which is stored in BigQuery. This dataset contains rich information about website visits, including user interactions, traffic sources, and transaction data.
However, it comes with its own set of challenges, including nested data structures.
The Data Cleaning Process
Let’s break down our data cleaning process into several key steps.
1. Data Extraction
First, we extract the data from BigQuery using a SQL query:
query = """
SELECT *
FROM bigquery-public-data.google_analytics_sample.ga_sessions_20170801
LIMIT 10000
"""
df = client.query(query).to_dataframe()
2. Flattening Nested Columns
One of the challenges with this dataset is that some columns contain nested data. We flatten these nested structures to make the data easier to work with:
def flatten_nested_columns(df):
nested_columns = ['totals', 'trafficSource', 'device', 'geoNetwork']
flattened_dfs = []
for col in nested_columns:
flattened = pd.json_normalize(df[col])
flattened.columns = [f'{col}_{subcol}' for subcol in flattened.columns]
flattened_dfs.append(flattened)
df_flattened = pd.concat([df.drop(nested_columns, axis=1)] + flattened_dfs, axis=1)
return df_flattened
3. Handling Missing Data
Missing data can skew our analysis, so we handle it using imputation techniques:
numeric_imputer = SimpleImputer(strategy='median')
df_cleaned[numeric_columns] = numeric_imputer.fit_transform(df_cleaned[numeric_columns])
categorical_imputer = SimpleImputer(strategy='most_frequent')
df_cleaned[categorical_columns] = categorical_imputer.fit_transform(df_cleaned[categorical_columns])
4. Data Type Conversion
We ensure that our data types are consistent and appropriate for analysis:
df_cleaned['date'] = pd.to_datetime(df_cleaned['date'], format='%Y%m%d')
for col in df_cleaned.columns:
if col in numeric_columns:
df_cleaned[col] = pd.to_numeric(df_cleaned[col], errors='coerce')
elif col in categorical_columns:
df_cleaned[col] = df_cleaned[col].astype(str)
5. Feature Scaling
Scaling is essential for machine learning algorithms that rely on distance metrics, such as linear regression and SVMs. We’ll standardize our numeric features using StandardScaler.
scaler = StandardScaler()
df_cleaned[numeric_columns] = scaler.fit_transform(df_cleaned[numeric_columns])
6. Encoding Categorical Variables
Categorical variables need to be converted into numerical format before they can be used in machine learning models. We’ll use LabelEncoder for this purpose.
le = LabelEncoder()
for col in categorical_columns:
df_cleaned[col] = le.fit_transform(df_cleaned[col])
7. Extracting Hit-Level Data
In addition to session-level data, we can extract hit-level data, which provides more granular information about user interactions.
def extract_hit_level_data(df):
# Flatten the main dataframe
df_flat = pd.json_normalize(df.to_dict('records'))
# Extract hit-level data
hits_data = df_flat.apply(lambda row: pd.json_normalize(row['hits']), axis=1).explode().reset_index()
# Rename 'index' to 'session_id' for clarity
hits_data = hits_data.rename(columns={'index': 'session_id'})
# Add session-level identifiers to the hit-level data
hits_data['fullVisitorId'] = df_flat['fullVisitorId'].repeat(df_flat['hits'].str.len()).reset_index(drop=True)
hits_data['visitId'] = df_flat['visitId'].repeat(df_flat['hits'].str.len()).reset_index(drop=True)
hits_data['date'] = pd.to_datetime(df_flat['date'].repeat(df_flat['hits'].str.len()).reset_index(drop=True), format='%Y%m%d')
# Select and rename relevant columns
columns_to_keep = [
'session_id', 'fullVisitorId', 'visitId', 'date',
'hitNumber', 'time', 'hour', 'minute',
'isEntrance', 'isExit', 'page.pagePath', 'page.pageTitle',
'eventInfo.eventCategory', 'eventInfo.eventAction', 'eventInfo.eventLabel',
'transaction.transactionId', 'transaction.transactionRevenue',
'item.productName', 'item.productCategory', 'item.productSKU', 'item.itemRevenue'
]
hits_df = hits_data[columns_to_keep].copy()
return hits_df
hit_level_df = extract_hit_level_data(df)
Conclusion
In this blog post, we walked through the process of cleaning and preparing the Google Analytics Sample Dataset for machine learning tasks.
By flattening nested columns, handling missing values, scaling features, and encoding categorical variables, we prepared the data for advanced analysis, such as predicting e-commerce sales, customer segmentation, and more.
This cleaned dataset can now be used to train and evaluate machine learning models, leading to actionable insights and data-driven decisions.