Data Cleaning｜Coding Crossroads

Data cleaning is a crucial step in the data science pipeline. It involves preparing and transforming raw data into a format suitable for analysis.

In this post, we’ll explore how to clean and prepare data from Google Analytics Sample Dataset in BigQuery for various machine learning tasks.

We’ll go through key steps such as handling missing values, feature scaling, and encoding categorical variables, using Python and popular data processing libraries.

You can find the complete code in my GitHub repository.

The Importance of Data Cleaning

This data cleaning process transforms our raw, messy data into a structured format that’s ready for analysis.

By flattening nested structures, handling missing data, converting data types, and encoding categorical variables, we’ve addressed many common data quality issues.

Clean data allows us to

Perform more accurate analyses
Build more reliable machine learning models
Make data-driven decisions with confidence
Save time in the long run by avoiding errors and inconsistencies

The Dataset

We’re working with Google Analytics data, which is stored in BigQuery. This dataset contains rich information about website visits, including user interactions, traffic sources, and transaction data.

However, it comes with its own set of challenges, including nested data structures.

The Data Cleaning Process

Let’s break down our data cleaning process into several key steps.

1. Data Extraction

First, we extract the data from BigQuery using a SQL query:

Python

query = """
SELECT *
FROM bigquery-public-data.google_analytics_sample.ga_sessions_20170801
LIMIT 10000
"""
df = client.query(query).to_dataframe()

2. Flattening Nested Columns

One of the challenges with this dataset is that some columns contain nested data. We flatten these nested structures to make the data easier to work with:

Python

def flatten_nested_columns(df):
    nested_columns = ['totals', 'trafficSource', 'device', 'geoNetwork']
    flattened_dfs = []

    for col in nested_columns:
        flattened = pd.json_normalize(df[col])
        flattened.columns = [f'{col}_{subcol}' for subcol in flattened.columns]
        flattened_dfs.append(flattened)

    df_flattened = pd.concat([df.drop(nested_columns, axis=1)] + flattened_dfs, axis=1)
    
    return df_flattened

3. Handling Missing Data

Missing data can skew our analysis, so we handle it using imputation techniques:

Python

numeric_imputer = SimpleImputer(strategy='median')
df_cleaned[numeric_columns] = numeric_imputer.fit_transform(df_cleaned[numeric_columns])

categorical_imputer = SimpleImputer(strategy='most_frequent')
df_cleaned[categorical_columns] = categorical_imputer.fit_transform(df_cleaned[categorical_columns])

4. Data Type Conversion

We ensure that our data types are consistent and appropriate for analysis:

Python

df_cleaned['date'] = pd.to_datetime(df_cleaned['date'], format='%Y%m%d')

for col in df_cleaned.columns:
    if col in numeric_columns:
        df_cleaned[col] = pd.to_numeric(df_cleaned[col], errors='coerce')
    elif col in categorical_columns:
        df_cleaned[col] = df_cleaned[col].astype(str)

5. Feature Scaling

Scaling is essential for machine learning algorithms that rely on distance metrics, such as linear regression and SVMs. We’ll standardize our numeric features using StandardScaler.

Python

scaler = StandardScaler()
df_cleaned[numeric_columns] = scaler.fit_transform(df_cleaned[numeric_columns])

6. Encoding Categorical Variables

Categorical variables need to be converted into numerical format before they can be used in machine learning models. We’ll use LabelEncoder for this purpose.

Python

le = LabelEncoder()
for col in categorical_columns:
    df_cleaned[col] = le.fit_transform(df_cleaned[col])

7. Extracting Hit-Level Data

In addition to session-level data, we can extract hit-level data, which provides more granular information about user interactions.

Python

def extract_hit_level_data(df):
    # Flatten the main dataframe
    df_flat = pd.json_normalize(df.to_dict('records'))
    
    # Extract hit-level data
    hits_data = df_flat.apply(lambda row: pd.json_normalize(row['hits']), axis=1).explode().reset_index()
    
    # Rename 'index' to 'session_id' for clarity
    hits_data = hits_data.rename(columns={'index': 'session_id'})
    
    # Add session-level identifiers to the hit-level data
    hits_data['fullVisitorId'] = df_flat['fullVisitorId'].repeat(df_flat['hits'].str.len()).reset_index(drop=True)
    hits_data['visitId'] = df_flat['visitId'].repeat(df_flat['hits'].str.len()).reset_index(drop=True)
    hits_data['date'] = pd.to_datetime(df_flat['date'].repeat(df_flat['hits'].str.len()).reset_index(drop=True), format='%Y%m%d')
    
    # Select and rename relevant columns
    columns_to_keep = [
        'session_id', 'fullVisitorId', 'visitId', 'date',
        'hitNumber', 'time', 'hour', 'minute',
        'isEntrance', 'isExit', 'page.pagePath', 'page.pageTitle',
        'eventInfo.eventCategory', 'eventInfo.eventAction', 'eventInfo.eventLabel',
        'transaction.transactionId', 'transaction.transactionRevenue',
        'item.productName', 'item.productCategory', 'item.productSKU', 'item.itemRevenue'
    ]
    
    hits_df = hits_data[columns_to_keep].copy()
    
    return hits_df

hit_level_df = extract_hit_level_data(df)

Conclusion

In this blog post, we walked through the process of cleaning and preparing the Google Analytics Sample Dataset for machine learning tasks.

By flattening nested columns, handling missing values, scaling features, and encoding categorical variables, we prepared the data for advanced analysis, such as predicting e-commerce sales, customer segmentation, and more.

This cleaned dataset can now be used to train and evaluate machine learning models, leading to actionable insights and data-driven decisions.