In the previous post, we focused on Data Cleaning, ensuring that our dataset from the Google Analytics Sample Dataset was ready for further analysis.
Now, we move on to Feature Engineering, a crucial step in the data science process that involves creating new variables or transforming existing ones to improve the performance of our machine learning models.
Contents
- Why Feature Engineering?
- Overview of the Data
- Explanation of Each Feature Category
- Code Walkthrough: Feature Engineering
- Conclusion
Why Feature Engineering?
Feature engineering is all about extracting more meaningful information from your raw data. By creating new features or transforming existing ones, you can provide your machine learning models with the right inputs to make accurate predictions.
This step is critical because the quality and relevance of the features often determine the success of the model.
Overview of the Data
The Google Analytics Sample Dataset is structured into two main levels: session-level and hit-level data.
- Session-Level Data: This aggregates information about each user’s session on the website, such as session duration, pageviews, transactions, and traffic source. Each row represents a single session, summarizing the user’s interactions within that timeframe.
- Hit-Level Data: This provides detailed information about individual interactions (or “hits”) within a session, including pageviews, events, transactions, and product details. Hit-level data is more granular and helps us extract features that reflect specific user behaviors during each session.
Understanding both levels is essential for feature engineering, as it enables us to capture both the overall session characteristics and the finer details of user interactions.
Explanation of Each Feature Category
To effectively engineer features, we categorize the existing data based on the type of information they convey:
- Time-Based Features: These include variables like date, visitStartTime, hour, and minute, capturing patterns such as peak visit times and seasonal trends.
- Engagement Features: Metrics like pageviews, hits, transactions, and transactionRevenue measure user interaction with the site. Features such as “Pages per Session” or “Transaction Rate” help capture the depth and quality of user engagement.
- Device Features: These describe the technology used to access the site, including browser, deviceCategory, and operatingSystem.
- Traffic Source Features: These provide insight into how users arrived at the site, covering variables like source, medium, and campaign. Traffic source features are crucial for analyzing the effectiveness of marketing campaigns and user acquisition channels.
- Geographical Features: Features like country, region, and city capture user location, helping to analyze regional differences in behavior and tailor marketing strategies accordingly.
- Custom Dimensions: Custom dimensions are business-specific metrics in Google Analytics, such as tracking loyalty program participation or product categories viewed, allowing for more personalized and accurate predictions.
Understanding these categories helps us strategically engineer new features, enhancing the predictive power of our models.
Code Walkthrough: Feature Engineering
1. Time-Based Features
df_engineered['day_of_week'] = df_engineered['date'].dt.dayofweek
df_engineered['is_weekend'] = df_engineered['day_of_week'].isin([5, 6]).astype(int)
df_engineered['month'] = df_engineered['date'].dt.month
df_engineered['quarter'] = df_engineered['date'].dt.quarter
Time-based features capture the temporal aspects of user behavior.
2. Session Duration
df_engineered['session_duration_seconds'] = df_engineered['totals_timeOnSite']
The session duration is a key indicator of user engagement. Longer sessions may indicate higher interest or deeper exploration of the site.
3. Page Views per Session
df_engineered['pageviews_per_session'] = df_engineered['totals_pageviews']
It measures the extent of user interaction within a session. Higher page views can suggest greater user engagement.
4. Bounce Rate
df_engineered['is_bounce'] = (df_engineered['totals_bounces'] > 0).astype(int)
The bounce rate indicates whether users leave the site after viewing just one page. A high bounce rate may suggest that the site is not engaging enough.
5. Device Features
df_engineered['is_mobile'] = (df_engineered['device_deviceCategory'] == 'mobile').astype(int)
df_engineered['is_tablet'] = (df_engineered['device_deviceCategory'] == 'tablet').astype(int)
df_engineered['is_desktop'] = (df_engineered['device_deviceCategory'] == 'desktop').astype(int)
Device features help understand how user experience varies by device. For instance, mobile users might have different interaction patterns compared to desktop users.
6. Traffic Source Features
df_engineered['is_organic_search'] = (df_engineered['trafficSource_medium'] == 'organic').astype(int)
df_engineered['is_paid_search'] = (df_engineered['trafficSource_medium'] == 'cpc').astype(int)
df_engineered['is_referral'] = (df_engineered['trafficSource_medium'] == 'referral').astype(int)
These features categorize traffic sources, helping to analyze the effectiveness of different marketing channels, such as organic search, paid search, and referrals.
7. Geographical Features
df_engineered['is_us'] = (df_engineered['geoNetwork_country'] == 'United States').astype(int)
The country feature is derived from the geoNetwork_country during the data cleaning process.
It represents the country from which the user accessed the website. The is_us feature specifically identifies sessions originating from the United States.
Understanding geographical distribution is crucial for region-specific analysis and strategy development.
By focusing on sessions from the U.S., we can tailor marketing and user experience strategies to a key demographic.
8. Engagement Features from Hit-Level Data
hit_level_features = hit_level_df.groupby('fullVisitorId').agg({
'time': ['count', 'mean', 'max'],
'isEntrance': 'sum',
'isExit': 'sum',
'eventInfo.eventCategory': 'nunique',
'transaction.transactionId': 'nunique',
'transaction.transactionRevenue': 'sum',
'item.productName': 'nunique',
})
These features aggregate hit-level data to capture detailed user interactions:
- total_hits, avg_time_per_hit, max_time_per_hit provide insights into session intensity.
- num_entrance_pages, num_exit_pages help understand session flow.
- num_transactions, total_revenue capture e-commerce activity.
- num_unique_products_viewed indicates the diversity of user interests.
9. Derived Features
df_engineered['avg_pageviews_per_session'] = df_engineered['pageviews_per_session'] / df_engineered['totals_visits']
df_engineered['conversion_rate'] = df_engineered['num_transactions'] / df_engineered['totals_visits']
df_engineered['avg_revenue_per_session'] = df_engineered['total_revenue'] / df_engineered['totals_visits']
These features are derived to provide more granular insights:
- avg_pageviews_per_session shows engagement depth per visit.
- conversion_rate is critical for assessing e-commerce success.
- avg_revenue_per_session helps evaluate the revenue efficiency of sessions.
10. Behavioral Segmentation
df_engineered['user_value_segment'] = pd.qcut(df_engineered['total_revenue'], q=4, labels=['Low', 'Medium', 'High', 'VIP'])
df_engineered['engagement_segment'] = pd.qcut(df_engineered['total_hits'], q=3, labels=['Low', 'Medium', 'High'])
Segmentation based on behavior, such as user_value_segment and engagement_segment, helps classify users into distinct groups for targeted analysis and strategies.
Conclusion
In this post, we explored the essential process of feature engineering, transforming raw data into meaningful features to enhance our machine learning models.
By understanding the structure of session-level and hit-level data, we were able to extract and create features that capture both high-level session attributes and detailed user interactions.
Additionally, we derived new features like conversion rates and behavioral segments, which are crucial for specific objectives such as boosting conversions or identifying high-value customers.
These engineered features form the backbone of our upcoming models, setting the stage for more accurate predictions and data-driven decisions.
In the next post, we’ll dive into feature selection to further optimize our models.