What Are Advanced Feature Engineering Techniques Like PCA and LDA?
You’re handed a dataset with dozens of features — some useful, some redundant, and some pure noise. Your task? Find what matters, simplify the data, and get your model to shine. Overwhelming? Not when you’ve got tools like Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA).
Both PCA and LDA are dimensionality reduction techniques, but they work differently. PCA focuses on summarizing the data, while LDA prioritizes separating classes. In this article, we’ll explore both techniques using a single dataset and compare them side by side for better intuition.
Introducing the Dataset
We’ll use a simple dataset to keep things relatable. Imagine you’re working with a flower classification problem:
Here’s a sample of what the data looks like:
Goal:
- Use PCA to simplify the dataset by reducing the number of features while retaining variance.
- Use LDA to create new features that maximize class separability (e.g., better distinguish Setosa, Versicolor, and Virginica).
Principal Component Analysis (PCA): Simplifying Data
PCA is like organizing your closet by finding the most popular colors and arranging everything along those shades. It identifies the directions (called principal components) where the data varies the most, reducing the dimensionality while preserving as much variability as possible.
How PCA Works (Step-by-Step)
1. Standardize the Data:
Since PCA is influenced by scale, all features are standardized (e.g., z-score transformation).
2. Find Principal Components:
PCA identifies directions in the data where the variance is maximized. Each principal component (PC) is a linear combination of the original features.
3. Rank Components by Variance:
The first principal component explains the most variance, the second explains the next most, and so on.
4. Transform the Data:
Project the dataset onto the top components, reducing its dimensionality.
PCA Example: Simplifying the Flower Dataset
Using PCA, let’s reduce the 4 original features (sepal length, sepal width, petal length, petal width) into 2 principal components.
Here:
- PC1 captures 90% of the variance in the dataset.
- PC2 adds another 5%.
So, by keeping just 2 components, we’ve reduced the dataset’s dimensionality from 4 to 2 while retaining 95% of the variance!
What PCA Tells Us
PCA focuses solely on summarizing data. While it helps simplify datasets for tasks like clustering or visualization, it doesn’t care about the flower species (the target variable). If you plotted PC1 vs. PC2, you might see some clusters, but PCA doesn’t guarantee they’ll align with the species.
Linear Discriminant Analysis (LDA): Separating Classes
LDA, on the other hand, is like organizing your closet by finding colors that separate work clothes from casual wear. It creates a new feature space where the classes (e.g., Setosa, Versicolor, Virginica) are as distinct as possible.
How LDA Works (Step-by-Step)
1. Compute Class Means:
Calculate the mean of each feature for each class.
2. Maximize Between-Class Variance:
LDA maximizes the distance between the class means to separate them clearly.
3. Minimize Within-Class Variance:
Simultaneously, it minimizes the spread of data within each class.
4. Transform the Data:
The original dataset is projected onto the directions (linear discriminants) that maximize class separability.
LDA Example: Separating the Flower Dataset
Using LDA, let’s reduce the 4 features into 2 linear discriminants (LD1, LD2).
Here:
- LD1 captures the separation between Setosa and the other two species.
- LD2 captures the separation between Versicolor and Virginica.
When plotted, flowers from different species form distinct clusters. This makes LDA especially powerful for classification tasks.
What LDA Tells Us
Unlike PCA, LDA uses the target variable (species) to guide the feature transformation. It ensures that the new features (LD1, LD2) make it easier for your model to classify the flowers correctly.
PCA vs. LDA: Key Differences
Here’s how PCA and LDA compare when applied to the same flower dataset:
Visualizing the Difference
If you plot the results of PCA and LDA side by side:
- PCA: The clusters (species) may overlap because it’s only concerned with variance.
- LDA: The clusters are more distinct because it’s explicitly designed for class separation.
When to Use Which?
- Use PCA when:
- You have no target variable.
- You’re exploring data or performing clustering.
- You want to visualize high-dimensional data.
- Use LDA when:
- You’re working on a classification task.
- You need to reduce dimensions while maintaining class separability.
Conclusion: PCA and LDA Are Your Power Tools
PCA and LDA are like two sides of the same coin. While PCA simplifies data by focusing on variance, LDA transforms it to highlight class distinctions. By understanding when and how to use these techniques, you can turn messy datasets into structured, insightful inputs for your models.
Cohorte Team
December 6, 2024