Feature scaling is the process of normalizing or standardizing the range of independent variables (features) in a dataset. It ensures that no single feature dominates others due to differences in magnitude, which can significantly affect the performance of many machine learning algorithms.
Why is Feature Scaling Important?
Consider a dataset with two features:
- Age: ranges from 20 to 60
- Income: ranges from 30,000 to 150,000
Without scaling, algorithms that rely on distances or gradients will treat Income as far more important simply because its values are numerically larger — not because it’s actually more informative. Scaling puts all features on a level playing field.
Types of Feature Scaling
1. Min-Max Normalization (Rescaling)
Transforms features to a fixed range, typically [0, 1].
Formula:
X′ = (Xmax−Xmin) / (X−Xmin)
Example:
| Age (original) | Age (scaled) |
|---|---|
| 20 | 0.0 |
| 40 | 0.5 |
| 60 | 1.0 |
When to use: When you know the distribution is not Gaussian and the algorithm requires bounded inputs (e.g., neural networks, image pixel values).
Drawback: Sensitive to outliers. A single extreme value compresses all other values.
2. Standardization (Z-Score Normalization)
Centers data around mean = 0 and scales to standard deviation = 1.
Formula:
X′ = (X − μ) / σ
Example:
Salaries: [40k, 50k, 60k, 70k, 80k] → Mean = 60k, Std = 14.14k
| Salary | Z-Score |
|---|---|
| 40,000 | -1.41 |
| 60,000 | 0.00 |
| 80,000 | +1.41 |
When to use: When the algorithm assumes normally distributed data (e.g., Linear/Logistic Regression, SVM, PCA).
Advantage: Less affected by outliers compared to Min-Max.
3. Robust Scaling
Uses the median and Interquartile Range (IQR) instead of mean and std — making it robust to outliers.
Formula:
X′ = (X−median) / IQR
where IQR = Q3 − Q1
Example: A salary dataset with an outlier of 10,000,000 won’t skew all other values when using robust scaling.
When to use: When your dataset contains significant outliers.
4. MaxAbs Scaling
Scales each feature by its maximum absolute value, resulting in values in [-1, 1].
Formula:
X′ = ∣Xmax∣ / X
When to use: Works well with sparse data and preserves zero entries (common in text/NLP data).
5. Log Transformation
Applies a logarithm to compress wide-ranging values.
Formula:
X′ = log(X+1)
Example: Website traffic [100, 1000, 10000, 1000000] → after log: [2.0, 3.0, 4.0, 6.0]
When to use: Highly skewed distributions (e.g., income, population, prices).
Which Algorithms Need Feature Scaling?
| Algorithm | Needs Scaling? | Reason |
|---|---|---|
| Linear Regression | ✅ Yes | Gradient descent converges faster |
| Logistic Regression | ✅ Yes | Distance-sensitive |
| SVM | ✅ Yes | Maximizes margin using distances |
| K-Nearest Neighbors | ✅ Yes | Purely distance-based |
| Neural Networks | ✅ Yes | Gradient sensitivity |
| PCA | ✅ Yes | Variance-based; large scales dominate |
| Decision Trees | ❌ No | Split-based, not distance-sensitive |
| Random Forests | ❌ No | Ensemble of trees |
| Naive Bayes | ❌ No | Probability-based |
| Gradient Boosting (XGBoost) | ❌ No | Tree-based, scale-invariant |
Practical Example: KNN Without vs. With Scaling
Dataset:
| Person | Age | Income ($) | Bought? |
|---|---|---|---|
| A | 25 | 40,000 | No |
| B | 45 | 80,000 | Yes |
| C | 30 | 60,000 | ? |
Without scaling, the Euclidean distance between C and A:
d = sqrt ( (30−25)2 + (60000−40000)2 = sqrt (25 + 400,000,000) ≈ 20,000
The Income completely dominates Age — Age contributes almost nothing.
With Min-Max scaling (Age → [0,1], Income → [0,1]):
d = sqrt ( (0.25−0)2 + (0.5−0)2 ) = sqrt (0.0625+0.25 ) ≈ 0.56
Now both features contribute meaningfully to the distance.
Key Takeaways
- Always fit the scaler on training data only, then transform both train and test sets — to prevent data leakage.
- Standardization is the most general-purpose choice.
- Min-Max is best when you need bounded outputs.
- Robust Scaling is best with outliers.
- Tree-based models are naturally immune to feature scale.
- Feature scaling doesn’t change the information content — it just reframes the numeric range for algorithms to interpret fairly.