8 Dimensionality Reduction Techniques every Data Scientists should know

Essential guide to various dimensionality reduction techniques in Python

Checklist:
1. Missing Value
2. Correlation Filter
3. Variance Filter
4. Forward / Backward Feature Selection Techniques
5. PCA (Principal Component Analysis)
6. t-SNE (t-distributed Stochastic Neighbourhood Embedding)
7. UMAP
8. Auto-Encoders

(1.) Missing Value:

A real-world dataset often contains a lot of missing records that may be caused due to data corruption or failure while recording the data. One can try various data imputation techniques to fill up the missing records, but this only works if a limited number of records are missing for any feature.

(Image by Author), Visualization of Missing Values: white lines denote the presence of missing value

(2.) Correlation Filter:

The correlation of one or more features may result in the problem of multicollinearity. Multicollinearity of features can undermine the statistical significance of an independent variable. The idea is to drop the features that are correlated with the other independent features. One can also drop the features that are not correlated with the target class label.

(3.) Variance Filter:

A categorical feature with only one feature category or numerical feature variable with very low variance can be excluded from the training sample. Such features may not contribute to model training.

(4.) Forward / Backward Feature Selection:

The forward feature selection technique is a wrapper technique to select the best set of features. It’s a step-wise process, and the features are selected based on the inference from the previous step. The steps of the forward feature selection technique are:

  1. Take the feature with the best performance and retrain individual models using the remaining features.
  2. The feature using we get the best performance is concatenated with the list of features from the last step.
  3. Repeat Steps 2 and 3 until you get the desired number of features.

(5.) Principal Component Analysis:

Principal Component Analysis (PCA) is a pretty old dimensionality reduction technique. PCA projects the feature vector to a lower-dimensional space by preserving the variance of the features. It finds the direction of maximum variance to get the best list of features.

  1. Compute the covariance matrix of the standardized dataset
  2. Compute eigenvalues and eigenvectors from the covariance matrix
  3. Take dot product of feature vector with the eigenvectors having high eigenvalues.

(6.) t-SNE:

t-SNE (t-distributed Stochastic Neighbourhood Embedding) is a dimension reduction technique mostly used for data visualization. t-SNE converts a higher dimensional dataset into a 2 or 3-dimensional vector which can be further visualized.

(7.) UMAP:

UMAP (Uniform Manifold Approximation) is a dimensionality reduction technique that works similar to t-SNE by projecting the higher dimensional dataset to a comparable lower-dimensional space.

(9.) Auto Encoders:

Auto Encoder is a single-layer perceptron-based dimensionality reduction approach. It has 2 components: compression (encoder) and expansion (decoder). The number of nodes in the input and output layer is the same, whereas the middle layer has fewer neurons compared to the input and output layers.

Conclusion:

In this article, we have discussed feature selection-based dimensionality reduction approach, component-based dimensionality reduction techniques, Projection-based approach, and finally neural network-based autoencoders.

Consultant & Trainer | Artificial Intelligence | Machine Learning | Deep Learning | Blockchain | Tableau