Essential guide to various dimensionality reduction techniques in Python
Exploratory Data Analysis is an important component of the data science model development pipeline. A data scientist spends most of the time in data cleaning, feature engineering, and performing other data wrangling techniques. Dimensionality Reduction is one of the techniques used by data scientists while performing feature engineering.
Dimensionality Reduction is the process of transforming a higher-dimensional dataset to a comparable lower-dimensional space. A real-world dataset often has a lot of redundant features. Dimensionality reduction techniques can be used to get rid of such redundant features or convert the n-dimensional datasets to 2 or 3 dimensions for visualization.
In this article, we will discuss 8 such dimensional reduction techniques that can be used for various use cases to reduce the dimensionality of the dataset.
1. Missing Value
2. Correlation Filter
3. Variance Filter
4. Forward / Backward Feature Selection Techniques
5. PCA (Principal Component Analysis)
6. t-SNE (t-distributed Stochastic Neighbourhood Embedding)
(1.) Missing Value:
A real-world dataset often contains a lot of missing records that may be caused due to data corruption or failure while recording the data. One can try various data imputation techniques to fill up the missing records, but this only works if a limited number of records are missing for any feature.
If the number of missing feature values is greater than a decided threshold then it’s better to remove the feature from the training data. One can remove all the features that have missing feature records greater than a threshold (say 50%), thus reducing the dimensionality of the data.
The above missing values interpretation image is generated for titanic data using the
missingno package. Features ‘Age’, and ‘Cabin’ have a large number of missing records, eventually, they can be removed from the training sample.
(2.) Correlation Filter:
The correlation of one or more features may result in the problem of multicollinearity. Multicollinearity of features can undermine the statistical significance of an independent variable. The idea is to drop the features that are correlated with the other independent features. One can also drop the features that are not correlated with the target class label.
There are various techniques to compute the correlation between the independent features including Pearson, Spearman, Kendall, Chi-square test, etc.
The above heatmap of the correlation matrix (for the titanic dataset) is computed using
(3.) Variance Filter:
A categorical feature with only one feature category or numerical feature variable with very low variance can be excluded from the training sample. Such features may not contribute to model training.
DataFrame.var() can compute the variance of all the features of the Pandas data frame.
DataFrame.value_counts() function can compute the distribution of each feature.
(4.) Forward / Backward Feature Selection:
The forward feature selection technique is a wrapper technique to select the best set of features. It’s a step-wise process, and the features are selected based on the inference from the previous step. The steps of the forward feature selection technique are:
- Train a machine learning model using each of the d-features individually, and measure the performance of each model.
- Take the feature with the best performance and retrain individual models using the remaining features.
- The feature using we get the best performance is concatenated with the list of features from the last step.
- Repeat Steps 2 and 3 until you get the desired number of features.
The backward feature selection technique is similar to the forward feature selection but works in just the opposite way, where initially all the features are selected, and the most redundant features are removed in each step.
(5.) Principal Component Analysis:
Principal Component Analysis (PCA) is a pretty old dimensionality reduction technique. PCA projects the feature vector to a lower-dimensional space by preserving the variance of the features. It finds the direction of maximum variance to get the best list of features.
PCA can be used to project the very high-dimensional data into a desired dimensionality. The steps of the PCA algorithm are:
- Column Standardize the dataset
- Compute the covariance matrix of the standardized dataset
- Compute eigenvalues and eigenvectors from the covariance matrix
- Take dot product of feature vector with the eigenvectors having high eigenvalues.
Scikit-learn package comes with the implementation of PCA, read the documentation to more about the implementation.
t-SNE (t-distributed Stochastic Neighbourhood Embedding) is a dimension reduction technique mostly used for data visualization. t-SNE converts a higher dimensional dataset into a 2 or 3-dimensional vector which can be further visualized.
t-SNE performs better than PCA as it preserves the local structure of the data, and embeds each of the data points from a higher dimension to a lower-dimensional space by preserving the neighborhood local structure.
Scikit-learn package comes with the implementation of t-SNE, read the documentation to more about the implementation.
UMAP (Uniform Manifold Approximation) is a dimensionality reduction technique that works similar to t-SNE by projecting the higher dimensional dataset to a comparable lower-dimensional space.
UMAP build a neighbor graph in the original space of the data and tried to find a similar graph in a lower-dimensional.
(9.) Auto Encoders:
Auto Encoder is a single-layer perceptron-based dimensionality reduction approach. It has 2 components: compression (encoder) and expansion (decoder). The number of nodes in the input and output layer is the same, whereas the middle layer has fewer neurons compared to the input and output layers.
The dataset is passed to the autoencoder neural network model and is encoded to the lower dimension hidden layer. Then it tries to generate from the reduced encoding to get a representation as close as possible to its original input. The middle layer is the vector reduced to a comparable lower dimension.
In this article, we have discussed feature selection-based dimensionality reduction approach, component-based dimensionality reduction techniques, Projection-based approach, and finally neural network-based autoencoders.