Feature Engineering TechniquesFeature Engineering Techniques
Spread the love

Loading

What is Feature Engineering ?

Feature engineering is an important step in the data science pipeline. One of the most important questions facing data scientists is how to choose which features or variables to focus on for building high-performing machine learning models. Poor selection of features or lack of focus on feature engineering is one of the major reasons for poor performance in many cases.  

As data scientists, we must avoid the garbage in, garbage out approach when choosing features. Not all features are relevant or useful for a particular algorithm or problem. There are several reasons for applying feature engineering techniques to select the best features including:

  • Not all features are correlated with the target, so do not use/select uncorrelated features for model building.
  • Some of the features may be correlated with other features and hence may not play a constructive role in improving model performance.
  • Some of the features may have too many missing values, and may need to be discarded altogether. 
  • Some of the features may require further processing because they have categorical data type and we can’t always use these.
  • In some cases, assumptions about data distribution may be violated. 
  • In other cases, we may need to reduce dimensionality of feature space to deal with noisy behavior due to the curse of dimensionality.

Feature Engineering Approaches

Feature Engineering is typically accomplished using one of the following approaches:

  1. Feature Elimination

You can reduce the feature space by eliminating features. This has a key disadvantage as you gain no information from those features that have been dropped. For example, recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached.

  1. Feature Selection

Feature selection is the process of reducing the number of features when dealing with large numbers of features while building a data science model.

One of goals in data science is to reduce the number of input variables(or features) to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model.

Statistical-based feature selection methods involve evaluating the relationship between each input feature and the target variable using statistics and selecting those input variables that have the strongest relationship with the target variable.

You apply some statistical tests in order to rank features according to their importance and then select a subset of features for building your model. This again suffers from information loss and is less stable as different tests give different importance scores to features. 

  1. Feature Extraction

You create new independent features, where each new independent feature is a combination of each of the old independent features. These techniques can be further divided into linear and non-linear dimensionality reduction techniques.

By Hassan Amin

Dr. Syed Hassan Amin has done Ph.D. in Computer Science from Imperial College London, United Kingdom and MS in Computer System Engineering from GIKI, Pakistan. During PhD, he has worked on Image Processing, Computer Vision, and Machine Learning. He has done research and development in many areas including Urdu and local language Optical Character Recognition, Retail Analysis, Affiliate Marketing, Fraud Prediction, 3D reconstruction of face images from 2D images, and Retinal Image analysis in addition to other areas.