What is Feature Engineering ?
Feature engineering is an important step in the data science pipeline. One of the most important questions facing data scientists is how to choose which features or variables to focus on for building high-performing machine learning models. Poor selection of features or lack of focus on feature engineering is one of the major reasons for poor performance in many cases.
As data scientists, we must avoid the garbage in, garbage out approach when choosing features. Not all features are relevant or useful for a particular algorithm or problem. There are several reasons for applying feature engineering techniques to select the best features including:
- Not all features are correlated with the target, so do not use/select uncorrelated features for model building.
- Some of the features may be correlated with other features and hence may not play a constructive role in improving model performance.
- Some of the features may have too many missing values, and may need to be discarded altogether.
- Some of the features may require further processing because they have categorical data type and we can’t always use these.
- In some cases, assumptions about data distribution may be violated.
- In other cases, we may need to reduce dimensionality of feature space to deal with noisy behavior due to the curse of dimensionality.
Feature Engineering Approaches
Feature Engineering is typically accomplished using one of the following approaches:
- Feature Elimination
You can reduce the feature space by eliminating features. This has a key disadvantage as you gain no information from those features that have been dropped. For example, recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached.
- Feature Selection
Feature selection is the process of reducing the number of features when dealing with large numbers of features while building a data science model.
One of goals in data science is to reduce the number of input variables(or features) to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model.
Statistical-based feature selection methods involve evaluating the relationship between each input feature and the target variable using statistics and selecting those input variables that have the strongest relationship with the target variable.
You apply some statistical tests in order to rank features according to their importance and then select a subset of features for building your model. This again suffers from information loss and is less stable as different tests give different importance scores to features.
- Feature Extraction
You create new independent features, where each new independent feature is a combination of each of the old independent features. These techniques can be further divided into linear and non-linear dimensionality reduction techniques.