Solving data science problems requires systematic thinking and approach, here are some of the key concepts and ideas that you need to apply when trying to solve data science problems:-
Step 1 | Identify Type of Problem
Typical problems may be classification, regression, recommendation systems, reinforcement learnings, etc. When getting started, you should have a clear idea about the type of problem that you are going to solve.
Step 2 Basic Understanding of the Training Set
The training set may have too few samples, or it might be too big. The training set may or may not have desired features or information for solving the problem at hand. The training set may also be imbalanced, meaning examples of a specific type of data may be too few or too many.
Step 3 Accuracy
We have to be conscious of the accuracy and risks involved when choosing the algorithm.
Step 4 Training Time
In various scenarios, we might or might not have the luxury of long training cycles.
Step 5 Linearity
Some algorithms can only solve problems if they are linearly separable. If that’s not the case than we can have poor accuracy.
Step 6 Number of Parameters
Parameters affect the algorithm’s behavior, such as error tolerance or several iterations. Typically, algorithms with large numbers of parameters require the most trial and error(and time) to find a good combination.
Step 7 Optimize hyperparameters
There are three options for optimizing hyperparameters, grid search, random search, and Bayesian optimization.
Step 8 Number of Features
The number of features in some datasets can be very large compared to the number of data points. This is often the case with genetics or textual data. A large number of features can bog down some learning algorithms, making training time unfeasibly long. Some algorithms such as Support Vector Machines are particularly well suited to this case.
Step 9 Feature Engineering
Feature engineering is about creating new input features from your existing ones.
In general, you can think of data cleaning as a process of subtraction and feature engineering as a process of addition. Deep learning algorithms can deduce features implicitly and therefore may not require creation of new features whereas classical machine learning algorithms may benefit from additional derived features.
Step 10 Tree Ensembles
In many cases, no single algorithm may give us desired results. Ensembles are machine learning methods for combining predictions from multiple separate models.
While bagging and boosting are both ensemble methods, they approach the problem from opposite directions. Bagging uses complex base models and tries to “smooth out” their predictions while boosting uses simple base models and tries to “boost” their aggregate complexity.
Ensembling is a general term, but when the base models are decision trees, they have special names: random forests and boosted trees!