Solving Data Science ProblemsSolving Data Science Problems
Spread the love

Loading

Solving data science problems requires systematic thinking and approach, here are some of the key concepts and ideas that you need to apply when trying to solve data science problems:-

Step 1 | Identify Type of Problem

Typical problems may be classification, regression, recommendation systems, reinforcement learnings, etc. When getting started, you should have a clear idea about the type of problem that you are going to solve.

Step 2 Basic Understanding of the Training Set

The training set may have too few samples, or it might be too big. The training set may or may not have desired features or information for solving the problem at hand. The training set may also be imbalanced, meaning examples of a specific type of data may be too few or too many.

Step 3 Accuracy

We have to be conscious of the accuracy and risks involved when choosing the algorithm.

Step 4 Training Time

In various scenarios, we might or might not have the luxury of long training cycles.

Step 5 Linearity

Some algorithms can only solve problems if they are linearly separable. If that’s not the case than we can have poor accuracy.

Step 6 Number of Parameters

Parameters affect the algorithm’s behavior, such as error tolerance or several iterations. Typically, algorithms with large numbers of parameters require the most trial and error(and time) to find a good combination.

Step 7 Optimize hyperparameters

There are three options for optimizing hyperparameters, grid search, random search, and Bayesian optimization.

Step 8 Number of Features 

The number of features in some datasets can be very large compared to the number of data points. This is often the case with genetics or textual data. A large number of features can bog down some learning algorithms, making training time unfeasibly long. Some algorithms such as Support Vector Machines are particularly well suited to this case.

Step 9 Feature Engineering

Feature engineering is about creating new input features from your existing ones.

In general, you can think of data cleaning as a process of subtraction and feature engineering as a process of addition. Deep learning algorithms can deduce features implicitly and therefore may not require creation of new features whereas classical machine learning algorithms may benefit from additional derived features.

Step 10 Tree Ensembles

In many cases, no single algorithm may give us desired results. Ensembles are machine learning methods for combining predictions from multiple separate models. 

While bagging and boosting are both ensemble methods, they approach the problem from opposite directions. Bagging uses complex base models and tries to “smooth out” their predictions while boosting uses simple base models and tries to “boost” their aggregate complexity.

Ensembling is a general term, but when the base models are decision trees, they have special names: random forests and boosted trees!

By Hassan Amin

Dr. Syed Hassan Amin has done Ph.D. in Computer Science from Imperial College London, United Kingdom and MS in Computer System Engineering from GIKI, Pakistan. During PhD, he has worked on Image Processing, Computer Vision, and Machine Learning. He has done research and development in many areas including Urdu and local language Optical Character Recognition, Retail Analysis, Affiliate Marketing, Fraud Prediction, 3D reconstruction of face images from 2D images, and Retinal Image analysis in addition to other areas.