Data Science Interview Questions and AnswersData Science Interview Questions and Answers
Spread the love

Loading

Q. What are the differences between Supervised and Unsupervised Learning?

Answer: Supervised learning requires training labeled data. For example, in order to do classification (a supervised learning task), you’ll need to first label the data you’ll use to train the model to classify data into your labeled groups. Unsupervised learning, in contrast, does not require labeling data explicitly.

Supervised LearningUnsupervised Learning
Uses known and labelled data as input Supervised learning has a feedback mechanism  The most commonly used supervised learning algorithms are decision trees, logistic regression, and support vector machineUses unlabelled data as input Unsupervised learning has no feedback mechanism  The most commonly used unsupervised learning algorithms are k-means clustering, hierarchical clustering, and apriori algorithm

Q. How can you avoid Overfitting your model?

Overfitting refers to a model that works well only set for a certain type of data and ignores the rest. There are three main methods to avoid overfitting:

  • Keep the model simple—take fewer variables into account, thereby removing some of the noise in the training data
  • Use cross-validation techniques, such as k folds cross-validation 
  • Use regularization techniques, such as LASSO, that penalize certain model parameters if they’re likely to cause overfitting

Q. What is the Trade-Off between Bias and Variance?

Answer: Bias is error due to erroneous or overly simplistic assumptions in the learning algorithm you’re using. This can lead to the model underfitting your data, making it hard for it to have high predictive accuracy and for you to generalize your knowledge from the training set to the test set.

Variance is error due to too much complexity in the learning algorithm you’re using. This leads to the algorithm being highly sensitive to high degrees of variation in your training data, which can lead your model to overfit the data. You’ll be carrying too much noise from your training data for your model to be very useful for your test data.

The bias-variance decomposition essentially decomposes the learning error from any algorithm by adding the bias, the variance and a bit of irreducible error due to noise in the underlying dataset. Essentially, if you make the model more complex and add more variables, you’ll lose bias but gain some variance — in order to get the optimally reduced amount of error, you’ll have to tradeoff bias and variance. You don’t want either high bias or high variance in your model.

Q. How is KNN different from K-means Clustering?

Answer: K-Nearest Neighbors is a supervised classification algorithm, while k-means clustering is an unsupervised clustering algorithm. While the mechanisms may seem similar at first, what this really means is that in order for K-Nearest Neighbors to work, you need labeled data you want to classify an unlabeled point into (thus the nearest neighbor part). K-means clustering requires only a set of unlabeled points and a threshold: the algorithm will take unlabeled points and gradually learn how to cluster them into groups by computing the mean of the distance between different points.

The critical difference here is that KNN needs labeled points and is thus supervised learning, while k-means doesn’t—and is thus unsupervised learning.

Q. Explain how a ROC curve works ?

Answer: The ROC curve is a graphical representation of the contrast between true positive rates and the false positive rate at various thresholds. It’s often used as a proxy for the trade-off between the sensitivity of the model (true positives) vs the fall-out or the probability it will trigger a false alarm (false positives).

Q. Define Precision and Recall ?

Answer: Recall is also known as the true positive rate: the amount of positives your model claims compared to the actual number of positives there are throughout the data. Precision is also known as the positive predictive value, and it is a measure of the amount of accurate positives your model claims compared to the number of positives it actually claims. It can be easier to think of recall and precision in the context of a case where you’ve predicted that there were 10 apples and 5 oranges in a case of 10 apples. You’d have perfect recall (there are actually 10 apples, and you predicted there would be 10) but 66.7% precision because out of the 15 events you predicted, only 10 (the apples) are correct.

Q. What is deep learning, and how does it contrast with other machine learning algorithms?

Answer: Deep learning is a subset of machine learning that is concerned with neural networks: how to use backpropagation and certain principles from neuroscience to more accurately model large sets of unlabelled or semi-structured data. In that sense, deep learning represents an unsupervised learning algorithm that learns representations of data through the use of neural nets.

Q. How would you differentiate between univariate, bivariate, and multivariate analysis ?

Univariate

Univariate data contains only one variable. The purpose of the univariate analysis is to describe the data and find patterns that exist within it. 

Example: height of students 

Bivariate

Bivariate data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to determine the relationship between the two variables.

Example: temperature and ice cream sales in the summer season

Multivariate

Multivariate data involves three or more variables, it is categorized under multivariate. It is similar to a bivariate but contains more than one dependent variable.

Example: House price prediction using area, location, condition of house and other features 

Q. What are the feature selection methods used to select the right variables?

There are two main methods for feature selection, i.e., filter, and wrapper methods.

Filter Methods

This involves: 

  • Linear discrimination analysis
  • ANOVA
  • Chi-Square

The best analogy for selecting features is “bad data in, bad answer out.” When we’re limiting or selecting the features, it’s all about cleaning up the data coming in. 

Wrapper Methods

This involves: 

  • Forward Selection: We test one feature at a time and keep adding them until we get a good fit
  • Backward Selection: We test all the features and start removing them to see what works better
  • Recursive Feature Elimination: Recursively looks through all the different features and how they pair together

Wrapper methods are very labor-intensive, and high-end computers are needed if a lot of data analysis is performed with the wrapper method. 

Q. You are given a data set consisting of variables with more than 30 percent missing values. How will you deal with them?

The following are ways to handle missing data values:

If the data set is large, we can just simply remove the rows with missing data values. It is the quickest way; we use the rest of the data to predict the values.

For smaller data sets, we can substitute missing values with the mean or average of the rest of the data using the pandas’ data frame in python. There are different ways to do so, such as df.mean(), df.fillna(mean).

Q. What are dimensionality reduction and its benefits?

The Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions (fields) to convey similar information concisely. 

This reduction helps in compressing data and reducing storage space. It also reduces computation time as fewer dimensions lead to less computing. It removes redundant features; for example, there’s no point in storing a value in two different units (meters and inches). 

Q. What are Recommender Systems?

A recommender system predicts what a user would rate a specific product based on their preferences. It can be split into two different areas:

Collaborative Filtering

As an example, Last.fm recommends tracks that other users with similar interests play often. This is also commonly seen on Amazon after making a purchase; customers may notice the following message accompanied by product recommendations: “Users who bought this also bought…”

Content-based Filtering

As an example: Pandora uses the properties of a song to recommend music with similar properties. Here, we look at content, instead of looking at who else is listening to music.

Q. How can you select k for k-means? 

We use the elbow method to select k for k-means clustering. The idea of the elbow method is to run k-means clustering on the data set where ‘k’ is the number of clusters.

Within the sum of squares (WSS), it is defined as the sum of the squared distance between each member of the cluster and its centroid. 

Q. What is the Significance of p-value?

p-value typically ≤ 0.05

This indicates strong evidence against the null hypothesis; so you reject the null hypothesis.

p-value typically > 0.05

This indicates weak evidence against the null hypothesis, so you accept the null hypothesis.

p-value at cutoff 0.05

This is considered to be marginal, meaning it could go either way.

Q. How can outlier values be treated?

You can drop outliers only if it is a garbage value.

Example: height of an adult = abc ft. This cannot be true, as the height cannot be a string value. In this case, outliers can be removed.

If the outliers have extreme values, they can be removed. For example, if all the data points are clustered between zero to 10, but one point lies at 100, then we can remove this point.

If you cannot drop outliers, you can try the following:

Try a different model. Data detected as outliers by linear models can be fit by nonlinear models. Therefore, be sure you are choosing the correct model.

Try normalizing the data. This way, the extreme data points are pulled to a similar range.

You can use algorithms that are less affected by outliers; an example would be random forests.

Q. You are part of data science team that is working for a national fast-food chain. You create a simple report that shows trend: Customers who visit the store more often and buy smaller meals spend more than customers who visit less frequently and buy larger meals. What is the most likely diagram that your team created?

  •  multiclass classification diagram
  •  linear regression and scatter plots
  •  pivot table
  •  K-means cluster diagram

Answer : Linear Regression and Scatter Plots

Q. You work for an organization that sells a spam filtering service to large companies. Your organization wants to transition its product to use machine learning. It currently has a list Of 250,00 keywords. If a message contains more than few of these keywords, then it is identified as spam. What would be one advantage of transitioning to machine learning?

  •  The product would look for new patterns in spam messages.
  •  The product could go through the keyword list much more quickly.
  •  The product could have a much longer keyword list.
  •  The product could find spam messages using far fewer keywords.

Answer: The product could find spam messages using far fewer keywords.

Q. You work for a music streaming service and want to use supervised machine learning to classify music into different genres. Your service has collected thousands of songs in each genre, and you used this as your training data. Now you pull out a small random subset of all the songs in your service. What is this subset called?

  •  Data cluster
  •  Supervised set
  •  big data
  •  test data

Answer: Test data

Q. In traditional computer programming, you input commands. What do you input with machine learning?

  •  patterns
  •  programs
  •  rules
  •  data

Answer: Data

Q. Your company wants to predict whether existing automotive insurance customers are more likely to buy homeowners insurance. It created a model to better predict the best customers contact about homeowners insurance, and the model had a low variance but high bias. What does that say about the data model?

  •  It was consistently wrong.
  •  It was inconsistently wrong.
  •  It was consistently right.
  •  It was equally right end wrong.

Answer: It was consistently wrong.

Q. You want to identify global weather patterns that may have been affected by climate change. To do so, you want to use machine learning algorithms to find patterns that would otherwise be imperceptible to a human meteorologist. What is the place to start?

  •  Find labelled data of sunny days so that the machine will learn to identify bad weather.
  •  Use unsupervised learning have the machine look for anomalies in a massive weather database.
  •  Create a training set of unusual patterns and ask the machine learning algorithms to classify them.
  •  Create a training set of normal weather and have the machine look for similar patterns.

Answer: Use unsupervised learning have the machine look for anomalies in a massive weather database.

Q. You work in a data science team that wants to improve the accuracy of its K-nearest neighbor result by running on top of a naive Bayes result. What is this an example of?

  •  regression
  •  boosting
  •  bagging
  •  stacking

Answer : Stacking

Q. What’s the difference between a Generative and Discriminative model?

Answer: A generative model will learn categories of data while a discriminative model will simply learn the distinction between different categories of data. Discriminative models will generally outperform generative models on classification tasks.

Q. What Cross-Validation technique would you use on a time series dataset?

Answer: Instead of using standard k-folds cross-validation, you have to pay attention to the fact that a time series is not randomly distributed data—it is inherently ordered by chronological order. If a pattern emerges in later time periods, for example, your model may still pick up on it even if that effect doesn’t hold in earlier years!

You’ll want to do something like forward chaining where you’ll be able to model on past data then look at forward-facing data.

  • Fold 1 : training [1], test [2]
  • Fold 2 : training [1 2], test [3]
  • Fold 3 : training [1 2 3], test [4]
  • Fold 4 : training [1 2 3 4], test [5]
  • Fold 5 : training [1 2 3 4 5], test [6]

Q. How is a Decision Tree Pruned?

Answer: Pruning is what happens in decision trees when branches that have weak predictive power are removed in order to reduce the complexity of the model and increase the predictive accuracy of a decision tree model. Pruning can happen bottom-up and top-down, with approaches such as reduced error pruning and cost complexity pruning.

Reduced error pruning is perhaps the simplest version: replace each node. If it doesn’t decrease predictive accuracy, keep it pruned. While simple, this heuristic actually comes pretty close to an approach that would optimize for maximum accuracy.

Q. Which is more important to you: model accuracy or model performance?

Answer: Such machine learning interview questions tests your grasp of the nuances of machine learning model performance! Machine learning interview questions often look towards the details. There are models with higher accuracy that can perform worse in predictive power—how does that make sense?

Well, it has everything to do with how model accuracy is only a subset of model performance, and at that, a sometimes misleading one. For example, if you wanted to detect fraud in a massive dataset with a sample of millions, a more accurate model would most likely predict no fraud at all if only a vast minority of cases were fraud. However, this would be useless for a predictive model—a model designed to find fraud that asserted there was no fraud at all! Questions like this help you demonstrate that you understand model accuracy isn’t the be-all and end-all of model performance.

Q. What’s the F1 score? How would you use it?

Answer: The F1 score is a measure of a model’s performance. It is a weighted average of the precision and recall of a model, with results tending to 1 being the best, and those tending to 0 being the worst. You would use it in classification tests where true negatives don’t matter much.

Q. How would you handle an Imbalanced dataset?

Answer: An imbalanced dataset is when you have, for example, a classification test and 90% of the data is in one class. That leads to problems: an accuracy of 90% can be skewed if you have no predictive power on the other category of data!

Here are a few tactics to get over the hump:

  • Collect more data to even the imbalances in the dataset.
  • Resample the dataset to correct for imbalances.
  • Try a different algorithm altogether on your dataset.

What’s important here is that you have a keen sense for what damage an unbalanced dataset can cause, and how to balance that.

Related Posts

Data Science versus Business Intelligence ? What Do You Need ? – Technology Magazine (tech-mags.com)

Apache Spark Tutorial Covering Concepts, Questions and Answers – Technology Magazine (tech-mags.com)

By Hassan Amin

Dr. Syed Hassan Amin has done Ph.D. in Computer Science from Imperial College London, United Kingdom and MS in Computer System Engineering from GIKI, Pakistan. During PhD, he has worked on Image Processing, Computer Vision, and Machine Learning. He has done research and development in many areas including Urdu and local language Optical Character Recognition, Retail Analysis, Affiliate Marketing, Fraud Prediction, 3D reconstruction of face images from 2D images, and Retinal Image analysis in addition to other areas.