TEN famous questions for basic Data science interview, Part III

Hicham AMAR · 7 مارس 2022

The Data science interview is an essential step with the technical staff of the new company that you want to integrate. So, it is important to prepare yourself to answer to the theoretical questions and technical situations and some time to write a code using the language that you master (python or R).

Can you explain the difference between a test set and a validation set?

Validation set can be considered as a part of the training set as it is for parameter selection and to avoid overfitting of the model being built. On the other hand, test set is used for testing or evaluating the performance of a trained machine learning model.

How will you find the correlation between a categorical variable and a continuous variables?

You can use the analysis of covariance technique to find the correlation between a categorical variables and a continuous variable.

What are the basic assumption to be made for linear regression?

Normality of errors distribution, statistical independence of errors, linearity and additivity.

What does P-value signify about the statistical data?

P-value is used to determine the significance of results after a hypothesis test in statistics. P-value helps the readers to draw conclusions and is always between 0 and 1.

What is an eigenvalue and eigenvector?

Eigenvalue are used for understanding linear transformation. In data analysis, we usually calculate the eigenvectors as the direction along which a particular linear transformation act by flipping, compressing or stretching. Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvectors or the factors by which the compression occurs.

Can you write the formula to calculate R square?

R square can be calculated using the below formular : 1-(residual sum of squares/total sum of square)

What is advantage of performing dimensionality reduction for fitting a SVM?

Support vector machine learning algorithms performs better in the reduced space. It is beneficial to perform dimensionality reduction before fitting am SVM if the number of features is large when compared to the number of observations.

How will you assess the statistical significance of an insight whether it is a real insight or just by chance?

Statistical importance of an insight can be accessed using hypothesis testing.

You are given a data set. The data at has missing values which spread along standard deviation from the median. What percentage of data would remain unaffected? Why?

This question has enough hints for you to start thinking. Since data is spread across median. Let’s assume it’s a normal distribution, we know, in a normal distribution, -68% of a ata lies in 1 standard deviation from mean (or mode, median), which leaves -32% of the data unaffected. Therefore -32% of the data would remain unaffected by missing values.

Why is naïve Bayes so “naive”?

Naïve bayes is so ‘naïve’ because it assumes that all of the features in a data set are equally important and independent. As we know, these assumptions are rarely true is real world scenario.

Hicham AMAR, Ing in Geomatic Sciences, Co-founder of Geoinfo4all.com

متابعة القراءة...

TEN famous questions for basic Data science interview, Part III

Hicham AMAR

Guest