H
Hicham AMAR
Guest
The Data science interview is an essential step with the technical staff of the new company that you want to integrate. So, it is important to prepare yourself to answer to the theoretical questions and technical situations and some time to write a code using the language that you master (python or R).
You can find the parts I, II, III and V in the section Data science of the Geoinfo4all.com blog.
Data science involves using automated methods to analyse massive amounts of data and to extract knowledge from them. By combining aspect of statistics, computer science, applied mathematics, and visualisation. Data science can turn the vast amount of data into new insights and knowledge.
# Understand the business problem
# Explore the data and become familiar with it
# Prepare the data for modelling by detecting outliners, treating missing values, transforming variables, etc
#After data preparation, start running the model, analyse the result and tweak the approach. This is iterative step till the best possible outcome is achieved
# Variable the model using a new data set
# Start implementation the model and track the result to analyse the performance of the model over the period
Correlation is the standardised form of covariance. Covariance is difficult to compare. For example: if we calculate the covariance of salary ($) and age (years), we’ll get different covariances having unequal scales to combat such situation, we calculate correlation to get a value between +1 and 1, irrespective of their respective scale.
The true positive rate = recall. Yes there are equal having the formula (TP/TP+FN)
Where TP : true positive , FN : false negative
Yes, we can use ANACOVA (Analysis of covariance) techniques to capture association between continuous and continuous and categorical variables.
For better prediction, categorical variable can be considered as a continuous variables only when the variable is ordinal in nature.
Time series data is known to posses linearity. On the other hand, a decision tree algorithm is known to work best to detect non-linear interaction. The reason why decision tree failed to provide robust predictions because it couldn’t map the linear relationship as good as a regression model. Therefore, we learned that a linear regression model can provide robust prediction given the data satisfies it’s linearity assumptions.
#Good understanding of the built-in data types especially list, dictionaries, tuples and sets
#Mastery of N*dimensional Numpy arrays
#Mastery of pandas data frames
# Familiarity with scikit learn
In the wide format, a subject repeated responses with be in a single row, and each response is in separate column.
In the long format, each row is a one-time points per subject.
Univariate analysis are descriptive statistics analysis techniques which can be differentiated based on the number of variables involved at a given point of time.
Hicham AMAR, Ing in Geomatic Sciences, Co-founder of Geoinfo4all.com
متابعة القراءة...
You can find the parts I, II, III and V in the section Data science of the Geoinfo4all.com blog.
- What is data science?
Data science involves using automated methods to analyse massive amounts of data and to extract knowledge from them. By combining aspect of statistics, computer science, applied mathematics, and visualisation. Data science can turn the vast amount of data into new insights and knowledge.
- What are various steps involved in analytics project?
# Understand the business problem
# Explore the data and become familiar with it
# Prepare the data for modelling by detecting outliners, treating missing values, transforming variables, etc
#After data preparation, start running the model, analyse the result and tweak the approach. This is iterative step till the best possible outcome is achieved
# Variable the model using a new data set
# Start implementation the model and track the result to analyse the performance of the model over the period
- What is difference between covariance and correlation?
Correlation is the standardised form of covariance. Covariance is difficult to compare. For example: if we calculate the covariance of salary ($) and age (years), we’ll get different covariances having unequal scales to combat such situation, we calculate correlation to get a value between +1 and 1, irrespective of their respective scale.
- How is true positive rate and recall related?
The true positive rate = recall. Yes there are equal having the formula (TP/TP+FN)
Where TP : true positive , FN : false negative
- It is possible to capture correlation between continuous and categorical variables? if yes, how?
Yes, we can use ANACOVA (Analysis of covariance) techniques to capture association between continuous and continuous and categorical variables.
- Do you suggest that treating a categorical variables as continuous variable would result in a better predictive model?
For better prediction, categorical variable can be considered as a continuous variables only when the variable is ordinal in nature.
- You are working in a time series data set. Your manager asked you to built a high accuracy model, you start with the decision tree algorithm. Since you know it work fairly well on all kinds of data. Later, you tried a time series regression model and got higher accuracy than decision tree model. Can this happen? Why?
Time series data is known to posses linearity. On the other hand, a decision tree algorithm is known to work best to detect non-linear interaction. The reason why decision tree failed to provide robust predictions because it couldn’t map the linear relationship as good as a regression model. Therefore, we learned that a linear regression model can provide robust prediction given the data satisfies it’s linearity assumptions.
- What are the important skills to have in python with regards to data analysis?
#Good understanding of the built-in data types especially list, dictionaries, tuples and sets
#Mastery of N*dimensional Numpy arrays
#Mastery of pandas data frames
# Familiarity with scikit learn
- What is the difference between “long” and “wide” format data?
In the wide format, a subject repeated responses with be in a single row, and each response is in separate column.
In the long format, each row is a one-time points per subject.
- What is univariate analysis?
Univariate analysis are descriptive statistics analysis techniques which can be differentiated based on the number of variables involved at a given point of time.
Hicham AMAR, Ing in Geomatic Sciences, Co-founder of Geoinfo4all.com
متابعة القراءة...