
Sweat more during peace, bleed less during war.
Sun Tzu, The Art of War
Data scientist is most coveted job of this century. It’s lucrative and challenging at the same time. In interviews aspirants are scrutinized on so many fronts leading people into loosing track of what to focus on. Thus, for a data science interview one must prepare oneself round the year and moreover keep track of your preparations.
If you already got your job in the field of data or your are looking for one this guide will help you with what to prepare. This is not like a regular “data science interview questions” blog, where you get some questions and their answers. Instead we’ve covered some topics with sample questions and answers, resource links that you can look at for your preparation. In case you are a complete beginner and want complete list of topics you can check out this blog. So lets get started.
Topics to focus on for data science interview
- Probability and Statistics
- Linear Algebra
- Calculus
- Data Structures and Algorithms
- Machine learning algorithms
- Deep learning basics
Probability and Statistics
Probability and statistics is the brain of a data scientist. It is crucial for making inferences, data presentation, finding actionable insights and significant relations in the data. Although its a whole course per se, but we tried to break it down for you and list down some important topics and references, assuming you are already familiar with them and just need to brush up on them. If situation is quite the opposite, you can check this out.
first off you need to be familiar with distributions, Normal, Exponential, Bernoulli, Binomial etc. Moreover, their mathematical equation, expected value and variance must be on your finger tips.
Then comes Covariance, Correlation and Central limit theorem, Law of large numbers. You are definitely going to get a question from here. But that’s not all, next is Maximum likelihood estimate, Confidence interval, Hypothesis testing (t-test, anova, chi square test), p-value, Bayesian inference, prior, posterior, log-likelihood, Maximum a posteriori estimation.
Bonus Tips:
This blog is a complete course on Probability and Statistics, with sample practice questions, but going through it will take a lot of time. You could just skim through the above topics for your preparation.
Sample Questions:
- What is the difference between likelihood and probability?
Probability is the chance of an event occurring given a Stochastic process. Likelihood is function which determines how you’ve modeled an unknown stochastic process to estimate it’s parameters. Here is a reference for better understanding.
- What is the difference between Bernoulli and Binomial distribution?
A random variable follows Bernoulli distribution if it has two possible outcomes with probability p and (1-p) respectively. When a Bernoulli experiment is repeated n times its called Binomial experiment and distribution of k success out of n is called Binomial distribution. Check this for more details.
- What are independent and dependent random variables?
Let’s say there are two events A and B. They are independent if probability of A and B occurring at the same time is product of probability of A and probability of B, otherwise they’re dependent. This is an easy one, one hint though look at Baye’s rule for more information.
- What is the difference between a priori and a posteriori?
Let’s say we observed some data (X) from a random process and we want to model this process (with parameters θ). So our likelihood function will be P(X|θ) prior will be P(X) and posterior will be P(θ|X).
- What is Pareto distribution? Can you give examples?
Pareto principal is also known as 80-20 rule, i.e. 80% of the outcomes are due 20% of causes. One example is distribution of wealth in the society.
- What is relation between mean, median mode in case of left skew and right skew Normal distribution?
For left skew mean < median < mode, for right skew mode < median < mean.
Linear Algebra
Linear algebra is crucial to data representation and for doing mathematical operations on it. In Linear algebra is a bit abstract study of properties of the data. Whether the data is tabular, image, video or text, it is represented as a matrix. Key topics in Linear algebra for a data scientist are vector space, linear independence, spanning and basis vectors, vector sub-spaces, rank of a matrix, nullity of a matrix, System of linear equations, dot product and its interpretations, orthogonality, projections, least square approximation, vector and matrix norm, determinant of a matrix, inverse of a matrix, matrix decomposition (LU, QR, eigenvalue decomposition), eigenvalue eigenvectors and its interpretation, Singular value decomposition, Diagonalizing a matrix, Principal component analysis, convolution.
Resources:
Sample Questions:
- What is the difference between singular value decomposition and eigendecomposition?
- What is the relation between and eigenvalues?
- What are positive definite and positive semi-definite matrices?
- What is Frobenius norm?
- How do you compress images using matrix decomposition? Is there a better approach to solve this problem?
- How do you compute word embedding using linear algebra?
- What is relation between covariance matrix and PCA?
- What do you mean by broadcasting operation in numpy?
- How do you diagonalize a matrix?
- What is Hadamard product?
Calculus
Calculus is a fundamental tool in various fields of science, engineering, economics etc. It helps us model and solve problems involving change and motion, optimization, and understanding complex systems. It provides a framework for understanding how things change and how to analyze and describe complex curves and surfaces. Calculus is typically divided into two main branches:
- Differential Calculus: The derivative of a function represents the rate of change of that function at a given point. It allows you to find slopes of curves, calculate instantaneous velocities, and analyze how functions behave locally.
- Integral Calculus: The integral of a function represents the accumulation of quantities or the area under a curve. It allows you to compute the total change or accumulated quantity over a specified interval.
Calculus is generally left out from an interview perspective, but its still deserves to be included in this list owing to its utility in day to day chores of a data scientist. Key topics worth looking into are derivatives and integrals of key known functions like exponent, log, trigonometric functions etc., multivariate calculus, chain rule, Taylor series and optimization (gradient descent).
Resources:
Here is link to calculus cheat sheet from University of Colorado.
Data Structure and Algorithms
Data structures and Algorithms are fundamental concepts of Computer Science. Data structures are just ways to represent data, while Algorithms are set of instructions to get the job done. It may sound trivial but its not. It will take years of practice to master this one.
They are more relevant for software development, but some companies do ask questions on it to filter out candidates. Moreover, in my opinion its a good skill to have since at the end of the day you are going to write task specific scripts at your job.
Although its a vast topic I recommend to get familiar with data structures and their properties first, then you can dig into different types of Algorithms. Here is a list of data structures to get you started.
- Arrays
- Heaps
- Hash maps
- Stacks and Queues
- Trees (binary tree, binary search tree)
- Linked lists
- Graphs
Even if you prepare first four it will cover most of the questions asked in the interviews for data science, but for software development you’ll have to ace it.
Resources:
Here is a list of 500+ questions with solutions. It will take a lot of time to cover all of them, so I’ll suggest start with Arrays. If you are willing to pay a bit for it here is a course on Udemy that you can check out, trust me it will be worth it.
Sample Questions:
- Find a pair with given sum
- Find Kth largest element in an array
- Find element in a 2D sorted array
- Delete a node from BST
- Find duplicate parenthesis in an expression
- Inorder tree traverlsal | iterative and recursive
- Inplace merge two sorted arrays
- Sort an array containing only 0’s and 1’s
- Find all distinct combination of given length
- Rain water trapping problem
Machine Learning Algorithms
Now lets get to the meat of the matter. Up until now what we covered are prerequisites for this and the next one. Lacking in any one of these skills will be devastating.
Machine learning is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models that enable computers to learn from available data and make predictions without being explicitly programmed. It is used for tasks such as image recognition, language translation, fraud detection, and personalized content recommendations etc.
Resources:
- Intro to ML from IIT Kanpur. Here is the link with lecture notes. Best for quick revision of the concepts.
- Machine learning Mooc by IIT Madras, its a bit hard to follow but worth the time and effort.
- Here is a git repository with practical notebooks that you might find handy.
Sample Questions:
- What are different types of machine learning algorithms?
- What is difference between Supervised and Reinforcement learning?
- What are decision trees? Explain how we can create decision trees for a classification problem?
- Explain how K-means clustering work?
- What are some of the popular clustering algorithms?
- What is the difference between generative and discriminative models?
- What is Ordinary least square method? Are there any other ways to solve regression problems?
- Why does L1 regularization converge faster than L2 regularization?
- What are difference between random forest and decision trees?
- What are kernels in SVMs?
- Explain parametric vs non parametric machine learning algorithms?
- How are decision trees regression different from classification?
- What is Bias – Variance trade off? Is low bias, low variance desirable?
- How is Linear discriminant analysis different from PCA?
- What is the difference between model parameters and hyper-parameters?
- What is precision and recall? which one is more preferred?
- What are the steps involved in model evaluation?
- What are some of the feature selection techniques?
- What are the stages in the life cycle of a machine learning project?
- What is co-variate shift?
- What is difference between model drift and data drift?
Deep Learning Basics
Deep learning is a sub-field of machine learning that focuses on artificial neural networks, particularly deep neural networks with many layers. Some key features of deep learning based algorithms is:
- Automatic feature learning
- Requires lot of data and compute to train
- Have out-performed traditional machine learning algorithms in certain tasks speech and image recognition.
- It can be used as both dicriminative as well as generative model.
- Model explainability is a bit of a challenge.
- Transfer learning, that is fine tuning of pre-trained models for specific problems.
Resources:
- Deep Learning course from IIT Madras. Its pretty long and demanding.
- Lecture notes for the above course
- NLP using deep learning – Stanford
Sample Questions:
- Explain working of Gradient descent?
- What is the difference between stochastic gradient descent and batch gradient descent ?
- How do we introduce non linearity in neural networks?
- What is vanishing gradient and exploding gradient? When are they observed?
- What is the difference between objective function and loss function?
- what is batch normalization? why is it used?
- What is the difference between CNNs and RNNs?
- Explain Architecture of transformers? Also explain how attention works?
- What is residual connection? How does it work? What are its advantages?
- What is pre-training with respect to transformers?
- Explain encoder-decoder architecture?
Conclusions
We have come a long way here. I hope all this information will serve you well. I have not provided answers to some of the questions on purpose, if you don’t know the answers, its just a google search away. I’ll encourage you guys to find it on your own. Don’t hesitate to post your questions in the comments section though.
Leave a comment I’d really appreciate it. Also let me know If you want me to cover any topic in detail. Thank you!
Leave a Reply