Lean path to become a successful data scientist

Why become a data scientist?

A career in data science offers a unique combination of high demand, high salaries, opportunities for growth, and the chance to work on meaningful problems. If you’re interested in mathematics, computer science, and solving real-world problems, becoming a data scientist is the way to go.

Required Skills for becoming a data scientist

Math

Mathematics is the language of the data. Data is represented through numbers, tables, matrices, series etc. Mathematical operations and equations are key to get information out of the data. But we don’t need to learn scary stuff to get started, high school math will suffice. Here is a comprehensive list of topics to get you started:

math4ml.pdf

Probability and Statistics

Preface

For years, I have been joking with my students that I would teach probability with the same level of excitement even if I were woken up in the middle of the night and asked to teach it. Years later, as a new father, I started writing this book when it became clear to me that I would not be sleeping…

Linear Algebra

Lecture Notes for Linear Algebra

6.4 Solve Linear Differential Equations

Calculus

Lecture Notes | Multivariable Calculus | Mathematics | MIT OpenCourseWare

This section provides summaries of the lectures as written by Professor Auroux to the recitation instructors.

Transform Theory (optional)

Lecture Notes | Fourier Analysis – Theory and Applications | Mathematics | MIT OpenCourseWare

This section provides the schedule of lecture topics, lecture notes for each session, and notes for the entire course as a single file.

Information theory (optional)

Lecture Notes | Information Theory | Electrical Engineering and Computer Science | MIT OpenCourseWare

This section provides the lecture notes used for the course.

Programming

Programming is key skill required to implement mathematical concepts learnt on the underlying data. Here is a list of topics and courses to get you started.

Choose a programming Language

I prefer Python, as it is widely adopted, versatile, has amazing community and is easy to learn.

How to Get Started With Python?

In this tutorial, you will learn to install and run Python on your computer. Once we do that, we will also write our first Python program.

OOPs Concepts

Python Object Oriented Programming (With Examples)

In this tutorial, we’ll learn about Object-Oriented Programming (OOP) in Python with the help of examples.

Data Structure and Algorithms

500+ Data Structures and Algorithms Interview Questions & Practice Problems

Array

Data Mining and Analysis

So far we have looked at prerequisites for data science. Now comes the core application part. First steps into the journey of data science starts with data mining and analysis.

Data Processing

Data Analysis Using Pandas | Guide to Pandas Data Analysis

Pandas is one of the most famous data science tools and it’s definitely a game-changer for cleaning, manipulating, and data analysis.

Exploratory data analysis and visualizations

Step-by-Step Exploratory Data Analysis (EDA) using Python –

EDA is performed on the datasets to explore the data and extract all possible insights helping in model building and better decision making.

Association rule mining and clustering

#1 Introduction To Data Mining, Types Of Data |DM|

Abroad Education Channel :https://www.youtube.com/channel/UC9sgREj-cfZipx65BLiHGmwCompany Specific HR Mock Interview : A seasoned professional with over 18 y…

If you are with me so far you will be competent enough to become a data analyst, which is entry level position in the field of data science.

Machine Learning Algorithms

Introduction to Machine Learning

above course covers all machine learning algorithms you need to know as a beginner in grave detail. but for sake of saving some time I’ll list down resources for individual algorithms as well.

linear regression

Linear Regression Algorithm To Make Predictions Easily

Linear regression is a statistical regression method used for predictive analysis and shows the relationship between the continuous variables.

Logistic regression

Logistic Regression- Supervised Learning Algorithm for Classification

This article will talk about Logistic Regression, a method for classifying the data in Machine Learning. Logistic regression is generally used where we have to classify the data into two or more classes.

Decision tree

Decision Tree Algorithm, Explained – KDnuggets

All you need to know about decision trees and how to build and optimize decision tree classifier.

Support vector machines

svm.pdf

Naive bayes

Naïve Bayes Algorithm: Everything You Need to Know – KDnuggets

Naïve Bayes is a probabilistic machine learning algorithm based on the Bayes Theorem, used in a wide variety of classification tasks. In this article, we will understand the Naïve Bayes algorithm and all essential concepts so that there is no room for doubts in understanding.

Random forest

Random Forest | Introduction to Random Forest Algorithm

Random forest is a Supervised Machine Learning Algorithm. This is an introduction to understanding random forest, its working and features.

Xgboost

An End-to-End Guide to Understand the Math behind XGBoost

Ever since its introduction in 2014, XGBoost has been lauded as the holy grail of machine learning hackathons and competitions. From predicting ad click-through rates to classifying high energy physics events, XGBoost has proved its mettle in terms of performance – and speed.

Dimentionality reduction

A One-Stop Shop for Principal Component Analysis

At the beginning of the textbook I used for my graduate stat theory class, the authors (George Casella and Roger Berger) explained in the…

What is LDA: Linear Discriminant Analysis for Machine Learning

Understand Linear Discriminant Analysis (LDA) in Machine Learning, Dimensionality Reduction, limitations of Logistic Regression. Learn practical approach to an LDA model.

Singular Value Decomposition | SVD in Python

Singular Value Decomposition (SVD) is a common dimensionality reduction technique in data science. Read about the common application of SVD is data science.

The why and how of nonnegative matrix factorization | the morning paper

The why and how of nonnegative matrix factorization Gillis, arXiv 2014 from: ‘Regularization, Optimization, Kernels, and Support Vector Machines.’

Clustering

What is Hierarchical Clustering? – KDnuggets

The article contains a brief introduction to various concepts related to Hierarchical clustering algorithm.

DBSCAN Clustering Algorithm in Machine Learning – KDnuggets

An introduction to the DBSCAN algorithm and its implementation in Python.

Balanced Iterative Reducing and Clustering using Hierarchies — BIRCH

The biggest challenge with clustering in real-life scenarios is the volume of the data and the consequential increase in the complexity…

K-means clusterin: The ultimate guide

K-means clustering is a widely used method for cluster analysis where the aim is to partition a set of objects into K clusters in such a way …

Gaussian Mixture Model | Brilliant Math & Science Wiki

Gaussian mixture models are a probabilistic model for representing normally distributed subpopulations within an overall population. Mixture models in general don't require knowing which subpopulation a data point belongs to, allowing the model to learn the subpopulations automatically. Since s…

Gaussian Mixture Model

Recommendation system

An In-Depth Guide to How Recommender Systems Work

Recommender systems are the brains behind product and content recommendations on websites. Here’s how they work.

Model Deployment

Data storage

How to store Data for your Data Science Process

Learn how to develop an effective data storing strategy…

Data processing

In-Depth ETL in Machine Learning Tutorial – Case Study With Neptune – neptune.ai

Most of the time, as data scientists, we think that our core value is our ability to figure out a machine learning algorithm that solves a task. In reality, model training is just the final part of a large body of work, mainly with data, that’s required just to start building a model. Before ML…

Cloud for machine learning

Best Machine Learning as a Service Platforms (MLaaS) That You Want to Check as a Data Scientist – neptune.ai

The availability of tremendous computing power in the cloud was one of the factors behind the machine learning revolution. Thus, it is not surprising that there are cloud-based services emerging, aimed at machine learning specialists. But which one to pick? Cloud-based services are not that new in f…

Data versioning

Instant Experiment Tracking: Just Add DVC!

Experiment tracking in DVC with a few lines of Python.

Experiment tracking

Track ML experiments and models with MLflow – Azure Machine Learning

Use MLflow to log metrics and artifacts from machine learning runs

Model serving

Deploying Machine Learning Models using Flask

This tutorial will serve as an introduction on deploying Machine Learning models using Flask. We will go through various steps for building an end-to-end web application with inbuilt Machine Learning model using Flask.

How to Dockerize a Flask Application

These days, developers need to develop, ship, and run applications quicker than ever. And fortunately, there’s a tool that helps you do that – Docker. With Docker, you can now easily ship, test, and deploy your code quickly while maintaining full control over your infrastructure. It significantly re…

Conclusions

I know its a long way ahead if you are just starting your journey, but you don’t have to wait till you complete everything I listed here. There are several checkpoints where you can start your professional journey. For instance you can start your career as a data analyst once you know the core math and basic programming, and slowly keep on building your skills to move further ahead in your career.

One last tip though

You don’t rise to the level of your goals, you fall down to the level of your system.
James Clear, Atomic Habits

So make a habit of learning!

Base Zero