Industry Insights

Background image of user typing on a calculator with floating interface elements surrounding them

Why You Still Need to Know Statistics

Posted
March 20, 2023

In the realm of data science, there has been a long-standing debate regarding whether machine learning, AI, or statistics ranks supreme. However, this argument is ultimately unnecessary as all three methodologies have a plethora of data and problems to address. Although there are differences in how models operate in each of those fields, statistics is a crucial component of all of them. It is critical for data science professionals to possess a firm grasp of foundational mathematics and statistics, as well as the computational and computer science principles that underpin machine learning and AI. It is only through comprehending the theoretical underpinnings, strengths, and limitations of each approach that the right methodology for a given problem can be chosen. This blog will investigate why data scientists require a solid understanding of statistics and analyze the specific applications and methodologies where this knowledge is indispensable.

Designing Experiments

One of the fundamental aspects of statistics used to support data science is experimental design. Classic statistical experimental design involves carefully planning an experiment to ensure that it is properly controlled and that any variation observed is due to the factor(s) being tested rather than extraneous factors. This approach is regularly applied when conducting A/B tests, where a new intervention or treatment is compared to a control group. Experimental design is even more important when multiple factors are being assessed at once as opposed to a true, simple A/B test. By using classic statistical experimental design, it is possible to ensure that the experiment is properly designed, the data is collected correctly, and the results are analyzed using the appropriate statistical methods. This, in turn, can help to ensure that the results of the experiment are reliable and can be used to make informed decisions.

Handling Outliers and Missing Values

Outliers and missing values can significantly affect the analysis of data, leading to biased results and incorrect conclusions. Outliers, in particular, can distort statistical parameters and can also cause machine learning models to lose accuracy. Identifying and addressing outliers is therefore critical in ensuring that data model of any type is as accurate as possible. Classic statistical methods such as box plots and z-scores can help to identify outliers. Once identified, outliers can be handled in a way that preserves the integrity of the data while improving the models. Similarly, missing values can be addressed using various statistical techniques such as mean imputation, and model-based imputation, among others.

Statistics in Deep Learning

Deep learning uses artificial neural networks to learn from data and is one of the most commonly used approaches to artificial intelligence. While deep learning is often associated with computer science, there are several mathematical and statistical constructs that are embedded within these methods. For example, each node in a neural network has an activation function, which is a mathematical function that determines the output of the node based on the input it receives. One of the most commonly used activation functions is the sigmoid function, which is familiar to anyone who has done logistic regression. In addition to activation functions, there are other statistical constructs that are used in deep learning, such as loss functions, regularization methods, and optimization algorithms. These constructs are essential for training neural networks and ensuring that they generalize well to new data.It's important to note that while statistics is an important component of deep learning, it's not the only one. Deep learning algorithms also rely heavily on computer science and mathematical concepts. Without a basic understanding of statistics, however, it would be difficult for a deep learning practitioner to fully comprehend how the algorithm works, which will impact the ability to build and deploy effective models.

Assessing Model Accuracy

To evaluate the performance of a machine learning or deep learning algorithm, it is essential to examine the accuracy of its predictions. This is another area where statistics comes into play. In the simplest case of a binary outcome, such as predicting whether an image is a cat or not, it is necessary create a 2x2 table of true yes/no and predicted yes/no. This table allows the calculation of metrics such as accuracy, precision, recall, and F1 score, which are widely used in binary classification problems. These metrics are based on statistical principles and provide a way to measure the performance of the algorithm objectively. They can also be generalized for models predicting more than two outcomes. Furthermore, there are several statistical methods for exploring and weighting prediction errors, such as false positive and false negative rates.

Yes, You Do Still Need to Know Statistics

While machine learning and artificial intelligence may be the new kids on the block, they are built on the foundation of math and statistics. Understanding statistics is critical for anyone looking to pursue a career in data science, as it provides the necessary skills to analyze and interpret data accurately and to understand how models operate under the hood. This foundational understanding of statistics will enable data scientists to maximize the accuracy and impact of the processes that they create.This post was developed with input from Bill Franks, internationally recognized thought leader, speaker, and author focused on data science & analytics.