Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Fundamentals of Statistics

Open in Colab

Before starting with the details of machine learning, let us first recap some fundamental concepts of statistics. This are the terms that you will use a lot if you stick with the business of machine learning. Machine learning is science as well as it is a form or painting, where statistics and mathematics are like the paint and stroke.

Probability distribution function

Let us first jump into the definitions of a probability distribution functions (PDF). They come in two flavours discrete and continuous. To be worthy of being a probability distribution both of them have obey some properties.

Discrete probability distribution

The probability distribution of a discrete random variable is a list of probabilities associated with each of its possible outcomes. It is also sometimes called the probability mass function. Suppose a random variable XX may take kk different values, with the probability that X=xiX = x_{i} defined to be P(X=xi)=piP(X = x_{i}) = p_{i}. Then the probabilities pip_{i} must satisfy the following:

1: 0 < pip_{i} < 1 for each ii

2: p1+p2+...+pk=1p_{1} + p_{2} + ... + p_{k} = 1.

Binomial distribution

This is the distribution where only two outcomes are possible, success and failure with probabilities pp and 1p1-p. Then the probability of kk successes in nn trials is

P(X=k)=(nk)pk(1p)nk;where (nk)=n!(nk)!P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k} ; \quad \text{where } \binom{n}{k}=\frac{n!} {(n-k)!}
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm, binom
import ipywidgets as widgets
from IPython.display import display

def plot_binomial(n, p, k_highlight):
    k = np.arange(0, n + 1)  # Possible number of successes
    probs = binom.pmf(k, n, p)  # PMF values

    plt.figure(figsize=(10, 6))
    markerline, stemlines, baseline = plt.stem(k, probs)
    plt.setp(markerline, color='b', label="PMF")
    plt.setp(stemlines, color='b')

    # Highlight one point in red
    if 0 <= k_highlight <= n:
        plt.plot(k_highlight, binom.pmf(k_highlight, n, p), 'ro', label=f"P(X={k_highlight})")

    plt.title(f"Binomial Distribution PMF (n={n}, p={p})")
    plt.xlabel("Number of Successes (k)")
    plt.ylabel("Probability")
    plt.legend()
    plt.grid(True)
    plt.show()

widgets.interact(
    plot_binomial,
    n=widgets.IntSlider(value=10, min=1, max=50, step=1, description='Trials (n)'),
    p=widgets.FloatSlider(value=0.5, min=0.01, max=1.0, step=0.01, description='Success Prob (p)'),
    k_highlight=widgets.IntSlider(value=5, min=0, max=50, step=1, description='k Highlight')
    )
Loading...
<function __main__.plot_binomial(n, p, k_highlight)>

❓ Exercise

Q1: For a binomial distribution with n=30n = 30, and success probability of p=0.5p=0.5, what is the probability of getting 10 successes?

Click to show answer

Answer: The result is 0.02798. You can check this by using the function binom.pmf(10, 30, 0.5).

❓ Exercise

Q2: When is the binomial distribution most symmetric?

Click to show answer

Answer: A binomial distribution is most symmetric when p = 0.5.

Continuous probability distribution

As the name suggests in this case the outcomes can take any continuos value. In this case one can only talk about outcomes between some number to another. For example, in this case it is fare to ask the question what is probability of some random outcome xx, to be in the range (a,b)(a,b). The curve, which represents a function p(x)p(x), must satisfy the following:

1: The curve has no negative values (p(x)>0p(x) > 0 for all values of xx).

2: The total area under the curve is equal to 1.

Gaussian Distribution

The Gaussian or normal distribution, is something you will find everywhere in Data science. Sometimes this is one of the assumptions for many data science algorithms too.

A normal distribution has a bell-shaped density curve described by its mean μμ and standard deviation σσ. The density curve is symmetrical, centered about its mean, with its spread determined by its standard deviation showing that data near the mean are more frequent in occurrence than data far from the mean. The probability distribution function of a normal density curve with mean μμ and standard deviation σσ at a given point xx is given by:

f(x)=12πσ2e(xμ)22σ2f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}

For more mathematically oriented people, you can plug into this distribution and confirm

  • xf(x)dx=μ\int xf(x) dx=\mu

  • (xμ)2f(x)dx=σ2\int (x-\mu)^2 f(x) dx =\sigma^2

def plot_normal(mu, sigma):
    x = np.linspace(mu - 8*sigma, mu + 8*sigma, 1000)
    y = norm.pdf(x, mu, sigma)
    plt.figure(figsize=(8, 4))
    plt.plot(x, y)
    plt.xlim(-20, 20) 
    #plt.ylim(0, 20)
    plt.title("Gaussian Distribution")
    plt.grid(True)
    plt.show()

widgets.interact(plot_normal, mu=(-5, 5, 0.5), sigma=(0.1, 5.0, 0.1))
Loading...
<function __main__.plot_normal(mu, sigma)>

Likelihood vs Probability

  • Probability: Given parameters, what’s the chance of observing the data?

  • Likelihood: Given data, how likely are the parameters?

Example:

  • Probability: “Given (p=0.7), what’s the probability of 3 heads in 5 tosses?”

  • Likelihood: “Given 3 heads in 5 tosses, what is the most likely value of (p)?”

# Likelihood visualization
obs_heads = 7
total_flips = 10
p_vals = np.linspace(0.01, 0.99, 100)
likelihoods = binom.pmf(obs_heads, total_flips, p_vals)

plt.figure(figsize=(8, 4))
plt.plot(p_vals, likelihoods)
plt.title("Likelihood Function")
plt.xlabel("p")
plt.ylabel("Likelihood")
plt.grid(True)
plt.show()
<Figure size 576x288 with 1 Axes>

❓ Exercise

Q3: Given 8 heads out of 10 tosses, sketch or estimate the maximum likelihood estimate (MLE) for pp.

Click to show answer

Answer: The MLE for p is \frac{8}{10} = 0.8. This is just basically the probability of having 8 heads out of 10 tosses. From the plot above the mode (the value that appears most frequently, in this case the peak of the curve) of the plot is the probability.

Histograms and Distribution Approximation

A histogram approximates the probability distribution of data. With more samples, it resembles the true distribution.

Key Concepts:

  • Histogram shape depends on sample size and bin width.

  • More data yields smoother distribution.

  • For HEP most of the time we will be looking at histograms or observed results from the detector. The target of ML in this case is basically to find a function to fit this emperical distribution.

np.random.seed(42)
for N in [10, 100, 1000, 10000]:
    data = np.random.normal(0, 1, N)
    sns.histplot(data, kde=True, stat="density", bins=30) 

    # A kernel density estimate (KDE) plot is a method for visualizing the distribution 
    # of observations in a dataset, analogous to a histogram. KDE represents the data 
    # using a continuous probability density curve in one or more dimensions.

    plt.title(f"Histogram with N={N}")
    plt.grid(True)
    plt.show()
<Figure size 432x288 with 1 Axes>
<Figure size 432x288 with 1 Axes>
<Figure size 432x288 with 1 Axes>
<Figure size 432x288 with 1 Axes>

❓ Exercise

Q4: Why does the histogram with N=10N=10 look so different from N=10000N=10000?

Click to show answer

Answer: With N=10, there are too few samples to capture the underlying distribution, resulting in high variance and noise.

Surprise, entropy and gini index

Surprise (Self-Information)

Surprise is the measurement of how “unexpected” an event is. Therefore, the more probable the event is, the less surprising it should be. Mathematically for a event with probability pp it could have been 1p\frac{1}{p}, but to allow “zero” surprise for certain event, the surprise (or self-information) is defined as

I(p)=log2(p)I(p)= -\log_2(p)

Shanon entropy

  • Entropy is the average surprise across all possible outcomes.

  • For a random variable XX with outcomes xix_i and probabilities pip_i, Shanon entropy is defined as

S(X)=ipilog2(pi)S(X)=-\sum_i p_i\log_2(pi)
  • Higher entropy --> more uncertainty; lower entropy --> less uncertainty

Gini index

  • Taking log\log is computationally more taxing and therefore most of the times we use different other functions or formulas to quantify same thing as entropy.

  • To measure the impurity of a dataset, we use Gini index as

Gini(D)=1i=1Cpi2Gini(D)=1-\sum_{i=1}^C p_i^2
  • Where, CC is the number of unique classes in the dataset. pip_i is the proportion of data sample belonging to class ii in the Dataset DD.

❓ Exercise

Q5: If all samples belong to same class, then what is the Gini index?

Click to show answer

Answer: It should be 1-1=0

import math 

def calculate_surprise(probability):
    """Calculates the surprise (negative log probability) for a given event."""
    if probability == 0:
        return float('inf')  # Handle the case where probability is zero to avoid log(0) error
    return -math.log2(probability)

def calculate_gini(probabilities):
    """Calculates the Gini index of a probability distribution."""
    gini = 1
    for probability in probabilities:
        gini -= probability**2
    return gini

def calculate_entropy(probabilities):
    """Calculates the entropy of a probability distribution."""
    entropy = 0
    for probability in probabilities:
        if probability > 0:  # Handle the case where probability is zero to avoid log(0) error
            entropy -= probability * math.log2(probability)
    return entropy

def visualize_surprise_and_entropy(probabilities, title="Surprise and Entropy"):
    """Visualizes surprise and entropy for different probability distributions."""
    n_colors = len(probabilities)
    colors = ['red', 'green', 'blue', 'orange', 'purple', 'brown', 'pink', 'cyan', 'magenta', 'black']  # Add more colors if needed
    
    plt.figure(figsize=(10, 5))
    plt.subplot(1, 2, 1)
    
    # Plot Surprise
    plt.title("Surprise (Negative Log Probability)")
    plt.bar(range(n_colors), [calculate_surprise(p) for p in probabilities], color=colors[:n_colors])
    plt.xlabel("Color (Event)")
    plt.ylabel("Surprise")
    
    # Plot Entropy
    plt.subplot(1, 2, 2)
    plt.title("Entropy")
    plt.pie(probabilities, labels=[f"Color {i+1}" for i in range(n_colors)], autopct='%1.1f%%', startangle=140)
    plt.ylabel("Entropy")
    
    plt.suptitle(title, fontsize=16)
    plt.tight_layout(rect=[0, 0.03, 1, 0.95])
    plt.show()
probabilities_1 = [0.2, 0.2, 0.2, 0.2, 0.2]
print("The entropy in this case: % .2f" % calculate_entropy(probabilities_1))
print("The Gini index: % .2f" % calculate_gini(probabilities_1))
visualize_surprise_and_entropy(probabilities_1, title="High Entropy, Low Surprise (Equal Probabilities)")
    
The entropy in this case:  2.32
The Gini index:  0.80
<Figure size 720x360 with 2 Axes>
# Example 2: Unequal probabilities (lower entropy, higher surprise for rarer events)
probabilities_2 = [0.7, 0.1, 0.1, 0.05, 0.05]
print("The entropy in this case: % .2f" % calculate_entropy(probabilities_2))
print("The Gini index: % .2f" % calculate_gini(probabilities_2))
visualize_surprise_and_entropy(probabilities_2, title="Low Entropy, High Surprise (Unequal Probabilities)")
The entropy in this case:  1.46
The Gini index:  0.48
<Figure size 720x360 with 2 Axes>
# Example 3: One event with high probability, others with zero (zero entropy, infinite surprise for zero probability events)
probabilities_3 = [0.99, 0.005, 0.005, 0, 0]
print("The entropy in this case: % .2f" % calculate_entropy(probabilities_3))
print("The Gini index: % .2f" % calculate_gini(probabilities_3))
visualize_surprise_and_entropy(probabilities_3, title="Almost zero Entropy, High Surprise (One High Probability)")
The entropy in this case:  0.09
The Gini index:  0.02
<Figure size 720x360 with 2 Axes>

❓ Exercise

Q6: For different situations that you can think of, do a quantitative analysis of Gini index vs Shanon entropy.