This article is the sixth in my series of math notes. This series records my notes and insights from math courses during my undergraduate studies, while interspersing the profound connections between probability statistics and modern AI technology, hoping to be helpful to you. The theme this time is probability theory and mathematical statistics.
Why Learn Probability and Statistics
Probability and statistics are very helpful for machine learning, as statistical methods can be used to select and optimize models, such as linear regression, logistic regression, support vector machines, etc.; Bayesian methods are applied for parameter estimation and model selection, commonly used in machine learning to construct Bayesian networks and hidden Markov models.
The core of generative AI is probability distribution modeling. Language models predict the probability distribution of the next word, which directly applies conditional probability theory. The model training process relies on statistical inference methods such as maximum likelihood estimation and Bayesian inference to learn parameters from data.
Core knowledge includes:
- Random events and probability
- One-dimensional discrete random variables and their distributions
- One-dimensional continuous random variables and their distributions
- Two-dimensional continuous random variables and their distributions
- Two-dimensional discrete random variables and their distributions
- Mathematical expectation of random variables
- Variance and covariance of random variables
At the end of the article, there are practice questions from real exams.
Random Events and Probability
a. Conditional Probability Formula
Conditional probability describes the probability of an event given that another event has already occurred. The formula is as follows:
The basic task of a language model is to predict the probability distribution of the next word in a sequence. Mathematically, given the preceding word sequence , the model needs to calculate the conditional probability of the next word :
b. Bayes' Theorem
It uses prior probability and conditional probability to update the probability of an event occurring, which is the core foundation of Bayesian statistics (finding the cause given the result). Its formula is:
For example, suppose you took a COVID-19 test, and the result is positive. We want to know the probability that you are actually infected with COVID-19. Assume that in a certain city, 1 in 1000 people is infected with COVID-19 (prior probability); if a person is indeed infected, the probability of testing positive is 99% (likelihood), and among uninfected people, there is a 5% chance of testing positive.
Let testing positive be , and infection be . Combining the conditional probability formula and substituting into Bayes' theorem gives:
This means that even if your COVID test is positive, the probability that you have COVID is only 1.94%.
Another example: a person can travel by plane, train, ship, or car, with probabilities of 5%, 20%, 30%, and 45%, respectively. The probabilities of arriving on time for these modes of transport are 100%, 70%, 50%, and 80%, respectively. Assuming the person arrives on time, what is the probability that they took the train?
Let the probability of taking the train be , and the probability of arriving on time be . According to Bayes' theorem:
Using the law of total probability, we can easily find that . Thus, the result is 0.2.
Modern language models, while not directly using Bayes' theorem, operate similarly by inferring the most likely next word (posterior probability) based on the observed context (prior information).
c. Multiplication Formula
Suppose you flip a fair coin twice; what is the probability of getting heads both times?
- Event A: Getting heads on the first flip,
- Event B: Getting heads on the second flip,
- Since the two flips are independent,
Thus, the probability is 1/4.
d. Independence of Events
If two events and are independent, they have the following important properties:
- The occurrence of one event does not change the probability of the other event occurring, i.e., , and vice versa.
- There is no transitivity: A being independent of B, and B being independent of C does not guarantee that A is independent of C. Additional verification is needed for .
For example, in a printing shop app, the order processing time is related to the amount , indicating a lack of independence. High-value orders (e.g., large posters) typically require longer processing times.
Joint Probability Distribution
Joint probability distribution is a complete description of all possible combinations of values of multiple random variables and their corresponding joint probabilities. For discrete random variables, joint probability distribution is usually represented using a table; for continuous random variables, it is described using a joint probability density function.
Marginal Probability
Marginal probability can be calculated by summing (for discrete) or integrating (for continuous) the joint probability distribution.
One-Dimensional Discrete Random Variables and Their Distributions
The distribution of discrete random variables is described using a probability mass function.
The distribution of continuous random variables is described using a probability density function.
a. Geometric Distribution
In independent Bernoulli trials (where the outcomes are binary and independent), the number of trials until the first success follows a geometric distribution with parameter . For example:
- A shooter continuously shoots at a target, with a probability of hitting , and stops after hitting. What is the probability of needing trials to stop?
- Continuously rolling a die, stopping when a 6 is rolled. What is the probability of needing trials to stop?
- Continuously flipping a coin, stopping when heads is flipped. What is the probability of needing trials to stop?
In these examples, is the random variable (taking values 1, 2, 3, ...), and we say these random variables follow a geometric distribution.
For the example of rolling a die, suppose we want to find the probability of needing 3 trials; we can directly substitute into the probability mass function:
Here, , , so the probability is 0.1157.
b. Binomial Distribution
If an experiment has only two possible outcomes (success or failure), and is repeated times with a success probability of each time, then the random variable represents the number of successes. For example:
- Suppose a coin is flipped 10 times, with the probability of heads being 0.5. What is the probability of getting heads times?
- A product has a defect rate of 10%. If 50 items are sampled, how many defective items are most likely to be found?
Here, is the random variable, and we say these random variables follow a binomial distribution.
For the coin flip example, suppose we want to find the probability of getting heads 6 times; we can directly substitute into the probability mass function:
Here, , , so the probability is 0.205.
Now, consider a more complex example:
Let the random variable follow a binomial distribution . What value of maximizes ?
For a binomial distribution, the probability mass function reaches its maximum at (when np is an integer). If np is not an integer, the maximum occurs at the floor or ceiling of . Therefore, the answer to this question is 4.
c. Poisson Distribution
If the number of occurrences of an event in a unit time is a random variable, and its rate is , then this random variable follows a Poisson distribution with parameter , denoted as . For example:
- A website receives an average of 2 access requests per minute. What is the probability of receiving 3 access requests in one minute?
- A product has an average monthly sales of 4. What is the probability of selling 8 units in a month?
For the example of product sales, the probability mass function of the Poisson distribution is:
Consulting the Poisson distribution table gives a probability of 0.9786. In other words, if 8 units are stocked each month, there is a 97.86% probability that the product will not run out of stock.
One-Dimensional Continuous Random Variables and Their Distributions
Calculating continuous random variables involves definite integrals.
It is worth noting that there are three different notation symbols for three types of distributions. Each distribution has a corresponding probability density function. The integral of the probability density function is the probability distribution function, also known as the cumulative distribution function.
For a function to be a density function, it must satisfy the following conditions:
- The definite integral of the probability distribution function over its range must equal 1.
- It must be monotonic and continuous.
Modern generative models (such as VAE, GAN, and diffusion models) typically operate in continuous latent spaces. Each dimension of these latent vectors can be viewed as a continuous random variable following a specific probability distribution (such as a normal distribution).
a. Normal Distribution
The normal distribution is one of the most common continuous probability distributions, describing random variables in many natural phenomena. A random variable follows a normal distribution with parameters (mean) and (variance), denoted as .
Its probability density function is:
The mean determines the center of symmetry of the distribution, while the variance determines the "width" of the distribution. Essentially, it is a Gaussian function.
For example:
- Suppose exam scores follow a normal distribution with a mean of 75 and a standard deviation of 10. What is the probability that a student scores above 85?
- A brand of light bulbs has a lifespan that follows a normal distribution, with an average lifespan of 1000 hours and a standard deviation of 200 hours. What is the probability that a light bulb lasts more than 1300 hours?
- The height of adult males in a certain country follows a normal distribution, with an average height of 175 cm and a standard deviation of 6 cm. What is the probability that a randomly selected adult male has a height between 180 and 190 cm?
For the exam example, we first standardize (which represents the deviation of an observation from the mean):
According to the normal distribution table,
The normal distribution has some elegant properties:
- The graph is symmetric about .
b. Uniform Distribution
The uniform distribution indicates that all values within a certain interval are equally likely. A random variable follows a uniform distribution on the interval , denoted as . For example:
- Suppose the weight of a certain product is uniformly distributed between 50 grams and 100 grams. What is the probability that the product weighs between 60 grams and 80 grams?
Its probability density function is:
This example demonstrates an important property of the uniform distribution: probability is proportional to the length of the interval. In a uniform distribution, we only need to know the proportion of the interval of interest to the total interval to directly derive the probability.
c. Exponential Distribution
The exponential distribution is commonly used to describe the time intervals between events. A random variable follows an exponential distribution with parameter (rate parameter), denoted as . For example, indicates that follows an exponential distribution with . Its probability density function is:
For example, let the random variable . What is ?
Note that this uses the conditional probability formula.
Two-Dimensional Discrete Random Variables and Their Distributions
Two-dimensional discrete random variables are combinations of pairs of discrete random variables, used to describe the joint distribution between two variables. Let and be two discrete random variables, defining their joint probability distribution , which represents the probability that takes the value and takes the value .
We typically use a table to represent the joint distribution. The joint distribution has the following properties:
- The sum of all probabilities must equal 1.
- If and are independent, this means: holds for each .
Two-Dimensional Continuous Random Variables and Their Distributions
Two-dimensional continuous random variables refer to a pair of random variables , whose values continuously take values within a certain region in the two-dimensional real plane, and whose probability distribution is described by a joint probability density function.
This distribution also has normalization. Two random variables and are independent if and only if their joint probability density function can be factored into the product of their marginal density functions.
For example, in the case of the printing shop app, the processing time is , and the order amount is . Given the density function, we can obtain the following information:
- The marginal density function can show whether there are more or fewer orders with longer processing times.
- The marginal density function can show whether there are more high-value orders or low-value orders.
- The expected value can show the average processing time.
- The expected value can show the average amount.
Mathematical Expectation of Random Variables
The mathematical expectation of a function of a random variable, also known as the expected value, is the weighted average of the values taken by the random variable, with weights being the corresponding probabilities. Let be a random variable, and be a function of . The calculation method for the mathematical expectation varies depending on whether is discrete or continuous.
The expected value formulas for common distributions are as follows:
- Geometric distribution:
- Poisson distribution:
- Uniform distribution:
- Exponential distribution:
a. Discrete Case
The formula for calculating the expected value of a discrete random variable is:
where is the value of the discrete random variable, and is the corresponding probability.
For example, let be the outcome of a die roll, where takes values from 1 to 6, each with a probability of . We want to find the expected value of :
Now consider a more complex example:
The implicit conditions are:
- ,
- The interval [-2, 2] can be divided into three segments.
Thus, substituting into the formula gives:
b. Continuous Case
Suppose is a continuous random variable with probability density function , then the mathematical expectation (also known as the mean) of is defined as:
This is essentially the definite integral of the calculation formula for discrete cases. We can directly remember some common distribution integral results, as seen at the beginning of this chapter.
c. Chebyshev's Inequality
Chebyshev's inequality provides an upper bound on the probability that a random variable deviates from its expected value by a certain distance. This inequality applies to any random variable, regardless of its distribution shape.
Let be a random variable with expected value and variance . For any positive number (k > 0), Chebyshev's inequality states:
For example, suppose the average score in a math exam for a class is 75 points , and the standard deviation is 10 points (σ). We want to know the upper limit of the proportion of students whose scores deviate from the average by more than 20 points.
Here, , according to Chebyshev's inequality: , which means at most 25% of students may score below 55 or above 95.
Variance and Covariance of Random Variables
Variance and covariance are used to describe the distribution and relationships between random variables.
a. Variance
Variance measures the degree of dispersion of the values taken by a random variable, i.e., how far its values deviate from the expected value. Let be a random variable with expected value , then the variance of is denoted as or , defined as:
Additionally, variance has several important properties:
- For a constant and random variable ,
- For independent random variables and ,
- For independent random variables and ,
- For geometric distribution, the variance equals
- For binomial distribution, the variance equals
- For exponential distribution, the variance equals
- For Poisson distribution, the variance equals
- For uniform distribution, the variance equals
For example, let be the outcome of a die roll, taking values from 1 to 6. We want to find the variance of :
Now consider another example: let random variables and be independent, with variances of 4 and 8, respectively. What is the variance of ?
Using the properties mentioned above, we can easily find:
b. Covariance
Covariance is used to describe the linear relationship between two random variables. Let and be two random variables with expected values and , then the covariance of and is denoted as , defined as:
✍️ Comprehensive Exercises
Examination of Continuous Random Variables
Given the density function of a continuous random variable :
Find the constant and .
Calculating gives:
Thus, we find:
Since , we have and .
Examination of the Law of Total Probability and Independence
This problem can be solved using the law of total probability, which implicitly involves knowledge of joint probabilities.
Examination of Two-Dimensional Random Variables
The key to solving this problem is understanding that the probability density function must satisfy the basic property that the integral over the entire sample space equals 1.
Thus, we need to compute the double integral. The solution yields .
Examination of Distribution Functions
The key to this problem is understanding that .
That is, the distribution function is defined as the probability that is less than or equal to .
Examination of Normal Distribution
This problem requires understanding the two parameters in the normal distribution. We only need to calculate the two parameters (expected value and variance) based on and to obtain the answer.
Comprehensive Examination of Variance and Binomial Distribution
This problem requires immediate recognition that is essentially a binomial distribution. First, we easily find that . The variance formula for the binomial distribution is , where is 3 and is , so the answer to this problem is C.
Understanding Random Variables
A machine has a failure rate of 0.01, and a person supervises 20 machines simultaneously. What is the probability that they cannot be repaired in time?
Essentially, this refers to the probability that more than two machines fail simultaneously.
The random variable described in this problem follows a binomial distribution, where is 0.01, is 20, and is the number of machines that fail. We are looking for:
Comprehensive Examination of Distributions
Let , and now we conduct 3 independent observations of . What is the probability that at least two observations are greater than 3?
First, we use the uniform distribution to find the probability of a single observation being greater than 3, which is .
Then we can use the binomial distribution to find the result.
Conclusion
The essence of probability and statistics is to describe, understand, and utilize uncertainty. It analyzes the regularities of random phenomena through mathematical tools and methods, providing a scientific basis for decision-making, prediction, and inference. It is a core discipline for dealing with an uncertain world, widely applied in the fields of natural sciences, social sciences, and engineering technology.