mk

Mathematics → Probability

Probability

Probability is the mathematical language of uncertainty. Grounded in set theory and measure theory, it provides the foundation for statistics, machine learning, information theory, and every rational approach to inference.


Sample Spaces and Events

Definition — Sample Space

The sample space Ω\Omega is the set of all possible outcomes of a random experiment. An event is any subset AΩA \subseteq \Omega.

Fair coin flip

Ω={H,T}\Omega = \{H, T\}
Event: A={H}A = \{H\}

Rolling a die

Ω={1,2,3,4,5,6}\Omega = \{1,2,3,4,5,6\}
Event: A={2,4,6} (even)A = \{2,4,6\}\text{ (even)}

Since events are sets, set operations apply: ABA \cup B is the event that AA or BB occurs; ABA \cap B is both; AcA^c is the event that AA does not occur.

Kolmogorov's Axioms

Andrey Kolmogorov (1933) placed probability on a rigorous axiomatic foundation. A probability measure is a function P:F[0,1]P : \mathcal{F} \to [0,1] satisfying:

1.

Non-negativity

P(A)0 for all events AP(A) \geq 0 \text{ for all events } A

2.

Normalization

P(Ω)=1P(\Omega) = 1

3.

Additivity

P(AB)=P(A)+P(B) if AB=P(A \cup B) = P(A) + P(B) \text{ if } A \cap B = \emptyset

From these three axioms, all of probability theory follows. Key consequences:

P()=0P(Ac)=1P(A)P(AB)=P(A)+P(B)P(AB)P(\emptyset) = 0 \qquad P(A^c) = 1 - P(A) \qquad P(A \cup B) = P(A) + P(B) - P(A \cap B)

Conditional Probability

Definition — Conditional Probability

The probability of event AA given that event BB has occurred (with P(B)>0P(B) > 0):
P(AB)=P(AB)P(B)P(A \mid B) = \frac{P(A \cap B)}{P(B)}

Conditioning restricts the sample space to BB and re-normalizes. The multiplication rule follows directly:

P(AB)=P(AB)P(B)=P(BA)P(A)P(A \cap B) = P(A \mid B)\, P(B) = P(B \mid A)\, P(A)

Law of Total Probability

If B1,B2,,BnB_1, B_2, \ldots, B_n partition Ω\Omega (mutually exclusive, exhaustive), then for any event AA:

P(A)=i=1nP(ABi)P(Bi)P(A) = \sum_{i=1}^n P(A \mid B_i)\, P(B_i)

Independence

Definition — Independence

Events AA and BB are independent if knowing BB occurred gives no information about AA:
P(AB)=P(A)P(B)equivalently,P(AB)=P(A)P(A \cap B) = P(A)\, P(B) \qquad \text{equivalently,} \quad P(A \mid B) = P(A)

Independence and mutual exclusivity are very different concepts. If P(A)>0P(A) > 0 and P(B)>0P(B) > 0, then AA and BB cannot be both independent and mutually exclusive.

Bayes' Theorem

Theorem — Bayes

P(AB)=P(BA)P(A)P(B)P(A \mid B) = \frac{P(B \mid A)\, P(A)}{P(B)}
Using the law of total probability to expand the denominator:
P(AB)=P(BA)P(A)P(BA)P(A)+P(BAc)P(Ac)P(A \mid B) = \frac{P(B \mid A)\, P(A)}{P(B \mid A)\,P(A) + P(B \mid A^c)\,P(A^c)}

Bayes' theorem is the engine of Bayesian inference: it tells us how to update a prior belief P(A)P(A) in light of new evidence BB to obtain a posterior P(AB)P(A \mid B).

Classic example: a medical test for a disease with 1% prevalence. Sensitivity (true positive rate) = 99%, specificity (true negative rate) = 95%. If you test positive, Bayes' theorem gives P(disease+)16.7%P(\text{disease} \mid +) \approx 16.7\% — far lower than intuition suggests, because the disease is rare.

Random Variables

Definition — Random Variable

A random variable XX is a function X:ΩRX : \Omega \to \mathbb{R} that assigns a numerical value to each outcome. It is discrete if it takes countably many values; continuous if described by a probability density function (PDF).

Discrete: PMF

The probability mass function of a discrete random variable satisfies:

p(x)=P(X=x)0xp(x)=1p(x) = P(X = x) \geq 0 \qquad \sum_x p(x) = 1

Continuous: PDF and CDF

A continuous random variable has a probability density function f(x)0f(x) \geq 0 with:

P(aXb)=abf(x)dxf(x)dx=1P(a \leq X \leq b) = \int_a^b f(x)\,dx \qquad \int_{-\infty}^{\infty} f(x)\,dx = 1

The cumulative distribution function F(x)=P(Xx)F(x) = P(X \leq x) is non-decreasing, right-continuous, with F()=0F(-\infty)=0 and F()=1F(\infty)=1.

Expectation and Variance

Definition — Expected Value

The expected value (mean) of XX is its probability-weighted average:
E[X]=xxp(x)(discrete)E[X]=xf(x)dx(continuous)E[X] = \sum_x x\, p(x) \quad \text{(discrete)} \qquad E[X] = \int_{-\infty}^{\infty} x\, f(x)\,dx \quad \text{(continuous)}

Definition — Variance

The variance measures spread around the mean:
Var(X)=E[(Xμ)2]=E[X2](E[X])2\text{Var}(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2
The standard deviation is σ=Var(X)\sigma = \sqrt{\text{Var}(X)}.

Key properties

E[aX+b]=aE[X]+bVar(aX+b)=a2Var(X)E[aX + b] = aE[X] + b \qquad \text{Var}(aX+b) = a^2\,\text{Var}(X)
E[X+Y]=E[X]+E[Y](always — linearity of expectation)E[X + Y] = E[X] + E[Y] \qquad \text{(always — linearity of expectation)}
Var(X+Y)=Var(X)+Var(Y)(if X,Y independent)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) \qquad \text{(if } X, Y \text{ independent)}

Common Distributions

Bernoulli(p)

Single trial: success with probability p.

E[X]=p,Var(X)=p(1p)E[X] = p, \quad \text{Var}(X) = p(1-p)

Binomial(n, p)

Number of successes in n independent Bernoulli trials.

P(X=k)=(nk)pk(1p)nk,E[X]=np,Var(X)=np(1p)P(X=k) = \binom{n}{k}p^k(1-p)^{n-k}, \quad E[X] = np, \quad \text{Var}(X) = np(1-p)

Poisson(λ)

Number of rare events in a fixed interval; λ is the average rate.

P(X=k)=λkeλk!,E[X]=Var(X)=λP(X=k) = \dfrac{\lambda^k e^{-\lambda}}{k!}, \quad E[X] = \text{Var}(X) = \lambda

Uniform(a, b)

Equally likely over an interval.

f(x)=1ba,E[X]=a+b2,Var(X)=(ba)212f(x) = \dfrac{1}{b-a}, \quad E[X] = \dfrac{a+b}{2}, \quad \text{Var}(X) = \dfrac{(b-a)^2}{12}

Normal(μ, σ²)

The bell curve; arises everywhere by the Central Limit Theorem.

f(x)=1σ2πexp ⁣((xμ)22σ2),E[X]=μ,Var(X)=σ2f(x) = \dfrac{1}{\sigma\sqrt{2\pi}}\exp\!\left(-\dfrac{(x-\mu)^2}{2\sigma^2}\right), \quad E[X] = \mu, \quad \text{Var}(X) = \sigma^2

Exponential(λ)

Waiting time between Poisson events; memoryless.

f(x)=λeλx  (x0),E[X]=1/λ,Var(X)=1/λ2f(x) = \lambda e^{-\lambda x}\;(x\geq 0), \quad E[X] = 1/\lambda, \quad \text{Var}(X) = 1/\lambda^2

Linear Algebra