Mathematics → Probability

Probability

Probability is the mathematical language of uncertainty. Grounded in set theory and measure theory, it provides the foundation for statistics, machine learning, information theory, and every rational approach to inference.

Sample Spaces and Events

Definition — Sample Space

The sample space

\Omega

is the set of all possible outcomes of a random experiment. An event is any subset

A \subseteq \Omega

Fair coin flip

\Omega = \{H, T\}

Event:

A = \{H\}

Rolling a die

\Omega = \{1,2,3,4,5,6\}

Event:

A = \{2,4,6\}\text{ (even)}

Since events are sets, set operations apply: $A \cup B$ is the event that $A$ or $B$ occurs; $A \cap B$ is both; $A^c$ is the event that $A$ does not occur.

Kolmogorov's Axioms

Andrey Kolmogorov (1933) placed probability on a rigorous axiomatic foundation. A probability measure is a function $P : \mathcal{F} \to [0,1]$ satisfying:

Non-negativity

P(A) \geq 0 \text{ for all events } A

Normalization

P(\Omega) = 1

Additivity

P(A \cup B) = P(A) + P(B) \text{ if } A \cap B = \emptyset

From these three axioms, all of probability theory follows. Key consequences:

P(\emptyset) = 0 \qquad P(A^c) = 1 - P(A) \qquad P(A \cup B) = P(A) + P(B) - P(A \cap B)

Conditional Probability

Definition — Conditional Probability

The probability of event

A

given that event

B

has occurred (with

P(B) > 0

P(A \mid B) = \frac{P(A \cap B)}{P(B)}

Conditioning restricts the sample space to $B$ and re-normalizes. The multiplication rule follows directly:

P(A \cap B) = P(A \mid B)\, P(B) = P(B \mid A)\, P(A)

Law of Total Probability

If $B_1, B_2, \ldots, B_n$ partition $\Omega$ (mutually exclusive, exhaustive), then for any event $A$ :

P(A) = \sum_{i=1}^n P(A \mid B_i)\, P(B_i)

Independence

Definition — Independence

Events

A

and

B

are independent if knowing

B

occurred gives no information about

A

P(A \cap B) = P(A)\, P(B) \qquad \text{equivalently,} \quad P(A \mid B) = P(A)

Independence and mutual exclusivity are very different concepts. If $P(A) > 0$ and $P(B) > 0$ , then $A$ and $B$ cannot be both independent and mutually exclusive.

Bayes' Theorem

Theorem — Bayes

P(A \mid B) = \frac{P(B \mid A)\, P(A)}{P(B)}

Using the law of total probability to expand the denominator:

P(A \mid B) = \frac{P(B \mid A)\, P(A)}{P(B \mid A)\,P(A) + P(B \mid A^c)\,P(A^c)}

Bayes' theorem is the engine of Bayesian inference: it tells us how to update a prior belief $P(A)$ in light of new evidence $B$ to obtain a posterior $P(A \mid B)$ .

Classic example: a medical test for a disease with 1% prevalence. Sensitivity (true positive rate) = 99%, specificity (true negative rate) = 95%. If you test positive, Bayes' theorem gives $P(\text{disease} \mid +) \approx 16.7\%$ — far lower than intuition suggests, because the disease is rare.

Random Variables

Definition — Random Variable

A random variable

X

is a function

X : \Omega \to \mathbb{R}

that assigns a numerical value to each outcome. It is discrete if it takes countably many values; continuous if described by a probability density function (PDF).

Discrete: PMF

The probability mass function of a discrete random variable satisfies:

p(x) = P(X = x) \geq 0 \qquad \sum_x p(x) = 1

Continuous: PDF and CDF

A continuous random variable has a probability density function $f(x) \geq 0$ with:

P(a \leq X \leq b) = \int_a^b f(x)\,dx \qquad \int_{-\infty}^{\infty} f(x)\,dx = 1

The cumulative distribution function $F(x) = P(X \leq x)$ is non-decreasing, right-continuous, with $F(-\infty)=0$ and $F(\infty)=1$ .

Expectation and Variance

Definition — Expected Value

The expected value (mean) of

X

is its probability-weighted average:

E[X] = \sum_x x\, p(x) \quad \text{(discrete)} \qquad E[X] = \int_{-\infty}^{\infty} x\, f(x)\,dx \quad \text{(continuous)}

Definition — Variance

The variance measures spread around the mean:

\text{Var}(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2

The standard deviation is

\sigma = \sqrt{\text{Var}(X)}

Key properties

E[aX + b] = aE[X] + b \qquad \text{Var}(aX+b) = a^2\,\text{Var}(X)

E[X + Y] = E[X] + E[Y] \qquad \text{(always — linearity of expectation)}

\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) \qquad \text{(if } X, Y \text{ independent)}

Common Distributions

Bernoulli(p)

Single trial: success with probability p.

E[X] = p, \quad \text{Var}(X) = p(1-p)

Binomial(n, p)

Number of successes in n independent Bernoulli trials.

P(X=k) = \binom{n}{k}p^k(1-p)^{n-k}, \quad E[X] = np, \quad \text{Var}(X) = np(1-p)

Poisson(λ)

Number of rare events in a fixed interval; λ is the average rate.

P(X=k) = \dfrac{\lambda^k e^{-\lambda}}{k!}, \quad E[X] = \text{Var}(X) = \lambda

Uniform(a, b)

Equally likely over an interval.

f(x) = \dfrac{1}{b-a}, \quad E[X] = \dfrac{a+b}{2}, \quad \text{Var}(X) = \dfrac{(b-a)^2}{12}

Normal(μ, σ²)

The bell curve; arises everywhere by the Central Limit Theorem.

f(x) = \dfrac{1}{\sigma\sqrt{2\pi}}\exp\!\left(-\dfrac{(x-\mu)^2}{2\sigma^2}\right), \quad E[X] = \mu, \quad \text{Var}(X) = \sigma^2

Exponential(λ)

Waiting time between Poisson events; memoryless.

f(x) = \lambda e^{-\lambda x}\;(x\geq 0), \quad E[X] = 1/\lambda, \quad \text{Var}(X) = 1/\lambda^2

← Linear Algebra

On this page

Sample Spaces & Events

Probability Axioms

Conditional Probability

Independence

Bayes' Theorem

Random Variables

Expectation & Variance

Common Distributions