Probability Distributions The Stories Behind the Math
Every distribution has a story. Someone had a real problem — measuring star positions, counting horse kicks, testing beer quality — and invented a new mathematical tool to solve it. Here is how they happened.
StatisticsHistory of scienceClinical research
Why would you ever assume your data are normal? Good question. And it deserves a good answer.
Let us start with a man who was not thinking about medicine. Not thinking about people at all. Carl Friedrich Gauss, in 1809, was thinking about asteroids. Ceres — a small asteroid — had disappeared behind the Sun, and different telescopes were giving different coordinates. Who was wrong? Everyone, a little. Gauss asked: if every measurement carries a small error, what is the most probable true answer?
To answer that question, Gauss had to invent new mathematics. The curve he arrived at was symmetric, centred on the true value — errors go a little high, a little low, but the centre always holds. We call that curve the Gaussian — or normal — distribution. He did not name it after himself. Others did.
Twenty years later, a Belgian named Quetelet took Gauss's curve and did something bold — he applied it to human beings. He measured the chest circumferences of 5,738 Scottish soldiers. The same pattern appeared. He had not planned for it — it simply showed up. And he concluded that nature tends toward the middle.
"The normal distribution is the most important distribution in statistics — not because everything is normal, but because it describes what happens when you stop looking at individuals and start looking at averages."
⚠ The trap
This does not mean your data are normal. It means that measurement errors are normal, and that sample means are normal. That is an entirely different thing. CRP is not normal. Troponin is not normal. Length of hospital stay is not normal. We will understand why this still works when we reach the Central Limit Theorem.
Explore it — move the sliders
Parameters
0
1.0
68–95–99.7 rule: 68% of values fall within ±1σ, 95% within ±2σ, 99.7% within ±3σ. The shaded areas show this.
🔭 Astronomical measurement errors (Gauss, 1809)🩺 Adult systolic BP ~ N(120, 15²)📏 Adult height ~ N(175, 7²) cm📊 Any sample mean with n ≥ 30
Think about it
IQ scores are designed as N(100, 15²). What percentage of people score between 70 and 130?
1 / 9
Gauss did not discover the normal distribution. A Frenchman named Abraham de Moivre did — seventy-six years earlier. And he found it by flipping coins.
Abraham de Moivre was studying what happens when you flip a coin not ten times, not a hundred times, but thousands of times. He noticed that the histogram of results — how many heads, how many tails — started taking a shape. A bell shape. The same curve Gauss would later derive from telescope errors and claim as his own.
De Moivre died poor and largely forgotten. Gauss became one of the most celebrated mathematicians in history. Science is not always fair.
But de Moivre left us something important. By studying coin flips, he gave us the binomial distribution — and showed that inside it, if you look long enough, the normal distribution is hiding.
The mathematics came from coins. But the philosophy came from Jacob Bernoulli, whose book Ars Conjectandi — The Art of Conjecture — was published in 1713, eight years after his death. Bernoulli was not interested in gambling. He was interested in justice. His question was moral: how many observations do you need before you can be confident in a conclusion? How many witnesses before you convict? How many experiments before you publish? His answer was the binomial theorem — and the insight that more trials always bring you closer to truth.
"Even the stupidest man knows by some instinct of nature that the more observations are made, the less danger there is of straying from one's goal."
— Jacob Bernoulli, Ars Conjectandi, 1713
🔗
The link to Normal: increase n below and set p near 0.5. Watch the histogram. At large n, the binomial becomes bell-shaped. This is exactly what de Moivre saw. The normal distribution was always there — inside the coins.
⚠ The trap
Three conditions must all be true: (1) fixed number of trials n, (2) only two possible outcomes, (3) the same probability p for every trial, with trials independent. In real clinical practice, this is never perfectly true — but close enough to work.
Explore it — watch the bell emerge
Parameters
10
0.50
🪙 Coin flips (de Moivre, 1733)⚖️ Jury decisions (Bernoulli, 1713)💊 n patients treated, p = response rate🧬 Mendelian inheritance
Think about it
A drug has a 60% response rate. You treat 20 patients. Which distribution describes the number of responders?
2 / 9
In 1898, a statistician in Berlin published a dataset about horses. Specifically — about the number of Prussian cavalry soldiers killed by horse kicks, per year, across 14 corps, over 20 years.
It sounds morbid. It is. But Ladislaus Bortkiewicz was not morbid — he was pedantic. And in those numbers he saw something nobody else had: perfect order inside chaos. The death counts were random — but their distribution followed a mathematical formula to almost the decimal place.
That formula had been written sixty years earlier by Siméon Denis Poisson — and nobody had believed him. Poisson was a mathematician, not a statistician. His distribution was considered a theoretical curiosity. Bortkiewicz took it, applied it to horses and soldiers, and suddenly — everyone wanted to use it.
The Poisson distribution has one single parameter: lambda (λ) — the average number of events in your time period. If the average number of cardiac arrests per day is four — λ = 4. That is all you need to know. From that one number, the distribution tells you: what is the probability of zero arrests today? Of ten? Of exactly four?
And here is something strange: the mean and the variance are always equal. Both are λ. If your data show a variance much larger than the mean — Poisson does not fit. Your data have more chaos than the distribution assumes. That is called overdispersion — and it is a signal that you need a different model.
The average was 0.61 deaths per corps per year. Years with 0, 1, 2, 3, 4 deaths matched the Poisson prediction with extraordinary accuracy. A formula written for abstract mathematics described real deaths with almost perfect precision.
🔗
The link to Binomial: Poisson is the limit of Binomial(n, p) as n→∞ and p→0, with np = λ fixed. When events are very rare and n is very large — Poisson is simply a more convenient Binomial.
⚠ The trap
Mean = Variance = λ. Always check this in your data. If variance is much larger than the mean, you have overdispersion — use Negative Binomial instead. If variance is smaller, you have underdispersion — a different problem entirely.
Explore it
Parameters
3.0
🐴 Horse kick deaths (Bortkiewicz, 1898)🚑 ER arrivals per hour🦠 Bacterial colonies per petri dish💀 Rare disease cases per region per year🧬 Mutations per genome per generation
Think about it
A hospital records an average of 4 cardiac arrests per day. The number on any given day most likely follows:
3 / 9
Poisson counts events. The exponential distribution asks a different question: how long do you wait between them?
In the early 1900s, Ernest Rutherford and Frederick Soddy were studying radioactive decay in Montreal. Particles were being emitted — but not at perfectly regular intervals. Sometimes two in a second, sometimes none for three seconds. Rutherford noticed that if the number of particles per second followed a Poisson distribution, then the waiting time between particles followed a precise mathematical pattern. That pattern is the exponential distribution.
The exponential distribution has one parameter — lambda (λ), the same rate as Poisson. The mean waiting time is 1/λ. If particles arrive at 3 per second on average, you wait an average of 1/3 of a second between them.
But the exponential has a strange property that took physicists a while to accept: it is memoryless. The machine does not remember how long it has been running. A radioactive atom that has existed for a thousand years is no more likely to decay in the next second than one created a moment ago. The past is irrelevant. Only the rate matters.
🔗
Poisson ↔ Exponential — two sides of the same coin: if events arrive at rate λ per hour (Poisson counts), the time between arrivals follows Exp(λ). You cannot have one without the other. Rutherford saw both in the same experiment.
⚠ The trap in survival analysis
The exponential assumes a constant hazard rate — the risk of dying is the same regardless of how long the patient has already survived. This is almost never true in medicine. Use Weibull or Cox models instead, which allow the hazard to change over time.
Explore it
Parameters
1.0
☢️ Time between radioactive decays (Rutherford)⏱️ Time between ER patient arrivals🏥 Time until first adverse event📞 Time between calls in a call centre
4 / 9
The Central Limit Theorem
You have now seen four distributions — Normal, Binomial, Poisson, Exponential. Each has a different shape. Some are symmetric. Some are skewed. Some are discrete. None of them are the same.
Now here is the most remarkable fact in all of statistics.
Take any dataset — any shape, any distribution. Take a random sample of n observations and calculate their mean. Now repeat that a thousand times. Plot all those means.
The distribution of those means will be normal.
Always. Regardless of what the original data looked like. This is the Central Limit Theorem, formally proved by Aleksandr Lyapunov in 1901 — though Gauss, Laplace, and de Moivre had all glimpsed it before him.
This is why CRP can be log-normal and you can still use a t-test on it with a large enough sample. This is why income can be wildly skewed and economists still report confidence intervals. The test does not care about your raw data. It cares about the mean of your sample — and the mean is always normal.
There is a catch: n must be large enough. The rule of thumb is n ≥ 30. For very skewed data, you may need more. For roughly symmetric data, less.
Gauss found the normal distribution looking at stars. De Moivre found it looking at coins. Lyapunov proved that it was not a coincidence — it was a law. Every average, from every distribution, eventually becomes normal.
See it happen — simulate it
Simulation settings
5
500
Original population
Distribution of sample means
Watch what happens as you increase n. No matter how strange the original shape — the means always pull toward a bell curve. At n = 30, it is almost perfect. This is not an approximation. It is a mathematical law.
5 / 9
In 1945, physicists at Los Alamos needed to simulate the behaviour of neutrons inside a nuclear bomb. The mathematics was too complex to solve analytically. So they invented a new method — and the uniform distribution was its engine.
John von Neumann and Stanislaw Ulam called it the Monte Carlo method — named after the casino, because it relied on random chance. The idea was simple: instead of solving the equations, simulate thousands of random scenarios and average the results. To generate randomness, you start with uniform numbers — every value equally likely — and transform them into whatever distribution you need.
Ulam reportedly got the idea while playing solitaire, wondering about the probability of a successful game. From solitaire to nuclear physics to clinical trial power calculations — the uniform distribution is the foundation of every simulation ever run on a computer.
The uniform distribution itself is simple: every value in the range [a, b] is equally likely. No value is more probable than any other. It is the flattest possible distribution — maximum uncertainty, no preference, no bias.
"I immediately thought of problems of neutron diffusion and other questions of mathematical physics, and it seemed to me that this could be done by means of random sampling."
— Stanislaw Ulam, on inventing Monte Carlo
🔗
The foundation of all other distributions: every random number generator produces Uniform(0,1) first, then transforms those numbers mathematically into Normal, Poisson, Exponential — whatever is needed. The uniform distribution is where all randomness begins.
Explore it
Parameters
1.0
6.0
⚛️ Monte Carlo simulations (Los Alamos, 1945)🎲 Rolling a fair die🔢 Foundation of all random number generators📐 Rounding errors
6 / 9
William Sealy Gosset was a chemist at the Guinness brewery in Dublin. His employer would not let him publish under his own name — trade secrets. So he used a pseudonym. That pseudonym was "Student." That is why, 117 years later, we still call it Student's t-test.
Gosset's problem was practical: he had small samples — a few batches of barley, a few fermentation tanks — and he needed to make decisions about quality. The normal distribution gave wrong answers with small n, and he knew it. So in 1908, he derived the correct distribution for small-sample means.
The t-distribution looks like the normal — bell-shaped, symmetric — but with heavier tails. Those heavier tails reflect a simple truth: when your sample is small, you are less certain. Your estimate of the standard deviation is itself uncertain, estimated from the same small data you are trying to analyse. The t-distribution widens the tails to account for that extra layer of uncertainty.
The parameter that controls the heaviness of the tails is degrees of freedom (df). For a one-sample t-test, df = n − 1. For two independent groups, df = n₁ + n₂ − 2. As df increases — as your sample grows — the t-distribution converges to the normal. Move the slider below and watch it happen.
Gosset submitted his paper to Karl Pearson's journal Biometrika in 1908. Pearson published it — not because he fully believed it, but because he thought it was interesting. He had no idea it would become the most commonly used statistical test in the world.
⚠ The trap
The t-test assumes normality of the data — or a large enough n for CLT to rescue you. With n < 30 and clearly non-normal data, use a non-parametric alternative (Mann-Whitney U). The t-test is robust to mild non-normality, but not to extreme skewness or outliers in small samples.
Watch the tails change
Parameters
3
Grey dashes = Normal. Purple = t. As df increases, they merge. At df = 30, the difference is negligible.
🍺 Beer quality control (Gosset, 1908)🔬 Small clinical trials📊 Comparing two group means📉 Regression coefficient testing
Think about it
You have 8 patients and want to test if their mean SBP differs from 120 mmHg. σ is unknown. Which distribution do you use?
7 / 9
Gregor Mendel spent years crossing pea plants in a monastery garden. He predicted that dominant to recessive traits should appear in a 3:1 ratio. His data matched almost perfectly. Suspiciously perfectly. In 1900, Karl Pearson invented the chi-squared test — and it was that test, applied decades later, that revealed Mendel's data were too good to be true.
Karl Pearson was not trying to expose Mendel. He was trying to solve a general problem: how do you know if an observed frequency distribution matches what your theory predicts? How far is "too far"? He needed a way to quantify the gap between observed and expected counts — and his answer, in 1900, was the chi-squared statistic.
The chi-squared distribution is always non-negative and right-skewed for small degrees of freedom — because it is built from squared values, which can never be negative. As degrees of freedom increase, it becomes more symmetric, eventually approaching normal.
It has two main uses in medicine. First: goodness-of-fit — do your observed frequencies match expected ones? Second: test of independence — are two categorical variables associated? Is smoking independent of lung cancer? Is blood type independent of disease risk? These are chi-squared questions.
The chi-squared test later revealed that Mendel's results were statistically improbable — the match between his data and his predictions was too perfect for random sampling. The most famous genetics dataset in history may have been unconsciously curated. The test built to validate theory ended up questioning the father of genetics.
⚠ The trap
Expected frequency in every cell must be ≥ 5. If any cell has fewer expected observations, chi-squared gives unreliable results. Use Fisher's exact test instead. This is a very common mistake in published clinical papers.
Explore it
Parameters
3
For a contingency table with r rows and c columns: df = (r−1)(c−1). A 2×2 table has df = 1.
🌱 Mendel's pea genetics — too perfect (Pearson, 1900)🩸 Blood type distribution testing🏥 Is treatment response independent of sex?📊 Any categorical contingency table
8 / 9
Nature does not always add. Sometimes it multiplies. And when it does, the log-normal distribution appears.
Francis Galton — Darwin's cousin, founder of eugenics, and one of the most brilliant and controversial scientists of the 19th century — noticed something that Quetelet had missed. Quetelet believed everything followed the normal distribution. Galton was sceptical. He pointed out that many biological measurements — income, lifespan, species sizes, the strength of inherited traits — were not symmetric. They were right-skewed, with a long tail of large values.
His explanation in 1879 was elegant: some phenomena are governed by additive processes — many small independent influences adding up, producing a normal distribution. But others are governed by multiplicative processes — influences that compound on each other. When you multiply many small random factors together, the result is not normally distributed. But its logarithm is.
This matters enormously in medicine. CRP does not increase by a fixed amount when inflammation rises — it doubles, triples, multiplies. Viral load is not additive — it grows exponentially. Drug concentrations in the body follow multiplicative pharmacokinetics. All of these are log-normal. And the key implication: log-transform them before analysis, and report the geometric mean — not the arithmetic mean, which is pulled far into the right tail by extreme values.
"The law of deviation from an average may be of two kinds: one where errors are cumulative, another where they are proportional."
— Francis Galton, 1879
🔗
The link to Normal: if X is log-normal, then log(X) is normal. Log-transform your data — run normal tests — back-transform your results. The geometric mean (e^μ) is the natural summary statistic. It is not distorted by the long right tail.
⚠ The trap
Running a t-test on raw CRP, troponin, or viral load violates the normality assumption and inflates your means. Log-transform first. Report geometric means and geometric confidence intervals. When you see a paper reporting arithmetic mean CRP of 45 mg/L with a standard deviation of 80 — someone forgot to log-transform.
Explore it
Parameters (of the underlying normal)
0.5
0.5
🩸 CRP, troponin, cytokine levels💊 Drug concentrations (pharmacokinetics)🦠 Viral load, bacterial counts💰 Income and wealth distributions (Galton, 1879)⏳ Survival times and latency periods
Think about it
CRP levels in your sample are heavily right-skewed. The correct approach is: