Statistical Inference II: Point Estimation

Mathematical Statistics

Big picture intuition for statistical inference under the random sampling framework.

Published

October 3, 2025

Point Estimation

Point estimation is the first step of statistical inference, and involves constructing a “good guess” for a feature of the unknown data generating process. To be more precise, this feature is called an estimand \(\theta\) and is defined as a function of the data generating process \(F\):

\[ \theta = \theta(F). \]

An example of an estimand is the the population mean of a random variable \(X\): \[ \mu = \mathbb{E}[X] = \int x \, dF(x). \]

Since the DGP is unknown, the best we can do is use the observed data to guess the value of the estimand. An estimator \(\hat \theta\) is a function of the sample that is intended to provide a guess of the estimand:

\[ \hat{\theta} = \hat{\theta}(\boldsymbol X_1, \ldots, \boldsymbol X_n). \]

When the estimator is evaluated at a specific realization of the sample, we obtain an estimate \(\hat\theta(\boldsymbol x_1, \ldots, \boldsymbol x_n)\) of the estimand. It’s worth emphasizing that the estimand is a fixed but unknown number, the estimator is a random variable, and the estimate is a fixed and known number.

In statistics, there are several estimation principles that provide systematic ways (i.e. rules) to construct estimators. One common method is the analog principle (or plug-in principle). The idea is to construct the estimator by replacing the population quantities in the estimand with their sample analogs.¹ Thus, the analog estimator for the population mean is the sample mean, defined as

\[ \hat{\mu} = \frac{1}{n} \sum_{i=1}^n \boldsymbol X_i. \]

Estimator Properties

How do we know if an estimator is any good? To answer this question, statisticians study desirable properties that an estimator should ideally satisfy. A full treatment of estimator properties is typically the focus of a mathematical statistics course, but it is still valuable to briefly highlight some fundamental properties here.

The error of an estimator is defined as the difference between the estimate and the estimand:

\[ e(\boldsymbol x_1, \ldots, \boldsymbol x_n) = \hat\theta(\boldsymbol x_1, \ldots, \boldsymbol x_n) - \theta. \] The bias of an estimator is the average error of the estimator across all possible samples of size \(n\) from the DGP:

\[ B(\hat\theta) = \mathbb{E}_F[\hat{\theta}] - \theta \] Intuitively, the bias captures the systematic error of the estimator: if the bias is positive, the estimator tends to overestimate the estimand, and if the bias is negative, the estimator tends to underestimate the estimand. We say an estimator \(\hat{\theta}\) is unbiased for \(\theta\) if its bias is zero:²

\[ \mathbb{E}_F[\hat{\theta}] - \theta = 0. \] Thus, the errors of an unbiased estimator are purely due to randomness in the data. As it turns out, the sample mean is an unbiased estimator of the population mean under the random sampling assumption:

\[ \mathbb{E}_F[\hat{\mu}] = \frac{1}{n} \sum_{i=1}^n \mathbb{E}_F[\boldsymbol X_i] = \frac{1}{n} \sum_{i=1}^n \mu = \mu. \] The first equality follows from the linearity of expectations, the second equality follows from the random sampling assumption, and the third equality is a simplification.

While bias quantifies how far the estimator’s average is from the estimand, the variance (or sampling variance) measures how much the estimator varies across repeated samples of size \(n\):

\[ Var(\hat\theta) = \mathbb{E}_F[(\hat\theta - \mathbb{E}_F[\hat\theta])^2]. \] The variance of the sample mean under the random sampling assumption is given by

\[ \operatorname {Var} \left[\hat\mu\right] = \frac{1}{n^2}\operatorname{Var} \left[ \sum_{i=1}^n \boldsymbol X_i\right] = \frac{1}{n^2} \sum_{i=1}^n \operatorname{Var}[\boldsymbol X_i] = \frac{1}{n^2} \sum_{i=1}^n \sigma^2 = \frac{\sigma^2}{n}, \] where \(\sigma^2\) is the population variance \(\operatorname{Var}[X]\). The first equality uses the properties of variance. The second equality follows from the fact that the independence of each \(\boldsymbol X_i\) means they are uncorrelated, and so the variance of their sum equals the sum of their variance. The third equality uses the fact that each \(\boldsymbol X_i\) are drawn from an identical distribution and so have the same variance \(\sigma^2\). The fourth equality is an algebraic simplification.

The Necessity for Statistical Models

The sampling distribution of an estimator is the probability distribution that describes how the estimator’s estimates vary across all possible samples of size \(n\) drawn from the DGP. Intuitively, it characterizes the behavior of the estimator under repeated sampling. Under the random sampling assumption, the sampling distribution is completely determined by the DGP \(F\), the sample size \(n\), and the functional form of the estimator \(\hat\theta\).

Let’s revisit the example of the sample mean estimator \(\hat\mu\) for the population mean \(\mu\). We have already established some features of the sampling distribution of \(\hat\mu\) despite knowing nothing about \(F\). Particularly, the mean of \(\hat\mu\) is \(\mu\) and its variance is \(\sigma^2/n\). However, to say more about the distribution of \(\hat\mu\) — like its shape — we need to make assumptions about the DGP.

A statistical model is a set of assumptions about the general structure of the data generating process \(F\). Put differently, we can think of a statistical model as a family of possible distributions that \(F\) could belong to. To illustrate the added value of statistical models, suppose our sample \(\boldsymbol X_1, \ldots, \boldsymbol X_n\) is drawn iid from \(\mathcal{N}(\mu, \sigma^2)\). Since

\[ \hat\mu = \frac{1}{n} \sum_{i=1}^n \boldsymbol X_i, \]

is a linear combination of normally distributed random variables, it is also normally distributed. Moreover, we have previously established that \(\mathbb{E}_F[\hat\mu] = \mu\) and \(\operatorname{Var}[\hat\mu] = \sigma^2/n\) for any DGP \(F\). Thus, assumption of a normal DGP allows us to completely characterize the sampling distribution of \(\hat\mu\) as \(\mathcal{N}(\mu, \sigma^2/n)\). This is powerful because we can use this sampling distribution to quantify the uncertainty in our estimates by constructing confidence intervals and conducting hypothesis tests.

Constructing Confidence Intervals for \(\hat\mu\)

Conclusion

flowchart LR
  classDef box fill:#f8f9fa,stroke:#444,stroke-width:1px,rx:10,ry:10;

  DGP["Data Generating Process"]:::box
  Data["Observed Data"]:::box
  Model["Statistical Model"]:::box

  %% Main flows
  DGP -- "Random Sampling" --> Data
  DGP -. "Assumptions" .-> Model
  Model -- "Probability" --> Data
  Data -- "Inference" --> Model
  Model -. "Approximate Reality" .-> DGP

It is important to note that statistical inference is valid only if the assumptions of statistical model hold.

References

Hansen, Bruce E. 2022. Probability and Statistics for Economists. Princeton University Press.

Footnotes

To quote my Mathematical Statistics Professor Daniel Weiner: “Do to the sample to get your estimator, as you would do to your population to get your estimand.”↩︎
To be more precise, \(\hat\theta\) is unbiased for \(\theta\) if \(\mathbb{E}[\hat\theta]=\theta\) for all \(F \in \mathcal{F}\), where \(\mathcal{F}\) is a class of distributions.↩︎