flowchart LR classDef box fill:#f8f9fa,stroke:#444,stroke-width:1px,rx:10,ry:10; DGP["Data Generating Process"]:::box Data["Observed Data"]:::box Model["Statistical Model"]:::box %% Main flows DGP -- "Randomness" --> Data DGP -. "Assumptions" .-> Model Model -- "Probability" --> Data Data -- "Inference" --> Model Model -. "Approximate Reality" .-> DGP
How to Think About Statistical Inference
Learning About the Unknown
We can think of our observed data to be the result of some true, unknown Data Generating Process (DGP) that is inherently random.1 That is to say if we repeated the data collection, we would almost certainly get different values. The source of this randomness is varied, and includes (i) natural variation in the phenomenon being measured, (ii) measurement error, and (iii) sampling error.
The presence of such randomness makes it challenging to learn about the unobserved characteristics of the DGP using the observed data. Statistical inference is the process of (i) using random observed data to make informed predictions about unobserved features of the DGP and (ii) using probability to quantify the uncertainty in those predictions. 2
Models Make Inference Possible
A Set of Assumptions
Since the true DGP is unknown, we need to make assumptions about its general structure in order to make inference feasible. A statistical model is a mathematical characterization of how the data is generated using probability theory. Specifically, we view the observed data points as realizations of a collection of random variables with some joint probability distribution. The set of assumptions that define this probability distribution — such as its functional form and the structural relationships among the random variables — is exactly what constitutes the statistical model.
Forms of Inference
So far, I have referred to statistical inference as involving informed predictions about unobserved features of the DGP. This is a broad umbrella term that encompasses various forms of inference, and is therefore worth clarifying.
Perhaps the most common form of inference is point estimation. This is the process of using the observed data to produce a single “best guess” value (the point estimate) for an unknown numerical quantity that characterizes the probability distribution in the statistical model (the parameter). Another type of inference is hypothesis testing, where we use the data to assess specific claims about the model parameters. Finally, forecasting is a form of inference where the model is used to predict future realizations of the random variables.3
All Inference is Within Model
Initially, I had defined inference as the process of using observed data to make informed, probabilistic predictions about unknown features of the data generating process. However, notice that in our discussion of the different forms of inference, the DGP was never mentioned. In every case — point estimation, hypothesis testing, and forecasting — we use the observed data to learn more about the statistical model.
Any conclusions we make about the DGP are therefore indirect and entirely dependent on the model’s specification. If the model’s assumptions accurately approximate the true DGP, then inference can meaningfully describe reality. However, if the model is misspecified, our inference may be well-calculated but ultimately irrelevant: a precise answer to the wrong question. As Konishi and Kitagawa put it in Information Criteria and Statistical Modeling, “the majority of the problems in statistical inference can be considered to be problems related to statistical modeling.”
A Very Simple Model
The relationship between the data generating process, observed data, and statistical model is summarized in the concept map below.
To make this figure more concrete, it’s useful to walk through a simple example. Suppose the true data generating process is a standard normal distribution with mean \(0\) and variance \(1\). However, we don’t actually know this DGP and thus need to make assumptions about its general structure. After analyzing the random observed data, one possible statistical model we could assert is
\[ X \sim \mathcal{N}(\mu, \sigma^2). \] In words, we are treating our observed data as a random variable \(X\) with a normal probability distribution with unknown mean \(\mu\) and variance \(\sigma^2\).4
With this model in hand, we can now perform statistical inference to learn about the unknown model parameters. For example, we can use the observed data to find point estimates for \(\mu\) and \(\sigma^2\), and quantify the uncertainty associated with those estimates.5 Since in this toy example our normal distribution assumption is correct, the inference would yield meaningful conclusions about the true DGP. However, if the DGP was something entirely different — say, a uniform distribution — then our estimated model parameters would provide limited insight, if any, into the underlying structure of the data generating process.
Footnotes
To be clear, this is a simplification of reality because (i) it assumes that the DGP exists and (ii) it assumes that the DGP is stable over time. In reality, some events can fundamentally alter the DGP (if it even exists). For example, consider the impacts of the widespread adoption of artificial intelligence. Nevertheless, we often do observe an approximately stable distribution of data over time. Thus, the fixed-DGP is often reasonable in practice and, more importantly, provides a useful framework for applying probabilistic models to learn from the data we have.↩︎
Note that if we instead defined statistical inference as “the process of using sample data to learn about the population,” we would be describing only one specific case. It is entirely possible to have population-level data and still require statistical inference due to natural variation or measurement error.↩︎
When we think of the word “prediction” we usually think of forecasts. But notice how point estimation and hypothesis testing can also be viewed as a form of prediction, as we are extrapolating from observed data to the statistical model’s parameters. This guided the choice of using “informed predictions” as the umbrella term describing inference.↩︎
Since the number of parameters defining the distribution is finite, this is an example of a parametric statistical model.↩︎
One way to find such estimates is through maximum likelihood estimation.↩︎