Linear Regression as an Approximation Method

Linear Regression
Intuition for the perspective that linear regression is the process of approximating the conditional expectation function (CEF).
Published

October 9, 2025

Motivating the Conditional Expectation Function

Optimal Predictor

The heterogeneity in economic outcomes naturally motivates a prediction problem: given a random vector \(X\) of regressors, what function \(g(X)\) provides the best prediction of the outcome \(Y\)? One way to define “best” is to choose the function that minimizes the mean squared error (MSE):

\[ g^\star(X) := \underset{g \in \mathbb{R}^k\rightarrow \mathbb{R}}{\arg\min} \, \mathbb{E} [Y - g(X)]^2. \tag{2}\]

As it turns out, the CEF is the solution to this problem! In other words, the CEF is the best predictor of \(Y\) given \(X\) in the sense that it minimizes the mean squared error between the predictions and the actual outcomes.

To see this, we need to first define the error term as the difference between the outcome and the CEF

\[ e := Y - m(X). \]

This allows us to decompose the outcome variable as

\[ Y = m(X) + e, \tag{3}\]

where \(m(X)\) is the systematic component of \(Y\) — the part that can be explained by \(X\) on average — and \(e\) is the remaining idiosyncratic variation in \(Y\).

Since the error is defined as a function of the outcome and regressors, it is also a random variable. Moreover, we can show that its distribution has two key properties. First, by construction, the error has zero conditional mean:

\[ \mathbb{E}[e \mid X] = \mathbb{E}[Y - m(X) \mid X] = \mathbb{E}[Y \mid X] - m(X) = 0. \tag{4}\]

Second, the error is uncorrelated with any function of the regressors:

\[ \mathbb{E}\big[h(X) e \big] \overset{(a)}{=} \mathbb{E}\big[\mathbb{E}[h(X) e | X]\big] \overset{(b)}{=} \mathbb{E}\big[h(X) \mathbb{E}[e | X]\big] \overset{(c)}{=} 0, \tag{5}\] where \((a)\) uses the law of iterated expectations, \((b)\) uses the linearity of expectations \((c)\) applies Equation 4.

Now notice that for any function \(g(X)\), we have

\[ \begin{aligned} \mathbb{E}\big[\big(Y - g(X)\big)^2\big] &= \mathbb{E}\big[\big(e + m(X) - g(X)\big)^2\big] \\ &= \mathbb{E}\big[e^2\big] + 2\mathbb{E}\big[e\big(m(X) - g(X)\big)\big] + \mathbb{E}\big[\big(m(X) - g(X)\big)^2\big] \\ &= \mathbb{E}\big[e^2\big] + \mathbb{E}\big[\big(m(X) - g(X)\big)^2\big] \\ &\geq \mathbb{E}\big[e^2\big] \\ &= \mathbb{E}\big[\big(Y - m(X)\big)^2\big], \end{aligned} \] where the third equality follows from Equation 5. In words, the mean squared error of any predictor \(g(X)\) is always at least as large as the mean squared error of the CEF \(m(X)\). Thus, we have shown that the CEF is indeed the solution to Equation 2.

Best Linear Predictor and Regression

TODO: NEED TO RE-WRITE EVERYTHING BELOW.

In most empirical cases, however, the functional form of the CEF is unknown. Thus, it is more practical to model the relationship between the outcome and covariates using a simpler function. One option is to find a linear approximation of the CEF, which takes the form

\[ \ell(x_i) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \ldots + \beta_k x_{ki} = x_i^T\beta, \tag{6}\] where \(x_i^T = (1, x_{1i}, x_{2i}, \ldots, x_{ki})\). Note that the word “linear” here refers to the simplifying assumption that the function is a linear combination of the covariates.

How do we choose the specific form of our approximating linear function? Since we want our predictions to minimize the MSE function, we set the parameters in Equation 6 as

\[ \beta^\star = \underset{\beta}{\arg\min} \, \mathbb{E} [Y_i - (X_i^T \beta)]^2. \]

In words, we are choosing the linear function with the lowest MSE among all possible linear functions. The resulting linear predictor

\[x_i^T \beta^\star \approx \mathbb{E}[Y_i|X_i =x_i]\]

is often called the “linear regression of \(Y\) on \(X\)” by economists. This simply refers to the process of finding the best linear predictor of \(Y_i\) given some realized values of \(X_i\) by minimizing the mean squared error function.

Linear Regression Can (Sometimes) Recover True Conditional Averages

So far, we have discussed linear regression as a tool to approximate the conditional expectation function. However, it can be shown that when the CEF is itself linear, the best linear predictor exactly equals the CEF. That is to say linear regression recovers the true conditional averages in this case.1

To see this in practice, I simulate data on income using the following equation

\[ \begin{align} income_i &= \beta_0 + \beta_1 \times white_i + \beta_2 \times male_i \\ & \quad \quad + \beta_3 \times literacy_i^{(1+0.3 \times sex_i)} + \beta_4 (male_i \times black_i) + \varepsilon_i. \end{align} \]

Show the code that simulates the dataset.
set.seed(123)

# Population size 
n <- 1000

# Noise 
noise <- rnorm(n, 0, 2000)

# Covariates 
male <- rbinom(n, 1, 0.5)
sex <- factor(male, labels = c("Female", "Male"))
white <- rbinom(n, 1, 0.7)
race <- factor(white, labels = c("Black", "White"))
literacy <- runif(n, 0, 100)

# Outcome 
white_female_base <- 30000                           # white females with 0 literacy earn 30000
literacy_effect <- 50
male_premium <- 200                                    # + 200 male premium 
white_premium <- 1500                                  # + 1500 white premium
white_male_premium <- 1000                              # + 1000 white male premium 

# Create income 
income <- 
  white_female_base +
  white_premium       * white +                          
  male_premium        * male +                            
  literacy_effect   * (literacy^(1 + 0.3 * male)) +    # literate males are better off than literate females 
  white_male_premium   * (male * white) +                  
  noise

# Combine into a dataset
sim_data <- data.frame(sex = sex,
                     race = race,
                     literacy,
                     income)

A Simple Univariate Example

Recall from Equation 1 that the CEF is a function of the covariates we use to predict the outcome variable. So, as a simple example of a linear CEF, let us consider the conditional expectation of \(income_i\) as a function of \(male_i\). The conditional expectation can be written as the step-wise function \[ \mathbb{E}[Y_i|male_i] = \begin{cases} \mu_0 \quad \text{if} \quad male_i = 0 \\ \mu_1 \quad \text{if} \quad male_i = 1 \end{cases}. \tag{7}\] To illustrate that Equation 7 is a linear combination of \(male_i\), we can rewrite it as

\[ \mathbb{E}[Y_i |male_i] = \mu_0 + (\mu_1-\mu_0) \times male_i. \tag{8}\]

Now, we can use R to regress \(income_i\) on \(sex_i\) as follows.

reg_model <- feols(
  income ~ sex, 
  data = sim_data
)

summary(reg_model)
OLS estimation, Dep. Var.: income
Observations: 1,000
Standard-errors: IID 
            Estimate Std. Error  t value  Pr(>|t|)    
(Intercept) 33423.97    207.948 160.7323 < 2.2e-16 ***
sexMale      7044.47    295.863  23.8099 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 4,673.0   Adj. R2: 0.361625

To verify that the linear regression above does in fact recover Equation 8, we need to check that (i) its intercept parameter equals the average income of females in the dataset and (ii) the coefficient on \(male_i\) is the difference in the average income of males and females in the dataset. As shown below, this is indeed the case.

sim_data %>%
  filter(sex == "Female") %>%
  summarise(female_avg = mean(income, na.rm = TRUE))
  female_avg
1   33423.97
sim_data %>%
  summarise(
    sex_diff = mean(income[sex == "Male"], na.rm = TRUE) -
                       mean(income[sex == "Female"], na.rm = TRUE)
  )
  sex_diff
1 7044.472

Saturated Regressions

The exercise of transforming the step-wise function in Equation 7 into a linear combination of the covariate in Equation 8 provides an important insight. It turns out that we can perform a similar transformation for any CEF where the covariates are discrete variables that take on a finite set of values.

TODO: Add example of sex x race. Talk about main effect and interaction.

In general, the idea is to include a separate parameter for each possible value of the discrete covariates. The linear regression corresponding to such a CEF is said to be saturated.

TODO: Saturated models makes inference easier: give example of comparing means.

Footnotes

  1. Again, to abstract away from statistical inference, I do not distinguish between population and sample averages. In practice, however, it is important to note that linear regression recovers sample conditional averages rather than the “true” population conditional averages when the CEF is linear.↩︎