Linear Regression as an Approximation Method

Linear Regression

Intuition for the perspective that linear regression is the process of approximating the conditional expectation function (CEF).

Published

October 9, 2025

Motivating the Conditional Expectation Function

Finding Trends in Noise

Economic datasets record quantities that vary across observational units. For example, Figure 1 highlights the substantial variation in earnings across individuals in the March 2009 Current Population Survey (CPS). In order to mathematically characterize this variation, it is useful to view economic data as a collection of random variables generated from some underlying probability distribution. The empirical distribution of earnings in the CPS, shown in Figure 2, provides visual intuition for this abstraction.

Show code

# Read the CPS dataset
cps <- read_excel(
  here("hasen-econometrics-datasets", "cps09mar", "cps09mar.xlsx"), 
  col_names = TRUE
)

# Plot earnings
ggplot(cps, aes(x = seq_along(earnings), y = earnings)) +
  geom_point(alpha = 0.3) +  
  labs(
    x = "Observational Unit",
    y = "Earnings"
  ) +
  theme_classic()
# Plot the density of earnings
ggplot(cps, aes(x = earnings)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 30,
                 fill = "lightblue",
                 color = "white") +
  geom_density(color = "black", linewidth = 0.5) +
  labs(x = "Earnings", y = "Density") +
  theme_classic()

Figure 1: Earnings Data from the March 2009 Current Population Survey (CPS).

Figure 2: Empirical Density of Earnings in the March 2009 Current Population Survey (CPS).

The variation in the data arises from a mix of observed (e.g. age) and unobserved factors (e.g. ability). We seek to satisfactorily explain the heterogeneity that arises in economic datasets. A simple first step is to stratify the data by some aggregate characteristics of the observational unit. In our example, we could plot the earnings density separately for men and women, as shown in Figure 2.

Show code

# Convert sex variable to factor
cps$female <- factor(cps$female, levels = c(0, 1), labels = c("Male", "Female"))

# Plot density by sex
ggplot(cps, aes(x = earnings, color = female)) +
  geom_density(linewidth = 0.8) +
  labs(x = "Earnings", y = "Density", color = "Sex") +
  theme_classic()

Figure 3: Empirical Density by Sex (CPS Data)

Density functions, while informative about the overall distributional shape, are often complex and can obscure key features of variation in the data. For example, Figure 3 shows that the earnings density for men is shifted to the right relative to women. Yet the plot alone does not reveal the magnitude of this difference. An alternative way to summarize the relationship between earnings and sex is to calculate the conditional expectations:

\[ \mathbb{E}[earnings \mid sex = male] = 63,147 \quad \text{and} \quad \mathbb{E}[earnings \mid sex = female] = 44,224. \]

The functional form of the conditional expectations is useful in summarizing the overall relationship between two variables. To see this, consider the blue line in Figure 4. By plotting the conditional expectation of log earnings by education level, this curve reduces the substantial variation of earnings at each education level — illustrated by the gray density plots — into a single point. In doing so, it reveals the systematic upward trend in log earnings as education increases.

Show code

# Map numeric education codes to readable ordered labels (once)
edu_code_levels <- c(9, 10, 11, 12, 13, 14, 16, 18, 20)
edu_code_labels <- c(
  "9th grade",
  "10th grade",
  "11th grade / 12th no diploma",
  "High school graduate",
  "Some college, no degree",
  "Associate degree",
  "Bachelor's degree",
  "Master's degree",
  "Professional/Doctorate"
)

cps <- cps %>%
  mutate(
    education_cat = factor(
      education,
      levels = edu_code_levels,
      labels = edu_code_labels,
      ordered = TRUE
    )
  )

# Lock the desired display order explicitly (prevents reordering later)
edu_levels <- edu_code_labels

# Keep positive earnings and build log earnings
cps_pos <- cps %>%
  filter(earnings > 0, !is.na(education_cat)) %>%
  mutate(log_earnings = log(earnings),
         education_cat = factor(education_cat, levels = edu_levels, ordered = TRUE))

# CEF: mean log earnings by education category (all categories)
cef_data <- cps_pos %>%
  group_by(education_cat) %>%
  summarise(mean_log_earnings = mean(log_earnings), .groups = "drop") %>%
  mutate(education_cat = factor(education_cat, levels = edu_levels, ordered = TRUE))

# Choose categories where you want the half-violins
selected_levels <- c("10th grade", "High school graduate", "Bachelor's degree")

violin_data <- cps_pos %>%
  filter(education_cat %in% selected_levels) %>%
  mutate(education_cat = factor(education_cat, levels = edu_levels, ordered = TRUE))

# ---- Plot: keep your original look, overlay left half-violins on selected levels ----
ggplot() +
  ggdist::stat_halfeye(
    data = violin_data,
    aes(x = education_cat, y = log_earnings),
    side = "right",
    normalize = "groups",
    width = 0.9,
    slab_alpha = 0.35,
    slab_fill  = "grey70",
    slab_color = NA,
    interval_size = 0,
    point_size = 0
  ) +
  geom_line(
    data = cef_data,
    aes(x = education_cat, y = mean_log_earnings, group = 1),
    color = "blue", linewidth = 0.8
  ) +
  geom_point(
    data = cef_data,
    aes(x = education_cat, y = mean_log_earnings),
    color = "blue", size = 2
  ) +
  scale_x_discrete(limits = edu_levels, drop = FALSE) +
  scale_y_continuous() +
  coord_cartesian(ylim = c(8, 12)) +   
  labs(x = "Education Level", y = "Log Earnings") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

Figure 4: Conditional Expectation of Log Earnings Given Education Level (CPS Data)

More generally, we can summarize the relationship between an outcome \(Y\) and a set of regressors \(X = (X_1, \ldots, X_K)\) with the conditional expectation function (CEF) \[ m(X) = \mathbb{E}[Y \mid X]. \tag{1}\]

Optimal Predictor

The heterogeneity in economic outcomes naturally motivates a prediction problem: given a random vector \(X\) of regressors, what function \(g(X)\) provides the best prediction of the outcome \(Y\)? One way to define “best” is to choose the function that minimizes the mean squared error (MSE):

\[ g^\star(X) := \underset{g \in \mathbb{R}^k\rightarrow \mathbb{R}}{\arg\min} \, \mathbb{E} [Y - g(X)]^2. \tag{2}\]

As it turns out, the CEF is the solution to this problem! In other words, the CEF is the best predictor of \(Y\) given \(X\) in the sense that it minimizes the mean squared error between the predictions and the actual outcomes.

To see this, we need to first define the error term as the difference between the outcome and the CEF

\[ e := Y - m(X). \]

This allows us to decompose the outcome variable as

\[ Y = m(X) + e, \tag{3}\]

where \(m(X)\) is the systematic component of \(Y\) — the part that can be explained by \(X\) on average — and \(e\) is the remaining idiosyncratic variation in \(Y\).

Since the error is defined as a function of the outcome and regressors, it is also a random variable. Moreover, we can show that its distribution has two key properties. First, by construction, the error has zero conditional mean:

\[ \mathbb{E}[e \mid X] = \mathbb{E}[Y - m(X) \mid X] = \mathbb{E}[Y \mid X] - m(X) = 0. \tag{4}\]

Second, the error is uncorrelated with any function of the regressors:

\[ \mathbb{E}\big[h(X) e \big] \overset{(a)}{=} \mathbb{E}\big[\mathbb{E}[h(X) e | X]\big] \overset{(b)}{=} \mathbb{E}\big[h(X) \mathbb{E}[e | X]\big] \overset{(c)}{=} 0, \tag{5}\] where \((a)\) uses the law of iterated expectations, \((b)\) uses the linearity of expectations \((c)\) applies Equation 4.

Now notice that for any function \(g(X)\), we have

\[ \begin{aligned} \mathbb{E}\big[\big(Y - g(X)\big)^2\big] &= \mathbb{E}\big[\big(e + m(X) - g(X)\big)^2\big] \\ &= \mathbb{E}\big[e^2\big] + 2\mathbb{E}\big[e\big(m(X) - g(X)\big)\big] + \mathbb{E}\big[\big(m(X) - g(X)\big)^2\big] \\ &= \mathbb{E}\big[e^2\big] + \mathbb{E}\big[\big(m(X) - g(X)\big)^2\big] \\ &\geq \mathbb{E}\big[e^2\big] \\ &= \mathbb{E}\big[\big(Y - m(X)\big)^2\big], \end{aligned} \] where the third equality follows from Equation 5. In words, the mean squared error of any predictor \(g(X)\) is always at least as large as the mean squared error of the CEF \(m(X)\). Thus, we have shown that the CEF is indeed the solution to Equation 2.

Best Linear Predictor and Regression

TODO: NEED TO RE-WRITE EVERYTHING BELOW.

In most empirical cases, however, the functional form of the CEF is unknown. Thus, it is more practical to model the relationship between the outcome and covariates using a simpler function. One option is to find a linear approximation of the CEF, which takes the form

\[ \ell(x_i) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \ldots + \beta_k x_{ki} = x_i^T\beta, \tag{6}\] where \(x_i^T = (1, x_{1i}, x_{2i}, \ldots, x_{ki})\). Note that the word “linear” here refers to the simplifying assumption that the function is a linear combination of the covariates.

How do we choose the specific form of our approximating linear function? Since we want our predictions to minimize the MSE function, we set the parameters in Equation 6 as

\[ \beta^\star = \underset{\beta}{\arg\min} \, \mathbb{E} [Y_i - (X_i^T \beta)]^2. \]

In words, we are choosing the linear function with the lowest MSE among all possible linear functions. The resulting linear predictor

\[x_i^T \beta^\star \approx \mathbb{E}[Y_i|X_i =x_i]\]

is often called the “linear regression of \(Y\) on \(X\)” by economists. This simply refers to the process of finding the best linear predictor of \(Y_i\) given some realized values of \(X_i\) by minimizing the mean squared error function.

Linear Regression Can (Sometimes) Recover True Conditional Averages

So far, we have discussed linear regression as a tool to approximate the conditional expectation function. However, it can be shown that when the CEF is itself linear, the best linear predictor exactly equals the CEF. That is to say linear regression recovers the true conditional averages in this case.¹

To see this in practice, I simulate data on income using the following equation

\[ \begin{align} income_i &= \beta_0 + \beta_1 \times white_i + \beta_2 \times male_i \\ & \quad \quad + \beta_3 \times literacy_i^{(1+0.3 \times sex_i)} + \beta_4 (male_i \times black_i) + \varepsilon_i. \end{align} \]

Show the code that simulates the dataset.

set.seed(123)

# Population size 
n <- 1000

# Noise 
noise <- rnorm(n, 0, 2000)

# Covariates 
male <- rbinom(n, 1, 0.5)
sex <- factor(male, labels = c("Female", "Male"))
white <- rbinom(n, 1, 0.7)
race <- factor(white, labels = c("Black", "White"))
literacy <- runif(n, 0, 100)

# Outcome 
white_female_base <- 30000                           # white females with 0 literacy earn 30000
literacy_effect <- 50
male_premium <- 200                                    # + 200 male premium 
white_premium <- 1500                                  # + 1500 white premium
white_male_premium <- 1000                              # + 1000 white male premium 

# Create income 
income <- 
  white_female_base +
  white_premium       * white +                          
  male_premium        * male +                            
  literacy_effect   * (literacy^(1 + 0.3 * male)) +    # literate males are better off than literate females 
  white_male_premium   * (male * white) +                  
  noise

# Combine into a dataset
sim_data <- data.frame(sex = sex,
                     race = race,
                     literacy,
                     income)

A Simple Univariate Example

Recall from Equation 1 that the CEF is a function of the covariates we use to predict the outcome variable. So, as a simple example of a linear CEF, let us consider the conditional expectation of \(income_i\) as a function of \(male_i\). The conditional expectation can be written as the step-wise function \[ \mathbb{E}[Y_i|male_i] = \begin{cases} \mu_0 \quad \text{if} \quad male_i = 0 \\ \mu_1 \quad \text{if} \quad male_i = 1 \end{cases}. \tag{7}\] To illustrate that Equation 7 is a linear combination of \(male_i\), we can rewrite it as

\[ \mathbb{E}[Y_i |male_i] = \mu_0 + (\mu_1-\mu_0) \times male_i. \tag{8}\]

Now, we can use R to regress \(income_i\) on \(sex_i\) as follows.

reg_model <- feols(
  income ~ sex, 
  data = sim_data
)

summary(reg_model)

OLS estimation, Dep. Var.: income
Observations: 1,000
Standard-errors: IID 
            Estimate Std. Error  t value  Pr(>|t|)    
(Intercept) 33423.97    207.948 160.7323 < 2.2e-16 ***
sexMale      7044.47    295.863  23.8099 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 4,673.0   Adj. R2: 0.361625

To verify that the linear regression above does in fact recover Equation 8, we need to check that (i) its intercept parameter equals the average income of females in the dataset and (ii) the coefficient on \(male_i\) is the difference in the average income of males and females in the dataset. As shown below, this is indeed the case.

sim_data %>%
  filter(sex == "Female") %>%
  summarise(female_avg = mean(income, na.rm = TRUE))

  female_avg
1   33423.97

sim_data %>%
  summarise(
    sex_diff = mean(income[sex == "Male"], na.rm = TRUE) -
                       mean(income[sex == "Female"], na.rm = TRUE)
  )

  sex_diff
1 7044.472

Saturated Regressions

The exercise of transforming the step-wise function in Equation 7 into a linear combination of the covariate in Equation 8 provides an important insight. It turns out that we can perform a similar transformation for any CEF where the covariates are discrete variables that take on a finite set of values.

TODO: Add example of sex x race. Talk about main effect and interaction.

In general, the idea is to include a separate parameter for each possible value of the discrete covariates. The linear regression corresponding to such a CEF is said to be saturated.

TODO: Saturated models makes inference easier: give example of comparing means.

Footnotes

Again, to abstract away from statistical inference, I do not distinguish between population and sample averages. In practice, however, it is important to note that linear regression recovers sample conditional averages rather than the “true” population conditional averages when the CEF is linear.↩︎