Regression analysis is a common statistical tool used to model relationships between variables and to explore the influencing factors underlying observed spatial data patterns. This entry focuses on the most basic form of regression model: linear regression. The notations, inference, assumptions, and diagnostics of linear regression are introduced, and interpretations of linear regression results are demonstrated using an empirical example in R software. The entry concludes with a brief discussion of the challenges of applying standard linear regression to spatial data.
Li, Z. (2024). Regression Fundamentals. The Geographic Information Science & Technology Body of Knowledge (2024 Edition). John P. Wilson (ed.). DOI: 10.22224/gistbok/2024.1.11.
Regression analysis is a fundamental statistical tool used by geographers and spatial scientists to study and model relationships in spatial data. Regression allows for the exploration of how one or more factors are correlated with an outcome of interest. The outcome of interest is commonly referred to as the dependent variable in a regression model, while the potential influencing factors are referred to as independent variables. Regression models are estimated from data in a tabular structure, providing insights into the processes and relationships behind spatial data. These models help researchers identify significant factors, quantify the strength and direction of relationships, and predict outcomes given new data. For example, geographers used regression analysis to study social determinants of spatial health disparities (e.g., Anderson et al., 2023), examine physical and human factors associated with wildfire occurrence (e.g., Oliveira et al., 2012), explore socio-demographic factors in determining voting behavior (e.g., Fotheringham et al., 2021), and predict house prices using property and neighborhood-level attributes (e.g, Bourassa et al., 2007), among other applications. Regression modeling is arguably the most widely applied research method in quantitative geography and other disciplines.
A general regression model takes the form of:
The most basic form of a regression model is simple linear regression. Simple linear regression has one dependent variable and one independent variable. The function is represented by a straight line with an intercept and a slope:
where is the intercept and
is the slope of the regression line. As shown in Figure 1, the intercept is the value of y when X equals zero, and the slope measures the change in y when X changes by one unit. The intercept and the slope of the regression line are estimated using the sampled data and can be determined based on the least squares criterion, which minimizes the sum of the squared differences between the observed values and the values predicted by the line.
Simple linear regression can be extended to the multivariate case when there are multiple (k) independent variables. Multiple linear regression is formulated by
where the intercept is the value of y when all independent variables equal zero. Each independent variable is associated with a slope coefficient that quantifies the change in y for a one-unit change in the corresponding independent variable, while holding all other variables constant. Following the same least squares criterion, all regression coefficients can be estimated by
where X is an matrix of the independent variables, including a column of ones for the intercept term.
is the transpose of the X matrix, and
is the inverse of the
matrix. The resulting
is a (k +1) x 1 vector of the estimated coefficients where the last element is the intercept estimate.
The data we collect and observe is a sample of a population and is subject to sampling variation. As a result, the regression coefficients estimated based on the collected sample will also change when independent sampling from the population is repeated. Therefore, performing statistical inferences such as hypothesis testing is critical to understanding the true underlying relationships in the population. The most common task is to determine the statistical significance of independent variables, suggesting their strength associated with the dependent variable. To test the significance of an individual coefficient , the following hypothesis test is performed:
(null hypothesis)
(alternative hypothesis)
The null hypothesis states that the coefficient of the independent variable
is equal to zero, which indicates that there is no relationship between the independent variable and the dependent variable in the population. The test statistic t-value for
is given by:
where is the estimated coefficient and SE (
) is the standard error of
. This test statistic follows a t-distribution with n−k−1 degrees of freedom, where n is the sample size and k is the number of independent variables. A p-value, which is the probability of obtaining the observed t-value when the null hypothesis is true, can be calculated from the area under the t-distribution curve that is more extreme than the observed t-value. If the p-value is smaller than a specified threshold (0.05 is a commonly referenced value), then we can reject the null hypothesis and conclude that the coefficient
is significantly different from zero and there is a statistically significant association between independent variable
and the dependent variable.
In addition to testing an individual coefficient’s significance, another common task is to evaluate the goodness-of-fit of a linear regression model. The most commonly used goodness-of-fit metric is the Coefficient of Determination which measures the proportion of the variance in the dependent variable that is explained by the independent variables. It is given by:
where is the observed value,
is the predicted value, and
is the overall mean of the dependent variable. An
value of 1 indicates a model that perfectly explains the dependent variable. An
value that is equal to 0 indicates a model that has no explanatory power. A higher
value indicates the model has more explanatory power when model assumptions are satisfied.
will increase if adding more independent variables, in order to penalize model complexity and potential overfitting, an adjustment to
is often used to account for the number of independent variables k and the number of observations n involved in the calculation:
Another commonly used test statistic for overall model significance is the F-statistic test, given by:
This F-statistic follows an F-distribution with k and n − k − 1 degrees of freedom. The null hypothesis of the F-test is that all regression coefficients are zero, indicating that the model with the included independentvariables is not significantly different from a model with only an intercept. Accordingly, a p-value can be calculated based on the observed F statistic and F-distribution. A small p-value than a specified threshold (e.g., 0.05) will reject the null hypothesis and indicate the model provides a better fit to the data than a model that contains no independent variables.
Linear regression as one of the most fundamental statistical tools is available in popular GIS and statistical software, including ArcGIS Pro, R, Python, MATLAB, Microsoft Excel, Stata, SAS, among others. Here, an example of fitting a linear regression model is demonstrated using the open-source R programming language. The R code and R output can be seen in Figure 2. A voting data set of the county-level 2020 Presidential Election from Fotheringham and Li (2023) is used as an example. The voting data is loaded as an R data frame. The dependent variable used in the model is the percentage of people who voted for the Democratic party (pct_dem), and three independent variables are selected to create a simple model: the percentage of people who have a Bachelor’s degree or higher (pct_bach), the ratio of males to females (sex_ratio), and the log-transformed population density (log_pop_den). A linear model then can be fitted using the lm() function.
Once the model is fitted, the user can call a summary() function to the fitted model object to output a regression summary. The Residuals section provides descriptive statistics of the model residuals. The Coefficients section provides the estimated coefficients for the intercept and each independent variable, along with their standard errors, t-values, and p-values. The intercept of the model is 6.072487, suggesting that when all independent variables are zero, the baseline percentage of Democratic voters is approximately 6.07% for all counties. The coefficient for pct_bach is 0.637875, indicating that for each 1% increase in the percentage of people with a Bachelor’s degree, holding all other variables constant, the percentage of Democratic votes increases by about 0.64%. The p-value associated with pct_bach is less than 2e-16 (i.e., 2 x ), indicating that the coefficient is statistically different from zero. Similarly, the coefficient for log_pop_den is 3.372961, meaning a one-unit increase in the log-transformed population density is associated with a 3.37% increase in Democratic votes, holding all other variables constant, also with a highly significant p-value of less than 2e-16. In contrast, the sex_ratio coefficient is 0.009763, and with a p-value of 0.6509, it is not statistically significant, indicating that the county-level sex ratio does not significantly associate with the Democratic vote share.
In summary, the multiple R-square value is 0.3871, and the adjusted R-square value is 0.3865. Both values indicate that approximately 38.7% of the variance in pct_dem is explained by the model. The F-statistic of the model is 653.5 with a p-value less than 2.2e-16, suggesting that the included independent variables collectively have a significant relationship with the percentage of Democratic votes.
In order to ensure that linear regression results (i.e., coefficient estimates and inference) are valid (meaning the OLS estimators are unbiased and have the lowest sampling variance according to the Gauss-Markov theorem when assumptions hold), it is important to verify that certain statistical assumptions are met. For linear regression, it is convenient to refer to the ”LINE” assumptions. These assumptions are:
Linearity, where the relationship between the dependent variable and independent variables is linear;
A couple of visual plots are helpful in diagnosing assumption violations. For example, a residuals vs. fitted values plot (Figure 3) is useful to check for non-linearity and heteroscedasticity. A well-behaved plot should have the residuals ”bounce randomly” and roughly form a ”horizontal band” around the 0 line, suggesting that the relationship is linear and the variance is equal.
Plotting histograms and Quantile-Quantile (Q- Q) plots of residuals can check for residual normality (Figure 4) . A Q-Q plot shows the quantiles of the residuals against the quantiles of the theoretical normal distribution. Often, a diagonal reference line is plotted on a Q-Q plot, and if the residuals come from a normal distribution, all the points should fall along the reference line. Dependence of errors often occurs in time-series and spatial data due to temporal and spatial dependency. A plot of the residuals following the temporal order or spatial order (i.e., a map) will help to provide insights into the degree of dependency.
When multiple independent variables are used in a regression model, it is important to check for correlations among these variables to avoid the issue of multicollinearity, which occurs when one variable can be linearly (or almost linearly) explained by others. This can happen, for example, when some independent variables sum to a constant, such as including all percentages of racial groups or land cover classes of a geographic unit in one model. The consequence of multicollinearity is that estimated regression coefficients will have large uncertainties and reduced precision. Common checks include calculating bi-variate correlation coefficients to identify and remove variables that are highly correlated (e.g., > 0.8) with others, and using the Variance Inflation Factor (
The challenge of modeling spatial data using standard linear regression methods arises due to potential spatial effects that govern the data-generating processes, namely spatial autocorrelation and heterogeneity. Failing to account for these spatial effects will result in model residuals that are spatially correlated and heteroskedastic, which violates the Independence and Equal Variance assumptions mentioned in the above section. Consequently, the regression coefficients may be biased and have inflated variances (Anselin and Bera, 1998; Dormann et al., 2007). A common diagnostic is to calculate a spatial autocorrelation measure of the regression residuals, such as Moran’s I (its calculation can be found in AM-03-022 Global Measures of Spatial Association). If Moran’s I indicates substantial spatial autocorrelation in the residuals, spatial statistical models should be used instead, such as various forms of spatial econometric models (Anselin, 1988) and geographically weighted regression models (Fotheringham et al., 2023). Additionally, linear regression is the most basic form of a supervised machine learning model, and for more complicated non-linear processes, one should resort to more advanced statistical or machine learning methods. More details can be found in these entries: AM-32-032- Spatial Autoregressive Models, AM-34-034 - The Geographically Weighted Regression Framework, and AM-08-094 - Machine Learning Approaches.
Explain the general concepts of regression modeling.
Fit a linear regression model from data.
Interpret the results of a linear regression model.
Explain why spatial data may violate some assumptions of linear regressional models.