When working with data as a data science or data analyst, regression analysis is very common and something that many industries and companies utilize to understand how different series of data are related.
There are many major companies and industries which use SAS (banking, insurance, etc.), but with the rise of open source and the popularity of languages such as Python and R, these companies are exploring converting their code to Python.
A commonly used procedure for regression analysis in SAS is the PROC REG procedure. In this article, you’ll learn the Python equivalent of PROC REG.
PROC REG Equivalent in Python
In SAS, when we do simple regression analysis on continuous variables, we use PROC REG. PROC REG performs Ordinary Least Squares (OLS).
Let’s say we have data such as the following:
In SAS, to do OLS on this data, for example, to look at the linear relationship between height and weight, we could simply do the following:
The output for this code looks like the following image:
We see here that the linear relationship between height and weight is significant (p_value of 0.0007).
To do this in Python, we can use the statsmodels package. Creating the model and fitting the model is very easy to do. After fitting the model, we print the results to verify we got the same coefficients and p_value as SAS.
import pandas as pd
import numpy as np
from statsmodels.formula.api import ols
model = 'height ~ weight'
results = ols(model,data=data).fit()
results.summary()
#output:
# OLS Regression Results
#==============================================================================
#Dep. Variable: height R-squared: 0.600
#Model: OLS Adj. R-squared: 0.569
#Method: Least Squares F-statistic: 19.46
#Date: Sat, 09 Jan 2021 Prob (F-statistic): 0.000703
#Time: 09:39:28 Log-Likelihood: -45.073
#No. Observations: 15 AIC: 94.15
#Df Residuals: 13 BIC: 95.56
#Df Model: 1
#Covariance Type: nonrobust
#==============================================================================
# coef std err t P>|t| [0.025 0.975]
#------------------------------------------------------------------------------
#Intercept 10.2043 4.112 2.481 0.028 1.320 19.088
#weight 0.3520 0.080 4.412 0.001 0.180 0.524
#==============================================================================
#Omnibus: 1.249 Durbin-Watson: 2.506
#Prob(Omnibus): 0.535 Jarque-Bera (JB): 0.334
#Skew: 0.357 Prob(JB): 0.846
#Kurtosis: 3.150 Cond. No. 157.
#==============================================================================
#
#Notes:
#[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Above we see that we obtained the same coefficient and p_value as SAS.
PROC REG Testing Residuals for Normality Equivalent in Python
When doing OLS and regression analysis, one of the main assumptions we need to test for is normality of the residuals.
To do this in SAS, we would do the following with proc univariate:
After running this code,we receive these results:
To do this in Python, we can use the scipy package to get the probability plot, and matplotlib to plot it. In SAS, we specified we wanted studentized residuals. To get these in Python, we need to get to a few more steps.
from scipy import stats
import matplotlib.pyplot as plt
influence = results.get_influence()
studentized_residuals = influence.resid_studentized_external
res = stats.probplot(studentized_residuals, plot=plt)
plt.show()
You can see that the chart is identical to the one produced in SAS.
To get the p_values for the different normality tests, we can use the Anderson and Shapiro functions from the stats package.
result = stats.anderson(studentized_residuals)
print(result)
#output:
#AndersonResult(statistic=0.5182987927026232, critical_values=array([0.498, 0.568, 0.681, 0.794, 0.945]), significance_level=array([15. , 10. , 5. , 2.5, 1. ]))
stat, p = stats.shapiro(studentized_residuals)
print(stat)
print(p)
#output:
#0.9361917972564697
#0.336889386177063
We see we receive the same statistics from these tests as we received from SAS.
The full code for this example in Python is below:
import pandas as pd
import numpy as np
from statsmodels.formula.api import ols
from scipy import stats
import matplotlib.pyplot as plt
model = 'height ~ weight'
results = ols(model,data=data).fit()
results.summary()
influence = results.get_influence()
studentized_residuals = influence.resid_studentized_external
res = stats.probplot(studentized_residuals, plot=plt)
plt.show()
result = stats.anderson(studentized_residuals)
stat, p = stats.shapiro(studentized_residuals)
I hope that this example has helped you with translating your SAS PROC REG code into Python