Fitting models using R-style formulas
Since version 0.5.0, statsmodels
allows users to fit statistical models using R-style formulas. Internally, statsmodels
uses the patsy package to convert formulas and data to the matrices that are used in model fitting. The formula framework is quite powerful; this tutorial only scratches the surface. A full description of the formula language can be found in the patsy
docs:
Loading modules and functions
In [1]: import statsmodels.formula.api as smf In [2]: import numpy as np In [3]: import pandas
Notice that we called statsmodels.formula.api
instead of the usual statsmodels.api
. The formula.api
hosts many of the same functions found in api
(e.g. OLS, GLM), but it also holds lower case counterparts for most of these models. In general, lower case models accept formula
and df
arguments, whereas upper case ones take endog
and exog
design matrices. formula
accepts a string which describes the model in terms of a patsy
formula. df
takes a pandas data frame.
dir(smf)
will print a list of available models.
Formula-compatible models have the following generic call signature: (formula, data, subset=None, *args, **kwargs)
OLS regression using formulas
To begin, we fit the linear model described on the Getting Started page. Download the data, subset columns, and list-wise delete to remove missing observations:
In [4]: df = sm.datasets.get_rdataset("Guerry", "HistData").data In [5]: df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna() In [6]: df.head() Out[6]: Lottery Literacy Wealth Region 0 41 37 73 E 1 38 51 22 N 2 66 13 61 C 3 80 46 76 E 4 79 69 83 E
Fit the model:
In [7]: mod = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df) In [8]: res = mod.fit() In [9]: print(res.summary()) OLS Regression Results ============================================================================== Dep. Variable: Lottery R-squared: 0.338 Model: OLS Adj. R-squared: 0.287 Method: Least Squares F-statistic: 6.636 Date: Mon, 14 May 2018 Prob (F-statistic): 1.07e-05 Time: 21:46:27 Log-Likelihood: -375.30 No. Observations: 85 AIC: 764.6 Df Residuals: 78 BIC: 781.7 Df Model: 6 Covariance Type: nonrobust =============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------- Intercept 38.6517 9.456 4.087 0.000 19.826 57.478 Region[T.E] -15.4278 9.727 -1.586 0.117 -34.793 3.938 Region[T.N] -10.0170 9.260 -1.082 0.283 -28.453 8.419 Region[T.S] -4.5483 7.279 -0.625 0.534 -19.039 9.943 Region[T.W] -10.0913 7.196 -1.402 0.165 -24.418 4.235 Literacy -0.1858 0.210 -0.886 0.378 -0.603 0.232 Wealth 0.4515 0.103 4.390 0.000 0.247 0.656 ============================================================================== Omnibus: 3.049 Durbin-Watson: 1.785 Prob(Omnibus): 0.218 Jarque-Bera (JB): 2.694 Skew: -0.340 Prob(JB): 0.260 Kurtosis: 2.454 Cond. No. 371. ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Categorical variables
Looking at the summary printed above, notice that patsy
determined that elements of Region were text strings, so it treated Region as a categorical variable. patsy
’s default is also to include an intercept, so we automatically dropped one of the Region categories.
If Region had been an integer variable that we wanted to treat explicitly as categorical, we could have done so by using the C()
operator:
In [10]: res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit() In [11]: print(res.params) Intercept 38.651655 C(Region)[T.E] -15.427785 C(Region)[T.N] -10.016961 C(Region)[T.S] -4.548257 C(Region)[T.W] -10.091276 Literacy -0.185819 Wealth 0.451475 dtype: float64
Examples more advanced features patsy
’s categorical variables function can be found here: Patsy: Contrast Coding Systems for categorical variables
Operators
We have already seen that “~” separates the left-hand side of the model from the right-hand side, and that “+” adds new columns to the design matrix.
Removing variables
The “-” sign can be used to remove columns/variables. For instance, we can remove the intercept from a model by:
In [12]: res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit() In [13]: print(res.params) C(Region)[C] 38.651655 C(Region)[E] 23.223870 C(Region)[N] 28.634694 C(Region)[S] 34.103399 C(Region)[W] 28.560379 Literacy -0.185819 Wealth 0.451475 dtype: float64
Multiplicative interactions
“:” adds a new column to the design matrix with the product of the other two columns. “*” will also include the individual columns that were multiplied together:
In [14]: res1 = smf.ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit() In [15]: res2 = smf.ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit() In [16]: print(res1.params) Literacy:Wealth 0.018176 dtype: float64 In [17]: print(res2.params)
© 2009–2012 Statsmodels Developers
© 2006–2008 Scipy Developers
© 2006 Jonathan E. Taylor
Licensed under the 3-clause BSD License.
http://www.statsmodels.org/stable/example_formulas.html