31 Statistics in Python

To install Python and these dependencies, we recommend that you downloadAnaconda PythonorEnthought Canopy, or preferably use the package manager if you are under Ubuntu or other linux.

: This chapter does not cover tools for Bayesian statistics. Of particular interest for Bayesian modelling isPyMC, which implements a probabilistic programming language in Python.

: TheThink statsbook is available as free PDF or in print and is a great introduction to statistics.

R is a language dedicated to statistics. Python is a general-purpose language with statistics modules. R has more statistical analysis features than Python, and specialized syntaxes. However, when it comes to building complex analysis pipelines that mix statistics with e.g. image analysis, text mining, or control of a physical experiment, the richness of Python is an invaluable asset.

Data representation and interaction

Hypothesis testing: comparing two groups

Students t-test: the simplest statistical test

Paired tests: repeated measurements on the same indivuals

Linear models, multiple factors, and analysis of variance

formulas to specify statistical models in Python

Multiple Regression: including multiple factors

Post-hoc hypothesis testing: analysis of variance (ANOVA)

More visualization: seaborn for statistical exploration

lmplot: plotting a univariate regression

Solutions to this chapters exercises

In this document, the Python inputs are represented with the sign .

Some of the examples of this tutorial are chosen around gender questions. The reason is that on such questions controlling the truth of a claim actually matters to many people.

The setting that we consider for statistical analysis is that of multipleobservationsorsamplesdescribed by a set of differentattributesorfeatures. The data can than be seen as a 2D table, or matrix, with columns giving the different attributes of the data, and rows the observations. For instance, the data contained inexamples/brain_size.csv:

We will store and manipulate this data in apandas.DataFrame, from thepandasmodule. It is the Python equivalent of the spreadsheet table. It is different from a 2Dnumpyarray as it has named columns, can contain a mixture of different data types by column, and has elaborate selection and pivotal mechanisms.

It is a CSV file, but the separator is ;

Reading from a CSV file:Using the above CSV file that gives observations of brain size and weight and IQ (Willerman et al. 1991), the data are a mixture of numerical and categorical values:

Unnamed: 0 Gender FSIQ VIQ PIQ Weight Height MRI_Count

0 1 Female 133 132 124 118.0 64.5 816932

1 2 Male 140 150 124 NaN 72.5 1001121

2 3 Male 139 123 150 143.0 73.3 1038437

3 4 Male 133 129 128 172.0 68.8 965353

4 5 Female 137 132 134 147.0 65.0 951545

The weight of the second individual is missing in the CSV file. If we dont specify the missing value (NA = not available) marker, we will not be able to do statistical analysis.

Creating from arrays: Apandas.DataFramecan also be seen as a dictionary of 1D series, eg arrays or lists. If we have 3numpyarrays:

We can expose them as apandas.DataFrame:

Other inputs:pandascan input data from SQL, excel files, or other formats. See thepandas documentation.

datais apandas.DataFrame, that resembles Rs dataframe:

Index([uUnnamed: 0, uGender, uFSIQ, uVIQ, uPIQ, uWeight, uHeight, uMRI_Count], dtype=object)

For a quick view on a large dataframe, use itsdescribemethod:pandas.DataFrame.describe().

groupby: splitting a dataframe on values of categorical variables:

groupby_genderis a powerful object that exposes many operations on the resulting group of dataframes:

Unnamed: 0 FSIQ VIQ PIQ Weight Height MRI_Count

Female 19.65 111.9 109.45 110.45 137.200000 65.765000 862654.6

Male 21.35 115.0 115.25 111.60 166.444444 71.431579 954855.4

Use tab-completion ongroupby_genderto find more. Other common grouping functions are median, count (useful for checking to see the amount of missing values in different subsets) or sum. Groupby evaluation is lazy, no work is done until an aggregation function is applied.

What is the mean value for VIQ for the full population?

How many males/females were included in this study?

Hintuse tab completion to find out the methods that can be called, instead of mean in the above example.

What is the average value of MRI counts expressed in log units, for males and females?

groupby_gender.boxplotis used for the plots above (seethis example).

Pandas comes with some plotting tools (pandas.tools.plotting, using matplotlib behind the scene) to display statistics of the data in dataframes:

The IQ metrics are bimodal, as if there are 2 sub-populations.

Plot the scatter matrix for males only, and for females only. Do you think that the 2 sub-populations correspond to gender?

For simplestatistical tests, we will use thescipy.statssub-module ofscipy:

Scipy is a vast library. For a quick summary to the whole library, see thescipychapter.

scipy.stats.ttest_1samp()tests if the population mean of data is likely to be equal to a given value (technically if observations are drawn from a Gaussian distributions of given population mean). It returns theT statistic, and thep-value(see the functions help):

With a p-value of 10^-28 we can claim that the population mean for the IQ (VIQ measure) is not 0.

We have seen above that the mean VIQ in the male and female populations were different. To test if this is significant, we do a 2-sample t-test withscipy.stats.ttest_ind():

PIQ, VIQ, and FSIQ give 3 measures of IQ. Let us test if FISQ and PIQ are significantly different. We can use a 2 sample test:

The problem with this approach is that it forgets that there are links between observations: FSIQ and PIQ are measured on the same individuals. Thus the variance due to inter-subject variability is confounding, and can be removed, using a paired test, orrepeated measures test:

This is equivalent to a 1-sample test on the difference:

T-tests assume Gaussian errors. We can use aWilcoxon signed-rank test, that relaxes this assumption:

The corresponding test in the non paired case is theMannWhitney U testscipy.stats.mannwhitneyu().

Test the difference between weights in males and females.

Use non parametric statistics to test the difference between VIQ in males and females.

Conclusion: we find that the data does not support the hypothesis that males and females have different VIQ.

Given two set of observations,xandy, we want to test the hypothesis thatyis a linear function ofx. In other terms:

whereeis observation noise. We will use thestatsmodelsmodule to:

First, we generate simulated data according to the model:

Then we specify an OLS model and fit it:

We can inspect the various statistics derived from the fit:

Method: Least Squares F-statistic: 74.03

Date: … Prob (F-statistic): 8.56e-08

coef std err t Pt [95.0% Conf. Int.]

——————————————…

Intercept -5.5335 1.036 -5.342 0.000 -7.710 -3.357

x 2.9369 0.341 8.604 0.000 2.220 3.654

Omnibus: 0.100 Durbin-Watson: 2.956

Prob(Omnibus): 0.951 Jarque-Bera (JB): 0.322

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Statsmodels uses a statistical terminology: theyvariable in statsmodels is called endogenous while thexvariable is called exogenous. This is discussed in more detailhere.

To simplify,y(endogenous) is the value you are trying to predict, whilex(exogenous) represents the features you are using to make the prediction.

Retrieve the estimated parameters from the model above.Hint: use tab-completion to find the relevent attribute.

Categorical variables: comparing groups or multiple categories¶

Let us go back the data on brain size:

< p>We can write a comparison between IQ of male and female using a linear model:

Dep. Variable: VIQ R-squared: 0.015

Method: Least Squares F-statistic: 0.5969

Date: … Prob (F-statistic): 0.445

coef std err t Pt [95.0% Conf. Int.]

———————————————————————–…

Intercept 109.4500 5.308 20.619 0.000 98.704 120.196

Gender[T.Male] 5.8000 7.507 0.773 0.445 -9.397 20.997

Omnibus: 26.188 Durbin-Watson: 1.709

Prob(Omnibus): 0.000 Jarque-Bera (JB): 3.703

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Forcing categorical: the Gender is automatically detected as a categorical variable, and thus each of its different values are treated as different entities.

An integer column can be forced to be treated as categorical using:

Intercept: We can remove the intercept using- 1in the formula, or force the use of an intercept using+ 1.

By default, statsmodels treats a categorical variable with K possible values as K-1 dummy boolean variables (the last level being absorbed into the intercept term). This is almost always a good default choice – however, it is possible to specify different encodings for categorical variables (

Link to t-tests between different FSIQ and PIQ

To compare different types of IQ, we need to create a long-form table, listing IQs, where the type of IQ is indicated by a categorical variable:

coef std err t Pt [95.0% Conf. Int.]

——————————————…

Intercept 113.4500 3.683 30.807 0.000 106.119 120.781

type[T.piq] -2.4250 5.208 -0.466 0.643 -12.793 7.943

We can see that we retrieve the same values for t-test and corresponding p-values for the effect of the type of iq than the previous t-test:

Consider a linear model explaining a variablez(the dependent variable) with 2 variablesxandy:

Such a model can be seen in 3D as fitting a plane to a cloud of (x,y,z) points.

Example: the iris data(examples/iris.csv)

Sepal and petal size tend to be related: bigger flowers are bigger! But is there in addition a systematic effect of species?

sepal_width ~ name + petal_length

Dep. Variable: sepal_width R-squared: 0.478

Method: Least Squares F-statistic: 44.63

Date: … Prob (F-statistic): 1.58e-20

coef std err t Pt [95.0% Conf. Int.]

——————————————…

Intercept 2.9813 0.099 29.989 0.000 2.785 3.178

name[T.versicolor] -1.4821 0.181 -8.190 0.000 -1.840 -1.124

name[T.virginica] -1.6635 0.256 -6.502 0.000 -2.169 -1.158

petal_length 0.2983 0.061 4.920 0.000 0.178 0.418

Omnibus: 2.868 Durbin-Watson: 1.753

Prob(Omnibus): 0.238 Jarque-Bera (JB): 2.885

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In the above iris example, we wish to test if the petal length is different between versicolor and virginica, after removing the effect of sepal width. This can be formulated as testing the difference between the coefficient associated to versicolor and virginica in the linear model estimated above (it is an Analysis of Variance,ANOVA). For this, we write avector of contraston the parameters estimated: we want to testname[T.versicolor]-name[T.virginica], with anF-test:

Going back to the brain size + IQ data, test if the VIQ of male and female are different after removing the effect of brain size, height and weight.

Seaborncombines simple statistical fits with plotting on pandas dataframes.

Let us consider a data giving wages and many other personal information on 500 individuals (Berndt, ER. The Practice of Econometrics. 1991. NY: Addison-Wesley).

The full code loading and plotting of the wages data is found incorresponding example.

We can easily have an intuition on the interactions between continuous variables usingseaborn.pairplot()to display a scatter matrix:

Categorical variables can be plotted as the hue:

Look and feel and matplotlib settings

Seaborn changes the default of matplotlib figures to achieve a more modern, excel-like look. It does that upon import. You can reset the default using:

To switch back to seaborn settings, or understand better styling in seaborn, see therelevent section of the seaborn documentation.

A regression capturing the relation between one variable and another, eg wage and eduction, can be plotted usingseaborn.lmplot():

Given that, in the above plot, there seems to be a couple of data points that are outside of the main cloud to the right, they might be outliers, not representative of the population, but driving the regression.

To compute a regression that is less sentive to outliers, one must use arobust model. This is done in seaborn usingrobust=Truein the plotting functions, or in statsmodels by replacing the use of the OLS by a Robust Linear Model,statsmodels.formula.api.rlm().

Do wages increase more with education for males than females?

The plot above is made of two different fits. We need to formulate a single model that tests for a variance of slope across the to population. This is done via aninteraction.

wage ~ education + gender + education * gender

coef std err t Pt [95.0% Conf. Int.]

——————————————————————————

Intercept 0.2998 0.072 4.173 0.000 0.159 0.441

gender[T.male] 0.2750 0.093 2.972 0.003 0.093 0.457

education 0.0415 0.005 7.647 0.000 0.031 0.052

education:gender[T.male] -0.0134 0.007 -1.919 0.056 -0.027 0.000

Can we conclude that education benefits males more than females?

Hypothesis testing and p-value give you the

(with categorical variables) enable you to express rich links in your data

your data and simple model fits matters!

(adding factors that can explain all or part of the variation) is important modeling aspect that changes the interpretation.

Code examples for the statistics chapter.

Plotting simple quantities of a pandas dataframe

Analysis of Iris petal and sepal sizes

Test for an education/gender interaction in wages

Visualizing factors influencing wages

3.1.1. Data representation and interaction

Creating dataframes: reading data files or converting arrays

3.1.2. Hypothesis testing: comparing two groups

3.1.2.1. Students t-test: the simplest statistical test

1-sample t-test: testing the value of a population mean

2-sample t-test: testing for difference across populations

3.1.2.2. Paired tests: repeated measurements on the same indivuals

3.1.3. Linear models, multiple factors, and analysis of variance

3.1.3.1. formulas to specify statistical models in Python

Categorical variables: comparing groups or multiple categories

3.1.3.2. Multiple Regression: including multiple factors

3.1.3.3. Post-hoc hypothesis testing: analysis of variance (ANOVA)

3.1.4. More visualization: seaborn for statistical exploration

3.1.4.1. Pairplot: scatter matrices

3.1.4.2. lmplot: plotting a univariate regression

3.1.7. Solutions to this chapters exercises