Tutorial 26: Statistics

Authors: Francois Tadel, Elizabeth Bock, Dimitrios Pantazis, Richard Leahy, Sylvain Baillet

In this auditory oddball experiment, we would like to test for the significant differences between the brain response to the deviant beeps and the standard beeps, time sample by time sample. Until now we have been computing measures of the brain activity in time or time-frequency domain. We were able to see clear effects or slight tendencies, but these observations were always dependent on an arbitrary amplitude threshold and the configuration of the colormap. With appropriate statistical tests, we can go beyond these empirical observations and assess what are the significant effects in a more formal way.

Random variables

In most cases we are interested in comparing the brain signals recorded for two populations or two experimental conditions A and B.

A and B are two random variables for which we have a limited number of repeated measures: multiple trials in the case of a single subject study, or multiple subject averages in the case of a group analysis. To start with, we will consider that each time sample and each signal (source or sensor) is independent: a random variable represents the possible measures for one sensor/source at one specific time point.

A random variable can be described with its probability distribution: a function which indicates what are the chances to obtain one specific measure if we run the experiment. By repeating the same experiment many times, we can approximate this function with a discrete histogram of observed measures.

stat_distribution.gif stat_histogram.gif


You can plot histograms like this one in Brainstorm, it may help you understand what you can expect from the statistics functions described in the rest of this tutorial. For instance, seeing a histogram computed with only 4 values would discourage you forever from running a group analysis with 4 subjects...


Sources (relative)

Sources (absolute)


Group studies and central limit theorem

Statistical inference

Hypothesis testing

To show that there is a difference between A and B, we can use a statistical hypothesis test. We start by assuming that the two sets are identical then reject this hypothesis. For all the tests we will use here, the logic is similar:

Evaluation of a test

The quality of test can be evaluated based on two criteria:

Different categories of tests

Two families of tests can be helpful in our case: parametric and nonparametric tests.

Parametric Student's t-test


The Student's t-test is a widely-used parametric test to evaluate the difference between the means of two random variables (two-sample test), or between the mean of one variable and one known value (one-sample test). If the assumptions are correct, the t-statistic follows a Student's t-distribution.

The main assumption for using a t-test is that the random variables involved follow a normal distribution (mean: μ, standard deviation: σ). The figure below shows a few example of normal distributions.


Depending on the type of data we are testing, we can have different variants for this test:


Once the t-value is computed (tobs in the previous section), we can convert it to a p-value based on the known distributions of the t-statistic. This conversion depends on two factors: the number of degrees of freedom and the tails of the distribution we want to consider. For a two-tailed t-test, the two following commands in Matlab are equivalent and can convert the t-values in p-values.

p = betainc(df./(df + t.^2), df/2, 0.5);    % Without the Statistics toolbox
p = 2*(1 - tcdf(abs(t),df));                % With the Statistics toolbox

The distribution of this function for different numbers of degrees of freedom:

Example 1: Parametric t-test on recordings

Parametric t-tests require the values tested to follow a normal distribution. The recordings evaluated in the histograms sections above (MLP57/160ms) show distributions that are matching this assumption relatively well: the histograms follow the traces of the corresponding normal functions, and the two conditions have very similar variances. It looks reasonable to use a parametric t-test in this case.

Correction for multiple comparisons

Multiple comparison problem

The approach described in this first example performs many tests simultaneously. We test, independently, each MEG sensor and each time sample across the trials, so we run a total of 274*361 = 98914 t-tests.

If we select a critical value of 0.05 (p<0.05), it means that we want to see what is significantly different between the conditions while accepting the risk of observing a false positive in 5% of the cases. If we run the test around 100,000 times, we can expect to observe around 5,000 false positives. We need to better control for false positives (type I errors) when dealing with multiple tests.

Bonferroni correction

The probability to observe at least one false positive, or familywise error rate (FWER) is almost 1:
FWER = 1 - prob(no significant results) = 1 - (1 - 0.05)^100000 ~ 1

A classical way to control the familywise error rate is to replace the p-value threshold with a corrected value, to enforce the expected FWER. The Bonferroni correction sets the significance cut-off at $$\alpha$$/Ntest. If we set (p ≤ $$\alpha$$/Ntest), then we have (FWER ≤ $$\alpha$$). Following the previous example:
FWER = 1 - prob(no significant results) = 1 - (1 - 0.05/100000)^100000 ~ 0.0488 < 0.05

This works well in a context where all the tests are strictly independent. However, in the case of MEG/EEG recordings, the tests have an important level of dependence: two adjacent sensors or time samples often have similar values. In the case of highly correlated tests, the Bonferroni correction tends too be conservative, leading to a high rate of false negatives.

FDR correction

The false discovery rate (FDR) is another way of representing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. It is designed to control the expected proportion of false positives, while the Bonferroni correction controls the probability to have at least one false positive. FDR-controlling procedures have greater power, at the cost of increased rates of Type I errors (Wikipedia)

In Brainstorm, we implement the Benjamini–Hochberg step-up procedure (1995):

Note that there are different implementations of FDR. FieldTrip uses the Benjamini–Yekutieli (2001) algorithm as described in (Genovese, 2002), which usually gives less true positive results. Don't be surprised if you get empty displays when using the option "FDR" in the FieldTrip processes, while you get significant results with the Brainstorm FDR correction.

In the interface


You can select interactively the type of correction to apply for multiple comparison while reviewing your statistical results. The checkboxes "Control over dims" allow you to select which are the dimensions that you consider as multiple comparisons (1=sensor, 2=time, 3=not applicable here). If you select all the dimensions, all the values available in the file are considered as the same repeated test, and only one corrected p-threshold is computed for all the time samples and all the sensors.

If you select only the first dimension, only the values recorded at the same time sample are considered as repeated tests, the different time points are corrected independently, with one different corrected p-threshold for each time point.

When changing these options, a message is displayed in the Matlab command window, showing the number of repeated tests that are considered, and the corrected p-value threshold (or the average if there are multiple corrected p-thresholds, when not all the dimensions are selected):

BST> Average corrected p-threshold: 5.0549e-07  (Bonferroni, Ntests=98914)
BST> Average corrected p-threshold: 0.00440939  (FDR, Ntests=98914)

"It doesn't work"

If nothing appears significant after correction, don't start by blaming the method ("FDR doesn't work"). In the first place, it's probably because there is no clear difference between your sample sets or simply because your sample size is too small. For instance, with less than 10 subjects you cannot expect to observe very significant effects in your data.

If you have good reasons to think your observations are meaningful but cannot increase the sample size, consider reducing the number of multiple comparisons you perform (test only the average over a short time window, a few sensors or a region of interest) or using a cluster-based approach. When using permutation tests, increasing the number of random permutations also decreases the p-value of the very significant effects.

Nonparametric permutation tests


A permutation test (or randomization test) is a type of test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. (Wikipedia)

If the null hypothesis is true, the two sets of tested values A and B follow the same distribution. Therefore the values are exchangeable between the two sets: we can move one value from set A to set B, and one value from B to A, and we expect to obtain the same value for the test statistic T.

By taking all the possible permutations between sets A and B and computing the statistic for each of them, we can build a histogram that approximates the permutation distribution.

Then we compare the observed statistic with the permutation distribution. From the histogram, we calculate the proportion of permutations that resulted in a larger test statistic than the observed one. This proportion is called the p-value. If the p-value is smaller than the critical value (typically 0.05), we conclude that the data in the two experimental conditions are significantly different.

The number of possible permutations between the two sets of data is usually too large to compute an exhaustive permutation test in a reasonable amount of time. The permutation distribution of the statistic of interest is approximated using a Monte-Carlo approach. A relatively small number of random selection of possible permutations can give us a reasonably good idea of the distribution.

Permutation tests can be used for any test statistic, regardless of whether or not its distribution is known. The hypothesis is about the data itself, not about a specific parameter. In the examples below, we use the t-statistic, but we could use any other function.

Practical limitations

Computation time: If you increase the number of random permutations you use for estimating the distribution, the computation time will increase linearly. You need to find a good balance between the total computation time and the accuracy of the result.

Memory (RAM): The implementations of the permutation tests available in Brainstorm require all the data to be loaded in memory before starting the evaluation of the permutation statistics. Running this function on large datasets or on source data could quickly crash your computer. For example, loading the data for the nonparametric equivalent to the parametric t-test we ran previously would require:
276(sensors) * 461(trials) * 361(time) * 8(bytes) / 1024^3(Gb) = 0.3 Gb of memory.

This is acceptable on most recent computers. But to perform the same at the source level, you need:
45000*461*361*8/1024^3 = 58 Gb of memory just to load the data. This is impossible on most computers, we have to give up at least one dimension and run the test only for one region of interest or one time sample (or average over a short time window).

Example 2: Permutation t-test

Let's run the nonparametric equivalent to the test we ran in the first example.

FieldTrip implementation

FieldTrip functions in Brainstorm

We have the possibility to call some of the FieldTrip toolbox functions from the Brainstorm environment. For this, you need first to install the FieldTrip toolbox on your computer and add it to your Matlab path.

Cluster-based correction

One interesting method that has been promoted by the FieldTrip developers is the cluster-based approach for nonparametric tests. In the type of data we manipulate in MEG/EEG analysis, the neighboring channels, time points or frequency bins are expected to show similar behavior. We can group these neighbors into clusters to "accumulate the evidence". A cluster can have a multi-dimensional extent in space/time/frequency.

In the context of a nonparametric test, the test statistic computed at each permutation is the extent of the largest cluster. To reject or accept the null hypothesis, we compare the largest observed cluster with the randomization distribution of the largest clusters.

This approach solves the multiple comparisons problem, because the test statistic that is used (the maximum cluster size) is computed using all the values at the same time, along all the dimensions (time, sensors, frequency). There is only one test that is performed, so there is no multiple comparisons problem.

The result is simpler to report but also a lot less informative than FDR-corrected nonparametric tests. We have only one null hypothesis, "the two sets of data follow the same probability distribution", and the outcome of the test is to accept or reject this hypothesis. Therefore we cannot report the spatial or temporal extent of the most significant clusters. Make sure you read this recommendation before reporting cluster-based results in your publications.

Reference documentation

For a complete description of nonparametric cluster-based statistics in FieldTrip, read the following:

Process options

There are three separate processes in Brainstorm, to call the three FieldTrip functions.

Test statistic options (name of the options in the interface in bold):

Cluster-based correction:

Input options: All the data selection is done before, in the process code and functions out_fieldtrip_*.m.

Example 3: Cluster-based correction

Run again the same test, but this time select the cluster correction.

Example 4: Parametric test on sources

We can reproduce similar results at the source level. If you are using non-normalized and non-rectified current density maps, their distributions across trials should be normal, as illustrated earlier with the histograms. You can use a parametric t-test to compare the two conditions at the source level.

Scouts and statistics

Unconstrained sources

There are some additional constraints to take into consideration when computing statistics for source models with unconstrained orientations. See the corresponding section in the tutorial Workflows.

Directionality: Difference of absolute values

The test we just computed detects correctly the time and brain regions with significant differences between the two conditions, but the sign of the t statistic is useless, we don't know where the response is stronger or weaker for the deviant stimulation.

After identifying where and when the responses are different, we can go back to the source values and compute another measure that will give us the directionality of this difference:
abs(average(deviant_trials)) - abs(average(standard_trials))

Example 5: Parametric test on scouts

The previous example showed how to test for differences across the full source maps, and then to extract the significant scout activity. Another valid approach is to test directly for significant differences in specific regions of interest, after computing the scouts time series for each trial.

This alternative has many advantages: it can be a lot faster and memory-efficient and it reduces the multiple comparisons problem. Indeed, when you perform less tests at each time point (the number of scouts instead of the number of sources), the FDR and Bonferroni corrections are less conservative and lead to higher p-values. On the other hand, it requires to formulate stronger hypotheses: you need to define the regions in which you expect to observe differences, instead of screening the entire brain.


Convert statistic results to regular files

Apply statistic threshold

You can convert the results of a statistic test to a regular file. It can be useful because lots of menus and processes are not accessible for the files with the "stat" tag display on top of their icon.

Simulate recordings from these scouts


Example 6: Nonparametric test on time-frequency maps

Two run a test on time-frequency maps, we need to have all the time-frequency decompositions for each individual trial available in the database. In the time-frequency tutorial, we saved only the averaged time-frequency decompositions of all the trials.

Now delete the TF decompositions for the individual trials:


Export to SPM

An alternative to running the statistical tests in Brainstorm is to export all the data and compute the tests with an external program (R, Matlab, SPM, etc). Multiple menus exist to export files to external file formats (right-click on a file > File > Export to file).

Two tutorials explain how to export data specifically to SPM:


On the hard drive

Right click on the first test we computed in this tutorial > File > View file contents.

Description of the fields

Useful functions


Additional documentation


Forum discussions

Feedback: Comments, bug reports, suggestions, questions
Email address (if you expect an answer):