Scripting a pipeline : reproducibility issue

as_dub · February 2, 2023, 4:59pm

Hi!

I am working on scripting an analysis which was initially performed manually from the Brainstorm GUI. I am working from the database that was created by a user from the GUI.

I am facing a reproducibility issue. The main first steps of the analysis are :

Import
Epoch
Baseline removal
Resample

Below the history of one the "inital" files (from the database that I was given to reproduce)

and the history of the "regenerated" data

In order to quickly compare the two datasets I computed an average across trials. When plotting them on a same graph with distinct colors (red, blue) I see slight differences.

From the history, I suspect that the version of brainstorm which was used for the first analysis is not the same as the second one... does this explain the differences in the averages?

Thanks!
AnneSo

Francois · February 6, 2023, 9:10am

The differences in history still exist with today's version of Brainstorm.
If you take the "Remove baseline" entry:

The first one is coming from the GUI, and the baseline removal is handled directly by the function in_data.m:
https://github.com/brainstorm-tools/brainstorm3/blob/master/toolbox/io/in_data.m#L407
The second one was performed from a script, i.e. from process_baseline.m:
https://github.com/brainstorm-tools/brainstorm3/blob/master/toolbox/process/functions/process_baseline.m#L64

This however indicates that the processing script was not generated using the most optimal option. When generating a pipeline that does successfully import + DC correction + resample: the script generator optimizes the pipeline and adds option entries resample and baseline to the call to the import process.
The epochs are then fully processed and then saved on the drive, instead of being saved by the import process, then read and saved by process_baseline, then read and saved by process_resample.
And you fall back exactly on case #1.
Case #2 is coming from a script that should be fixed.

Example:

github.com

brainstorm-tools/brainstorm3/blob/master/toolbox/script/tutorial_introduction.m#L402


      
          bst_process('CallProcess', 'process_snapshot', [sFilesRun1, sFilesRun2], [], ...

              'target',  2, ...  % SSP projectors

              'Comment', 'SSP projectors');

          

          

          

          %% ===== TUTORIAL #15: IMPORT EVENTS =================================================

          %  ===================================================================================

          disp([10 'DEMO> Tutorial #15: Import events' 10]);

          % Process: Import MEG/EEG: Events (Run01)

          sFilesEpochs1 = bst_process('CallProcess', 'process_import_data_event', sFilesRun1, [], ...

              'subjectname', SubjectName, ...

              'condition',   '', ...

              'eventname',   'standard, deviant', ...

              'timewindow',  [], ...

              'epochtime',   [-0.100, 0.500], ...

              'createcond',  0, ...

              'ignoreshort', 1, ...

              'usectfcomp',  1, ...

              'usessp',      1, ...

              'freq',        [], ...

The differences you observe in the two graphs could be due to many things. In the first place, the import times for the epoch are not the same. There is a 1 sample shift between the two.
If you can reproduce the same difference between the two options with: a) the same version of Brainstorm and b) the fixed epoch time, there is probably something that should be fixed somewhere.
Let me know how to reproduce and I'll investigate.

as_dub · February 8, 2023, 3:19pm

Hi Francois,

Thank you for your response. I now fixed the script such that

It now computes import + DC correction + resample all at once (with adding the option entries resample and baseline to the call). Thank you for suggesting the improvement.
The history now shows :

Screenshot 2023-02-08 at 16.17.003138×464 61.4 KB
The one sample shift does not exist anymore.

I still see slight differences.

Unfortunately I do not have access to the version with which the dataset was first computed. So I guess I will have to live with this.

Thank you for your help!

Francois · February 8, 2023, 4:56pm

If you know the date, maybe from the protocol folder creation, you can pull any past version from GitHub.
In GitHub, navigate to an older commit, then download the corresponding data tree (green button from the github website).

From the difference you observe here, I guess there were maybe differences in the filters or in the ICA.
Note that there is a random initialization in runica, and the result is not deterministic, unless you initialize the Matlab random generator to a fixed seed at the beginning of your script.