Processes slowing way down...?

Hi,

I have rather a lot of conditions and trials and have scripted a couple of functions to do event detection and averaging (relevant portions below). I have the autosave commented because I thought the bottleneck was reloading the protocol file, which is really huge. That seemed to help with some operations, but the length of these operations on a single subject’s data has been creeping up from about 24 hours to 5 days. I checked that our server’s memory load is not too high and it’s not swapping, so it seems to be in matlab/brainstorm.

The windows “Running process: Detect bad channels: peak-to-peak” sometimes go pretty fast and sometimes seem frozen. There are no reported errors. Is there some trick I’m missing, should I be clearing matlab’s memory or something before runs?

Best,

E


% Process: Detect bad trials: Peak-to-peak MEGGRAD(0-1000)
sFiles = bst_process(‘CallProcess’, ‘process_detectbad’, epochnames, [], …
‘timewindow’, [], …
‘meggrad’, [0, 1000], …
‘megmag’, [0, 0], …
‘eeg’, [0, 0], …
‘eog’, [0, 0], …
‘ecg’, [0, 0], …
‘rejectmode’, 2); % Reject the entire trial

        % Process: Average: By trial group (folder average)
        sFiles = bst_process('CallProcess', 'process_average', sFiles, [], ...
            'avgtype',    5, ...  % By trial group (folder average)
            'avg_func',   1, ...  % Arithmetic average:  mean(x)
            'weighted',   0, ...
            'keepevents', 0);

I should admit that I’m running several matlabs/brainstorms, which I know you shouldn’t really do, but I can’t wait 6 months for this processing step :S and it does seem to be making the averages correctly, just very slowly!

Hi Emily,

To identify what’s taking time in your scripts, you can use the Matlab profiler.
Before running the execution, type “profile on” (or add it at the top of your script).
After finishing, or at the end of your script, enter “profile viewer”. You’ll be able to navigate by execution time (search for functions with the highest “self time” and look for the red lines in the code).

Running multiple Matlab/Brainstorm in parallel does not always make things faster. Most Matlab functions are designed to run on all the cores available on your computer (open a resource monitor showing the activity of all the CPUs separately).

If you never reach a memory usage higher than 90% or 95% of you memory, it’s not a memory problem and it won’t help to clear variables from the memory. If your resource monitor can show the volumes of disk access, check it out for processes that take a long time. If there is a lot of reading on the hard drive (which is probably the case in a process that just detects maxima), running multiple tasks in parallel may slow down significantly: hard drives work faster when they have one task only to deal with.

Good luck
Francois

Hi Francois,

Thanks for the advice. Memory doesn't seem to be the problem, and this process seems to run on one core at a time. I ran a threshold-based artifact rejection and averaging step on a single condition of 450 short epochs (350ms) with the profiler on, and it took about 20 minutes. The offending step seems to be line 3540 in function findFileInStudies, which is

% Find target in this list

iItem = find(file_compare(filesList(iValidFiles), fieldFile));

I believe what is happening is that each time it's looking for particular files it's having to compare with all of the files in my database (I have many!). It made 154170 calls to this function (see screen cap) to work on the 450 epochs.

A more knowledgeable friend asks if this might be a good case for a hash table, which would scale much more readily to larger databases and keep the performance levels functional.

Is this a modification you might consider in the near future? If that's not feasible, what would you suggest to process this dataset - maybe a custom function to read in the files directly, abs(max) of all channels, mark those above a threshold, average and then reattach finished file to db? Or should I make lots of mini databases with smaller numbers of files? (seems like a pain)

Emily

Ouch… Indeed, this is the price we pay now for not having a real database engine behind this “brainstorm database”… Maybe I can try to reduce a bit the number of times this function is called in this context, or suggest another way for you to work.

But I’d need some input for identifying the bottlenecks efficiently. Can you please send me screen captures showing the full summary of the profiler, sorted by “self time”?

What version of Brainstorm are you running? On line 3540 of bst_get.m, there is a comment now…

I’m currently running Version 3.4 (26-Jan-2018).

It’s possible to export the profiler results in html (profsave). Maybe that’s an easier way for you to explore than screen captures? I’ll zip it and email it to you; if that doesn’t work well let me know and I’ll make more screen captures.

E

I updated Brainstorm and reran a process, sent you the profiler output.

Hello Brainstorm users, here’s a temporary solution, in case useful for individual users:

There were two speed-related sticking points in my analysis, both related to my high numbers of individual trials. The first is that when the db_save function is called, if the names of all the trials is over 2GB it saves it in a much slower format. As it saves frequently, I often have to wait (in my case ~20 mins for this to complete after a 10 second processing step). The solution suggested by Francois (documented elsewhere) is to save a backup copy of the db_save function, and in the regular one put the command return; in the first line after the help comments, and then call the backup copy from time to time when you have time to wait for a long save.

The second problem was that when doing a simple threshold-based artifact rejection step, it was taking the age of the universe due to all the continuous indexing of the trials. My solution was to perform these steps outside of Brainstorm as follows:

  • obtain the list of my trials by listing all the contents of each condition folder of interest inside the databse folder, and use them to build an sFiles structure in Matlab
  • read in each epoch and perform whatever calculation is of interest (in my case marking when the absolute value of the MEG channels - those which are good of course) crosses a threshold value, and use another vector to mark which trials should be kept
  • use the “keep trials” vector to reduce the sFiles structure so you keep only the ones that are good
  • the appropriate brainstorm function for the next steps on the modified sFiles array, in my case just an averaging function - this makes sure that the average gets put back in its appropriate folder and the brainstorm DB is not disrupted

This sounds complicated but is actually really fast, particularly considering I was completely unable to run this step.

(My next step will be to detach all of my individual trials and only work on the averages so that my analysis will be manageable!).

Hope that helps someone.

Emily

Hi Emily,
We’ll work on this soon… Sorry it takes so long…
Thanks for the suggestion
Francois