Brainstorm on HPC

Hello,

TL:DR: i need to run MEM on a lot of files in a short amount of time. To do that, I am trying to run it on our local cluster but have hard time to do that.

Long story:

I am trying to launch some computation on our cluster and I need to have access to some brainstorm function (not the database).
I exported the data from the database, into a mat file; and created a script ('run_MEM') calling the appropriate MEM functions.

I am able to run my script using:

 matlab -nodisplay -nosplash -nodesktop -r "run_MEM('08-Apr-2024/in/data_sim_240404_1672.mat','08-Apr-2024/in/wMEM_options.json')"

where data_sim_240404_1672.mat contains the data coming from Brainstorm, and wMEM_options.json contains the options of the MEM process.

However, to use our HPC i need to use something like echo matlab -nojvm -nodisplay -nosplash -nodesktop -r \"run ./m.m\" | qsub -j y -o logs/mat.txt -V -cwd -q matlab.q -N matlabJob (https://perform-wiki.concordia.ca/mediawiki/index.php/SGE_/_Batch-queuing_system)

I tried the following:
echo matlab -nodisplay -nosplash -nodesktop -r "/NAS/home/edelaire//Documents/Project/wMEM-fnirs/run_MEM\('08-Apr-2024/in/data_sim_240404_1672.mat','08-Apr-2024/in/wMEM_options.json'\)" | qsub -j y -o logs/mat.txt -V -cwd -q matlab.q -N matlabJob -cwd -S /bin/bash

Log:

/opt/sge/default/spool/perf-hpc07/job_scripts/180873: line 1: 3481723 Aborted                 (core dumped) /util/packages/matlab/R2021b/bin/matlab -nodisplay -nosplash -nodesktop -r /NAS/home/edelaire//Documents/Project/wMEM-fnirs/run_MEM\('08-Apr-2024/in/data_sim_240404_1672.mat','08-Apr-2024/in/wMEM_options.json'\)
Opening log file:  /NAS/home/edelaire/java.log.30554
pure virtual method called
terminate called without an active exception

--------------------------------------------------------------------------------
               abort() detected at Tue Apr 09 15:00:32 2024 -0400
--------------------------------------------------------------------------------

Configuration:
  Crash Decoding           : Disabled - No sandbox or build area path
  Crash Mode               : continue (default)
  Default Encoding         : UTF-8
  GNU C Library            : 2.31 stable
  MATLAB Architecture      : glnxa64
  MATLAB Root              : /util/packages/matlab/R2019b
  MATLAB Version           : 9.7.0.1190202 (R2019b)
  Operating System         : Ubuntu 20.04.6 LTS
  Process ID               : 3045210
  Processor ID             : x86 Family 6 Model 63 Stepping 2, GenuineIntel
  Session Key              : 85124635-f246-4fe8-96a0-4093e7b0a629
  Static TLS mitigation    : Disabled: Unnecessary
  Window System            : No active display

Fault Count: 1


Abnormal termination:
abort()

Register State (from fault):
  RAX = 0000000000000000  RBX = 00001551ebfff700
  RCX = 000015521000c00b  RDX = 0000000000000000
  RSP = 00001551ebff9620  RBP = 00001551ebffa1a0
  RSI = 00001551ebff9620  RDI = 0000000000000002

   R8 = 0000000000000000   R9 = 00001551ebff9620
  R10 = 0000000000000008  R11 = 0000000000000246
  R12 = 00001551ebff98f0  R13 = 00001551ebffa830
  R14 = 00001551ebffa9e0  R15 = 00001551eda34800

  RIP = 000015521000c00b  EFL = 0000000000000246

   CS = 0033   FS = 0000   GS = 0000

Stack Trace (from fault):
[  0] 0x000015521000c00b                    /lib/x86_64-linux-gnu/libc.so.6+00274443 gsignal+00000203
[  1] 0x000015520ffeb859                    /lib/x86_64-linux-gnu/libc.so.6+00141401 abort+00000299
[  2] 0x00001551f2e2c965 /util/packages/matlab/R2019b/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+11549029
[  3] 0x00001551f2e2ab86 /util/packages/matlab/R2019b/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+11541382
[  4] 0x00001551f2e2abd1 /util/packages/matlab/R2019b/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+11541457
[  5] 0x00001551f2e20f1f /util/packages/matlab/R2019b/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+11501343
[  6] 0x00001551f2c48a40 /util/packages/matlab/R2019b/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+09566784
[  7] 0x00001551f2dfa728 /util/packages/matlab/R2019b/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+11343656
[  8] 0x00001551f2dfc28f /util/packages/matlab/R2019b/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+11350671
[  9] 0x00001551f2c41995 /util/packages/matlab/R2019b/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+09537941 JVM_handle_linux_signal+00000421
[ 10] 0x00001551f2c34858 /util/packages/matlab/R2019b/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+09484376
[ 11] 0x0000155210767420              /lib/x86_64-linux-gnu/libpthread.so.0+00082976
[ 12] 0x00001551eccd3250                                   <unknown-module>+00000000
[ 13] 0x00001552104dd5ee /util/packages/matlab/R2019b/bin/glnxa64/../../sys/os/glnxa64/libstdc++.so.6+01095150 _ZNSo5flushEv+00000030
[ 14] 0x0000000000000000                                   <unknown-module>+00000000

** This crash report has been saved to disk as /NAS/home/edelaire/matlab_crash_dump.3045210-1 **



MATLAB is exiting because of fatal error
/opt/sge/default/spool/perf-hpc02/job_scripts/180874: line 1: 3045210 Killed                  matlab -nodisplay -nosplash -nodesktop -r /NAS/home/edelaire//Documents/Project/wMEM-fnirs/run_MEM\('08-Apr-2024/in/data_sim_240404_1672.mat','08-Apr-2024/in/wMEM_options.json'\)

Any suggestion on how I should proceed ?

Solution:

I created a bash script that contains the following code:

#!/bin/bash 

module load matlab/R2019b

cd ~/Documents/Project/wMEM-fnirs
matlab -nodisplay -nosplash -nodesktop -r "run_MEM('08-Apr-2024/in/data_sim_240404_1672.mat','08-Apr-2024/in/wMEM_options.json')"

and that is started like that:

qsub -j y -o log.txt -pe smp 12 -S /bin/bash  -cwd -q matlab.q -N FS ./start_hpc.sh

I guess now, I'll make the datapath, and option as parameters of that script and i should be able to easily launch qsub on all the data I need to localize :slight_smile:

NOTE: in run_MEM; it seems important to specify the number of threads when opening the parpool; otherwise; it hang forever: in my case using: parpool(12);
More info here: https://forum.bic.mni.mcgill.ca/t/how-to-use-sge-with-matlab/1140

Edit: edited for clarity :slight_smile:
Edouard

1 Like

Thank you for sharing your experience with Brainstorm and HPC

1 Like

No problem.
Unfortunately; it doesn't work: I can launch on the HPC but parpool doesn't start (or at least, it only starts every 30 runs); making the use of the HPC, basically useless as running MEM on my laptop is faster...
Here is the error if you have any idea:

Starting parallel pool (parpool) using the 'Processes' profile ...

Error using parpool (line 146)
Parallel pool failed to start with the following error. For more detailed
information, validate the profile 'Processes' in the Cluster Profile Manager.

Error in run_MEM (line 36)
    parpool('Processes', 12);

Caused by:
    Error using parallel.internal.pool.AbstractInteractiveClient>iThrowWithCause (line 305)
    Failed to initialize the interactive session.
        Error using parallel.internal.pool.AbstractInteractiveClient>iThrowIfBadParallelJobStatus (line 399)
        The interactive communicating job failed with no message.

But this is an issue with our server, and not Brainstorm. I tried to contact our IT; but unfortunately, a dead fish would have been more useful. I'll try to launch on Compute Canada instead :slight_smile:

One lesson from Compute Canada website is that we should change the path of the JobStorageLocation to avoid conflict between different matlab:


            local_cluster = parcluster('Processes');

            % Modify the JobStorageLocation to $SLURM_TMPDIR
            slurm_tmp_dir = fullfile('~/Documents/Project/wMEM-fnirs/slurm', char(floor(26*rand(1, 10)) + 65));
            if ~exist(slurm_tmp_dir)
                mkdir(slurm_tmp_dir)
            end
            local_cluster.JobStorageLocation = slurm_tmp_dir;

            parpool(local_cluster,  16);