Managing messy, large DBs with branching analysis

ecoffey · January 26, 2022, 3:58pm

Hi Brainstorm,

We work with high sampling resolutions and hundreds of thousands of little files. This generates a lot of problems for saving the database which becomes very long as I have asked about in the past (partial solution was modify the save function to only save manually and intermittently), but also for organizing workflow when we have multiple research questions on the same dataset. We have experimented with making copies of the DB and deleting the bits that are not needed for each analysis, but this means we have multiple copies of some of the data, and since our DBs are in the 4TB range it's a problem.

Ideally we'd have something like versions of a database that somehow link back to or share common files and only save/display the files relevant for an analysis, but I think that is not possible. I am wondering if you have any suggestions for how to better manage this problem?

Best regards,

Emily

Francois · January 26, 2022, 6:46pm

Hi Emily,

@Raymundo.Cassani are currently working on the future version of the database. The protocol.mat file that contains the DB metadata will be replaced with a SQLite database (protocol.db), which will not be read/written constantly on the hard drive. Hopefully this will solve this issue.
This won't be ready immediately though, there are a few months of work on this do be done.

Your gigantic database would actually be a really good crash test for these new developments.
Maybe you could connect directly with @Raymundo.Cassani and @Sylvain to make sure your wish list is fully included in the future releases.

ecoffey · January 26, 2022, 6:58pm

Hi Francois,
That's great news. I'm always up for some Brainstorm beta testing. Will coordinate!
Emily