Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel peakpicking and/or concatenating XcmsExperiment objects #781

Open
pallevillesen opened this issue Nov 28, 2024 · 10 comments
Open

Comments

@pallevillesen
Copy link

I can see in other and older discussions that @jorainer mentioned the new HDF5 approach.

Earlier Johan Lassen did his own edited version xcms that can run findChromPeak() in parallel - save the results and them combine it before continuing. (https://github.com/JohanLassen/xcms_workflow?tab=readme-ov-file)

@jorainer - can we read more about the HDF5 approach?
Also the parallel peakpicking?

Finally: can we concatenate XcmsExperiment objects?

We do have access to HPC where storage is not a problem so being able to run findChromPeaks() on individual mzML files and then combine the results later would save us loads of time.
So any suggestions for how to do this with the new version of XCMS (without changing your code) would be highly appreciated.

The original issue that ended up with Johan writing his "hacked" xcms. #548
But that solution is for an earlier version of XCMS where we get concatenate XCMSnExp objects using a few tricks.

@pallevillesen pallevillesen changed the title Parallel peakpicking and/or concatenating MSExperiment (or building it from peakpicked Spectra) Parallel peakpicking and/or concatenating XcmsExperiment objects Nov 28, 2024
@sneumann
Copy link
Owner

Hi, the Galaxy W4M people do parallel peak picking and then merge into one xcmsSet. Indeed a similar approach would be cool for XcmsExperiment as well. If needed I could give you pointers to the W4M tools/wrappers. Yours, Steffen

@pallevillesen
Copy link
Author

That would be nice, please share and I will have a look (but it may be very similar to what we already did for xcms v3 where we used xcmsSet and did parallel peak picking and merged. )

@jorainer
Copy link
Collaborator

The Hdf5-based approach is currently developed in the large_scale branch. What I learned from our large data sets is the following:

  • parallel peak detection is fine and very efficient.
  • the bottleneck is merging the result - i.e. the chromPeaks matrix. that requires time and a very large amount of memory (for large experiments).

we also had once the approach to run peak detection first separately and then merge - but, for the reasons above, it's not efficient.

What we do now in the large_scale branch is the following:

  • all results are stored in a (large) HDF5 file, instead of keeping it in memory.
  • peak detection thus does not return the chromPeaks matrix, but is storing that into the HDF5 file.
  • any operation/processing requires the data to be read first from the HDF5 file, but, to reduce memory demand, we load only that part of the data we actually need.

So, with that on-disk data representation we essentially just reduce memory demand, but don't necessarily improve the processing speed (unless the slowdown was because of memory issues/swapping).

Also note that with the XcmsExperiment (and the future XcmsExperimentHdf5) based approach we switch to chunk-wise processing, i.e. load always only data from a certain number of data files (chunks) at a time into memory and then perform operations in parallel on that data. This is slightly different to what we've done with XCMSnExp.

We will also provide a large-scale/parallel processing tutorial at some point - would be great to get then also your thoughts and feedback for that.

@jorainer
Copy link
Collaborator

Side note: while I see the point of running the peak detection separately per file I don't like the idea. IMHO it's key to first have a representation of a whole experiment in one object/variable, which should include all experimental and analytical metadata (batches, injection index, sample information...) to do e.g. some first quality assessment and filtering on that.

This data representation should be efficient and lean. Then I want to be able to run the peak detection (in parallel) on the full object - and there should be no drawback in performance when doing this on the full data compared to running that separately for the files. If there is a performance difference I think we need to fix the code, instead of having the users to come up with workarounds.

So, I would be extremely happy if I could let you guys do some alpha/beta tests once the code is a bit more stable.

@pallevillesen
Copy link
Author

@jorainer Sounds awesome. I do agree with both your comments, for our cluster it may be a little more efficient with +5000 samples to parallelize peakpicking to 5000 jobs - because our largest compute node only have 192 cores.

But I completely agree with the benefit of keeping stuff together - and if we can get large jobs done in less than 100 hours or so it will be fine.

If you already have a few lines of how you use the large-scale branch I will be happy to test and help on a tutorial.

@jorainer
Copy link
Collaborator

May I ask if your nodes have their own file systems or all access the same, shared file system? In terms of nodes we have a similar setup, but for us all data is stored in a central, shared, network accessed data storage. While our genomics people run 1000s of jobs in parallel I usually don't because for peak picking we need to load the data from the raw files - thus, having hundreds of processes accessing the same file system at the same time is not efficient with our setup - and at least one (the main) process needs a large amount of memory available to be allow to collect and join the results from the individual processes into one huge matrix.

@pallevillesen
Copy link
Author

They have both. An internal scratch disk and access to our shared disk optimized for large data files.
The shared disk is so fast that we generally never use scratch disks.
I can easily read 1000 different large files in 1000 different jobs without problems (but having +1000 jobs reading from the same file can cause problems.) It can deliver ~40 gigabytes pr. second (but is shared among jobs).
It's the genomeDK cluster (http://genome.au.dk) and faststorage is BeeGFS.

I'm currently testing and optimizing. Do you have any guidelines for the settings of cores and chunks?

If I e.g. use 30 cores should chunkSize=30 be used? (loads of memory, but highest degree og parallelization, if I understand correctly).

And I can see that we will hit a memory problem at some stage and have to switch to on disk (hdf5) ... or alternatively figure out how to concatenate XcmsExperiment objects (and then peakpick single files and merge results before continuing xcms).

@jorainer
Copy link
Collaborator

jorainer commented Dec 3, 2024

ah, lucky you then for your disk setup :)

to use 30 cores, yes, chunkSize = 30 combined with MulticoreParam(30) should be used. The chunkSize defines the number of files/samples that are loaded at a time into memory. then, this chunk of data is subjected to parallel processing using the BiocParallel package (i.e. bplapply, bpmapply...) which will use the defined parallel processing setup (either passed through BPPARAM or registered globally with register. Note also that you can use chunkSize = -1 for findChromPeaks - then also the MS data import will be done in parallel.

analysis steps such as peak picking, peak refinement and gap filling will benefit the most from parallel processing (as they are computationally intense). others, such as extracting EICs (using chromatogram) benefit less from parallel processing - it's mostly the reading of MS data from the original data files that slows things down.

Ah, and maybe other tip: by default, using the MsBackendMzR backend with Spectra we load the MS data always on-the-fly from the mzML or mzXML files. so, please don't zip or tar.gz them. I had reports of users complaining about poor performance, but they zipped their mzML files. so any operation needed to first unzip the file to the extract the data.

An alternative that I'm currently using for our data is to store the MS data from an experiment into a SQL data base using our MsBackendSql package. In our case it has a bit higher performance than individual mzML files. At the moment I'm using SQLite as database backend - MsBackendSql would also be optimized for MariaDB, but I did not see any advantage - maybe not well configured MariaDB server...

@pallevillesen
Copy link
Author

Thanks for the reply. Got it down to ~3hours for 350 samples using 36 cores with diminishing returns when using more ram + cores.
I didn't know about the chunkSize = -1 for findChromPeaks() - will try it asap. I can also set chunkSize for adjustRtime() and fillChromPeaks() - where I guess it should be 30 then (?)

We're not zipping mzML files - but if someone comes up with a new standard format that is a little more compact, it would be nice :) And the SQL sounds nice - but mzML files are a lot simpler for the most people to handle and transfer to the cluster.

@jorainer
Copy link
Collaborator

jorainer commented Dec 3, 2024

yes, for any function except findChromPeaks() chunkSize should be 30 (if that is the number of data files you want to process in parallel...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants