-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel peakpicking and/or concatenating XcmsExperiment objects #781
Comments
Hi, the Galaxy W4M people do parallel peak picking and then merge into one xcmsSet. Indeed a similar approach would be cool for XcmsExperiment as well. If needed I could give you pointers to the W4M tools/wrappers. Yours, Steffen |
That would be nice, please share and I will have a look (but it may be very similar to what we already did for xcms v3 where we used xcmsSet and did parallel peak picking and merged. ) |
The Hdf5-based approach is currently developed in the large_scale branch. What I learned from our large data sets is the following:
we also had once the approach to run peak detection first separately and then merge - but, for the reasons above, it's not efficient. What we do now in the large_scale branch is the following:
So, with that on-disk data representation we essentially just reduce memory demand, but don't necessarily improve the processing speed (unless the slowdown was because of memory issues/swapping). Also note that with the We will also provide a large-scale/parallel processing tutorial at some point - would be great to get then also your thoughts and feedback for that. |
Side note: while I see the point of running the peak detection separately per file I don't like the idea. IMHO it's key to first have a representation of a whole experiment in one object/variable, which should include all experimental and analytical metadata (batches, injection index, sample information...) to do e.g. some first quality assessment and filtering on that. This data representation should be efficient and lean. Then I want to be able to run the peak detection (in parallel) on the full object - and there should be no drawback in performance when doing this on the full data compared to running that separately for the files. If there is a performance difference I think we need to fix the code, instead of having the users to come up with workarounds. So, I would be extremely happy if I could let you guys do some alpha/beta tests once the code is a bit more stable. |
@jorainer Sounds awesome. I do agree with both your comments, for our cluster it may be a little more efficient with +5000 samples to parallelize peakpicking to 5000 jobs - because our largest compute node only have 192 cores. But I completely agree with the benefit of keeping stuff together - and if we can get large jobs done in less than 100 hours or so it will be fine. If you already have a few lines of how you use the large-scale branch I will be happy to test and help on a tutorial. |
May I ask if your nodes have their own file systems or all access the same, shared file system? In terms of nodes we have a similar setup, but for us all data is stored in a central, shared, network accessed data storage. While our genomics people run 1000s of jobs in parallel I usually don't because for peak picking we need to load the data from the raw files - thus, having hundreds of processes accessing the same file system at the same time is not efficient with our setup - and at least one (the main) process needs a large amount of memory available to be allow to collect and join the results from the individual processes into one huge |
They have both. An internal scratch disk and access to our shared disk optimized for large data files. I'm currently testing and optimizing. Do you have any guidelines for the settings of cores and chunks? If I e.g. use 30 cores should chunkSize=30 be used? (loads of memory, but highest degree og parallelization, if I understand correctly). And I can see that we will hit a memory problem at some stage and have to switch to on disk (hdf5) ... or alternatively figure out how to concatenate XcmsExperiment objects (and then peakpick single files and merge results before continuing xcms). |
ah, lucky you then for your disk setup :) to use 30 cores, yes, analysis steps such as peak picking, peak refinement and gap filling will benefit the most from parallel processing (as they are computationally intense). others, such as extracting EICs (using Ah, and maybe other tip: by default, using the An alternative that I'm currently using for our data is to store the MS data from an experiment into a SQL data base using our MsBackendSql package. In our case it has a bit higher performance than individual mzML files. At the moment I'm using SQLite as database backend - |
Thanks for the reply. Got it down to ~3hours for 350 samples using 36 cores with diminishing returns when using more ram + cores. We're not zipping mzML files - but if someone comes up with a new standard format that is a little more compact, it would be nice :) And the SQL sounds nice - but mzML files are a lot simpler for the most people to handle and transfer to the cluster. |
yes, for any function except |
I can see in other and older discussions that @jorainer mentioned the new HDF5 approach.
Earlier Johan Lassen did his own edited version xcms that can run findChromPeak() in parallel - save the results and them combine it before continuing. (https://github.com/JohanLassen/xcms_workflow?tab=readme-ov-file)
@jorainer - can we read more about the HDF5 approach?
Also the parallel peakpicking?
Finally: can we concatenate XcmsExperiment objects?
We do have access to HPC where storage is not a problem so being able to run findChromPeaks() on individual mzML files and then combine the results later would save us loads of time.
So any suggestions for how to do this with the new version of XCMS (without changing your code) would be highly appreciated.
The original issue that ended up with Johan writing his "hacked" xcms. #548
But that solution is for an earlier version of XCMS where we get concatenate XCMSnExp objects using a few tricks.
The text was updated successfully, but these errors were encountered: