Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files required for steps 4-7 #588

Open
carlahurt opened this issue Jan 9, 2025 · 3 comments
Open

Files required for steps 4-7 #588

carlahurt opened this issue Jan 9, 2025 · 3 comments

Comments

@carlahurt
Copy link

Hello! I am running a very large data set (~200 individuals) using the denovo pipeline and steps 3 and 6 are slow (like 40 days slow). We have been syncing data to a temp folder to speed things up. However, we are running out of space on the temp folder after step 3. Can you tell me which folders are needed to run steps 4-6? are some of the folders unnecessary at that point? Any suggestions would be greatly appreciated.

Thank you,
Carla

@isaacovercast
Copy link
Collaborator

Hello Carla,
After step 3 completes you can safely remove the _fastqs and _edits directories, this should recover some space. 40 days is a pretty long runtime for step 3. Is this paired end data? How long are the reads? did you check sequence quality with fastqc for a couple samples? What clustering threshold are you using?
-isaac

@carlahurt
Copy link
Author

Thank you! This is single-read data. I did not use fastqc for quality check. I ran the fastqs through process radtags (STACKS) for demultiplexing and filtering. Perhaps I should be more stringent. The reads are 130bp and the clustering threshold is set to 0.95. I tried to paste a screenshot of my params file below. This is a crayfish dataset. I've had a difficult time getting sufficient SNPS so I spent some time with a subset of the data optimizing parameters. There are ~450 files in the dataset, so that may take some time as well.

Image

@isaacovercast
Copy link
Collaborator

I would run a couple of the sample .fq.gz files through fastqc to look at sequence quality. If you have low quality bases you can use trim_reads during step 2 to trim off the low quality parts. That will make things run faster, if it is indeed the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants