Fast FASTA/FASTQ Random Subsampling
This short tutorial teaches how to subsample a paired FASTQ, single FASTQ, pair FASTA, or single FASTA file to a specific number of reads.
This can be quickly accomplished by using seqtk, which can download using bioconda.
1. Why should I Subsample Paired FASTQ or FASTA?
There are several reasons why one may want to subsample paired FASTQ or FASTA files:
- Reducing computational time and memory requirements: By subsampling, you can reduce the size of your data, which can make your analysis faster and more efficient, especially for computationally intensive tasks like de novo assembly or variant calling.
- Validating methods and pipelines: Subsampling can be used to test and validate methods and pipelines on smaller datasets, which can be useful for optimizing parameters, testing reproducibility, or developing new methods.
- Sampling representative data: In some cases, you may want to subsample your data to obtain a representative sample that reflects the diversity of your original dataset. This can be useful for exploring the distribution of features of interest, such as GC content, read length, or sequence quality, or for creating a reference dataset for downstream analysis.
- Balancing read depth: In some cases, the original dataset may have uneven read depth across different regions or samples, and subsampling can be used to balance the read depth to achieve a more even representation.
It’s important to note that subsampling may introduce biases into your data, so it’s important to carefully consider the intended use and design an appropriate subsampling strategy to minimize these biases.
2. Randomly Subsample Paired FASTQ or FASTA
Using seqtk, we can quickly downsample a paired set of FASTQs. It is essential to set the same seed (-s 123) when running FASTQ pairs so the random selection can be repeated between FASTQ.
In the example below, we subsample 100k reads from each FASTQ pair.
# FASTQ R1 $ seqtk sample -s 123 read1.fq 100000 > sub_read1.fq # FASTQ R2 $ seqtk sample -s 123 read2.fq 100000 > sub_read2.fq
The same command lines could have been applied on paired FASTA files. Moreover, it should also work to subsample a FASTQ gz file.
3. Randomly Subsample FASTQ or FASTA
Similar to the previous section, we subsample 100k reads from a single pair FASTQ or FASTA.
# single paired FASTA $ seqtk sample sample.fasta 100000 > sub_sample.fasta
3. More Resources
- Fast Conversion of Lowercase Sequences to Uppercase in FASTA Format
- Easy NCBI Genome Download
- The Fastest Way to Read a FASTA in Python