September 25, 2009 – 7:49 am
Biopython 1.51 onward includes support for Sanger, Solexa and Illumina 1.3+ FASTQ files in Bio.SeqIO, which allows a lot of neat tricks very concisely. For example, the tutorial (PDF) has examples finding and removing primer or adaptor sequences.
However, because the Bio.SeqIO interface revolves around SeqRecord objects there is often a speed penalty. For example for FASTQ files, the quality string gets turned into a list of integers on parsing, and then re-encoded back to ASCII on writing.
The new Bio.SeqIO.convert(...) function in Biopython 1.52 onwards makes converting from FASTQ to FASTA, or between the FASTQ variants about five times faster. It can do this because it doesn't bother with creating any objects - it just uses Python strings.
You can use the same approach in your own scripts. For example, suppose you have a Solexa FASTQ file where you want to trim all the reads, taking just the first 21 bases ...
Posted in Biopython, Blogroll, Code, Community, Development, Documentation, HOWTO, OBF Projects | Comments Off