Illumina FASTQ files – Read Segment Quality Control Indicator

In another quirk to the FASTQ story, recent Illumina FASTQ files don’t actually use the full range of PHRED scores – and a score of 2 has a special meaning, The Read Segment Quality Control Indicator (RSQCI, encoded as ‘B’).

Hats off to Dr Torsten Seemann for raising awareness of this issue in his post on the seqanswers.com forum, referring to a presentation by Tobias Mann of Illumina which says:

The Read Segment Quality Control Indicator:

  • At the ends of some reads, quality scores are unreliable. Illumina has an algorithm for identifying these unreliable runs of quality scores, and we use a special indicator to flag these portions of reads
  • A quality score of 2, encoded as a “B”, is used as a special indicator. A quality score of 2 does not imply a specific error rate, but rather implies that the marked region of the read should not be used for downstream analysis.
  • Some reads will end with a run of B (or Q2) basecalls, but there will never  be an isolated Q2 basecall.

So, armed with this knowledge, you might want to apply a simple trimming criteria to any Illumina FASTQ files – remove anything after and including a PHRED quality score of 2 (encoded as ASCII ‘B’).

We could do this with the rich object orientated SeqRecord based API in Biopython, but when dealing with massive FASTQ files this overhead matters. Instead we’ll stick with plain Python strings:

from Bio.SeqIO.QualityIO import FastqGeneralIterator
handle = open("B_trimmed.fastq", "w")
min_length = 10
for title, seq, qual in FastqGeneralIterator(open("untrimmed.fastq")) :
    #Find the location of the first "B" (PHRED quality 2)
    trim = qual.find("B")
    if trim == -1:
        #No need to trim                                                                                                     
        handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
    elif trim >= min_length:
        #Take everything up to the first B                                                                                   
        handle.write("@%s\n%s\n+\n%s\n" % (title, seq[:trim], qual[:trim]))
handle.close()

In practice the above can trim too much – you can still get isolated “B” characters in the middle of a read, where it is just a low quality score. Instead, we can trim any trailing “B” characters – which will do the same thing on RSQCI based FASTQ files where the “B” should only appear at the end:

from Bio.SeqIO.QualityIO import FastqGeneralIterator
handle = open("B_trimmed.fastq", "w")
min_length = 10
for title, seq, qual in FastqGeneralIterator(open("untrimmed.fastq")) :
    qual = qual.rstrip("B") #Remove any trailing B characters
    length = len(qual)
    if length >= min_length:
        seq = seq[:length] #trim to match
        handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
handle.close()

You could easily modify this example to read from stdin and write to stdout (see this cookbook example), or take filenames as command line arguments to turn this into a general purpose “FASTQ B-trimming script”.

This entry was posted in Biopython, Blogroll, Code, Community, Development, Documentation, HOWTO, OBF, OBF Projects and tagged , . Bookmark the permalink.