Partial sequence files with Biopython

This is another blog post to highlight one of the neat tricks you’ll be able to do with Biopython 1.54 (which you can help test with the Biopython 1.54 beta release).

It is often useful to be able to extract a few records from a larger sequence file – for example, some sequences of interest from a full UniProt or GenBank dump. One obvious way to try to do this is to parse the file into an object representation (i.e. SeqRecord objects using Bio.SeqIO.parse(...)), filter to pick out the entries you want, and then write them back to disk (using Bio.SeqIO.write(...)). However, for complex file formats like GenBank this can be lossy (Bio.SeqIO does not support a 100% identical round trip), and Biopython don’t currently support writing out the SwissProt plain text format used by UniProt. So, that approach won’t work.

Biopython 1.52 introduced a new indexing function, Bio.SeqIO.index(...), which allows a large multi-sequence file to be treated like a Python dictionary – parsing requested records on request. This has been enhanced for Biopython 1.54 with a method get_raw(...) to extract the raw for a record as a string.

How is this useful? Well, take your large (UniProt) file, index it, then extract the records you want and write them to your output file unmodified:

from Bio import SeqIO
uniprot = SeqIO.index("uniprot_sprot.dat", "swiss")
handle = open("selected.dat", "w")
for acc in ["P33487", "P19801", "P13689", "Q8JZQ5", "Q9TRC7"]:

Another neat use of this functionality would be to sort entries in a sequential file format, and there is an example of that in the Biopython Tutorial (PDF).

As a matter of fact, every writer at is:

  • A remarkable pro in a particular academic field (and even more than one);
  • A native English speaker with peerless grammar and proper stylistics;
  • <li>Obliged to meet all your provided requirements and referential sources;</li>
    <li>Providing only original and RELEVANT content;</li>
    <li>Instructed to follow your provisions dealing with formatting and the Writing Standard</li>
    <li>Obliged to edit and proofread the piece he/she has written;</li>
    <li>Required to keep in touch with you and inform about the writing progress so far.</li></ul>
    This entry was posted in Biopython, Blogroll, Code, Development, HOWTO, OBF, OBF Projects and tagged . Bookmark the permalink.