new tab sequence format

Heikki writes:

Philip Lijnzaad has written a new sequence format module called ‘tab’.
It is in CVS. Here is the blurb he wrote:

It is very useful when doing large scale stuff using the Unix command
line utilities (grep, sort, awk, sed, split, you name it). Imagine
that you have a format converter ‘seqconvert’ along the following
lines:

my $in  = Bio::SeqIO->newFh(-fh => *STDIN , '-format' => $from);
my $out = Bio::SeqIO->newFh(-fh=> *STDOUT, '-format' => $to);
print $out $_ while <$in>;

then you can very easily filter sequence files for duplicates as:

$ seqconvert < foo.fa -from fasta -to tab | sort -u |
seqconvert -from tab -to fasta > foo-unique.fa

Or grep [-v] for certain sequences with:

$ seqconvert < foo.fa -from fasta -to tab | grep -v '^S[a-z]*control'|
seqconvert -from tab -to fasta > foo-without-controls.fa

Or chop up a huge file with sequences into smaller chunks with:

$ seqconvert < all.fa -from fasta -to tab | split -l 10 - chunk-
$ for i in chunk-*; do seqconvert -from tab -to fasta <$i> $i.fa; done
# (this creates files chunk-aa.fa, chunk-ab.fa, ..., each containing
# 10 sequences)


Categories