I could just let this slide, but I really don’t like being baited online. Paulo (see post and comments passim) has released his latest column showing the world how to generate multiple sets of random DNA sequence in Python for which I am sure they are profoundly grateful. In another pointed comment he goes on to say that:

“Of course a top notch bioinformatician that uses Linux 100% of the time would be able to create a simple bash script and use the initial script to generate multiple random sets.”

Complete with a broken link to this site.

I did comment to the effect that this is absolutely NOT the way to do it, because there is already a program in EMBOSS to do this called makenucseq. Now EMBOSS is not a great secret – it’s an extremely versatile, open-source package for many common sequence analysis task with an API that allows you to add functionality into the EMBOSS framework. Anyone who has met Peter Rice will know how justifiably proud he is of EMBOSS, supplanting (as it did almost entirely) the previous ‘gold standard’ package GCG (now seemingly owned by Accelrys) in just about every bioinformatics setup I’ve ever worked in.

But if you just want multiple files full of random sequence why would you use EMBOSS rather than write your own code?

First of all makenucseq gives you excellent control of the output sequences from the number of sequences generated to the length of those sequences. It allows you to output in multiple formats, to arbitrary locations but it has two especially useful features.

DNA is, I have to say, not particularly random. Apart from the fact it’s full of exons, promoters, antisense transcripts, miRNAs, structural motifs etc. we get variation in things like GC content and codon usage between organisms. Lets say rather than generating some random sequence you wanted to generate some random sequence with the codon usage/GC content of a particular organism you’re interested in – makenucseq will let you do this – and there are many good reasons why you might want to do this. The other great feature of makenucseq is that it allows you to specify an arbitrary string to be inserted at a point of your choosing.

Let’s say that rather than generating 100,000 FASTA files to make sure your input routines don’t break under load in the code that you’re writing (for which Paulo’s example would not be fine, as it doesn’t generate FASTA headers (makenucseq does by default)) you have written a program that detects a specific motif in DNA but you want to test it’s efficacy. You work on bacterial genomes and therefore need to specify the same ‘background’ nucleotide frequencies that your program was trained on. Makenucseq would let you wrap up as much functionality around it as you would like – you can make 1000′s of sequences appropriate to your organism with random distributions of a test insert, to allow you to analyse 1000′s of cases to test the efficacy of your program. OK it’s a simplified example at best, but you get the idea.

EMBOSS is full of discrete, pipeline-ready tools for all kinds of things. As a bioinformatician you ignore EMBOSS at your peril. If you’re also looking for a GUI to it, then there are dozens – EMBOSS has been around that long that an entire range of software has sprung up around it. You should probably take a look at Jemboss though if this is the functionality you’re interested in as it is the one written by the core EMBOSS team.

The lesson? Don’t be repeating what the EMBOSS team have been doing for many years!

Leave a Reply

You must be logged in to post a comment. Login »