zeroth order

A blog about Modern Perl, bioinformatics and anything else that I feel like rambling about.

Monday, June 22, 2009

Do What I Mean: Moose Types and Type Coercions

API's should be simple. I hate it when a module that solves a non-trivial problem requires the user to make non-trivial decisions about every single detail in the domain problem.

In my opinion, a class should be smart enough to make all the reasonable assumptions so as to require the least possible amount of input from the user.

This might seem a little dangerous, and can certainly be if the approach is taken too far (IO::All is for some people an example of too a DWIMmy API), but there's a healthy middle-point in which the user is not only able to rely on the module to solve the problem at hand, but is also spared of most of the cognitive load that solving that problem requires.

It is very probable that someone looking for an already-cooked solution in the CPAN is not only not willing to code it up for himself, but also doesn't know enough about the problem to do so (at least not initially). This person is going to rely on the module's wisdom, and the least that it's asked of him, the better.

One of the modules that I'm working on deals with protein sequence optimization using genetic algorithms. The user has a collection of protein sequences phylogenetically related, and wants to produce an optimized sequence for a custom trait (solubility, hydrophobicity, digestibility, etc) that still belongs to the original protein family.

Under the hood, the algorithm that I implemented requires a multiple protein alignment as input, or profile. Naturally, the methods that do the heavy-lifting expect a Bio::SimpleAlign object. But the complication is that protein alignments can come in lots of different formats, many of which are also shared with plain protein, RNA and DNA file formats. Also, the user actually shouldn't be aware that the module requires a protein alignment. Of course it should be allowed to provide one, but if the only thing he has is a bunch of sequences in a flat file, it shouldn't be bothered with opening (how?), parsing (what format? What is its specification?) and aligning (with what algorithm? Gap penalty who?) them to cater to my particular implementation. All he should need to give is a simple filename as a string.

So to maximize for user convenience, I decided that the module should accept either of:
In the case that there is ambiguity about whether the user supplied an alignment file or a sequence file (eg., fasta format is both an alignment and a sequence format), I'll make an educated guess and assume that it's an unaligned sequence. In the worst case scenario, It'll just realign an alignment. There is also an extra layer of guessing involved in determining what the format actually is in case that the file has an unknown extension or no extension at all (this is done by Bio::Tools::GuessSeqFormat).

All of this adds to the simplicity of the API in detriment of the simplicity of the underlying code. Luckily, Moose has the tools to make this as straightforward and clean as possible, using Types and Type coercions. The coercion map looks like this:



And the code that implements this is simply the following:

What's better is that all these types and type coercions are defined separately in a type library that uses MooseX::Types. They deal with both the input sanity checking and the type coercions. This way, not only this complexity is hidden from the user, it's also hidden from the main application. This is really helpful, since now most of the code in the module's main file describes the class behavior and it's not coupled with nor hidden by the juggling of all the possible user input types and input validation code.

Now future users of this module (most probably only myself) won't have to check the API's documentation that often; whatever representation of a collection of protein sequences they might have will serve as a valid input. I believe this to be a nice design choice.

Monday, May 25, 2009

Superloading Devel::REPL

Today I bumped into Class::Autouse, pointed by App::Cmd's docs. Essentially, it lets you defer the importing of a module until the code actually tries to use it. So, for instance, if you have:


The program will load Deps 1..10 before executing Dep:2's constructor. However, with


Only Dep::2 will be use-d before the instantiation of $thing.

This can be very useful in cases where an application's flow can take several different routes, and might not use all of its functionality in a single invocation. Padre, the Perl IDE, for example, makes extensive use of this module exactly for this reason.

But another cool feature of this library is its ':superloader' option. It allows you to import dependencies on the fly without explicitly saying so. For example:


Here, we don't predeclare any class in particular, we just use it. Notice that I said "class" and not "module"; this only works for object oriented libraries, ie., classes.

What's most useful about this is that it can be very convenient in cases where you don't know beforehand what classes you will be needing. In particular, I think this is a perfect addition to my Devel::REPL configuration file:


and then:


Since I'm mostly doing exploration and debugging when using the REPL, correctness is not an issue, and not having to load things beforehand increases the whipuptitude and overall improves the coding experience.

Monday, May 18, 2009

A sequence translator folder using POE::Component::DirWatch::Object

So, I admit it. I hate coding GUIs. Probably because I don't have much experience doing it, but I hate it nevertheless. So when I bumped into POE::Component::DirWatch::Object and grasped some of the potential it had for kind of replacing a GUI in some simple scripts, I just had to try it.

What POE::Component::DirWatch::Object basically does is keep an eye on a folder, and act upon an event regarding it. For instance, using ::NewFile, it will trigger a user-defined function whenever a file is created in that directory. This might seem old news for the compsci people, but for me it was quite a new concept and I really liked it.

So to test-drive it, I decided that I'd code for a simple script (not too different than the module's synopsis, in retrospect) that would translate any DNA sequence that was created in it. This way, the user would just have to copy or move her desired sequence files to the "translate" (as I very imaginatively called it) folder, and they would automagically translate themselves.

The core of the script is:


Simple, right? Here, $watcher will execute the translate subroutine whenever a new file is created in the /home/brunov/translate folder. The last line is needed to make POE do its magic. The script runs a daemon that never exits, and could be configured to start when you log in to your session.

The translate routine is pretty straightforward. It uses Bio::Seq for the translation part, and Bio::SeqIO for the file reading and writing. What it does is read the input file, translate all the translatable sequences that are in it, and write the results in a temporary file,
which is later used to replace the original one.


That's all that is needed (minus a little error checking that I chose to leave out for the sake of bloggability).

Adding neat notifications

The script runs fine as it is, but it's kind of dull. Since there are no newly created files nor visible output, you are tempted to reopen the translated file to actually check that your DNA has been translated into protein. This is why I thought that adding system notifications would be appropiate.

Actually, bullshit. I just can't get enough of Ubuntu's sexy notification system, so I figured it'd be nice if a bubble popped up saying the outcome of the translation.

So doing some more CPAN digging, I found Gtk2::Notify. With a few extra lines of code, I got a chatty, ribosomy (get it?) folder. Here's the modified translate sub (minus old comments) and the new notify one:


And that's it! Now whenever you drop a file in the "translate" folder, all DNA sequences that are in it get translated (and those who are not are left untouched). This could allow for cute chained actions; one could have a whole hierarchy of live folders with things such as "reverse translate", "blast it", "predict solubility", "digest it" and so on. It is certainly not as flexible as good old CLI scripts, but it reaches a nice compromise between the user friendliness of a GUI and the batch processing capabilities of command-line fu.

Here's how it looks:



You can see the real-life script here.

P.S: And just to spite TIOBE people: Perl Programming rocks! And also Euphoria Programming and SuperCollider Programming, of course.

Monday, April 27, 2009

Chopping proteins with Moose

It comes a time in the life of a wet lab geek when one has to simulate the outcome of a proteolytic degradation of a protein of known sequence (those who said that haven't are lying).

So for a problem this common, I did what I always do first: google for an already made solution. Surprisingly, after ten minutes or so, I came out empty-handed.

Now don't get me wrong: there were solutions out there, like PeptideCutter, PoPS, or EMBOSS' digest, but none of them were easily usable as a library. The first two only offered a web front-end, and that is OK for a quick digestion of one or two sequences, but I needed batch-processing capabilities and much more flexibility (ie, being able to predict partial digests, for instance). The last one was indeed available as a command line utility, but it offered very few enzyme specificities, it forced an interactive mode, and writing a wrapper for it would have involved more logic that the one implemented in the application (not to mention the ugly reading/writing of temporary files that would have been required to deal with the app's inflexible interface, and that would have created a nice and shiny IO bottleneck in every downstream application that used it).

So, there. Homework done, I decided to code for a solution myself. Luckily, PeptideCutter had a pretty comprehensive explanation of how their algorithm works, and it is simple enough to be coded quickly.

Basically, a protease specificity is modeled as a regular expression that returns true or false on a sequence sliding window. This sliding window is six residues long in PeptideCutter, I did it 8 residues long to account for potentially more specific proteases. When the substrate matches the regular expression, the sequence should be cut at the siscile bond. Simple enough.

The module is up at my github repo (I called it Bio::Protease for a lack of better name), you can download it/fork it/follow it to your liking. It has quite a robust test suite, I checked every single specificity against PeptideCutter's results on the same input.

I used Perl and Moose for it. One of the things that I like the most about Moose is how easy it is to make really DWIMmy APIs with it.
For instance, in Bio::Protease, you can do:

my $protease = Bio::Protease->new(specificity => 'trypsin');


This will coerce the string 'trypsin' into an internal representation of the trypsin specificity, which is implemented as a code reference. (Currently, there are more than thirty different specificities to choose from, you can list them by doing say Bio::Protease->Specificities).

Also, you could say:

my $cutting_pattern = Bio::Tools::SeqPattern->new(
    -SEQ => 'XXXW[^P]RSX', -TYPE => 'Amino'
);
my $protease = Bio::Protease->new(specificity => $cutting_pattern);


to get a custom regex-based specificity. The one above would cut sequences between the fourth and fifth residues if they match that pattern.

You could object that the regex-specificity model is too simplistic for your personal use, and you'd be right. Enzymic cleavage of peptide bonds does not share the deterministic character of restriction enzymes, and sometimes protein's structural characteristics get in the way and play an important part in cleavage. So, to account for an arbitrarily complex specificity model, the 'specificity' attribute also accepts a code reference:

my $nobel_prize_winning_specificity_model = sub {
    my $peptide = shift;
    # decide whether to cut $peptide or not
}

my $protease = Bio::Protease->new(
    specificity => $nobel_prize_winning_specificity_model;
);


Internally, the module passes a sliding window of peptides to the code reference; if it returns true, it marks the siscile bond as cut.

So back to the original problem, to actually make the cuts, you can do:

my @products = $protease->digest('MAAEEELLKKVVIKP');

The products of a full digestion of the argument sequence will be returned as a list. On the other hand, if you want to get the siscile bonds, you'd say:

my @cut_sites = $protease->cleavage_sites($seq);

And, for partial digests, you can use the method 'cut', that will return the products of a single cleavage if you provide a siscile bond as an argument:

my @products = $protease->cut($seq, $cut_sites[rand @cut_sites]);

In the above expression, $cut_sites[rand @cut_sites] will choose a random cleavage site from @cut_sites, and cut $seq there. This is the basis for a bigger module (and the reason for coding this one), which could make it to a separate blog post in the future.

The take home lesson is: use Moose. I am too lazy to have done all of this had not been so easy. Without it, I would have probably whipped up a half-assed script (not a shiny, reusable module) that took a sequence and a specificity as input from the command line, and that barfed a table with the cleavage sites as output, and not much more (sounds familiar?). But all of the above functionality was implemented in a hundred-or-so lines of clean, declarative, object-oriented code, and I don't ever have to worry about this problem again (and neither do you).

Saturday, April 25, 2009

The Enlightened Perl Iron Man Competition

Matt Trout just made a blogpost announcing the Iron Man Competition, sponsored by the Englightened Perl Organization.

I personally think this is a very good idea and a great way to tackle one of Perl's problem of today: perception. The language and its community may well be alive (Tim Bunce did a great job proving this), but few people outside the echo chamber know about it, and constantly make claims that Perl is dead. As Schwern very well put it in one of his talks, this perception issue is far from harmless and should not be taken lightly.

Given that today's communication routes among the newer generations are mainly blogs, Matt's proposal to "make noise" by blogging (as opposed to just writing excellent code, which should be enough) is the right way to help increase Perl awareness in the programming community.

I have been playing with the idea of starting a blog myself for a while. This might have been the push I needed to make up my mind and JFDI.

Note: This is a repost of a meditation of mine in PerlMonks. I am hoping that it counts as my first Iron Man blog entry.