zeroth order

A blog about Modern Perl, bioinformatics and anything else that I feel like rambling about.

Wednesday, November 25, 2009

Embedding images in POD

Gabor's post lamenting the lack of image examples in image-related modules (like charting) reminded me that once I bumped into a module that did show images in its documentation. They weren't links to images, but actual images, showing there in the CPAN search page.

I couldn't find that module again to see how it was done, but after a little googling and talking to some freenode #perl folks, we came up with a way to do it.

Behold:

=head Title

This module is awesome because it deals with pretty pictures, like this one:

=begin html

<img
src="data:image/gif;base64,[gibberish here]"
alt="foo" width="42" height="42"
/>

=end html

So the trick involves escaping the image source for non-html readers of pod (surrounding the img tag with =begin html and =end html), and using a base64 version of the image. I found this service to convert binary images into their base64 representation, which even lets you choose the column width of the resulting string.

It's a workaround, but having an example picture or two in some of the many plotting and charting modules out there might be very well worth the effort.

Here's a complete example, with an actual image. Copy it to a pod file, run pod2html, and see the resulting .html document in your browser:


=head Title

This module is awesome because it deals with pretty pictures, like this one:

=begin html

<img src="data:image/gif;base64,
R0lGODlhTwBGAOf/AEsvHV8/KXNML3lKMm1NOXFNNH1PLoVPMXZTOXZTQINSOINTPotRPX5ZP31a
R5tVQIpcOZNaO5VZRJFdQ49eSYVhTYliR7FZRIZlVa5cRYVmW6dfRoFqSpFmX8BcR8NfP6VoR85c
SN5YR6JpUpptRLhlS5JvX75kS5FvZZhvWZlwUIx0U5BzV5BxbbRqTrdrRcxlTYt6U5h4TYx8TtFp
SdlmUuRkUd1nTbN1U515cqh6SaV6W7h3RqV6ZKN+RKF+SpSCSbF3Ybh3Tax6U6R6bM5yV52DRa5+
R5KGS9hwV8R4RZiFRpOFVaSDQJmEV8V4VpWIR8J5YZuIQvNrYJ+GTvFwXpuOTKGNTbaHUaOOSMaD
UvF2VcKDYa6MQ/B2W5+QSLuGXOR6XpiRW6+MTbeIZcCHVqmPTO94ZNSBYL2KTayOWqORV6GRXrSQ
QKqPaaCUV6+TR6yVSKqVT6WXTuODYp6YYKiYScOQTLOWRLqPeciOfPWFXtiMdciVUK+eT6qgT9WS
XfSIZbKdXK2fV7ubWNOSbsmVbr2cUuOPabafUfaJcMebTbafWL+eSaiiac2YYbydZ7OgZs6bTq+j
ZbikT7alVbWlXLGnVbCnXK2naPeUcburYdmkVr+sVb6sXLmuXcOrX82pW/eaddalb9CnbsCsb9Sp
UtekfLaxXsmqbLyubuKhgbawb9+laMiuVrKycOKmY76yWbu1W/GhiMa0VcezXMWzY8C1YvWjfO6o
dc+xjdutl/Oomc26ad20etS6ZO6tmsq+XtC8ZeyyeO2zbsq+a9K9X/Svi8y/ZtS+WeqykMy+edm9
Ydq6ft66c9a9ctO9fc3Bdf24ifu4keTFW/i4ptTJfNzIcNvGh+DHfvy+idvJfuDKZ+fEkdHNfuvF
huTJddXOdPjCidbNhtzObdvOdNvMi9XPjubJofvFmvTGqfbKf9rSl+TUc+TVfOvUdN7VoePXid7W
qObWqu7SvN7YsOXYsvzVmOTZufzZpf/atf3cre/huv/lvv7v0v324ywAAAAATwBGAAAI/gDt4Rto
T6DBeQUT2pOn0B7Chg3rwbNmzZvFixSXWVu2TJWqSKVCelQ1iY0TJo4meWTF6lUmVpliOqpTZ95A
ghBz6iz4sKA8eByjVfRG8WK4aCNDqiylMhKbp0ycsFHF8lXLmK9czsSGTyC+evbA7lTYM6HYemCD
CqXItqK1kR5DhoxE96lUJ5GoZmXpKJPVrI66ehWcsGu9roTFIpT3E548sGKjcVwWjWhRipJVLZML
qbOazyZXiObAsmrLrFZjol19GGfYsPXKOoTHjh3t2o4Z11Mb7Wi0tUSjyS21ShSiMDROuNixggMH
0ayolv4LM5MjeV+/rn4N8WHP2uDP/olnJ55yr9+Vf7u11iukL16i4geiAQPGiQbOnUfXi9qvdUeQ
HcbaTYKhRVhYjN0mnnjjbNQLZb9ZNBRmpaQCDC+ziKJJIGEkUZ8HGVCQH1UrWQXTK305cs5jrG0n
kIs8FVZPbQui14s15WAWTlvWNFjNj7jgsuF8NcDgwZEPWMCBE2qQNElVVjkipSPuOIYWYwv9xBBP
tPU0zzw/nUPjONss88w22TjjTJpqNgNNM80oUw2G8W24xw031BDCkR6cAAJdkUwyiZSnsTITTe64
YxtD8+BG3jlgngOPeF96N4852GQKZzNtqumpM774kgsus2SoyamB7LGFFzXoeeQF/hm4AAZdgqry
Ekw05eoOOKVAY45tC46z4DnmnCPsr+x8iQ463Wwaqi/OcKpmqKOMMoyQpyqCaiCBrGpDq3t6cMED
LgwRaK1TZpJrHY6AQ0hn48Qrr7DCMjiON+Ng2mwz3XTji7/PBlxtK4iIosjBCCuyxx5eqLrFt3te
MO4DJ6hxbkqT/CdlrirIoEaiicYTzrzD4ovvm80E7Eu1LA9MhyYHn6HIGTSf4cXNXlSRsw3fivvA
AxIsMAEk5wo66JQ0iVEHyEyDPG+81oQTTzlEobzyyi1XG0bNXE9Rxddg4+xFz+NKEPQCFgyyBsZG
C6pu0om+I/fc7bSTaI/jJBrO/t7hlFPOM3BinTUaZ2w9hddfIw52zVXwvCfQC0S+AAWRrMGGlEZH
N4nSNNU99+fv1A2y6H6XDg44zpBCStaPJJFEDWdUcfjssiv+Nc96XiABBZIz8AADMlAhFRtPsqRK
HWKwIYbnoIPezuft+F136c+kkorqqnNRRBH11WDDFDRXETvYX99c5AknlJDBBR6EwL7EEm8wQwxi
RDfI5mKIEfrzzcudKP+hK4c7pue3YvQiFYQghPUKgYYioA99rvNCGLZGM9e5rkMdcl0NcraF+hhJ
AgTgQAxiwIRJWOJ+SaubCpm3vwGucIXkCEc7yFGOYkSiF89wxvUQ0cASuKAE/tsrwgTDgAY00AEN
T3iCEOnwMkXcbAsdLAHQKIAAJ5CQCWx4wyTewAbkvfCLdmNaOcixP3KY0Yx7K4YaK3SKIj7BBXAc
hjgQgYgn7GAHKtgBGcCQj370Qx/5EMcwGAiCBBCgAAVowAhPwoTkaRF5ywujDN2xNxyVboxjJCM3
ZohJGpbjGmq0HiiiwIVS4kCO/crFGcpgARXkkQz78Ic/+pGPfejCAmiwACILsIIYzGAGTAimI7Po
xXbwbW9+QwYyLmnGd5ixbjEMxzWmWY5wqLEXhKDHKcgwhCHkQhz36IYgEZGLVr7yHrL0xz74YQE3
DKOXwQzm/ICJxTq8gYv5/lMhDcOBDDUWQ5nXmKHcuMGNM55RmWq05j/poU0cuFIc4shHLb8xjHK6
EgyFQOcs8/EPCLjTijNg5Ah/ScKn5C+foeNGP6+JjGAgQxi/uMY7CkpQg/KzGLe4BUsZCowgpMAC
udBGRCUKziHIYAhk6KM//tEPdKqgGesAQxmwIAMs/EAGIgzmCPPXSDEQlBvGEEYvrikMY7xUGNfg
xjWQcQ1joJGfK7WFLXrRC2EwVA/ASEEDSPDNe+zDj/sQhzTE8del/oMMuvCHMu6xDm0MYxiteAQg
0oBVX86vq2J4gzBgGlax0vWlytwsMuiKVmWaFqHFkKst7KoOPazCAg1I/kADvvnXfuxDH7dN5z/c
wAIW+OIb90iHUAXZCkAA4ggyIOkVxYAEzW52swYUxmjrutlr0BUUtlCmS0/rT9LSo7V6gC0CCoBL
wt7Wj/2Q5T/+gQAWEIAAvpDGNLRB38cWVws8YAISfgnMNwxiEL2oBWhXigzV2uIXv7AFKBgBClD0
4hahjXAv5tqL7+phFztAwHjJiwZcpEMf+vCjev+hAfceEhfSkAZ9iQsI/K4BCPRkwhswgYlbyPWf
ZxWGXD3hCdVughEMBoUnQHHWHIvVrhY+RQN2iQAVzAIXIA6xbivAggBYmQDHmIaKHZuLFvPACoNA
whrWwIEZ11iuEH6u/o49sYlN9JjNQG7wkIcMis0agxnG+IUw8EGPPKTAAQXYcAEogAhcTAPEf9SH
LHVRZSsHgAu4OIY0HtviFwhhEPccMxNojAk0m3bCO+Zxm9ssiDgLmcdD/gUzVs2MX6SCDD8dr6AX
EAVReDgduFa0LFkAAEcHIEjSyEUuEPGCFxBiEFZ4wxoGQeNNfALNBka1qEf9YyAzws1upnMoWn2K
PLAAAw6ogAMWMIByD0ByhZ4GrnOtj2EUwNdo4CEaXLABELgZE3NI9htQ8QlUoKIWt6iFwKU97U1Y
wuBAFkSpPWGJgzcYFKE4RQ9MQPEKWHwC5Ca3AiJH6EhP4+Mfl8YT/t4dgASMYARmWwAVBOGJT1hi
Dsr+b78DLnBb1ILg2W64wS3BCIUzouENBzIhTpECiqMABRjAQA8moIANMGABDGDAxikQ70gHCReI
cEEEJhCAa3/i6wxv+SeQDWZm8/sWsRB4LWwRdlEDfeeDsPbPB9Fwup+C4hhAQQs0gIEUBCECEdjA
z34n9cm54AlFdOMGJKADHtP4Epf4RORjIYtLMBsTMr/FJ2q+doZXohJuroQlQB/3hNO97oMQBCQm
nvSj8x0FQdjAD0uwgQycgPAKQIAFKDCCCUgABEGTM+Qhjwl/y0IWn8DEJWg8B0x8Au015/HoQf95
hm+i9NYG+n8H/uEGiyddA+DvQAeCAMcSKGcDMPhABh4wgQE4+AckCAAExvALIVdC8sqPPCqOj3x+
f6LNX1cLneAJogd001cJBFgJjCAHpSYIdDd3CtcDFZB0FUBxGtABJhB76AMDtAcDN+A+G4AF2OUJ
oVCCJShknEZj/oYKkId8X+d/XzeACsgI//V5lkCAQLeAgkAKNJh9CpcCSYcBFWABKYAC4NcDP+QB
9uECJ4AnHlACQgAKaucKJoiCjzd8qKB8/faC/tZvBOgJ1jYIfuAHCFgJpUcI30AMrdAKsFBqOigI
eCeEGGB0KGACI+ACMBACe3ICHpgcF4BdNlYLriBnXNiFl4AK/n8AeZ8QC7cgCyt4CWVYCYlAg4kg
hnEnB4yQQJBlXI+QBoSgcIKgBnlgAkGoAXVoAhqQgRtQAnkIIh5wAzSgfkJ2Cy2Hap3wfPvXf5CX
iIsYC5S3gpjweYkwjMNICIlACJmYQIRACsWlBEogBGRgCKCYB3mAdOBnikdHihQQBCNQAnuSAa9I
Ax6wAaBwf2zHdjz2db7oiCwIeZWQdrTQUshwCywoicSYCIcACoegjAl0B5KFXziAVGRwCgSZB0Rg
hNeIjaRocSkgARIDjjAgjhngCYnQY+fYYwPoi5TniJAYC7TADckQDCIZDJJnj4fQCI2wCIewj2OA
BS6JBVog/gRCAAJ4pEdkQAYGmQNHt3fXiHTg5gAJEDk/YwN4EgInkAEgMASmNARgMAQ6QAhoJwsb
KQuV0AnMQA3UkAxaKZKXQIyh4AqmEAqmsAiEkAYvWQZCgAM4sAMpoAI98JY9cJB61wI8yXfgB24J
kACIdG4hIAIisIc4UEqCCQZNOQQaSXkaSQtYmZXJAJLBgArDeAhiOZaScAdpYJYuWQZqSZNt+ZZF
ZwIokAN0WZfgJm4OQAB6iUgR4AF+aQMcKJhcUAiGQJjdVAuH6Yu0kAyrBpIh6VKQiY+HsAiS0AeW
eZkuqQVqiQOcmQIpMHEmkAPQOZrX+JN5+V4bBgIX0Jch/gADaFAIXMAHq6AMpzCbtXl8GlkLWmkM
6TmSjjiJcDAGLXmZ8okFSJRET4ADQRAERBCXRACd/ima4AeUQJmXhnRICvACg3cCRcAHfDALwAAM
yhChswkGthkMtICbzJAM6kkLHHqhX+cHiWAG8HkEWDCfQmCfUZCiQcCf/fmf0Il0A4oAh5SXMpoA
GBc5CsAAGxAFfMALP1IN6qAMhQAGOmCbHJqYHeoKtNAJTAp5cQefP/ADJFqiZeACIOACOICl+ZkD
JkAERECKc3h0MFqg1pkAGoaaOsA7DtAAk8OjvOCjP6IMhjBVSnqhvtgJscCkneAKrkAJnTB8YpgF
UPoD/jpAojwAAoiaqCPAnEE4geHmAEk3oEwWaBqWAFEqbgS6jW7qo8BwDIVQBoewanbaCZTwjrHg
p33qB3+wqnMwB1dgBCIapYUKeBGAqBNwqxYnbpiaAA7QqwPaAAggA4ekYRaAACupAr2alw7Ae/nZ
o7wADKvwqaHqCnjKpJRwraRaqpWgqqsqB3KQBbA6Bj4wrjoAARBAqwrAdBVAoDTargjQABYAASvZ
CIfwA2PACZKwCF1gAQMabhVAAQDrpqVSCFhwCJTQp56gp9dKhts6htwqB3FgBllgBkbgA7IaARBw
AICnAApAAcCqYbIWaIFmAT9ACIdAjCspnJJgCl3Q/gC+2qsLsHELMAINOrA6MIx+6o6VcLAOu6o+
+wffegVZ8KpjIKsQoAAHYAAQYAACYK4kQAJmELVjIJmNQAnEKIkgGpyLYApc2wcWMIEVEDkDoADn
NjmI6pSJMIbbegkO27N+MAis+gd2MAdDewV2ewXj6gPmagBK67Q+MIwmOa+UoLYOmwiUsAiIu7KL
oAIWF5Qc+7gcOwEkoAMOewk+C7djeLlzwKpzmwWeK7SfKwdNoLfm6rRQGwdycI/32LZjOIwpibiL
ULTwKjQTwAAHcADlBgEkcARjaAd/ALeY8LN/oKqD0KpzYAfIS7ef67meGwfOGwcWqwM+YAbeWr1+
/hAHxGi1Yzi4hbuS3ksIYwABtxp1SGsAAwACPDAGDuu7/2UHcPu7rGq8yBsHyCu0S7AEzHsFz5sI
zisHc+CtcTAHqGsHqpu2vVu4/AsHChwHY0AC51q+fAsCZUC9g4C8duAHxsutrVq88ju/eEAFV3C/
Qku3EDuGcmAH3nq8reqtKIzCcsC6x+sHyHu9EOutWfADe2sAGssDR8AIrXrBczu3xvsHxlvEx2sH
cYAHjbAEIczEWSAHX4DE9Iu6KrzCLbzC/3vBY/jDFty/NnwFJMC3EfAC6QsHc/AFVtCqX2DEbHzG
R2wHStwIVGAEInwFb+zC34rFRVy9P9y7XxDF/snbqp57xhNrri9wBF1Av3+cxvkGBWv8w/5bxGk8
vyjZCEBwv5h8BV/wrVmAxpp8t3ZsxGicBcqLxn/8yIRsBX/suWYwvZF8xmhsBVAwy468yXEAB86b
BVawy2ccB5W8CEYQzJjMxEuwyFbQxMjMyPm2y6rMzFBQzMrLvJ38BdL8x8bLzLIMBbsstGYABygZ
B8ZMzd5cycEszPcrBec8zEBwyezszLLczNqMyU8csZ5LzaesyqdszNk8y7I8sXCAB3ggscycBQpc
yY3QBAhtzsOMyevc0EgABEiAzZdMy7O8BFDQvJ67y6bMzPl8yrSczVYgBWbQBXgAB/U8y1nQwAVt
YNBd0AUJTcfDDAUPDdEzDdEQLdPPrM1QIAUUTc3SzNPYjM/6TNE6jb9RKwVZQNEETdKVrNIvfb+0
vAQNDWMOjQRQAAQ6TdQfTc1IndS7rM2yjMY8TctIUNYyDdK6/AVEfQVN0AX/3Agt7dJNANM4/cxS
DdEzYNM3bdZaTdG7LAVIrdZfIAWLvNNEXdZ83dezjASY3AQiGtcI/dQ5fb9T/dBmjdh9LdW0DNhZ
QNiKTdagLdNWfdh8fcnl3AQBAQA7"/>

=end html


Update: confound told me that apparently data tags don't work with IE6, so if you are planning to do this in your module, take it into consideration.

Tuesday, November 24, 2009

Baby Haskell and Good Old Perl

I finally drank the functional, immutable and pure koolaid and started reading some Haskell books. And to exercise a bit, I began to tackle some of the problems from project Euler.

Problem 10 states: "Find the sum of all primes less than 2,000,000".

After finding out that it's only necessary to check for divisors less than the square root of the number, I blatantly stole this version from the Haskell Wiki:


If this doesn't give you a nerdgasm, I don't know what will. It creates an infinite list of primes by filtering an infinite list of odd numbers with the isPrime function. This function, in turn, uses the primes list itself to look for potential divisors. So basically, the list is defined recursively.

The solution then boils down to getting all primes smaller than two million from the infinite list, summing them and outputting the result.

Anyway, this beautiful solution got me wondering how the perl version would look. Here is my first attempt:


Ugh. I mean, it's not that bad, it has some interesting things going on like having an infinite stream of primes, but after being embelished by the Haskell version, this seems uninspired at best.

But then I thought that being CPAN (and not syntax) Perl's strength, I should take a look there. After no more than 30 seconds of searching, I found Math::Primality. Look how the Perl 5 version looks now:


This put a smile on my face. Perl may not be the most beautiful language around, but it has a JFDI attitude that I have yet to find elsewhere.

Monday, June 22, 2009

Do What I Mean: Moose Types and Type Coercions

API's should be simple. I hate it when a module that solves a non-trivial problem requires the user to make non-trivial decisions about every single detail in the domain problem.

In my opinion, a class should be smart enough to make all the reasonable assumptions so as to require the least possible amount of input from the user.

This might seem a little dangerous, and can certainly be if the approach is taken too far (IO::All is for some people an example of too a DWIMmy API), but there's a healthy middle-point in which the user is not only able to rely on the module to solve the problem at hand, but is also spared of most of the cognitive load that solving that problem requires.

It is very probable that someone looking for an already-cooked solution in the CPAN is not only not willing to code it up for himself, but also doesn't know enough about the problem to do so (at least not initially). This person is going to rely on the module's wisdom, and the least that it's asked of him, the better.

One of the modules that I'm working on deals with protein sequence optimization using genetic algorithms. The user has a collection of protein sequences phylogenetically related, and wants to produce an optimized sequence for a custom trait (solubility, hydrophobicity, digestibility, etc) that still belongs to the original protein family.

Under the hood, the algorithm that I implemented requires a multiple protein alignment as input, or profile. Naturally, the methods that do the heavy-lifting expect a Bio::SimpleAlign object. But the complication is that protein alignments can come in lots of different formats, many of which are also shared with plain protein, RNA and DNA file formats. Also, the user actually shouldn't be aware that the module requires a protein alignment. Of course it should be allowed to provide one, but if the only thing he has is a bunch of sequences in a flat file, it shouldn't be bothered with opening (how?), parsing (what format? What is its specification?) and aligning (with what algorithm? Gap penalty who?) them to cater to my particular implementation. All he should need to give is a simple filename as a string.

So to maximize for user convenience, I decided that the module should accept either of:
In the case that there is ambiguity about whether the user supplied an alignment file or a sequence file (eg., fasta format is both an alignment and a sequence format), I'll make an educated guess and assume that it's an unaligned sequence. In the worst case scenario, It'll just realign an alignment. There is also an extra layer of guessing involved in determining what the format actually is in case that the file has an unknown extension or no extension at all (this is done by Bio::Tools::GuessSeqFormat).

All of this adds to the simplicity of the API in detriment of the simplicity of the underlying code. Luckily, Moose has the tools to make this as straightforward and clean as possible, using Types and Type coercions. The coercion map looks like this:



And the code that implements this is simply the following:

What's better is that all these types and type coercions are defined separately in a type library that uses MooseX::Types. They deal with both the input sanity checking and the type coercions. This way, not only this complexity is hidden from the user, it's also hidden from the main application. This is really helpful, since now most of the code in the module's main file describes the class behavior and it's not coupled with nor hidden by the juggling of all the possible user input types and input validation code.

Now future users of this module (most probably only myself) won't have to check the API's documentation that often; whatever representation of a collection of protein sequences they might have will serve as a valid input. I believe this to be a nice design choice.

Monday, May 25, 2009

Superloading Devel::REPL

Today I bumped into Class::Autouse, pointed by App::Cmd's docs. Essentially, it lets you defer the importing of a module until the code actually tries to use it. So, for instance, if you have:


The program will load Deps 1..10 before executing Dep:2's constructor. However, with


Only Dep::2 will be use-d before the instantiation of $thing.

This can be very useful in cases where an application's flow can take several different routes, and might not use all of its functionality in a single invocation. Padre, the Perl IDE, for example, makes extensive use of this module exactly for this reason.

But another cool feature of this library is its ':superloader' option. It allows you to import dependencies on the fly without explicitly saying so. For example:


Here, we don't predeclare any class in particular, we just use it. Notice that I said "class" and not "module"; this only works for object oriented libraries, ie., classes.

What's most useful about this is that it can be very convenient in cases where you don't know beforehand what classes you will be needing. In particular, I think this is a perfect addition to my Devel::REPL configuration file:


and then:


Since I'm mostly doing exploration and debugging when using the REPL, correctness is not an issue, and not having to load things beforehand increases the whipuptitude and overall improves the coding experience.

Monday, May 18, 2009

A sequence translator folder using POE::Component::DirWatch::Object

So, I admit it. I hate coding GUIs. Probably because I don't have much experience doing it, but I hate it nevertheless. So when I bumped into POE::Component::DirWatch::Object and grasped some of the potential it had for kind of replacing a GUI in some simple scripts, I just had to try it.

What POE::Component::DirWatch::Object basically does is keep an eye on a folder, and act upon an event regarding it. For instance, using ::NewFile, it will trigger a user-defined function whenever a file is created in that directory. This might seem old news for the compsci people, but for me it was quite a new concept and I really liked it.

So to test-drive it, I decided that I'd code for a simple script (not too different than the module's synopsis, in retrospect) that would translate any DNA sequence that was created in it. This way, the user would just have to copy or move her desired sequence files to the "translate" (as I very imaginatively called it) folder, and they would automagically translate themselves.

The core of the script is:


Simple, right? Here, $watcher will execute the translate subroutine whenever a new file is created in the /home/brunov/translate folder. The last line is needed to make POE do its magic. The script runs a daemon that never exits, and could be configured to start when you log in to your session.

The translate routine is pretty straightforward. It uses Bio::Seq for the translation part, and Bio::SeqIO for the file reading and writing. What it does is read the input file, translate all the translatable sequences that are in it, and write the results in a temporary file,
which is later used to replace the original one.


That's all that is needed (minus a little error checking that I chose to leave out for the sake of bloggability).

Adding neat notifications

The script runs fine as it is, but it's kind of dull. Since there are no newly created files nor visible output, you are tempted to reopen the translated file to actually check that your DNA has been translated into protein. This is why I thought that adding system notifications would be appropiate.

Actually, bullshit. I just can't get enough of Ubuntu's sexy notification system, so I figured it'd be nice if a bubble popped up saying the outcome of the translation.

So doing some more CPAN digging, I found Gtk2::Notify. With a few extra lines of code, I got a chatty, ribosomy (get it?) folder. Here's the modified translate sub (minus old comments) and the new notify one:


And that's it! Now whenever you drop a file in the "translate" folder, all DNA sequences that are in it get translated (and those who are not are left untouched). This could allow for cute chained actions; one could have a whole hierarchy of live folders with things such as "reverse translate", "blast it", "predict solubility", "digest it" and so on. It is certainly not as flexible as good old CLI scripts, but it reaches a nice compromise between the user friendliness of a GUI and the batch processing capabilities of command-line fu.

Here's how it looks:



You can see the real-life script here.

P.S: And just to spite TIOBE people: Perl Programming rocks! And also Euphoria Programming and SuperCollider Programming, of course.

Monday, April 27, 2009

Chopping proteins with Moose

It comes a time in the life of a wet lab geek when one has to simulate the outcome of a proteolytic degradation of a protein of known sequence (those who said that haven't are lying).

So for a problem this common, I did what I always do first: google for an already made solution. Surprisingly, after ten minutes or so, I came out empty-handed.

Now don't get me wrong: there were solutions out there, like PeptideCutter, PoPS, or EMBOSS' digest, but none of them were easily usable as a library. The first two only offered a web front-end, and that is OK for a quick digestion of one or two sequences, but I needed batch-processing capabilities and much more flexibility (ie, being able to predict partial digests, for instance). The last one was indeed available as a command line utility, but it offered very few enzyme specificities, it forced an interactive mode, and writing a wrapper for it would have involved more logic that the one implemented in the application (not to mention the ugly reading/writing of temporary files that would have been required to deal with the app's inflexible interface, and that would have created a nice and shiny IO bottleneck in every downstream application that used it).

So, there. Homework done, I decided to code for a solution myself. Luckily, PeptideCutter had a pretty comprehensive explanation of how their algorithm works, and it is simple enough to be coded quickly.

Basically, a protease specificity is modeled as a regular expression that returns true or false on a sequence sliding window. This sliding window is six residues long in PeptideCutter, I did it 8 residues long to account for potentially more specific proteases. When the substrate matches the regular expression, the sequence should be cut at the siscile bond. Simple enough.

The module is up at my github repo (I called it Bio::Protease for a lack of better name), you can download it/fork it/follow it to your liking. It has quite a robust test suite, I checked every single specificity against PeptideCutter's results on the same input.

I used Perl and Moose for it. One of the things that I like the most about Moose is how easy it is to make really DWIMmy APIs with it.
For instance, in Bio::Protease, you can do:

my $protease = Bio::Protease->new(specificity => 'trypsin');


This will coerce the string 'trypsin' into an internal representation of the trypsin specificity, which is implemented as a code reference. (Currently, there are more than thirty different specificities to choose from, you can list them by doing say Bio::Protease->Specificities).

Also, you could say:

my $cutting_pattern = Bio::Tools::SeqPattern->new(
    -SEQ => 'XXXW[^P]RSX', -TYPE => 'Amino'
);
my $protease = Bio::Protease->new(specificity => $cutting_pattern);


to get a custom regex-based specificity. The one above would cut sequences between the fourth and fifth residues if they match that pattern.

You could object that the regex-specificity model is too simplistic for your personal use, and you'd be right. Enzymic cleavage of peptide bonds does not share the deterministic character of restriction enzymes, and sometimes protein's structural characteristics get in the way and play an important part in cleavage. So, to account for an arbitrarily complex specificity model, the 'specificity' attribute also accepts a code reference:

my $nobel_prize_winning_specificity_model = sub {
    my $peptide = shift;
    # decide whether to cut $peptide or not
}

my $protease = Bio::Protease->new(
    specificity => $nobel_prize_winning_specificity_model;
);


Internally, the module passes a sliding window of peptides to the code reference; if it returns true, it marks the siscile bond as cut.

So back to the original problem, to actually make the cuts, you can do:

my @products = $protease->digest('MAAEEELLKKVVIKP');

The products of a full digestion of the argument sequence will be returned as a list. On the other hand, if you want to get the siscile bonds, you'd say:

my @cut_sites = $protease->cleavage_sites($seq);

And, for partial digests, you can use the method 'cut', that will return the products of a single cleavage if you provide a siscile bond as an argument:

my @products = $protease->cut($seq, $cut_sites[rand @cut_sites]);

In the above expression, $cut_sites[rand @cut_sites] will choose a random cleavage site from @cut_sites, and cut $seq there. This is the basis for a bigger module (and the reason for coding this one), which could make it to a separate blog post in the future.

The take home lesson is: use Moose. I am too lazy to have done all of this had not been so easy. Without it, I would have probably whipped up a half-assed script (not a shiny, reusable module) that took a sequence and a specificity as input from the command line, and that barfed a table with the cleavage sites as output, and not much more (sounds familiar?). But all of the above functionality was implemented in a hundred-or-so lines of clean, declarative, object-oriented code, and I don't ever have to worry about this problem again (and neither do you).

Saturday, April 25, 2009

The Enlightened Perl Iron Man Competition

Matt Trout just made a blogpost announcing the Iron Man Competition, sponsored by the Englightened Perl Organization.

I personally think this is a very good idea and a great way to tackle one of Perl's problem of today: perception. The language and its community may well be alive (Tim Bunce did a great job proving this), but few people outside the echo chamber know about it, and constantly make claims that Perl is dead. As Schwern very well put it in one of his talks, this perception issue is far from harmless and should not be taken lightly.

Given that today's communication routes among the newer generations are mainly blogs, Matt's proposal to "make noise" by blogging (as opposed to just writing excellent code, which should be enough) is the right way to help increase Perl awareness in the programming community.

I have been playing with the idea of starting a blog myself for a while. This might have been the push I needed to make up my mind and JFDI.

Note: This is a repost of a meditation of mine in PerlMonks. I am hoping that it counts as my first Iron Man blog entry.