During the last few months, I have been working on a complete rewrite of our voucher and sequence database system VoSeq. This is a web application that helps managing our voucher and DNA sequence data. We can easily organize our data based on identifiers such as voucher code and gene code. This makes data retrieval extremely easy. With VoSeq, we can create ready-to-run datasets in a couple of clicks.

I started this project in 2006 and wrote it in PHP. It was the computer language that I learned by programming VoSeq. I learned PHP and web application development from the book "Web application for dummies", the title was like something that. I was new to programming so my code was not the best-looking. I threw in PHP combined with HTML and CSS in the same files. I was young and naive. I did not know nothing about life, the universe and the rest.

Anyway, our little VoSeq was very useful for us in our research group. We discarded the database system that we were using at the time (in Filemaker) and stated using VoSeq in 2006, and we haven't stopped. We used VoSeq in the production of several papers between 2006 and 2012.

It was some time during 2011 that my buddy Tobias Malm told me that he had included many tools into VoSeq, including a login system, more complex dataset creation tools involving codon positions and aminoacid translations. So we thought that it was the time to publish a paper on VoSeq. We quickly put VoSeq in shape for publication and in 2012 the paper in PLoS ONE came out:

Peña C, Malm T (2012) VoSeq: A Voucher and DNA Sequence Web Application. PLoS ONE 7(6): e39071. doi: 10.1371/journal.pone.0039071

So far VoSeq has been cited 22 times and is being used by a few research groups that are not related to us.

Some time ago, due to the increasing amount of data, it was obvious that VoSeq needed to be made more efficient and quicker. The more data that we put in VoSeq, the slower it becomes. It was getting annoyingly slow when we were trying to search for vouchers and generating datasets. My boss, Niklas Wahlberg kept telling me that VoSeq would time out when trying to create a dataset of a few hundred taxa and more than 100 genes.

So it was time to put some love into VoSeq. But I did not want to touch the source code. It is scary. We fell for every single malpractice in software development: spaghetti code, lack of unittest, code blocks that are repeated in many files, lack of style, etc. My biggest fear is that I may break something if I tried to improve the code. Since there are no unittests, I would need to manually test every single feature to make sure that everything is still working.

It was time for a complete rewrite of the code. In the last couple of years, I learned to program in Python and learned of best practices when developing software. I discovered the framework Django and managed to get my head around the model-view-controller system. 2015 me knows better than 2006 me.

So, with little help of my friends, I have been rewriting VoSeq from scratch. Now it is quicker, faster, better, modular, have many unittests and so on. I will finish soon the rewrite of the same tools in the old VoSeq (version 1.7.4) and I am releasing (pre-releasing) alpha versions of the software (versions that don't have all the features yet). The new VoSeq is much faster. For datasets that take 30 seconds to generate in the old VoSeq, it takes only 1.5 seconds in the new VoSeq. Taxon searches are now very fast due to the use of indexed data by using backends such as elasticsearch. Thanks to Django, the new login and user administrator system is much better and robust. It is possible to implement a fine grained set of permissions for users. The group leaders can use the administration interface in VoSeq to specify whether users are able or restricted to look at certain sets of data, or upload and modify other sets of data. For example, in the new VoSeq, users that have registered accounts are able to see the DNA sequences and retrieve datasets. Users without accounts (the public) will only be able to look at voucher and taxonomical information.

The repository for VoSeq is here: https://github.com/carlosp420/VoSeq. You will find instructions on how to install and configure VoSeq. Getting it to work might be tricky due to the many requirements, so I am planing to release a Vagrant file so that users can create a Virtual Box running Ubuntu with a ready and fresh working version of VoSeq.

Tags: Python VoSeq Django datasets database-manager