This thesis program is the result of my work from Dec 2014 until around Mar 2015. It is a parallelisation of swarm - an existing single-linkage clustering algorithm for metagenomics data. This program arranges the complete hierarchical clustering workflow into of a set of finely-grained computations (referred as row calculations) that can be run in parallel. The concept of "subseed" was re-used from swarm to skip most of the redundant comparisons (referred as economic search).

The code for global pairwise alignment and k-mer filtering scheme were also re-used from Swarm since they were proven to be very efficient with the SIMD utilisation.

Based on the tests that were run on medium sequence length dataset (averagely 381 nucleotides), the speedup that can be achieved through the parallelisation is averagely 11 times when using 16 threads. This result indicates a successful parallelisation technic which according to Amdahl's law it means 97% of the code were able to be parallelised.

4 level economic search

sequence distance triangle inequality

And the full document:

1 comments: (+add yours?)

Richard S. Maddox said...
This comment has been removed by a blog administrator.