.NET classes for BWA and SAMtools

Sep 11, 2012 at 1:17 PM

Hello,

 

it would be very useful if you could provide them as a .NET class. If so, we could efficiently and easily embed them into more complex tools. Are you planning on doing something like this in the next future?

 

Regards,

Rocco

Coordinator
Sep 17, 2012 at 2:06 PM

Hi, Rocco,

Thanks very much for the question.

1, If you mean you want a managed class to represent all the functions of BWA, i.e. to rewrite whole BWA in .NET. THIS IS NOT EASY, BUT WILL BE FUN. Suppose it can be done, the performance should be 10 times slower, based on .NET v.s. C general metrics. As end user, do you think this is acceptable?

2, If you mean you want a "Wrapper" over BWA so that you can call this native application within your managed pipeline, this should be easy.

For samtools, it should be ok to rewrite the whole thing in .NET, just need to justify the cause. Not sure if you already use .NET Bio, this has classes on BAM/SAM and I/O functions that overlap with samtools.

Best regards,

dong

Sep 17, 2012 at 2:59 PM
Hi Dong,

thank you for your quick answer.
About NET .Bio I have no direct experience: if BAM/SAM and
I/O overlap SAMtools I agree that there is no need to
replicate.
About BWA: I was thinking about a managed class. I must
say that I am surprised about the '10 times slower' thing.
All the benchmarks I'm aware of point to (globally
speaking) a 1.1 to 2.0 fold difference (with simple
cycling code, so no risk of memory leaks on the C/C++
side) between C/C++ and C#.

See:
http://reverseblade.blogspot.it/2009/02/c-versus-c-versus-java-performance.html

and

http://www.codeproject.com/Articles/212856/Head-to-head-benchmark-Csharp-vs-NET

Have you got direct experience about this huge performance
loss on the C# side?

Besides this, having a managed class for a high quality
aligner, such as BWA, could still be useful whenever the
alignment is not time-critical (i.e. not genome-sized
alignment of billions of Fastq reads).

Best regards,
Rocco


On 17 Sep 2012 07:06:23 -0700
�"xied75" <notifications@codeplex.com> wrote:
>From: xied75
>
> Hi, Rocco, Thanks very much for the question. 1, If you
>mean you want a managed class to represent all the
>functions of BWA, i.e. to rewrite whole BWA in .NET. THIS
>IS NOT EASY, BUT WILL BE FUN. Suppose it can be done, the
>performance should be 10 times slower, based on .NET v.s.
>C general metrics. As end user, do you think this is
>acceptable? 2, If you mean you want a "Wrapper" over BWA
>so that you can call this native application within your
>managed pipeline, this should be easy. For samtools, it
>should be ok to rewrite the whole thing in .NET, just
>need to justify the cause. Not sure if you already use
>.NET Bio, this has classes on BAM/SAM and I/O functions
>that overlap with samtools. Best regards, dong
>
>
Coordinator
Sep 17, 2012 at 3:47 PM

BWA and samtools are in pure C (C99 to be precise). C is faster than C++ generally speaking, if not talking about some of the C++ lib's inefficencies. It is correct that C++ is 2 times faster than managed code (JAVA/C#). BWA spends most of its runtime doing memory access, and C# memory access is far slower. Thus I gave a rough number of 10. The main point is how does one write High performance C# code.

I do have plan on this, i.e. 1, come up a managed version of BWA that has all abilities as the native version; 2, a reduced power version, can only deal with smaller genome, etc., but fully compatiable on the algorithms level.

If there is enough support/request, I might start next month.

Best,

dong

Sep 17, 2012 at 3:58 PM
Great!

please, keep me in touch.

Regards,
Rocco


On 17 Sep 2012 08:47:33 -0700
�"xied75" <notifications@codeplex.com> wrote:
>From: xied75
>
> BWA and samtools are in pure C (C99 to be precise). C is
>faster than C++ generally speaking, if not talking about
>some of the C++ lib's inefficencies. It is correct that
>C++ is 2 times faster than managed code (JAVA/C#). BWA
>spends most of its runtime doing memory access, and C#
>memory access is far slower. Thus I gave a rough number
>of 10. The main point is how does one write High
>performance C# code. I do have plan on this, i.e. 1, come
>up a managed version of BWA that has all abilities as the
>native version; 2, a reduced power version, can only deal
>with smaller genome, etc., but fully compatiable on the
>algorithms level. If there is enough support/request, I
>might start next month. Best, dong
>
>
Feb 5, 2013 at 2:35 AM
Hi Dong,

Was using samtools today and wanted to swing by with a quick question but also thought I would ask something here.

As for the quick question, when you ported BWA and SAMTOOLs did you rewrite them to work with the ms compiler? If so, why did you pick this over MingW (I ask because R uses Ming and I was curious what differences might be).

As for BWA ported to C#, indeed I agree there might be a substantial slow down. I think the main issue might be that BWA does random array access (e.g. Arr[M] and I would worry the compiler would check for an out of bounds exception on every call, e.g. that M<Arr.Length). However, I suppose that could be avoided by following certain guidelines (http://blogs.msdn.com/b/clrcodegeneration/archive/2009/08/13/array-bounds-check-elimination-in-the-clr.aspx) or using unsafe code that directly accesses the memory without a slow down.

I think if it was ported to C# though for a smaller genome I might not be surprised if the performance could compare within 2X to BWA. For starters, I am not sure BWA is parrallel but C# obviously would be. Also, with a small enough genome you could keep the entire FM index and also the count array in memory, though I suppose at that stage one would be better off hashing or just using typical suffix array methods. Actually, there are suffix array lookups implemented in .NET bio, I will go post over there to see if anyone has checked how the nucmer/mumer set compares to the C versions.

Cheers,
Nigel
Coordinator
Feb 5, 2013 at 12:39 PM
Edited Feb 5, 2013 at 12:42 PM
HI, Nigel,

Nice to see you come over here. :)

Answer to question 1: I wouldn't say I 'rewrite' BWA and SAMTOOLS. I 'ported' those. In fact if you look at the source code and do a compare with Linux code, total diff is less than 100. And many of those diffs are to do with C99 standard that MSC Compiler does not support (e.g. declare a variable at arbitary position other than block start).

Other stuff is to do with Glibc routines we don't have on Windows, so on my GitHub I have a Glibc repo to hold those pieces.

Why not MingW or Cygwin? Because people already tried many times without success. That is, you expect this middle layer do all the magic for you without the need to ever look at the source code. The truth is, I believe the one who maintain the R Win code base has a very good understanding of how MingW works internally and do the fix where needed (so he knows R code also MingW code). In my case, once the situation forced me to look at BWA C code, I would rather fix it directly without any middle layer's 'help'. Then I only need to look at BWA C code, Glibc C code, and MSDN for MSC details.

In the case of Linux pthread, I've no idea if MingW can 'translate' that for you like-for-like?

Regarding high performance C#, thanks for the link, that's a good read. I feel unless we start the real work, all is just on paper. I've mixed feelings about the need to read books like "CLR in C#" or MSIL spec down to the level of CPU register or X86 instruction sets, even SIMD, MMX, SSE2 etc (which I planned to anyway). My thinking is that I would rather hope MS could advance MSC Compiler so we can adopt C code easier.

You might follow this on GitHub as well: https://github.com/mythz/ScalingDotNET

Joe Duffy who is leading the Midori team has a blog post on string op: http://www.bluebytesoftware.com/blog/2012-10-31-BewareTheString.aspx
His team should be the best in the world on managed performance, simply because they are writing an OS in .NET. :)

Best regards,

dong
Feb 6, 2013 at 5:55 PM
Hey Dong,

Would definitely agree that it is probably better to just rewrite the C code than use MingW. My experience has been that if you have to use cygwin you are better off just having a dual boot system and switching. MingW has actually been pretty useful to me though. I have found that if whatever project you are compiling is "stand alone" like an ODE solver or a computational program that doesn't use external libraries things go pretty smoothly. Dependencies like the glibc library would probably ruin that. As for pthreads, I am not actually sure how they implement that in mingw, they can't do it exactly like for like but I do know there is a library that can be called, for example I recently recompiled bowtie2 using MingW and it was a pretty straightforward port and I know that used pthreads. http://evolvedmicrobe.com/blogs/?p=12

I read "CLR via C#" over winter break, was reasonably interesting read but definitely not essential. It was interesting to hear more about the implementation of the CLR and how method lookups occur, things like how strings are all unicode and value types are certainly covered as well. I am reading computer architecture V now which is probably more interesting (though honestly less immediately useful, I haven't found a situation where knowing about CPU pipelining changes how I write my code). The CLR spec actually never goes down to machine instructions as it only guarantees behaviors but not implementations. Mono you may be aware does some SIMD, but the MS version doesn't.

Going to go check out those links now.

Cheers,
Nigel
Coordinator
Feb 6, 2013 at 10:03 PM
Edited Feb 6, 2013 at 10:05 PM
Hi, Nigel,

Did you notice that the URL we paste in always included the next word, I guess we better use INSERT LINK tool.

I had a look of your website, nice one. The fact that bowtie2 can be compiled with either cygwin or MinGW is apparent in the original make file, because the author already planned for it. This is not the case for the Sanger series (bwa, samtools, tabix, etc.). I would say majority doesn't care as long as they can publish. bamtools(C++) is another good citizen they use cmake so you can generate Visual studio project files (with bugs though).

Another look at the bowtie2 make file just make me laugh. It is yet another good example to show how arrogant those Linux kings are. I'm talking about this line:
# POSIX memory-mapped files not currently supported on Windows
Yes my lord, Windows does not have POSIX MM, as many other operating systems, but Windows does support Memory Mapped Files from stone age. In fact my multithreaded bwa is using it in many places. I guess without it this bowtie2 won't run as fast, then they will yet again proved that Windows is rubbish.

Here is a similar comments from samtools knetfile.c
#ifndef _WIN32
/* This function does not work with Windows due to the lack of
 * getaddrinfo() in winsock. 

/* In Unix/Mac, getaddrinfo() is the most convenient way to get
     * server information. */
    if (getaddrinfo(host, port, &hints, &res) != 0) __err_connect("getaddrinfo");

/* A slightly modfied version of the following function also works on
 * Mac (and presummably Linux). However, this function is not stable on
 * my Mac. It sometimes works fine but sometimes does not. Therefore for
 * non-Windows OS, I do not use this one. */
/* Dong Xie, 2012-06-12, this is almost identicial to above code, merge possible*/
As usual, Windows 'happen' to have the exact function in exact this name, so the function body after my fix is almost identical to unix one, as in my comments. You can see all those on my GitHub.

BWA never tried to be nice with Windows, samtools tried, full of those things, they even have an warning message if you dare to compile under Cygwin or MinGW, which I deleted from my port, something like
Note: The Windows version of SAMtools is mainly designed for read-only operations, such as viewing the alignments and generating the pileup. Binary files generated by the Windows version may be buggy.
Best,

dong
Feb 7, 2013 at 12:35 AM
Ha, never noticed the extra word in the link but indeed there it is! I think I would like to learn more about porting between unix and windows so will probably check out how you altered all that code soon. I have to say, most times I find problems with code that is specialized for one operating system it's almost always because the authors took some shortcut or made some assumption that is generally unsafe (sloppy use of shared file locks, assumed position of the program on the path, etc.), it's pretty annoying that it's 2012 and code still isn't cross platform compatible due to even simple things like the direction of a slash.
Oct 19, 2013 at 2:37 PM
Hey Dong,

What are you up to these days? Hope all is going well.

Wanted to let you know that I wrote a wrapper around the BWA-MEM API that exposes this and creates .NET classes.

https://github.com/evolvedmicrobe/BWA-Sharp

Right now it only works on linux because of the bwa compile issues, but should have all the basics if bwa mem can ever compile on windows as well.

Cheers,
N