You might know that Windows do I/O in two modes: text and binary; on Linux, there is no difference. Thus for text file, Windows use New Line characters CR + LF, (0x0D 0x0A) or \r \n, on Linux, it’s a single LF, (0x0A) or \n.

BWA reads in and writes out both text and binary files. When I was working on the port of BWA, I had a hard decision to make: let BWA output text files in Linux format or Windows format? If in Linux format, then the output will be identical to Linux run output, but many Windows programs will not show it properly (e.g. notepad. Smile) If in Windows format, then our files will be a little bigger than Linux output, (one more character each line of text), and it will affect how samtools work:

On Linux, if you try

   1:  samtools view -bSu toy.sam –o toy.bam
   2:  samtools view –h toy.bam –o toy.sam.back

toy.sam and toy.sam.back will be identical, (if the toy.sam is made on Linux), but if you give it a toy.sam made on Windows, then the back file is no longer same, if you look at the file in hex mode, you’ll notice all header lines end with \r \n, while the following sequence lines end with \n,  means it’s now a mix of two styles. This is by design, samtools never expected to encounter a SAM file with Windows New Line, it just parses the header lines by looking for \n, and saves everything untouched into BAM, so the \r was saved from axe.

Yet on Windows the correct way to deal with text file is to recognize \r \n together, i.e. \r should not be kept in the BAM anymore, this is how the current port version works. On my first try before I realized this issue, I had a very interesting observation: when I throw a Windows style toy.sam to samtools, and converted back, the header lines now end with \r \r \n, (just for fun, if you convert this sam into bam and back again, you’ll get \r \r \r \n), the issue is that Windows will replace \n with \r \n when you call the C library code to write lines.

Summary message for users who are not that interested in programming details:

  1. BWA x64 can take both style text files as input, and always output text file in Windows style.
  2. SAMTOOLS x64 can take both style text files as input, and always output text file in Windows style.
  3. throw Windows style text files to Linux run might have unexpected results.

Last edited Jun 1, 2012 at 4:34 PM by xied75, version 2


No comments yet.