[Next-Generation Sequencing] Mapping과 관련한 수치 구하기

SAM/BAM 파일에서 mapping과 관련한 수치를 뽑는 방법을 one-liner 철학에 입각하여 정리해 보도록 하겠다. 물론 다른 프로그램을 사용하면 좀 더 고차원적인 수치를 뽑을 수도 있을 것이다. 이 글을 쓰면서 추구하는 것은 되도록 적은 수의 기본 프로그램과 UNIX/LINUX에 설치된 유틸리티를 최대한 활용하여 (옵션이 길어짐은 피할 수 없음) 원하는 결과를 얻는 것이다. 이 문서를 작성하면서 인터넷 상의 여러 정보를 참조하였지만 특히 GitHub의 SAM and BAM filtering oneliner에서 많은 도움을 얻었다.

오늘의 작업을 위해서 쓰일 가장 핵심적인 도구는 samtools임은 부정할 수 없다. samtools는 1.x 버전으로 업그레이드되면서 공식 웹사이트도 변경되었음을 먼저 알아두도록 하자.

SAMtools 0.1.19까지 (링크)
SAMtools 1.x부터 (링크 http://htslib.org/)

1. 가장 일반적인 mapping statistics 출력하기(samtools flagstat)

$ samtools flagstat BL21.bam
869658 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
842935 + 0 mapped (96.93%:-nan%)
869658 + 0 paired in sequencing
434829 + 0 read1
434829 + 0 read2
791630 + 0 properly paired (91.03%:-nan%)
831272 + 0 with itself and mate mapped
11663 + 0 singletons (1.34%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Mapping statistics 수치를 뽑는 가장 간단한 방법은 samtools flagstat를 이용하는 것이다. 위에 든 사례(E. coli BL21)는 duplicate 하나 없이 너무나 깔끔하게 mapping이 된 결과물이라서 별로 공부할만한 것이 없다. IS가 잔뜩 박혀있는 Shigella boydii의 genome 서열에 대한 일루미나 매핑 결과물을 가지고 다시 분석을 해 보자. Bowtie2를 사용하되 시간을 줄이기 위하여 100만개의 read(50만 read pair)로 입력 데이터를 제한하였다.

$ bowtie2 -x SB_ref -1 SBpart_1.fastq -2 SBpart_2.fastq -S SB.sam
500000 reads; of these:
500000 (100.00%) were paired; of these:
55609 (11.12%) aligned concordantly 0 times
402919 (80.58%) aligned concordantly exactly 1 time
41472 (8.29%) aligned concordantly >1 times
----
55609 pairs aligned concordantly 0 times; of these:
11135 (20.02%) aligned discordantly 1 time
----
44474 pairs aligned 0 times concordantly or discordantly; of these:
88948 mates make up the pairs; of these:
72973 (82.04%) aligned 0 times
7334 (8.25%) aligned exactly 1 time
8641 (9.71%) aligned >1 times
92.70% overall alignment rate

bowtie2.2.4 매핑 과정에서 약간의 수치가 출력된다. 이제 결과물을 BAM 파일로 전환하여 samtools로 다시 수치를 뽑아보자.

$ samtools view -b -S -o SB.bam SB.sam
[samopen] SAM header is present: 1 sequences.
[hyjeong@tube 01_mapping_test]$ /usr/local/apps/a5_miseq_linux_20141120/bin/samtools flagstat SB.bam
1000000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
927027 + 0 mapped (92.70%:-nan%)
1000000 + 0 paired in sequencing
500000 + 0 read1
500000 + 0 read2
888782 + 0 properly paired (88.88%:-nan%)
919416 + 0 with itself and mate mapped
7611 + 0 singletons (0.76%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

어라? duplicate가 하나도 나타나지 않는다. 같은 데이터를 가지고 bwa 0.6.2-r126을 돌려 보았지만 mapped read의 수치가 아주 조금 줄어들었고 이번 포스팅에서 설명을 하려 했었던 duplicates 수치가 역시 하나도 나오지 않았다. "duplicates" 수치가 의미하는 것은 read가 반복하여 매핑된 횟수들의 합을 의미한다.

2. Unmapped reads

이제부터는 alignment 파일(SAM or BAM) 내에 있는 FLAG를 참조하여 적절히 필터를 하는 기법이 필요하다. samtools view 명령으로 특정 조건에 맞는 read alignment만을 출력하거나(-f INT) 혹은 그 조건에 맞는 것을 걸러서 제거하여 출력하는 것(-F INT)이 가능하다. 예를 들어 0x04(0x가 앞에 붙은 것은 16진수를 의미한다. 따라서 0x04는 십진수의 4와 같다) FLAG은 "segment unmapped"를 의미한다. 다음의 두 명령은 전혀 다른 수치를 출력한다.

samtools view -f 0x04 -c align.bam # unmapped read의 수
samtools view -F 0x04 -c align.bam # mapped read의 수

좀 더 정확히 말하자면 mapping이 일어난 경우 mapped read의 수라기보다는 mapped position이라고 하는 것이 나을 것이다. 하나의 read가 reference genome 내의 여러 위치에 정렬하는 경우 SAM/BAM 파일 내에 각각 별개의 레코드로 존재하게 되기에 그렇다. -c는 matching record의 카운트만을 출력하는 것이다. 따라서 -b -o out.bam이라고 하면 read/alignment를 BAM 파일로 출력함을 의미한다. 또한 -f 0x04는 -f4 -f0x04 등의 형태로 사용함을 허락한다.

3. Uniquely mapped read의 수 세기(이 항목은 잘못되었음 2019-08-06)

samtools view -F 0x40 aligned_reads.bam | cut -f1 | sort | uniq | wc -l
samtools view -f 0x40 -F 0x4 aligned_reads.bam | cut -f1 | sort | uniq | wc -l #left mate
samtools view -f 0x80 -F 0x4 aligned_reads.bam | cut -f1 | sort | uniq | wc -l #right mate

4. Paired end read의 정렬 결과로부터 insert size 추정하기

이것은 one-liner로서는 뽑기가 쉽지 않다. 다행히 getinsertsize.py(GitHub link)라는 훌륭한 공개 파이썬 스크립트가 있다. 사용법은 스크립트 저자의 블로그(2012년 4월 6일 포스팅)에 잘 설명되어 있다. SAM/BAM 파일을 표준 입력으로 받을 수 있어서 매우 깔끔한 처리가 가능하다. getinsertsize.py SAM_file 또는 samtools view BAM_file | getinsertsize.py - 어느 것이든 실행 가능하다.

$ samtools view aln-pe.bam | getinsertsize.py -
1M... # read의 전체 수를 의미하는 것으로 생각된다
Read length: mean 101.0, STD=0.0
Read span: mean 394.062581504, STD=31.8156907927

5. Coverage 관련 수치 구하기

Reference genome 전체에 대한 평균 coverage는 얼마일까? zero coverage region의 위치를 추출할 수는 없을까? 솔직히 말해서 one-liner로 이런 수치를 계산한다는 것은 다소 무리가 아닐까 싶다. 구글에서 bam sam read depth coverage라고 검색을 해 보면 Biostars와 SEQanswers에 이에 대한 몇 가지 질문과 답이 있으니 참고하기 바란다.

정해영의 블로그 - JEONG Haeyoung's blog

2015년 9월 14일 월요일