정해영의 블로그 - JEONG Haeyoung's blog: FASTA 파일 자르기

2021년 7월 12일 월요일

FASTA 파일 자르기 - Jim Kent의 faSplit

FASTX 파일을 조작하는 유틸리티는 그 종류가 하도 많아서 가끔씩은 내가 원하는 기능을 수행하는 명령어가 뭔지 찾는 것보다 차라리 새로 짜는 것이 더 편하다고 느낄 때가 있다. 이번에는 6G(gigabasepair가 아니라 gigabyte)가 넘는 FASTA file을 nucmer가 잘 다루지 못하는 현상을 발견하여 이를 수십 조각으로 적당히 분할하는 '남이 이미 만들어 놓은 유틸리티'가 뭔지 찾아 보았다. 다음의 글에서 저명한 전산생물학자 Jim Kent가 만들었다는 faSplit라는 것을 알게되어 설치 후 활용해 보았다.

[Biostars] How to divide FASTA file?

faSplit - Split an fa file into several files.
usage:
   faSplit how input.fa count outRoot
where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'.  
Files split by sequence will be broken at the nearest fa record boundary. 
Files split by base will be broken at any base.  
Files broken by size will be broken every count bases.

Examples:
   faSplit sequence estAll.fa 100 est
This will break up estAll.fa into 100 files
(numbered est001.fa est002.fa, ... est100.fa
Files will only be broken at fa record boundaries

   faSplit base chr1.fa 10 1_
This will break up chr1.fa into 10 files

   faSplit size input.fa 2000 outRoot
This breaks up input.fa into 2000 base chunks

   faSplit about est.fa 20000 outRoot
This will break up est.fa into files of about 20000 bytes each by record.

   faSplit byname scaffolds.fa outRoot/ 
This breaks up scaffolds.fa using sequence names as file names.
       Use the terminating / on the outRoot to get it to work correctly.

   faSplit gap chrN.fa 20000 outRoot
This breaks up chrN.fa into files of at most 20000 bases each, 
at gap boundaries if possible.  If the sequence ends in N's, the last
piece, if larger than 20000, will be all one piece.12

faSplit는 multi-FASTA file을 각 서열 단위로 분할하는 것은물론 큰 서열을 임의의 길이(bp 혹은 바이트 단위)로 잘라준다. 리눅스에 포함된 일반 유틸리티인 cut도 이와 유사한 일을 할 수 있으나 서열 경계 부분을 지혜롭게 처리해 주지는 못한다. cut의 유용성은 다른 곳에 있다. 예를 들어서 컬럼 단위로 텍스트 파일을 분할하는 능력 말이다.

만약 fasta 파일을 조작하여 염기서열이 한 줄에 표시되게 만들면 cut과 같은 일반 유틸리티를 이용한 복잡한 작업도 가능하다. 그러려면 awk나 sed를 동원해야 될 것이다.

이렇게 블로그에 기록을 남겨놓지 않으면 내가 faSplit라는 유용한 도구를 설치하여 사용했었다는 사실 자체를 기억조차 하지 못할 것이다!