Jugando con #magicblast, el mapeador de lectura corta de #NCBI. Mi cuaderno

NCBI MAGIC Blast fue mencionado recientemente por BioMickWatson en Twitter.

Aquí, jugaré con magicblast y compararé su salida con bwa (Makefile a continuación).

Primero, aquí hay un extracto del manual para explosión mágica.

USAGE
DESCRIPTION
   Short read mapper

 *** Input query options
 -query <File_In>
   Input file name
   Default = `-'
    * Incompatible with:  sra
 -infmt <String, `asn1', `asn1b', `fasta', `fastc', `fastq'>
   Input format for sequences
   Default = `fasta'
    * Incompatible with:  sra
 -paired
   Input query sequences are paired
 -query_mate <File_In>
   FASTA file with mates for query sequences (if given in another file)
    * Requires:  query
 -sra <String>
   Comma-separated SRA accessions
    * Incompatible with:  query, infmt

 *** Formatting options
 -outfmt <String, Permissible values: 'asn' 'sam' 'tabular' >
   alignment view options:
   sam = SAM format,
   tabular = Tabular format,
   text ASN.1
   Default = `sam'

Indexación de la referencia para magicblast

He indexado el chr22 humano usando el viejo NCBI hacerblastdb:

$ wget -O "chr22.fa.gz" "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz"
$ gunzip -f chr22.fa.gz
$ makeblastdb -in chr22.fa -dbtype nucl -title chr22

Mapeo de lecturas de extremos emparejados con magicblast

Primero, eliminé los sufijos de los nombres de lectura ‘/1’ y ‘/2’ porque todavía aparecen en el bam final.

gunzip -c R1.aa.fq.gz | sed 's//1$$//' | gzip > R1.fq.gz
gunzip -c R2.aa.fq.gz | sed 's//2$$//' | gzip > R2.fq.gz

Luego, las lecturas se mapean con explosión mágica usando el siguiente comando:

magicblast -db chr22 
 -infmt fastq -paired 
 -query R1.fq.gz 
 -query_mate R2.fq.gz |
 sed 's/gnl|BL_ORD_ID|0/chr22/g' |
 samtools sort -o child.magic.bam -T child.magic --reference chr22.fa -O BAM

Por lo que pude ver, no había ninguna opción para especificar el grupo de lectura (RG).

Aquí, transformé el nombre del contig porque parece un identificador de identificación de explosión:
También he mapeado las lecturas usando bwa:

bwa mem  chr22.fa R1.fq.gz R2.fq.gz |
 samtools sort -o child.bwa.bam -T child.bwa --reference chr22.fa -O BAM

Encabezados BAM

Aquí está el encabezado BAM para magicblast: Magic blast usa la especificación v1.2 y tiene un indicador de Orden de grupo.

@HD VN:1.2 GO:query SO:coordinate
@SQ SN:chr22 LN:50818468
@PG ID:0 PN:magicblast magicblast -db chr22.fa -infmt fastq -paired -query R1.fq.gz -query_mate R2.fq.gz

Y el encabezado BAM para bwa:

@HD VN:1.3 SO:coordinate
@SQ SN:chr22 LN:50818468
@PG ID:bwa PN:bwa VN:0.7.15-r1140 bwa mem chr22.fa R1.fq.gz R2.fq.gz

samtools samflags

para explosión mágica:

24387 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
24387 + 0 mapped (100.00% : N/A)
24387 + 0 paired in sequencing
12222 + 0 read1
12222 + 0 read2
24330 + 0 properly paired (99.77% : N/A)
24387 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
57 + 0 with mate mapped to a different chr
57 + 0 with mate mapped to a different chr (mapQ>=5)

para bwa:

24001 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1 + 0 supplementary
0 + 0 duplicates
23976 + 0 mapped (99.90% : N/A)
24000 + 0 paired in sequencing
12000 + 0 read1
12000 + 0 read2
23840 + 0 properly paired (99.33% : N/A)
23952 + 0 with itself and mate mapped
23 + 0 singletons (0.10% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Validación con picard

La validación de child.magic.bam con picard genera muchos errores:

$ java -jar picard.jar  ValidateSamFile  I=child.magic.bam IGNORE=RECORD_MISSING_READ_GROUP
ERROR: Read groups is empty
ERROR: Header version: 1.2 does not match any of the acceptable versions: 1.0, 1.3, 1.4, 1.5
ERROR: Record 4, Read name HWI-1KL149:97:C6Y6VACXX:4:2306:7588:23627, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 6, Read name HWI-1KL149:97:C6Y6VACXX:5:2303:6772:28953, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 15, Read name HWI-1KL149:97:C6Y6VACXX:4:1207:16508:83772, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 17, Read name HWI-1KL149:97:C6Y6VACXX:4:1216:17305:32405, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 19, Read name HWI-1KL149:97:C6Y6VACXX:5:1315:20109:45547, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 29, Read name HWI-1KL149:97:C6Y6VACXX:4:2313:12478:85202, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 36, Read name HWI-1KL149:97:C6Y6VACXX:4:1202:11437:96159, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 40, Read name HWI-1KL149:97:C6Y6VACXX:4:1213:16611:87818, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 45, Read name HWI-1KL149:97:C6Y6VACXX:4:1208:7944:83271, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 49, Read name HWI-1KL149:97:C6Y6VACXX:4:1216:17305:32405, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 22, Read name HWI-1KL149:97:C6Y6VACXX:5:1105:17083:36022, Mate alignment does not match alignment start of mate
ERROR: Record 51, Read name HWI-1KL149:97:C6Y6VACXX:5:1105:17083:36022, Mate alignment does not match alignment start of mate
ERROR: Record 51, Read name HWI-1KL149:97:C6Y6VACXX:5:1105:17083:36022, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 25, Read name HWI-1KL149:97:C6Y6VACXX:4:2209:8319:82198, Mate negative strand flag does not match read negative strand flag of mate
(...)

mientras que ~no hay error para child.bwa.bam:

$ java -jar picard.jar  ValidateSamFile  I=child.bwa.bam IGNORE=RECORD_MISSING_READ_GROUP
ERROR: Read groups is empty
[Fri Sep 09 16:09:28 CEST 2016] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=78643200

Llamada variante

Ambos BAM están asignados con samtools/vcftools (min DP=10)

samtools mpileup  --uncompressed --output-tags DP --reference chr22.fa child.bwa.bam |
 bcftools call --ploidy GRCh38  --multiallelic-caller --variants-only --format-fields GQ,GP - |
 bcftools filter -e 'DP<10'  --output-type v -o child.bwa.vcf -
samtools mpileup  --uncompressed --output-tags DP --reference chr22.fa child.magic.bam |
 bcftools call --ploidy GRCh38  --multiallelic-caller --variants-only --format-fields GQ,GP - |
 bcftools filter -e 'DP<10'  --output-type v -o child.magic.vcf -

Número de variantes:

$ grep -v "#" child.magic.vcf  | wc -l
111
$ grep -v "#" child.bwa.vcf  | wc -l
166

Aquí está la diferencia para CHROM/POS/REF/ALT:

  • Uniq a BWA: 'comm -13 <(grep -v "#" child.magic.vcf|cut -f 1,2,4,5 | sort | uniq) <(grep -v "#" child.bwa.vcf|cut -f 1,2,4,5 | sort | uniq) | wc -l' : 64 variantes
  • Uniq a la magia: 'comm -23 <(grep -v "#" child.magic.vcf|cut -f 1,2,4,5 | sort | uniq) <(grep -v "#" child.bwa.vcf|cut -f 1,2,4,5 | sort | uniq) | wc -l' : 9 variantes
  • Variantes comunes: 'comm -12 <(grep -v "#" child.magic.vcf|cut -f 1,2,4,5 | sort | uniq) <(grep -v "#" child.bwa.vcf|cut -f 1,2,4,5 | sort | uniq) | wc -l' : 102 variantes

El archivo MAKE

SHELL=/bin/bash
data.dir=$/src/DATA
ncbi.bin=$/packages/magicblast/bin
REF=chr22.fa
samtools.exe=$/packages/samtools/samtools 
bwa.exe=$/packages/bwa-0.7.15/bwa 

all: child.magic.bam child.bwa.bam

R1.fq.gz : $/src/DATA/child.R1.aa.fq.gz 
 gunzip -c $< | sed 's//1$$//' | gzip > [email protected]
R2.fq.gz : $/src/DATA/child.R2.aa.fq.gz 
 gunzip -c $< | sed 's//2$$//' | gzip > [email protected]

child.bwa.bam : $.bwt R1.fq.gz R2.fq.gz 
 $ mem  $ $(word 2,$^) $(word 3,$^) |
 $ sort -o [email protected] -T $(basename [email protected]) --reference $ -O BAM 


child.magic.bam : $.nin R1.fq.gz R2.fq.gz 
 $/magicblast -db $ -infmt fastq -paired -query $(word 2,$^) -query_mate $(word 3,$^) |
 sed 's/gnl|BL_ORD_ID|0/$(basename $)/g' |
 $ sort -o [email protected] -T $(basename [email protected]) --reference $ -O BAM 

$.bwt : $
 $ index $<

$.nin : $
 $/makeblastdb -in $< -dbtype nucl -title $(basename $)

$ :
 wget -O "[email protected]" "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/[email protected]"
 gunzip -f [email protected]

Eso es todo,
Pedro



Fuente del artículo

Deja un comentario