A pipeline to binning metagenomic sequence based KDE and normal mixture model
A typical process of metagenomics analysis includes sampling, sequencing, data cleaning, assembly and gene calling. Because of the incomplete nature of sampling and internal bias in sequencing, it is difficult to recovery the full genomes of each genome from the metagenomics. Binning provides necessary complementation in practice.
In metagenomics, binning is a process of grouping raw reads or assembled contigs and assigning them to operational taxiomic unit. Here, we mean the contigs assembled from the raw reads. There are two basic methods of binning sequence, and one is based on composition features, the other is based on sequence similarity. We present an effective binning approach combined two strategies above. In addtion, an operable and detailed step-by-step guide is provided.
- Step 1. Collect important features value
- 1-1. Trimming
- 1-2. De novo Assembly
- 1-3. Extraction of selected fetures data
- Step 2. Filter the metagenomic assembly
- Step 3. Binning the metagenomic assembly using selected features
- 3-1. Compute the KDE using the weighted data and create contour graph
- 3-2. Determine the initial group of data based on different level values of contour
- 3-3. Fit normal mixture model for the assembly
- 3-4. Validate each genome bin
- Step 4. Taxonomy assignment and functional analysis for each genome bin.
- 4-1. Toxonomy anslysis using MEGAN
- 4-2. Phylogenetic anlysis based on 16S rRNA sequence
- 4-3. Clusters of Orthologous Groups (COG) analysis
- 4-4. Pathway analysis using GhostKOALA service
Step 1. Collect important features value
After getting the sequencing data, we can utilize a lot of open source tools to analyze the data, including trimming, assembling and annotation. The following step is for reference only.
1-1. Trimming
Although there are many useful toolkits to perform the quality control, We choose Trimmomatic to clean our sequencing datasets. The detailed parameters is following:
java -jar trimmomatic-0.32.jar -threads 8 -phred33 -trimlog PE_trim.log \
PE_1.fastq PE_2.fastq \
PE_1_trimmed_paired.fq PE_1_trimmed_unpaired.fq \
PE_2_trimmed_paired.fq PE_2_trimmed_unpaired.fq \
ILLUMINACLIP:$HOME/Software/Trimmomatic-0.32/adapters/Index/TruSeq2-PE-GCCAAT.fa:2:30:10:1:true \
LEADING:3 TRAILING:3 MAXINFO:40:0.4 MINLEN:20
The dataset generated from the mate-pair library was trimmed like above commands.
1-2. De novo Assembly
There are two types of sequence assembly, that is overlap/layout/consensus approach and de Bruijn graph approach. Because of the characteristics of next-generation sequencing platform and the incompletion of reference genome, de nove assembly is appropriate for metagenomics assembly. According to our experience, we use SPAdes Genomes Assembler to assembly our data. Executinng SPAdes as the following command:
spades.py --dataset data.yaml -t 8 -o ./ -k 21,35,51,69 --carefull
Before using the commands above, you need to create your data.yaml file as the SPAdes online manual told.
1-3. Extraction of selected fetures data
In order to bin metagenomics sequence, we have to collect some fetures about the assembled fragments. For each contigs, GC content, length and coverage are extracted from assembly file using our R script. The script outputs the file contigs.info
including 4 column, that is contig name, contig length, GC content and coverage.
The main analysis of this pipeline uses R language, so the following codes are R codes unless noted. So, just make sure you have started R and type the code at the R prompt. Another convenient way to execute the following codes is to use a powerful GUI of R, such as RStudio.
suppressMessages(library(Biostrings))
assembly.file <- file.choose()
ctgsSeq <- readDNAStringSet(assembly.file)
seqsName <- names(ctgsSeq)
ctgs <- data.frame(name= seqsName, stringsAsFactors=FALSE)
ctgs$length <- width(ctgsSeq)
baseStat<- alphabetFrequency(ctgsSeq)
ctgsGc <- rowSums(baseStat[, 2:3])/rowSums(baseStat[, 1:4])
ctgs$gc <- as.numeric(format(ctgsGc, digits= 6))
The output of some assembler, such as SPAdes, Velvet and AByss, provide the coverage value of each cotigs in the header. If there isn’t coverage information in the assembly data, we can map reads to contigs using BWA or Bowtie to get coverage information instead. Here the aligment is done using Bowite in the Linux enviroment.
Note: the kmer coverage is different from the read coverage (or the nucleotide coverage), but they are closely-related.
bowtie-build --offrate 1 assembly.fa assembly.index
bowtie -as -v 3 -p 8 -X 400 assembly.index -1 read1.fq -2 read2.fq | samtools view -Sb > alignment.bam
express -o expr_out assembly.fa alignment.bam
In the R console, we read the aligment file to get coverage information of the assembly
align.file <- file.choose()
xprs <- read.table(align.file, header= TRUE, as.is= TRUE)
xprs <- xprs[order(xprs$target_id), ]
ctgs <- ctgs[order(ctgs$name), ]
ctgs$coverage <- sqrt(xprs$fpkm)
ctgs <- subset(ctgs, coverage!=0)
Step 2. Clean the assembly data
During assembly stage, most of assemblers filter the result with some default parameter, while others export all result. Of all contigs, those which have a very short length or high coverage are nonsense, because they might be repeats or other common regions in multiple organisms. To reduce the interference of these sequence, we filter the assembly by the length and coverage. In addition, GC contents and coverage are transformate normally to fit the model.
# filter by length
temp <- ctgs
ctgs <- subset(ctgs, length > 2000)
# write special weight function
specialWeightFun <- function(df){
num <- as.numeric(as.matrix(df[, c("coverage", "gc")]))
len <- df[, "length"]
num2 <- rep(num, times= rep(len, 2L))
dim(num2) <- c(sum(len), 2L)
return(num2)
}
# weight the data
ctgsWeighted <- specialWeightFun(ctgs)
coverageUpper <- quantile(ctgsWeighted[, 1], 0.99, names= FALSE)
# filter by coverage using the weighted data
ctgsWeighted <- ctgsWeighted[ctgsWeighted[, 1] <= coverageUpper]
ctgs <- ctgs[ctgs$coverage <= coverageUpper]
# transformation
normalTrans <- function(x) 0.5*log(x/(1-x))
ctgs$gc <- normalTrans(ctgs$gc)
Step 3. Binning the metagenomic assembly using selected features
3-1. Compute the KDE using the weighted data and create contour graph
Our method exploits the advantage of contour plot which can represent the relationships of three variables in two dimensions. Before drawing contour plot, we have to compute a 2D kernel density estimation.
library(KernSmooth)
bandwidth <- c(dpik(ctgs$coverage), dpik(ctgs$gc))
dens <- bkde2D(ctgsWeighted[, 1:2], bandwidth= bandwidth, gridsize= c(401L, 401L), truncate= TRUE)
When drawing contour, you need to adjust the arguments controlling the levels of contour in order to obtain the overview of all contigs. The contour filled with color looks much fancy, that ignore some patterns of lower density areas in the contour.
contour(dens$x1, dens$x2, dens$fhat, nlevels= 30)
contour(dens$x1, dens$x2, dens$fhat, levels= seq(min(dens$fhat), max(dens$fhat), by= 0.5))
myPalette <- colorRampPalette(c('white', 'red', 'orange'), bias= 2)
filled.contour(dens$x1, dens$x2, dens$fhat, nlevels= 20, color= myPalette)
3-2. Determine the initial group of data based on different level values of contour
With the aid of the contour, you usually find some patterns in the assembly. Each higher density area of contour contain most of contigs which might derive from an individual organism. Based on the assumption, we select the contigs in every higher density areas.
The process of selecting the contigs in each area may be time-consuming, it totally depends on the complexity of the metagenomics project. Sometimes, you need to combine more than one level value to get the optimal inital group.
# try some level values manually, and determine the optimal one.
library(sp)
levelValue <- 0.5
contour(dens$x1, dens$x2, dens$fhat, levels= levelValue)
cl <- contourLines(dens$x1, dens$x2, dens$fhat, levels= levelValue)
ctgs$group <- 0
library(sp)
for( i in 1:length(cl) ){
gi <- as.logical(point.in.polygon(ctgs$coverage, ctgs$gc, cl[[i]]$x, cl[[i]]$y))
ctgs[gi, 'group'] <- i
}
# display initial group
library(ggplot2)
(p <- ggplot(data= ctgs, aes(x= coverage, y= gc)))
p + geom_points(aes(color= factor(group))
3-3. Fit normal mixture model for the assembly
The Gaussian mixture model (GMM) is a powerful model for data clustering. Here we model the metagenomic assembly as a mixture of multiple Gaussian distributions where each component correspondes one genome bin. And we use mclust package which implemented the expectation-maximization algorithm (EM) to accomplish the parameter estimation of GMM.
After finishing the clustering through GMM, we utilize the essential gene sets to draw back some assembly fragements from the unassigned sequences. The process of obtaining the information of essential gene contained in the assembly was doned by the shell script cal_assemlby_ess-gene.sh
.
# perform the second binning
library(mclust)
z <- unmap(ctgs$group)
Vinv <- hypvol(ctgs[, c("coverage", "gc")], reciprocal = TRUE)
wt <- ctgs$length / max(ctgs$length)
fit.weighted <- me.weighted("VVV", ctgs[, c("coverage", "gc")], z = z, weights = wt, Vinv = Vinv)
ctgs$class <- map(fit.weighted$z)-1
table(ctgs$class)
tapply(ctgs$length, ctgs$class, sum)/10^6
# read the alignment info against 109 essential genes and get
# some fragments in group 0 back.
#####
# ./calc_contig_ess_gene_info.sh
#####
ctgs.ess <- read.table("essential_gene.info", header=FALSE, as.is=TRUE)
colnames(ctgs.ess) <- c('count', 'name')
idx <- intersect(ctgs[ctgs$class==0,'name'], ctgs.ess$name)
ctgs0.ess <- ctgs[match(idx, ctgs$name), ]
ctgs0.ess$count <- ctgs.ess[match(idx, ctgs.ess$name), 'count']
ctgs0.dist <- matrix(0, nrow=nrow(ctgs0.ess),
ncol= length(unique(ctgs$class)))
coor.scale <- (max(ctgs$coverage)-min(ctgs$coverage))/(max(ctgs$gc)-min(ctgs$gc))
ctgs0.ess <- transform(ctgs0.ess, gc.s= gc*coor.scale)
ctgs <- transform(ctgs, gc.s= gc*coor.scale)
each.group <- split(ctgs, f= ctgs$class)
for(i in seq.int(ctgs0.ess$name)){
tmp.num <-lapply(each.group, function(idx){
tmp.cen <- kmeans(idx[, c("coverage", "gc.s")], centers=1, iter.max=100)$centers
tmp.dist <- dist(matrix(c(tmp.cen, ctgs0.ess$coverage[i], ctgs0.ess$gc.s[i]), nrow=2, byrow=T))
as.matrix(tmp.dist)[2]
})
ctgs0.dist[i, ] <- unlist(tmp.num)
}
ctgs0.dist <- ctgs0.dist[, -1]
ctgs0.ess$class.dis <- apply(ctgs0.dist, 1, which.min)
# display the final binning
cols <- c("C1"="red", "C2"="green", "C3"="blue", "C4"="yellow", "C5"="blueviolet",
"C6"="gold", "C7"="aquamarine", "C8"="darkolivegreen", "C9"="deepskyblue",
"C0"="gray")
p <- ggplot(ctgs, aes(coverage, gc, color = class, size = length)) + geom_point()
(p+scale_color_manual(values= cols) )
# split the assembly file in each group and export them out in Fasta format to prepare for the downstream analysis
suppressMessages(library(Biostrings))
ctgsSeq <- readDNAStringSet("assembly.fa", format= 'fasta')
eachGroup<- split(ctgs$name, ctgs$class)
sapply(names(eachGroup),
function(x){seqs <- ctgsSeq[eachGroup[[x]]];
writeXStringSet(seqs, filepath= paste0("super_", x, ".fa"))
})
3-4. Validate each genome bin
After binning the assembly, the validation work needs to done for each genome bin, including the completeness and integrity.
prodigal -a temp_orfs.faa -i assembly.fa -m -o temp_prodigal.stdout -p meta -q
perl -wlan -e 'print $F[0];' temp_orfs.faa > orfs.faa
hmmsearch --tblout temp_hmm_orfs.txt --cut_tc --notextw --cpu 8 essential.hmm orfs.faa > temp_hmm.stdout
perl -wlan -e '/#/ or print "@F[0,3]";' temp_hmm_orfs.txt > assembly_2_ess-gene.txt
The above code is copied from the Supplementary Information of Albertsen et al. Based on the file assembly_2_ess-gene.txt
, we can easily infer the completeness of each genomic bin. Besides, we also provide a little shell script ess-gene_counts.sh
to count the statistic about the essential genes for each genome bin.
Step 4. Assign taxonomy to each cluster and then pay attention on interesting genomes
Taxonmical annotation and function annotation of the genes are performed as following. At first, the predicted proteins of each group were aligned to the NCBI-NR database using BLASTP (best hit with E<10-5), The BLASTP outputs were then filtered (at least 40% amino acid sequence identity and 50% length hit) and the imported into MEGAN5 for taxonomic classification. The detailed parameters of these programs are listed here.
4-1. Toxonomy anslysis using MEGAN
# run blastp
blastp -query orfs.faa -db /path/to/local/nr_database/ -out blastp.out -evalue 1e-5 \
-max_target_seqs 1 -num_threads 8 -outfmt \
"6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore slen"
# filter results
perl -wlan -e 'BEGIN{$,="\t";} print @F[0..11] if(($F[3]/$F[12]>=0.5) && ($F[2]>=40));'
After preparing the input files, run MEGAN at the command line with proper memory settings. Please refer to the MEGAN5 manual here.
4-2. Phylogenetic anlysis based on 16S rRNA sequence
To test the result of MEGAN, the 16S ribosomal RNA genes of each group were predicted by RNAmmer and the 16S ribosomal RNA genes of their closest neighbors were downloaded from the NCBI, the alignments were run in MEGA5 (neighbor-joining) at the default settings with 1 000 bootstraps.
4-3. Clusters of Orthologous Groups (COG) analysis
The COG analysis is performed by the COG software, and the steps are followed by the readme file in the software.
4-4. Pathway analysis using GhostKOALA service
To further compare the functional potential of each group, the predicted ORFs were analyzed using the GhostKOALA service on the KEGG website. When the results were returned through your email, open the links and then compare your interesting pathway.