4. Scaffold

Con los contigs generados en el ensamble de lecturas, se pude armar un esqueleto del genoma (scaffold) usando o no un genoma de referencia que este terminado o cerrado.

Scaffold SIN un genoma de referencia

BESST

Podemos hacer un scaffold con BESST sin necesidad de un genoma de referencia simplemente mapeando las secuencias a contigs previamente ensamblados. En el servidor Biobacter, este programa esta instalado en el ambiente python conda assemblies por lo que hay que activarlo:

$ conda activate assemblies

Creación de índice con BWA

(assemblies)$ bwa index contigs.fasta -p index

Mapeo de las secuencias limpias al índice previamente creado y conversión de archivos .sam a .bam con samtools

(assemblies)$ bwa mem -t 8 index R1.fastq.gz R2.fastq.gz | samtools view -bS --threads 4 | samtools sort -o sorted.bam --threads 4

bwa mem -t 8 index R1.fastq.gz R2.fastq.gz mapea las secuencias al índice

samtools view -bS --threads 4 convierte los archivos resultantes (.sam) a formato .bam

samtools sort -o sorted.bam --threads4 ordena por tamaño los archivos .bam y los salva como sorted.bam

Indexación de los archivos .bam

(assemblies)$ samtools index sorted.bam

Una vez creados los archivos .bam, ya podemos crear el scaffold con BESST:

(assemblies)$ runBESST -c contigs.fasta -f sorted.bam -orientation fr

Podemos ver el resultado en el directorio BESST_output/pass1, ahí habrá un archivo llamado Scaffolds_pass1.fa. Podemos ver las características de este scaffold con:

$ basic_statistics BESST_output/pass1/Scaffolds_pass1.fa

Scaffold con un genoma de referencia

Scaffold builder

Existe también el programa Scaffold Builder para ensambles en los cuales los gaps se rellenan con Ns. Puede usarse en una página web o bien por comandos desde la terminal. Se necesitan los contigs ensamblados en formato fasta y la cepa de referencia también en formato fasta. Si la cepa de referencia esta en formato GenBank es necesario convertirla con GB2fasta:

$ GB2fasta.pl reference.gbk reference.fasta

Scaffold builder es un programa de python

$ python /usr/bin/scaffold_builder.py -q query_contigs.fna -r reference.fna -p sb

Medusa

Medusa tiene una página web donde se pueden subir los archivos, o bien correrla en la terminal (ver abajo requisitos). Medusa realiza el scaffold basándose en uno mas genomas completos que deben estar en un folder en formato fasta.

$ java -jar /opt/medusa/medusa.jar -f /reference_genomes/ -i contigs.fna -v -o scaffold.fna

The following inputs are required:

The targetGenome file: a draft genome in fasta format. This is the genome you are interested in scaffolding.
An arbitrary long list of auxiliaryDraft files: other draft genomes in fasta format. The closest these organisms are related to the target, the better the results will be. This files are expected to be collected in a specific directory. It is possible to specify the path to the directory, see the command "-f" in the next section.

The following output files will be produced:

targetGenome_SUMMARY: a textual file containing information about your data. Number of scaffolds, N50 value etc..
targetGenomeScaffold.fasta: a fasta file with the sequences grouped in scaffolds. Contigs in the same scaffolds are separated by 100 Ns by default, or a variable number of Ns (estimate of the distance between the contigs), if the option "-d" is used.

The project folder must contain:

- the *targetGenome* in fasta format.

- the medusa.jar file

- the scripts sub-folder “medusa_scripts”.

- the comparison genomes sub-folder “drafts”. (In alternative you can

specify another path for this folder usinf the "-f" option)

Medusa can be run with the following parameters:

1. The option *-i* is required and indicates the name of the target

genome file.

2. The option *-o* is optional and indicates the name of output fasta

file.

3. The option *-v* (recommended) print on console the information given

by the package MUMmer. This option is strongly suggested to

understand if MUMmer is not running properly.

4. The option *-f* is optional and indicates the path to the comparison

drafts folder.

5. The option *-random* is available (not required). This option allows

the user to run a given number of cleaning rounds and keep the best

solution. Since the variability is small, 5 rounds are

usually sufficient to find the best score.

6. The option *-w2* is optional and allows for a sequence similarity

based weighting scheme. Using a different weighting scheme may lead

to better results.

7. The option *-d* allows for the estimation of the distance between pairs of contigs based on the reference genome(s):

in this case the scaffolded contigs will be separated by a number of N characters equal to this estimate.

The estimated distances are also saved in the "*_distanceTable" file.

By default the scaffolded contigs are separated by 100 Ns.

8. The *-gexf* is optional. With this option the gexf format of the contig network and

the path cover are porvided.

9. The option *-n50* allows the calculation of the N50 statistic on a FASTA file.

In this case the usage is the following: java -jar medusa.jar -n50 <name_of_the_fasta>

All the other options will be ignored.

10. Finally the *-h* option provides a small recap of the previous ones.

Page updated

Report abuse