4. Scaffold

Con los contigs generados en el ensamble de lecturas, se pude armar un esqueleto del genoma (scaffold) usando un genoma de referencia que este terminado o cerrado.

Esto se puede hacer con dos aplicaciones.

Scaffold builder

Existe también el programa Scaffold Builder para ensambles en los cuales los gaps se rellenan con Ns. Puede usarse en una página web o bien por comandos desde la terminal. Se necesitan los contigs ensamblados en formato fasta y la cepa de referencia también en formato fasta. Si la cepa de referencia esta en formato GenBank es necesario convertirla con GB2fasta:

$ GB2fasta.pl reference.gbk reference.fasta

Scaffold builder es un programa de python

$ python /usr/bin/scaffold_builder.py -q query_contigs.fna -r reference.fna -p sb

Medusa

Medusa tiene una página web donde se pueden subir los archivos, o bien correrla en la terminal (ver abajo requisitos). Medusa realiza el scaffold basándose en uno mas genomas completos que deben estar en un folder en formato fasta.

$ java -jar /opt/medusa/medusa.jar -f /reference_genomes/ -i contigs.fna -v -o scaffold.fna

The following inputs are required:

  • The targetGenome file: a draft genome in fasta format. This is the genome you are interested in scaffolding.
  • An arbitrary long list of auxiliaryDraft files: other draft genomes in fasta format. The closest these organisms are related to the target, the better the results will be. This files are expected to be collected in a specific directory. It is possible to specify the path to the directory, see the command "-f" in the next section.

The following output files will be produced:

  • targetGenome_SUMMARY: a textual file containing information about your data. Number of scaffolds, N50 value etc..
  • targetGenomeScaffold.fasta: a fasta file with the sequences grouped in scaffolds. Contigs in the same scaffolds are separated by 100 Ns by default, or a variable number of Ns (estimate of the distance between the contigs), if the option "-d" is used.

The project folder must contain:

- the *targetGenome* in fasta format.

- the medusa.jar file

- the scripts sub-folder “medusa_scripts”.

- the comparison genomes sub-folder “drafts”. (In alternative you can

specify another path for this folder usinf the "-f" option)

Medusa can be run with the following parameters:

1. The option *-i* is required and indicates the name of the target

genome file.

2. The option *-o* is optional and indicates the name of output fasta

file.

3. The option *-v* (recommended) print on console the information given

by the package MUMmer. This option is strongly suggested to

understand if MUMmer is not running properly.

4. The option *-f* is optional and indicates the path to the comparison

drafts folder.

5. The option *-random* is available (not required). This option allows

the user to run a given number of cleaning rounds and keep the best

solution. Since the variability is small, 5 rounds are

usually sufficient to find the best score.

6. The option *-w2* is optional and allows for a sequence similarity

based weighting scheme. Using a different weighting scheme may lead

to better results.

7. The option *-d* allows for the estimation of the distance between pairs of contigs based on the reference genome(s):

in this case the scaffolded contigs will be separated by a number of N characters equal to this estimate.

The estimated distances are also saved in the "*_distanceTable" file.

By default the scaffolded contigs are separated by 100 Ns.

8. The *-gexf* is optional. With this option the gexf format of the contig network and

the path cover are porvided.

9. The option *-n50* allows the calculation of the N50 statistic on a FASTA file.

In this case the usage is the following: java -jar medusa.jar -n50 <name_of_the_fasta>

All the other options will be ignored.

10. Finally the *-h* option provides a small recap of the previous ones.