Pancore subsets

Subset genes from pan- and core-genome plot

The analysis of ”specific” genes is performed on the clustering results from the pan- and core-genome calcula- tions. The analysis will output the actual sequences but for the purpose of this workshop you will only look at the sizes of the subsets. The procedure is based on mathematical set theory and works with intersections, unions and complementary data-sets.

Each genome is treated as a set and the intersection (-i) is the gene families that two or more sets have ”in common”. The intersection of genome A and B, is the set of all gene families which are found in both A and B. The union of two or more sets refers to the gene families which are found in either genome A or B.

Calculating the complimentary families of a genome refers to the set of all families which are members of A but not members of B. In the comparative genomic analysis, the sets usually consists of more than one genome, such as the intersection of genome A, B and C while not found (complimentary, -c) in genome D, E and F. This will give families that are found in A, B and C but not found in any of D, E or F.

It is also possible to calculate the situation where families are found in A, B and C but not found in the intersection of D, E and F, this is referred to as the ”compinter” (-ci).

The last option is to extract the union (-u) gene families, all families found in a genome set. This procedure outputs the genes/gene families in common or complementary between genomes in a pan- and core-genome plot.

The <directory> argument must be of the type created as temporary directory by the pancoreplot script (-keep blastOutPut).

# Syntax:

$ pancoreplot_subsets <-option> <value> <directory>

Examples

# Gives you the core gene families of genomes 1, 2, 3, 5, 6, and 7.

$ pancoreplot_subsets -i 1:3,5:7 blastOutPut

# Example output:

Calculating the intersection between genomes

Extracting the gene sequences from the data

Outputting 1582 gene sequences

# Gives the core gene families of all genomes in the set. This is used if no options are given.

$ pancoreplot_subsets -i 1: blastOutPut

# Gives the core gene families of genomes 1, 3, 4 and 5 which are not present in any of the genomes from 6 and on to the last.

# In this command, genome 2 is not considered at all.

$ pancoreplot_subsets -i 1,3:5 -c 6: blastOutPut

# Gives the core-genome of organisms 1, 2 and 3 as well as the core-genome of 5, 6 and 7. # This is a larger set different from '-i 1:3,5:7' above.

$ pancoreplot_subsets -i 1:3 -i 5:7 blastOutPut

# Gives the part of the pan-genome of organisms 1 through 5 which is neither in the core-genome of 7, 8 and 9 or in the core-genome of 8, 9 and 10.

$ pancoreplot_subsets -u 1:5 -ci 7:9 -ci 8:10 blastOutPut

# To output actual sequences use \texttt{-p 1} and redirect the output into a file:

$ pancoreplot_subsets -p 1 -i 1:3,5:7 blastOutPut > subset.fsa