Generality

We present here a calculation for genomic distances that is especially sensitive for bacterial intra-species comparisons. This distance is based on the amount of maximal unique and exact matches (MUMs) of minimal length of 19 nt shared by the two genomes being compared. It is called MUMi, for MUM index, and varies between 0 for very similar, to 1 for very distant genomes.

Description of the formula

MUMs are maximal unique exact matches shared by two sequences. Fast algorithms such as the one implemented in Mummer allow to calculate in few seconds the list of all such matches shared by two genomes, taking into consideration the direct as well as the reverse strand of the target genome. A naïve distance called hereafter MUMi (for MUM index) can be derived from this MUM list, using the formula

MUMi= 1 - Lmum/Lav

where Lmum is the sum of the lengths of all non-overlapping MUMs, and Lav is the average length of the two genomes to be compared. An important post-treatment of the MUM list is applied to remove all overlaps between MUMs.

Generation of the MUM list

For each genome pair, the list of MUMs is generated using the Mummer3 software (http://mummer.sourceforge.net/manual/), with the following options: -mum -b -c -l 19. Option b allows recovery of MUMs present on both strands of the target sequence, and hence takes into account DNA inversions. Parameter l is the minimal length of MUM to be detected. For its choice, see Deloger et al. (2009). J. Bacteriol. 191, 91-99.

Removal of overlapping MUMs and MUMi calculation

Description:

Mummer detects maximal exact matches that may not be unique, as the uniqueness criterion is examined independently on forward and reverse strands of the target genome being compared to the query sequence. This explains the presence of spurious matches that need to be removed or trimmed. The two genomes being compared are called hereafter g1 and g2. The MUMi calculation proceeds in five steps:

1. Remove MUMs whose coordinates on g1 (respectively on g2) are completely included in a larger MUM (this case is made possible due to the fact that in Mummer3, the uniqueness of each MUM is defined according to one strand only).

2. Remove MUMs whose coordinates on g1 (respectively on g2) are completely included in two neighboring MUMs.

3. Treat the remaining MUMs of g1 (respectively on g2) that exhibit partial overlap. To do this, MUMs are ordered according to their beginning position on g1 (respectively on g2), and starting from the last element of the list, each MUM is compared to its neighbour. In case of overlap, the end of the leftward MUM is trimmed, i.e. its end coordinates on both g1 and g2 are shifted so that no overlap exists on g1 (respectively on g2). Calculate MUMi(g1,g2) with the above formula.

4. Repeat steps 1 to 3 treating g2 before g1, and calculate MUMi (g2,g1).

5. The MUMi value is the average of MUMi(g1,g2) and MUMi (g2,g1).