Statistically Consistent K-Mer Methods for Phylogenetic Tree Reconstruction

    Seth Sullivant
    Phylogenetic construction algorithms based on k-mers of DNA or protein sequences are nonparametric distance methods for reconstructing phylogenetic trees from sequence data that do not depend on first constructing alignments. The methods are often used to construct the guide tree used in multiple sequence alignment. We show that when applied to data generated from a statistical model of sequence evolution, the standard k-mer methods are inconsistent, that is, even with arbitrary amounts of data, they will reconstruct the wrong tree. We also show how to derive model-based corrections that make the methods statistically consistent, and report on simulation studies comparing methods. This is joint work with Elizabeth Allman and John Rhodes.

