Historical linguistics has no generally accepted methodology for statistically estimating whether the connections it documents between languages are coincidental or statistically significant (likely to reflect historical realities). Currently the best proposals are very susceptible to errors which lead the researcher to falsely judge languages to be historically connected. I propose several improvements in the statistics of the testing. The new techniques are illustrated with a set of five languages having varying degrees of interrelatedness (English, German, French, Latin, Albanian) and three not believed to be related to that set or to each other (Hawai'ian, Navajo, and Turkish). Statistically, the technique of Ringe (1992) suffers from an invalid use of multiple tests. I develop a single test that uses Monte Carlo techniques for estimating significance. The test takes less than a minute on a personal computer and is conceptually much simpler than traditional parametric statistics. My technique is compatible with a wide range of metrics, and I develop several variants in attempts to interpret algorithmically the traditional techniques of historical linguistics, which seek to discover recurrent pairings of sounds between semantically matched words in a set of languages. I begin with an implementation of the familiar chi squared statistic. That approach is satisfactory, but only permits the researcher to consider one sound in each word. The Monte Carlo technique also permits a simpler, more traditional counting of the recurrent pairings, and with proper scaling that can be made to work for multiple sounds in a word. Although it is possible to consider all conceivable pairings of sounds, I show that a simple linear alignment is preferable because universal properties of word length interfere with the goal of finding particular, nonuniversal, connections between languages. I also explore the possibilities of comparing words at subsegmental levels. The greatest problem with the testing is the quality of the data. The tests are easily distorted by loans, recurring etyma, and nonarbitrary vocabulary. I show how prevalent such problems are among the items in the standard Swadesh list of 200 concepts, and introduce some mathematical techniques to help the linguist identify problem areas.