18 February 2010

The Case For The Altaic Language Family


A new analysis makes a solid empirical case that Turkic, Mongolian, Tungusic, Korean, and Japanese languages are all members of a "genetically related" Altaic language family with a tool called "Consonant Class Matching" validated with an analysis of Indo-European and Semetic languages.

There are 30 Turkic languages spoken by about 180 million people, most notably in Turkey and Central Asia. There are at least nine Mongolian languages spoken by 5-6 million people mostly in Mongolia and neighboring parts of China. There are about 75,000 speakers of thirteen Tungusic languages, many of which have only a few hundred or fewer speakers. About 78 million people speak Korean. More than 130 million people speak Japanese or closely related languages.

About Genetic Relationships in Linguistics and Language Families

Languages change over time. Over many centuries, a spoken language typically changes enough to cease to be clearly the same language. For example, most speakers of modern English find it takes effort to read Shakespeare (flourished 1589-1613), and that it is very difficult to read the Middle English original works of writers like Chaucer (flourished late 1300s), and that Old English (ca. 400s to 1100s) is another vaguely familiar looking language.

When people who speak the same language are mostly isolated from each other for extended period of time, the source language changes in ways that differ for each group of languages. For instance, when the Western Roman Empire fell (476 AD is a date commonly used as the defining moment), the common Roman version of Latin spoken in the Roman empire which avoided breaking into mutually unintelligble dialects through political unity began to diverge. The result was the Romance languages: French, Portugese, Spanish, Italian, Romanian and many less well known languages that are spoken today.

Some languages have well established "genetic" links. We can show through a combination of historical and linguistic evidence, and sometimes surviving evidence of intermediate versions of the languages, that they are all primarily derived from a common source language (although there may be borrowed words and influences from other languages). Two of the best established language families are Indo-European (which includes Germanic languages, Romance languages and the Sankscrit derived languages of India) and Semetic (which includes both Hebrew and Arabic).

The Evidence For Altaic

Not all language families fit into a consensus language family. There are classification controversies. One such controversy concerns the validity of the hypothesis that Turkic, Mongolian, Tungusic, Korean, and Japanese languages make up a single Altaic language family. Wikipedia sums of this debate as of 1999:

These language families share numerous characteristics. The debate is over the origin of their similarities. One camp, often called the "Altaicists", views these similarities as arising from common descent from a Proto-Altaic language spoken several thousand years ago. The other camp, often called the "anti-Altaicists", views these similarities as arising from areal interaction between the language groups concerned. Some linguists believe the case for either interpretation is about equally strong; they have been called the "skeptics" (Georg et al. 1999:81).

Another view accepts Altaic as a valid family but includes in it only Turkic, Mongolic, and Tungusic. This view was widespread prior to the 1960s, but has almost no supporters among specialists today (Georg et al. 1999:73–74). The expanded grouping, including Korean and Japanese, came to be known as "Macro-Altaic", leading to the designation by back-formation of the smaller grouping as "Micro-Altaic". Most proponents of Altaic continue to support the inclusion of Korean and Japanese.

Micro-Altaic would include about 66 living languages, to which Macro-Altaic would add Korean, Japanese, and the Ryukyuan languages for a total of about 74. (These are estimates, depending on what is considered a language and what is considered a dialect. They do not include earlier states of language, such as Old Japanese.) Micro-Altaic would have a total of about 348 million speakers today, Macro-Altaic about 558 million.


The new study compared languages using a tool called Consonant Class Matching, which rates the similiarity of languages based upon the percentage of pairs of words in the language in which the first two consonants in one hundred sample words fall in the same classes.

The method was validated with Indo-European and Semetic languages.

Applying the procedure to 21 modern Indo-European (IE) languages . . . we find that it reliably identifies such branches as Indic, Slavic, Germanic, and Romance (SIs varying between 45 and 77%, all statistically significant at P < 10-6). By contrast, similarity between languages belonging to different branches is much lower (between 1 and 21%). A particularly interesting comparison is between Germanic and Indic languages. The SIs are very low, between 1 and 7%. Half of the comparisons are not significant at the 0.05 level, while all but one of the rest are weakly significant at 0.05 < P < 0.01. [Ed. A 1-2% similarity is expected from random chance.]

Both the Indic and the Germanic groups reveal themselves beyond any doubt, while the genetic relation between these two groups is not convincingly demonstrated [with a one on one comparison of the languages]. We recall that the validity of the IE family was originally established not on the basis of modern languages but rather by comparing ancient ones, which are much closer to each other. The results of the CCM method reflect the greater degree of similarity (all comparisons are significant at least at P < 0.02 level, and most at much higher significance levels).

The SI between Old High German and Old Indian, in particular, is 14%. The probability of this overlap happening by chance is vanishingly small (<10–6). When we apply the CCM approach to several ancient Semitic languages we find that SIs for all comparisons are highly significant (P << 10–6). . . .

When we apply the CCM method to the proto-languages of four IE branches, we obtain the same pattern as for attested ancient languages. For example, the SI between the Proto-Iranian and the Proto-Germanic languages is 13%. By contrast, in pairwise comparisons between five modern Germanic languages (German, English, Dutch, Icelandic, and Swedish) and two modern Iranian languages (Kurdish and Ossetian) it ranges between 5 and 10% (average = 7%).

Using reconstructed proto-languages can sometimes yield even better results than using attested old languages, as is shown in the Iranian–Germanic comparison. The SIs between Old High German and Avestan or Classical Persian are only 9–10%, whereas the overlap between Proto-Germanic and Proto-Iranian is 13% (and the statistical significance of the result increases by several orders of magnitude). This improvement is at least partially due to the greater age of Proto-Germanic and Proto Iranian compared with Old High German and Classical Persian respectively.


So, how do the proposed languages of the Altaic language family fare?

Next, we use the CCM approach to test the reality of the Altaic family. We have four independent reconstructions: Proto-Turkic, Proto-Mongolian, Proto-Tungus, and Proto-Japanese (Korean dialects are too similar to one another to justify a reconstruction of Proto-Korean). We also calculated the degree of similarity between these four languages and Proto-Eskimo, because Mudrak proposed that Eskimo languages are closely related to the Altaic family. The SIs for the four Altaic proto-languages range between 6 and 11% (average = 8.7%). This range of values is lower than that for the IE family. Nevertheless, the significance levels range between 0.01 and less than 10–5, and this is strong evidence for historical connections among the four linguistic groups.

Note that when we run the test on modern languages, the degree of similarity between them is greatly attenuated. For example, comparing five modern Turkic languages (Turkish, Tatar, Chuvash, Yakut, and Tuvinian) with two modern Japanese ones (Tokyo and Nasa) we detect a statistically significant relationship only in two out of ten cases (P-values are 0.03 and 0.01). The SI between the proto-languages, however, is significant at P < 0.001 level. This is the same pattern that we have already noted in the context of the IE family. Interestingly, we find support for the hypothesis of Mudrak that there is a relationship between Altaic and Eskimo.


So, there is strongly statistically significant evidence that the relationship between the language is not random. But, what if loan words, rather than a genetic relationship, links the languages.

What remains, however, is the second objection: that the proto-languages of these families could have acquired similar lexicons “due to a prolonged history of areal convergence." One possible response to this alternative explanation is that borrowings into the basic lexicon (100-word lists) are rare. Thus, we expect that languages belonging to different linguistic families will have low SIs, even when they have coexisted in the same region for a long period of time. We test this proposition empirically.

First, we looked at comparisons of languages belonging to different families that were located in spatial proximity: (a) Old Chinese vs. the proto-languages within Altaic; (b) Turkish vs. modern languages of people that inhabited the Ottoman Empire (1378–1914); and (c) Turkish vs. Classical Persian and Arabic. The last comparison is particularly interesting because these three languages have coexisted in close cultural interaction at least since the Seljuk Sultanate (eleventh century), and many educated persons in the Middle East were trilingual.

The SIs . . . are somewhat higher than expected under the null hypothesis: three out of eleven comparisons are significant at 0.05 level, and the maximum SI is 6%. What is important for our purposes, however, is that prolonged contact yields much lower SIs than those observed beween proto-languages within Altaic (such as the SIs of 11% observed in comparisons of Proto-Mongolic with Proto-Turkic or Proto-Tungus). This observation is contrary to the hypothesis that the observed similarities between Altaic languages are entirely due to borrowings.

More generally, in the 66 comparisons between Altaic and Semitic languages the SIs ranged between 0 and 5% and there were only two significant P-values (whereas we expect 3.3; . . . ). This pattern is precisely what should happen when languages are so distantly related that most “signal” has been lost and there were no cross-borrowings into the basic lexicon.

In the 363 comparisons between Altaic and IE languages, however, there were 45 significant values (versus the expected 18). There is, thus, evidence for either some limited degree of cross-family borrowing or else deeper genetic connections between the Altaic and Indo-European families, as was proposed by Illich-Svitych in the context of his Nostratic superfamily, or both.

The main point, however, is that the evidence for internal connections between the Altaic languages is orders of magnitude stronger. (To test the superfamily idea properly using CCM it will be necessary to compare the reconstructed proto-languages of Indo-European, Altaic, and so forth.) The maximum observed SI in comparisons of modern languages or proto-languages within Altaic to those within IE was 8% (between Albanian and Nasa, no doubt caused by chance: the bootstrap-estimated probability of getting at least one SI=8% or better in the 363 comparisons is P > 0.7). By contrast, in the comparisons between the proto-languages within Altaic we observe SIs up to 11%. The . . . probability of getting two SIs of 11% in six comparisons . . . by chance is much less than 10–6. . . .

The evidence for the common origin of the Altaic languages, at least with respect to word-list comparisons, is thus nearly as strong as that for the Indo-European languages. If the Indo-European family is accepted as real, the same conclusion should also apply to the Altaic family.


This study would among other things, find a language family home for the world's largest language isolate (Korean).

With this conclusion what does the Big Picture look like?

If this evidence is correct in showing that there is a Macro-Altaic language family, then the largest language families in the world (by the percentage of the world population that speaks them) would be as follows;

*Indo-European languages 46% (Europe, Southwest to South Asia, America, Oceania)
*Sino-Tibetan languages 21% (East Asia)
*Niger-Congo languages 6.4% (Sub-Saharan Africa)
*Afro-Asiatic languages 6.0% (North Africa to Horn of Africa, Southwest Asia)
*Austronesian languages 5.9% (Oceania, Madagascar, maritime Southeast Asia)
*Altaic languages 5.8% (Central Asia, Northern Asia, Anatolia, Siberia, Japan, Korea)
*Dravidian languages 3.7% (South Asia)
*Austro-Asiatic languages 1.7% (mainland Southeast Asia)
*Tai-Kadai languages 1.3% (Southeast Asia)

This nine language families include the languages of 97.7% of the world's population. (The Semetic language family discussed is a subgroup of the Afro-Asiatic language family.)

Old World Languages Overview

A complete list of living language families of the world, excluding Australia, New Guinea (and a few neighboring islands) and the Americas would comprise the following 28 language families, 9 language isolates, and 8 unclassified small African languages.

At least six proposals to merge some of these language families are being discussed and some of the languages and language families are at risk of becoming extinct. If all six proposals were adopted and proposals to classify four of the language isolates and two unclassified languages were adopted, there would be just 22 language families, 5 language isolates and 6 unclassified languages outside the Americas, Australia, New Guinea and the nearby islands.

The languages are arranged by geography and suggestions of relatedness.

African/Near Eastern Language families and language isolates (11+2):
* Afro-Asiatic languages (formerly Hamito-Semitic)
* Nilo-Saharan languages
* Kadu languages (probably Nilo-Saharan) nine languages
* Koman languages (perhaps Nilo-Saharan) 47,000 speakers of five or six languages in Ethiopia and Sudan
* Songhay languages 2.6 million speakers of many languages around the Niger River
* Niger-Congo languages (sometimes Niger-Kordofanian)
* Mande languages (perhaps Niger-Congo)
* Ubangian languages 2-3 million speakers of 70 languages in and around the Central African Republic
* Khoe languages (part of the Khoisan proposal) 300,000 speakers of eight languages.
* Tuu languages (part of Khoisan) 4,200 speakers of two languages.
* Juu-ǂHoan languages (part of Khoisan) 45,000 speakers of two languages.
* Hadza language isolate (Tanzania) (perhaps Khoisan) Fewer than 1,000 speakers.
* Sandawe language isolate (Tanzania) (may be related to Khoe) 40,000 speakers

The last five languages/language families are all languages of peoples with click languages and hunter-gatherer economies in the recent past or the present.

European, Central Asian, South Asian and North Asian language families (9+7):
* Basque language isolate (Spain, France) (related to extinct Aquitanian) 1,063,000 speakers
* Northwest Caucasian languages (often included in North Caucasian) 1.7 million speakers of four living languages
* Northeast Caucasian languages (often included in North Caucasian) 4 million speakers of about thirty-two living languages
* South Caucasian languages 5.2 million speakers of four living languages
* Altaic languages (see map and description at top of post)
* Uralic languages (Finland, Hungary and Northern Russia) 25 million speakers of 39 languages
* Yukaghir languages (Eastern Siberia in Russia) 200 speakers of two languages (possibly related to Uralic)
* Indo-European languages (Europe, Iran and India)
* Nihali (aka Kalto) language isolate (Maharashtra, India) (sometimes linked to Munda) 2000 speakers (a "tribal" of India indigeneous to local tropical jungles probably as hunter-gatherers)
* Dravidian languages (Southeast India)
* Great Andamanese languages (part of the Andamanese proposal) (islands near India) 50 speakers of 1-2 languages.
* Ongan languages (part of the Andamanese proposal) (islands near India) 296 speakers of two languages.
* Kusunda language isolate (Nepal) 8 speakers (sometimes linked to Andaman and West New Guinea) (a moribund hunter-gatherer tribe physically dissimilar to neighboring groups)
* Shompen language isolate (Nicobar Island) (little known; appears to be two languages) 400 speakers (currently interdicted hunter-gatherers with minimal outside contact)
* Chukotko-Kamchatkan languages (Northeast Siberia in Russia) 11,000 speakers of four living languages
* Nivkh or Gilyak language isolate (Russia) (sometimes linked to Chukchi-Kamchatkan) 1,000 speakers
* Ainu language isolate (Japan, Russia) (like Arabic or Japanese, the diversity within Ainu is large enough that some consider it to be perhaps up to a dozen languages while others consider it a single language with high dialectal diversity) 100 speakers (indigenous people of Northern Japan who had substantial trade with the Nivkh people)
* Dené-Yeniseian languages (Dené is a discontinguous Native American language group, Yeniseian is an Eastern Siberian language family whose last living language Ket has about 500 speakers)
* Burushaski language isolate (Pakistan, India) (sometimes linked to Yeniseian) 87,000 speakers

Tibetan, Chinese, Southeast Asian and Pacific language families (5+0):
* Austronesian languages (part of the Austro-Tai proposal) (Indonesia, Madagascar, the Pacific)
* Sino-Tibetan languages (including China and Tibet)
* Tai-Kadai languages (part of Austro-Tai proposal) (Southeast Asia and South China)
* Hmong-Mien languages (Southeast Asia)
* Austro-Asiatic languages (Southeast Asia)

Collectively, language isolates mostly have few speakers. The eight language isolates other than Basque have fewer than 131,000 speakers combined, and most of those languages have hypothetical links to existing language families.

There are about eight living languages in Africa that are not currently classified:
* Ongota (perhaps Afro-Asiatic)
* Gumuz (perhaps Nilo-Saharan)
* Bangi-me (ethnically Dogon)
* Dompo
* Mpre
* Jalaa
* Laal
* Shabo

There are also a number of notable language isolates and unclassified languages in this region that are extinct and there is at least one extinct language family.

There are some notable extinct language isolates in this region:
* Elamite (Iran) (sometimes linked to Dravidian)
* Sumerian (Iraq)
* Hattic (Turkey) (sometimes linked to Northwest Caucasian)

The Hurro-Urartian and Tyrsenian language families are also extinct.

There are some notable extinct unclassified languages in this region:
* Iberian (Spain)
* Tartessian (Spain, Portugal)
* North Picene (Italy)
* Kwadi (perhaps Khoe)
* Meroitic (variously thought to be Nilo-Saharan or Afro-Asiatic)
* Quti
* Kaskian
* Cimmerian

Murray Gell-Mann at the Sante Fe institute is one of the authors of the New Study and a post related to him discusses proposals for linking many of the worlds remaining language families and language isolates with deeper genetic relationships.

New World Languages Overview

There is strong circumstantial evidence to suggest that all the language families and language isolates in Australia, New Guinea and the vicinity are really descended from the language of the founding population of Australia, the language of the founding population of Taiwan, or are creoles of languages in those two families. These languages are currently classified into 36 language families and about 20 language isolates.

Linguists currently break Native American languages (North, Central and South American) into about 85 language families and 57 language isolates, despite the fact that it is clear that all pre-Columbian Native American peoples who leave any linguistic trace (e.g. excluding Norse settlers who arrived around 1000 AD and whose colonies ultimately failed without leaving a linguistic trace in neighboring areas) share a common descent from a small number of waves of immigration from Siberia. Such common origins are strongly suggestive of a genetic relationship between all of the pre-Columbian languages of the Americas into a small number of macro-language families.

But, common genetic linguistic roots for Native American languages are hard to discern. Like Australia and New Guinea, these peoples spent well over ten thousand years in isolate, and a small number of geographically large political units emerged in the pre-Columbian area. The New World regions did not experience, as the Old World did, hosts of language extinctions in the pre-historic era. As a result, there is much more linguistic diversity in the Americas than there is in the Old World.

UPDATE (2/24/2010): The study did not analyze Korean linkages despite discussing it as a possible member of the language family.

There is a hierarchy of strength of relationships between proposed Altaic language family languages. The strength of the language relationships between non-Japanese proto-languages is strongest between the mostly geographically linked languages. Thus, Proto-Turkic is most closely related to Proto-Mongolian which in turn is most closely related to Proto-Tungus (i.e. Manchurian), which in turn is most closely related to Proto-Eskimo, on an anticipated West to East axis. Proto-Turkic is barely distinguishable as related to Proto-Eskimo at all when compared directly, only the intermediate language relationships so the extended Altaic language relationships.

Japanese is the exception. It is as closely related to Proto-Turkic as is Proto-Mongolian, and the strength of the relationship of Japanese languages to all other language families in the Altaic language superfamily appear to be derivative of its relationship to Turkic languages, despite the great distance between Anatolia and Japan compared to the geographical distance associated with the prior languages.

Proto-Japanese probably reflects the language of the mounted military invaders of Japan from the North (frequently Korea is described as the source) ca. 2400 years ago (perhaps up to 500 years earlier). The relationships bear the same hierarchical relationship, but are about half as strong, for proposed Altaic languages and modern Japanese. It is likewise more closely related to the Anatolian Turkic languages than it is to the Eastern Turkic languages. This is a linguistic mystery.

The speakers of proto-Japanese were a bronze age civilization. The last bronze age civilization in Anatolia was that of Troy, 3000 BC-700 BC, whose end period coincides roughly with the beginning of the Yayoi period of colonization. The Trojans, of course, were a sea going civilization although not known to have traveled as far as the Near East. They, however, probably spoke an extinct Indo-European language. Phrygia was also an Antatolian civilization at the time, but they also spoke an Indo-European language at the time as did the neo-Hitties

The earliest written evidence of Turkic language comes from Mongolia in the 700s AD. It did not originate in modern Turkey, and instead expanded to this region during the Middle Ages. Proto-Turkic is assumed to date from about the 300s AD. The history of Turkic expansion can be summed up as follows:

The Turkic migration as defined in this article was the expansion of the Turkic peoples across most of Central Asia into Europe and the Middle East between the 6th and 11th centuries AD (the Early Middle Ages). Tribes less certainly identified as Turkic began their expansion centuries earlier as the predominant element of the Huns. Their prehistoric point of origin was the hypothetical Proto-Turkic region of the Far East including North China, especially Xinjiang Province and Inner Mongolia with parts of Mongolia and Siberia possibly as far west as Lake Baikal and the Altai Mountains. They may have been among the peoples of the multi-ethnic historical Saka known as early as the Greek writer Herodotus.

Certainly identified Turkic tribes were known by the 6th century and by the 10th century most of Central Asia, formerly dominated by Iranian peoples, was settled by Turkic tribes. The Seljuk Turks from the 11th century invaded Anatolia, ultimately resulting in permanent Turkic settlement there and the establishment of the nation of Turkey. Meanwhile the other Turkic tribes either ultimately formed independent nations, such as Kyrgyzstan, Turkmenistan, Uzbekistan and Kazakhstan or formed enclaves within other nations, such as Chuvashia. Turkics also survived on the original range as the Uyghur people in China and the Sakha Republic of Siberia, as well as in other scattered places of the Far East and Central Asia.


The Proto-Turkic homeland is geographically close to Tibet, and so would explain the frequency of a Y chromosome halotype that is now common in Tibet in a conquering class speaking a language related to proto-Turkic in Japan.

Inner Mongolia's history at the relevant time was as follows:

During the Zhou Dynasty, central and western Inner Mongolia (the Hetao region and surrounding areas) were inhabited by nomadic peoples such as the Loufan, Linhu, and Dí, while eastern Inner Mongolia was inhabited by the Donghu. During the Warring States Period, King Wuling (340–295 BC) of the state of Zhao based in what is now Hebei and Shanxi provinces pursued an expansionist policy towards the region. After destroying the Dí state of Zhongshan in what is now Hebei province, he defeated the Linhu and Loufan and created the commandery of Yunzhong near modern Hohhot. King Wuling of Zhao also built a long wall stretching through the Hetao region. After Qin Shihuang created the first unified Chinese empire in 221 BC, he sent the general Meng Tian to drive the Xiongnu from the region, and incorporated the old Zhao wall into the Qin Dynasty Great Wall of China. He also maintained two commanderies in the region: Jiuyuan and Yunzhong, and moved 30,000 households there to solidify the region. After the Qin Dynasty collapsed in 206 BC, these efforts were abandoned.


The Donghu people in particular look plausible as proto-Japanese settlers, as were the neighboring Dongyi people. Also plausible are the Xiongnu to the West of the Donghu whom the Great Wall was errected to keep out. This would have been near the boundaries of the Silk Road and preceded any civilization politically conceived of as Tibetan.

An overview of theories of the origins and classifications of the Japanese language can be found here and generally favors a relationship between Korean and an extinct language spoken in Korea before Korean became dominant there.

Genetically, Japan (also here) and Korea (and here) are very similar in mtDNA and are each distinct from surrounding regions based on Y chromosome analysis, although not quite as similar in Y chromosomes as in mtDNA.

For example, mtDNA type C is common in Siberia and found in smaller numbers in Mongolia, but rare elsewhere in East Asia, including Japan and Korea. mtDNA types F1 and M are rare in Japan and Korea but common in Southeast Asia and the adjacent Indonesian islands.

Y halotypes that are found in two-thirds of Japanese men (D-M174 and O-P31) make up on the order of a third of Korean men and are rare in Siberia, Mongolia and China. The D-M174 halotype found in about a third of Japanese men and perhaps one in twenty Korean and Thai men is the dominant haloptype in Tibet where it foud in perhaps 85%-90% of Tibetan men. It is virtually absent everywhere else in East Asia and Siberia. "D-M174 is derived from African haplogroup DE-M1 (Yap+), is found at highest frequency in Andamanese, Tibetans and Japanese, and only sporadically elsewhere, and has been dated to about 60 kya."

The O-P31 halotype that is found in about a third of Japanese men is found in about a quarter of Korean men and is common in Southeast Asia and neighboring Indonesian islands; but is rare in Siberia, Mongolia, China, Tibet, Taiwan, the Phillipines and the Moluccas.

No comments: