12 March 2013

The Woman Who Speaks A Mystery Language

Language Log is currently crowdsourcing a difficult question out of an International Organization for Migration office in Nepal.  A destitute woman who is a refugee is in Kathmandu, Nepal speaks and writes in a manner that officials there have been unable to identify with any known language. 

A writing sample and two sound clips, along with some additional information on the woman and her cirumstances have been provided. Click on the link to listen to the sound clip.  A copy of the writing sample is below the break (in this circumstance, reproduction is fair use for copyright purposes).  Many likely candidate languages have been ruled out; no identification has been definitively made by anyone.  The office's goal is to identify her place of origin or family ties in the hope that she may find a home and a support network.  So long as she is stateless, she does not even benefit from counsular assistance from the country of which she may be a national. 

More information about her background and analysis from experts concerning what her linguistic background could be appear below. 

Our correspondent offers, in addition to the writing sample below, the two sound clips, and some additional background information reproduced below. 
[S]he is living under the government's care along with other foreigners who overstayed their visas. This woman is not recognized as a Bhutanese refugee and doesn't have any identification document with her. I think right now she is noted as a stateless migrant or maybe illegal migrant. She entered Nepal from somewhere about 6 years ago. . . .    
She can speak some English indicating perhaps some formal education. The strange thing is she can't tell us her homeland or anything about her family except a husband who travelled with her and abandoned her. We think this "husband" could be a genuine husband or maybe just a human trafficker.   
Mental Health, Speaking In Tounges, Or Fraud?  Mongolian? Copied Graffiti? 
She is taking medication for mental illness. I don't know the diagnosis but my colleague thinks it was caused by the trauma of leaving her homeland (voluntarily with this "husband" or perhaps through coercion/force). Because of the mental illness, my colleague thinks she is highly susceptible to suggestions from other foreigners living with her in this government center. This could also make it harder for us to pinpoint her origin and explain the different writing systems that came up in the initial sample. For sure my colleague told me she was agitated when mentioning India, you can also hear it in the recording.   
She said in English she is Mongolian. My colleague thinks she was just given that notion by another foreigner living at the center because other people in that center said the foreigner had been telling her she is Mongolian.
Distortions due to mental illness, the religiously motivated glossolalia of someone who has a Pentacostal Christian or animistic religious background, or intentional deception have been suggested as possibilities, but with good reason to discount those possibilities.

As one commenter aptly put the matter:
Before concluding that the speech samples are simply "a hotch-potch of phonemes" (actually suggested by the hypothesis of "speaking in tongues"), and declaring the woman to be either crazy or a faker or both, it is only fair to rule out actual languages, including those spoken in very small, remote areas and practically unknown to people outside those areas.
This is particularly true in this part of the world, in which two significant new language discoveries, one related to other regional languages and the other an isolate, have been made in the last few decades and there are scores of languages spoken by small numbers of people in isolated places in the region.

Another commentator followed up by noting that:
[A] person with mental illness do[es] not speak nonsense, contrary to some common beliefs. Most people with mental illness speak comprehensible words. Quite a few of them may express with distorted meaning and corrupted syntax, but as lexemes, understandable. However, if she has a history of brain injury or stroke, her language ability might have been impaired.
It also isn't clear that the mental health issues in question would have any impact on written or verbal expression, even if there are mental health issues that could have that effect.

In something of the same vein, it was suggested that some of the writing may be reproductions of graffiti tags found in Nepal acquired in the six years since her arrival in Nepal.

A Few Observations About The Writing Sample

According to our correspondent: "The writing sample came about when my colleague asked her to write something for her, so random words come up."

This woman's ability to write clearly demonstrates that has had enough formal education to have at least rudimentary literacy, as does a rudimentary command of English.  Since she has at least an elementary school level education, she cannot be from, for example, some unknown uncontacted population (such as the Great Andamanese).  Likewise, her command of roman script and rudimentary English implies contact with a community that was either influenced by Christian missionaries or was once part of the British colonial empire.

All of South Asia and Myanmar and quite a bit of Central Asia was once under English rule, and Christian missionaries have gone a great many places, so this doesn't narrow the possibilities very much, but it does tend to disfavor, for example, places where French or Russian, rather than English were the residual colonial languages.

Her roman alphabet writing appears to be arranged in words. 

The nature of the non-roman alphabet characters with which she code switches is less clear.  One commentator suggested that she may speak a language in the same linguistic family as the Mon-Khmer Khasi language of the religiously Christian tribal peoples of the hills of Meghalaya in India.  The Khasi language (and perhaps others in the region) was committed to writing using the Assamese script in the 1800s before the roman script was adopted.

I observed that this suggests that some of the symbols could be a simplified or localized version of the Assamese alphabet.  A sample of an early version of this script from 1207 CE (which might be less stylized and hence more like what someone who was less educated might produce, as opposed the the more calligraphic style that came into use use later) is here:

Many of her symbols show some similarity to Assamese symbols.  It is possible, however, that some or even many of the symbols in non-Roman text are doodles or non-alphabetic symbols.

Efforts To Classify The Language(s)

The most hopeful hypothesis for identifying this woman's language, in my opinion, is in the following comment:
Listening to the recordings, it doesn't sound like there's lexical tone, and the intonation pattern, syllable structure and phonetics are similar to the mon-khmer languages I'm aware of in Meghalaya. Keep in mind that though it's not Khasi or Pnar (and I don't think it's War), there are a lot of smaller local varieties in the hills of Meghalaya which are prime candidates. Given that this lady is now in SE Nepal, and that she seems to vociferously indicate that she is from India, (the first recording seems to be of her telling how she arrived in Nepal) she may be from NW Meghalaya near Garo country, where there have been insurgents in recent years.
While I had known that the South Asian specific Austroasiatic Munda languages were spoken in India, I had not known that there were any Mon-Khmer linguistic communities in India (which are mostly spoken in pockets of Southeast Asia and Southern China).

Our correspondent is a native speaker of Vietnamese and can rule out this Austroasiatic language.  Other common languages of the region that would be known to UN staff in the agency can also be ruled out.  For example, "the UNHCR camps in eastern Nepal have mainly Bhutanese refugees, who are mainly Lhotshampas, speakers of Dzongkha and Nepali, and other Tibeto-Burman Nepalese languages."  He also offers these insights:
My colleagues from the various IOM missions around the world couldn't identify the language. I have heard Mizo and Tedim in my work and this doesn't sound the same. Nothing heard from the Philippines and Indonesian missions so pretty safe to rule out Bahasa or Tagalog. Because of what you all wrote, I will re-check with my colleagues in Myanmar if they have the access to other tribal languages outside of Chin. . . One more language to remove from the checklist: Kashmiri. A native speaker listened to the sound clips and ruled out that possibility[.]
So far about 114 comments over two weeks, many from professional linguistics either directly, or via native speakers whom they have contacted, have contributed to the analysis.  Some of the more definitive comments have concluded, for example:
a. The language appears to have a fairly limited phonetic inventory; b. there are long vowels; c. there appear to be mono-, bi-, and trisyllables, but not longer; c. syllable structure is not too complex; and d. reduplication appears to be prevalent.
The recording sounds non-tonal, which would argue against Shan state.
[T]hose pharygeal h's sound very un-Tibeto-Burman

I am a Tagalog/Filipino native speaker and linguist, and this certainly is not Tagalog/Filipino. Tagalog doesn't allow word-final -h.  
From Zothani Khiangte at Manipur U. (via Mark Bender): “not even remotely Mizo”!  
Those are completely malformed Chinese/Japanese characters, if that is what they are meant to be (which seems highly unlikely). Someone who doesn't know Chinese/Japanese writing and is trying to reproduce the easy-looking characters MIGHT come up with those, but anyone who actually knows the language will produce something very different. The orientation and stroke count are the giveaways 
An ethnic Chinese and once a student of Chinese paleography and epigraphy, I am 100% sure that the non-Latin form of script is NOT any kind of Chinese writing (excluding the writings of the minorities such as Yi and NĂ¼ Shu which I have no idea) and not the Japanese Kanji.

I want to clarify why the non-Latin signs she wrote is not related to Chinese writing. Chinese characters are either single form or compound form written signs classified as the chart in this page:
95% of Chinese characters are compounds composed of at least two parts spatially separable within the sign unit. However, I don't see any of the lady's non-Latin signs separable, let alone the absence of any similar sign to Chinese except the one @Daniel Tse mentioned. Most of her signs are less than three strokes, but the average strokes of Chinese characters are around 9 (http://technology.chtsai.org/charfreq/). Only the sign I mentioned in a previous reply achieved this complexity. I don't think she has any training in Chinese, even informal.
[C]ertainly not the Hazaragi of Afhganistan, which is a dialect of Persian and readily understandable if you know other Persian dialects. I don't see it as Khowar (Chitrali) either) - but considering where she's from, it seems like it makes Iranian languages in general, or east Iranian in particular, unlikely.
[M]ost of the languages of the Indosphere, though not all, have three or more manners of articulation and more places of articulation for the stop series, which are not attested in the writing sample. It’s also notable that the voiced stop series has a clear /b/, but no /d/ and no /g/ (except for the loan ‘mig’ and in ‘Ang’, which may be part of a loan ‘Ang Sui Ki’ or is a part of an engma). . . .  
[I]t’s possible that similarities between the writing sample and Bahasa Malay/Bahasa Indonesia may be due to the greater statistical chance of similarity with a simple syllable structure and a limited phonemic inventory. Nevertheless, it’s certainly noteworthy that: several linguists have noticed possible Malay or Indonesian words in the writing sample AND no one has identified possible words from other languages in the romanized portion of the writing sample (with the exceptions of ‘hasta pe’ and the Hindi copula, as noted in two different comments above). . . .  
There are relatively few Tibeto-Burman languages which are not tonal; however, there are some in the greater northeast India region. A quick search of the STEDT database, http://stedt.berkeley.edu/~stedt-cgi/rootcanal.pl/gnis, shows three or four matches with Northern Naga languages (tou, ula, aba, ami), Angami languages (ole, ami, zope), Zeme languages (mala, tama, peu), Bodo-Garo (mala, ape, tama), and Chin (tou, reh, ami). The phonetic inventories of these language families, however, are more complex than that of the sample (http://stedt.berkeley.edu/~stedt-cgi/phon_inv.html?page=49, may be a bit glitchy to get this to run), so these do not seem to be likely candidates. Additionally, there is the same issue that these are mono- and bisyllabic words with cross-linguistically common sounds. 
(a.) a native speaker of a Kuki-Chin language closely related to Mizo says it is not Mizo. (b.) a fluent but non-native speaker of Indonesian says, “a few of the words look like Bahasa Indonesia or Malay (like salah means wrong or error) but I don't recognize most…” and has seen ‘peu’ in Acehnese but is not sure of its meaning. (c.) a fluent but non-native speaker of Urdu says the accent sounds Bengali-influenced. (d.) a colleague from NW India does not recognize it. (e.) a Bengali speaker thought it might be Nepali, Mizo, or Khasi. (f.) two colleagues who work on languages of Bhutan say, “ It's definitely not anything in Bhutan. It's not Nepali or Assamese, either. In fact, it doesn't sounds like anything either of us have heard in NE India (including Khasi). The closest thing we could come up with is Pashto (based on watching the movie Afghan Star). So, maybe it's worth checking out some other Indo-Iranian language from Pakistan or Afghanistan? (btw: sala is also a word in Hindi for wife's brother, but often used as a curse)”.

Ladino is a variety of Spanish, and there is nothing Spanish about the language(s) of the documents. 
The absence of retroflex consonant also indicate her unfamiliarity with Sanskrit and Dravidian.  
[T]he written text looks a bit like Indonesian. But having listened to the sound files, I would have to say that it doesn't sound the least bit like Bahasa Indonesia or any of the well-known languages of Indonesia.  . . . it also reminded me of ritualistic or shamanic language. Another possibility that came to mind is that it is a ludling. Note the high frequency of certain sounds, especially sibilants. I would recommend that experts in languages of the region select a couple of the most likely languages, and try to derive these texts from such languages by the application of ludling-style rules, eg. by adding syllables with sibilants all over the place.

[T]his sounds nothing like an indigenous TB language of the North East; the phonology, especially the prosody, is quite different, and sounds much more (north-)westerly. Also, if I remember correctly that the woman is reported as knowing some English and (thereby?) showing signs of being educated, it would be astonishing if she were from the North East and yet knew no Hindi, Assamese or Nepali whatsoever. 
I've just listened to the two recordings. I'm fairly certain that neither of them is Marathi. The first one does sound like an Indian language, and there are some recognizable words — especially "itihosa indiyaa," which would presumably mean "history of India." The other one sounds even less like Marathi to me. I thought I recognized the word "Sufi," but then she corrects herself and changes it to "Suze," or something like that.

It isn’t like any Tibeto-Burman language of NE India that I know of, and it doesn’t sound like it has lexically contrastive tone either, in contrast to the vast majority of T-B languages of that region. My partner is a native speaker of Nepali and Nagamese, grew up in Nagaland and Meghalaya, and has a very good knowledge of Hindi and related I-A languages, and she couldn’t recognize anything in the recording either. 
From Pushkar Sohoni, a native speaker of Marathi: The language is most certainly not Marathi or any other South Asian Indo-European language from the past 1500 years. 
It has been mentioned already that the language isn't any South Asian Indo-European language, but as a native speaker I'll confirm that it doesn't sound Bengali at all. /v/ and /z/ are not present in Bengali, and /s/ is very rare. Phonetically, the recording sounded like it might come from Assam or parts northward, although the words didn't sound Assamese.