The Internet, where languages go to die?

Forget the triumphant universalism of the Web; 95 percent of languages have almost no presence online

March 18, 2014 8:00AM ET
The digital realm was supposed to be a horizontal platform, a great equalizer that lets everyone to communicate seamlessly with one another. But according to a recent study, only a small fraction of languages are digitally ascending.
Dennis Lane

We’re used to the triumphalist universalism of the digital utopians: Google organizes the world’s information. Facebook connects everyone. Twitter tells you what’s happening. Wikipedia is the encyclopedia that anyone can edit. It’s all true — for a mere 5 percent of the world’s languages.

What few acknowledge is that the online world — when compared with offline, analog diversity — is very nearly a monoculture, an echo chamber where the planet’s few dominant cultures talk among themselves. English, Chinese, Spanish, Arabic and just a handful of other languages dominate digital communication. Thanks to their sheer size and to the powerful official and commercial forces behind them, the populations that speak and write these languages can plug in, develop the necessary tools and assume that their languages will follow them into an ever-expanding range of virtual realms.

Meanwhile, despite heroic and ongoing efforts, 95 percent of all languages languish almost entirely offline. Call it the largest digital divide of them all and by far the hardest to bridge. The better-known divides, such as the lack of dependable Internet access in many communities, are real but not intractable. As a decade of modest progress has shown, there are often straightforward solutions if we’re willing to invest in them, and at least these digital have-nots can make rapid progress once plugged in. The tools may be unfamiliar, but the terrain — the language and culture framing it all — is not.

For speakers of less common languages, the issue of access is usually serious enough to begin with, but the near total absence of their languages and cultures online is demoralizing and fundamentally limiting. Knowledge of a more widely spoken language is no panacea here: The real problem is a digital architecture that forces people to operate on the terms of another culture, unable to continue the development of their own. Of course, most of the languages that are missing online have also never found their way onto television screens or radio broadcasts and have never been taught in classrooms or used in offices. But the digital realm was supposed to be different — a horizontal platform, a great equalizer that would allow everyone to communicate seamlessly with one another. What went wrong?

Decoding language loss

The arguments for linguistic diversity are compelling, mirroring the better-known case for biodiversity. Children learn better in their mother tongue, multilingualism is an immense cognitive and cultural asset, and indigenous peoples with resilient languages and cultures may be better able to defend themselves against the ongoing tyranny of the larger monoculture. These are joined by considerations of justice and human rights; small languages are going extinct largely because of suppression and shaming by larger groups in the service of imperialist, nationalist and capitalist ends. And then there is the question of a global cultural heritage — the unfathomable loss of knowledge and art that comes with the disappearance of any language, but especially those that are unrecorded and unwritten. The loss of a language, wrote the linguist Ken Hale, is “like dropping a bomb on the Louvre,” if we only knew how to feel it.

And now the Internet, a supposed cure-all, is only making matters worse. This is the striking claim from András Kornai, a mathematical linguist at the Budapest Institute of Technology, in a recent paper titled “Digital Language Death.” It’s a conclusion that will make language activists shudder, so strong is the current faith that digital capabilities can help sustain and revive endangered languages.

For the overwhelming majority of languages, the glorious digital tomorrow will never arrive.

At first blush, digital communication seems to offer uniquely promising solutions to the endangered-language crisis. Basque blogs, a Faroese Wikipedia, iPad apps in Cherokee and Navajo, an operating system in Hawaiian, texting in Tlingit — these tremendous initiatives and many others like them have an iconic significance, especially for younger speakers, proclaiming to the world, “Our language has a future.” Disparate speakers, learners and resources can be linked across space and time, enhancing language teaching and bolstering efforts at language maintenance. Interactivity and crowdsourcing, made easier online, are just what small languages seem to need, given paltry or nonexistent governmental, institutional or corporate support. The production, dissemination and consumption of minority-language media have all become cheaper and easier — certainly an improvement over print production and television, which demanded serious startup resources. And if subcultures are famously flourishing online — everyone letting their inner freak flag fly — shouldn’t cultures? After all, the Klingon and Elvish languages seem to be doing pretty well.

Kornai gives short shrift to such optimism. For the overwhelming majority of languages, he suggests, the glorious digital tomorrow will never arrive. He persuasively demonstrates that the Internet is hardly more universal, at least when it comes to human language, than the printing press or the television before it. Only 5 percent of all living languages are digitally “ascending,” writes Kornai, and thus truly “enter[ing] the space of digitally mediated communication.” In other words, about 250 languages can be called well-established online, and another 140 are borderline. What do the speakers of the other 6,700-plus or so languages have to say — and will we ever hear their voices?

The textiness problem

Visualizations showing Twitter conversations in Basque, Irish Gaelic and Yiddish.
Kevin Scannell/Indigenous Tweets Project

One basic challenge is the Internet’s textiness. The language database Ethnologue estimates that 3,535 of the world’s 7,105 living languages have no writing system whatsoever. It’s precisely this category, the unwritten half of all of today’s spoken languages, about which we know next to nothing. For hundreds of languages, linguists lack even the barest documentation — a word list, a brief recording, basic grammatical information. Outsiders, even those from the same region, may be entirely unaware of the language’s existence or merely consider it a broken, backward dialect. The number of speakers of each such language is usually under 10,000 and declining fast. Many won’t survive the century. Ninety-six percent of the world’s languages are spoken by just 4 percent of the world’s people.

Acknowledging our inability to know all the languages used in emails, texts, Skype calls and so on (maybe the NSA could help), Kornai nonetheless tries to survey all publicly available textual material online, with a particular focus on the hyperglot Wikipedia, which has versions in 287 languages (with another 533 in “incubator stage,” according to him). He rightly homes in on the invisible underpinnings that enable us to use a language online, such as input methods, OS support (on a range of devices, in countless applications), transliteration and translation and spell-checking tools. Just developing a Yiddish spell-checker, for instance, has required a stable input method for the modified Hebrew alphabet that Yiddish uses, the prior standardization of that alphabet (still contested), standardized spellings of most words (sometimes contested), technical ease in handling the Yiddish alphabet and a loaded dictionary.

Needless to say, support can be extremely patchy even for very widespread languages, and most of what exists has depended on open-source solutions and dedicated volunteers. Even translated versions of the most popular tools and sites so far have only a strictly limited reach. According to Kevin Scannell of the Indigenous Tweets project, as of late last year, you could search Google in 150 languages, use the Firefox browser is 105 languages, navigate Facebook in approximately 100 languages and find tweets in 139.

In all, there may be online primary materials of some sort in up to 1,500 languages, he estimates. Even this more generous number leaves 80 percent of the world’s languages invisible in the digital realm.

A read-only heritage?

What Scannell’s estimate underscores is the remarkable effort made by linguists, language activists and archivists over the last 10 to 20 years to collect and curate endangered language materials across the globe — with major research initiatives led by the Hans Rausing Endangered Languages Project in London, the DOBES project in the Netherlands and the National Science Foundation in the U.S. On a local level, there are nonprofits such as the Endangered Language Alliance, where I serve as the assistant director. Just in and around New York City we have recorded, edited and uploaded stories in little-known languages, including Shughni (Tajikistan), Bribri (Costa Rica) and Juhuri (Azerbaijan). The amount of linguistic material online is growing exponentially — folk songs, oral histories, recordings of everyday life — but as Kornai points out, this is no guarantee of digital vitality.

The great work of Unicode — ensuring that any character from any script can be represented — has been hampered by a persistent English-centrism among most of the digital gatekeepers.

Read-only heritage materials are different from the living, breathing stuff that users generate online in daily use, but the distinction is not hard and fast. In some cases, lesser-used languages are actually ahead of the big ones; Maori speakers, for instance, have created a set of free, sophisticated language-learning tools for children. For speakers of less common languages, just hearing and seeing one’s language online can be a powerful affirmative experience, as I’ve learned through my work on Trung, a Tibeto-Burman language of southwestern China with fewer than 7,000 speakers. A well-established writing system is still some ways off, but I’ve been working with members of the community on a first-ever dictionary for the language. We are also beginning to upload video and audio recordings. As more Trung speakers use the Internet in the coming decades, their language may at least have a presence.

Any medium has its limitations, and the Internet is still a creature of the late 20th century monocultures, primarily English-speaking, that gave it life. Compared with cell phones or pirate radio, the Internet still presents significant barriers to entry, given the need for existing literacy, terminology and a range of digital tools to make it all useful. The great work of Unicode — ensuring that any character from any script can be represented — has been hampered by a persistent English-centrism among most of the digital gatekeepers, making things needlessly difficult for users of other alphabets, scripts and character sets.

Witness the changing role of the Internet Corporation for Assigned Names and Numbers, an international organization with vast jurisdiction over the Internet that only last Friday announced it will soon start operating independently of the U.S. government. In October, ICANN enabled the first generic top-level domains to use non-Latin characters — a very early and tentative step toward what the organization admits should be “a more inclusive Internet.” This means that in addition to .com, .org and the like, there are now two such domains in Cyrillic, one in Arabic, one in Chinese and many more on the way, which will undoubtedly provide more complete online experiences for speakers of those major languages. But much work has yet to be done before this precedent trickles down to speakers of smaller languages, currently without even a way to type their names.

A world of smaller internets

A much more intractable problem is the Internet’s tendency to push all of us out into the footlights of a single global stage. To reach more friends — or more customers, viewers, likers or retweeters — we pump out content in the dominant languages instead of cultivating more intimate and varied communities and legacies. There is no reason we cannot have both a single Internet and a fantastic diversity of smaller internets (or even intranets), communities of users with their own digital turf in the form of domains, closely linked websites and specialized tools and services. These need not be walled gardens, but they should be vital, organized nodes and safe spaces for language use. Improvements in machine translation can then play a role, as Google Translate already does, building bridges between these internets without erasing difference. Less Internet textiness would also help: The more we use audio and video, seamlessly approximating face-to-face conversation, the easier it will be for unwritten languages.

Kornai sounds the alarm that only the tiniest portion of our languages are making it into the digital realm. In doing so, he points to the need for online efforts that are much larger and more radical in conception. Language activists working online need massive, concerted support from programmers, companies, governments and speakers, but the real and most effective base for endangered languages, if they still have a base at all, remains the home, the family, the village, the world of the analog. There is no substitute.

Ross Perlin is a writer and linguist based in Brooklyn. His writing on language and labor has appeared in The New York Times, The Guardian and Harper’s magazine. His first book is “Intern Nation: How to Earn Nothing and Learn Little in the Brave New Economy.”

The views expressed in this article are the author's own and do not necessarily reflect Al Jazeera America's editorial policy.

Find Al Jazeera America on your TV

Get email updates from Al Jazeera America

Sign up for our weekly newsletter

Get email updates from Al Jazeera America

Sign up for our weekly newsletter