2011-12-23

1000 Languages on the Web

Click to see the full size image

About the Image

Since 2003 I've been gathering texts from the web written in indigenous and minority languages.  The image above is a "family tree" of the 1000 languages I've found to date, where proximity in the tree is measured by a straightforward statistical comparison of writing systems (details below).
  • When you load the full image it will be too big to fit in a browser window and you may not see anything at first you'll need to use the horizontal and vertical scrollbars to explore different parts of the tree (most browsers will let you zoom in and out also).  And because it's an SVG image, you can use your browser's search functionality (probably Ctrl+F or ⌘-F) to find different language codes, although the search behavior can be a bit weird/unpredictable.
  • Each language is colored according to its linguistic family (details here).  For example, all Indo-European languages are greenish colors, with different subfamilies (Celtic, Germanic, etc.) being slightly different shades of green.  I also tried to use similar colors for languages from the same geographical region even when there is no known genetic relationship among them, and so Arawakan, Quechuan, Tucanoan languages (all from South America) are shades of purple, while Central and North American languages are shades of blue.
  • Clicking on a language opens a new tab or window with the documentation page for the ISO 639-3 language identifier where you'll find a name for the language in English and a link to its Ethnologue page for additional information.
  • What I'm calling "languages" are really "writing systems"; you'll see, for example, separate nodes for bo (Tibetan) and bo-Latn (Tibetan written in Latin script).  In a small number of cases I track macrolanguages, regional variants (e.g. en, en-IE, en-ZA), and some dialects.  In total, there are 919 distinct ISO 639-3 codes among the 1000 writing systems represented.
I'm using these data in collaboration with language groups all around the world to develop basic resources that help people use their language online: keyboard input methods, spell checkers, online dictionaries, and so on.  This work also underlies the Indigenous Tweets and Indigenous Blogs projects, which aim to strengthen languages through social media.  You can learn more about how indigenous and minority language communities are using the web, social media, and technology to help revitalize their languages by following us on Twitter.

The Gory Details

Everything is based on an analysis of three character sequences ("3-grams") in the different languages. It turns out that computing the statistics of 3-grams in a given language provides a "fingerprint" that can be used for language identification and a number of other applications.  Specifically, imagine the huge-dimensional vector space V whose axes are labelled with all possible 3-grams of Unicode characters (dim V > 1015).  Given a collection of texts in a language, you can compute the frequencies of all 3-grams that appear in the collection, defining a (sparse) vector in V "representing" the language.  We then define the distance between two languages to be the angle between their representative vectors in V.  This can be computed by scaling the vectors to unit length and computing their dot product (which is the cosine of the angle we want).

Once we know the distance between each pair of languages, we can reconstruct a phylogenetic tree using any of a number of well-known algorithms.  The image above was created using the so-called "neighbor-joining" algorithm (which basically builds the tree in a greedy, bottom-up way). A side-effect of the algorithm is that each edge in the tree is assigned a length, but note that the edge lengths in the rendered image have nothing to do with the computed edge lengths (indeed, it's unlikely that the tree can be rendered in a distance-preserving way in two dimensions).  Another side-effect of the algorithm is that the tree is connected by definition, all languages are within a bounded distance of each other and so near the root of the tree you'll see various languages which use completely different scripts joined in a more-or-less random fashion (Khmer, Georgian, Tamil, Cherokee, etc.).  It would be easy enough to tweak the distance function or the algorithm to render languages with different scripts as separate connected components.

How many languages are out there?

Ethnologue lists 6909 living languages in the world, but how many have some presence on the web?  The answer depends greatly on what kinds of documents you include.  If one takes linguistic studies into account, the number might be as high as 4000 – the Open Language Archives Community (OLAC) brings together data from linguistic archives all over the world into a single, searchable interface.  The OLAC coverage page shows, at present, the existence of online resources for 3930 of the 6909 Ethnologue languages, with more material coming online every day.  The amazing ODIN project harvests examples of interlinear glossed text from linguistic papers, and has over 1250 languages in its database.

The 1000 languages found by my web crawler are, for the most part, what you might call "primary texts": newspapers, blog posts, Wikipedia articles, Bible translations, etc.  My best guess at present is that around 1500 languages have primary texts of this kind on the web.  If you know of online resources written in a language that's not listed on our status page, please let me know in the comments.

Here are a couple of closely-related (but ill-defined) questions: first, "How many of the 6909 languages have a writing system?" and second, since a great number of the texts we've found are Bible translations or other evangelical works, one might ask "How many languages have a writing system that's used regularly by members of the speaker community?"  I've looked around a bit for answers to these questions but I haven't found any careful studies in the literature.


Mash it up!

I put all of the data and scripts needed to generate the image in a github repository.  I'm not an expert on data visualization, so I'm hoping others will grab the data and experiment.  One idea would be to use a more sophisticated algorithm for reconstructing the tree, such as Fitch-Margoliash. In terms of the visualization itself, it would be cool to do something that connects the tree to locations on a world map where the languages are spoken. There are also some Javascript/HTML5 graph viewers that might provide a better browsing experience.  Or you might simply select the colors in different ways (perhaps colors for different typological features: for example, SVO, VSO, etc.).  Feel free to post additional ideas in the comments!

Thanks

First, I'd like to thank the hundreds of people who have contributed to the project over the years by providing training texts in many of the languages, correcting errors in the language identification, editing word lists, and helping separate different dialects/orthographies.  You'll find many of their names on the project status page. Thanks also to Michael Cysouw who first suggested generating an image of this kind (you can find his image, created in 2005, on the main project page). Finally, thanks to my colleagues at Twitter for several helpful conversations and for their interest in the Indigenous Tweets project.

2011-12-06

Language revitalization through free software: the case of Aragonese

Aragonese is one of the minority languages of Spain, spoken in the autonomous community of Aragon in the northeastern part of the country.  With an estimated 10,000 native speakers, it is in a much more precarious position than its neighbors Catalan and Basque.  Nevertheless, there is a vibrant online Aragonese community that is working hard to develop free and open source resources to support and help revitalize the language.  One notable example is the tremendous volunteer effort that has gone into developing the Aragonese Wikipedia; weighing in at 25,000+ articles and 2.5 million words, it is believed to be the largest Wikipedia of any language, per number of native speakers.  For this interview, I spoke with two leading figures in the Aragonese online community about their work on behalf of the language: Santiago Paricio, a high school teacher of Spanish in Navarra, and Juan Pablo Martínez, a university professor in the Engineering School at the University of Zaragoza.


Santi Paricio (L) and Juan Pablo Martínez (R)
KPS: Please tell us a little bit about the Aragonese language, how many speakers there are currently, whether it's taught in schools, etc.

SP/JPM: Although there are no official data, it is estimated that some 10,000 native speakers in the north of Aragon (less than 1% of the Aragonese population) plus an indeterminate number of second-language speakers speak Aragonese. The number of native speakers is dramatically decreasing mainly due to the fall of intergenerational transmission. In most areas, only older people use the language. In contrast, there is a certain interest among young and mid-age people to learn the language in areas where the language is not spoken anymore as a native language. Some of them are even raising their children in Aragonese.

But this has not always been like that. Aragonese was once spoken in almost all Aragon and was one of the administrative languages of the Kingdom of Aragon. However, it has suffered a constant decline and progressive substitution by Spanish since the 15th Century.

The language is only being taught as a voluntary subject at five primary schools in the north of Aragon. Since 2010, with the passage of the “Law on Languages of Aragon” the language has a minimal legal recognition from the local government. However, the Act, which established a Language Regulator Body (Academy) and voluntary classes in all educative levels in the regions where the language is still spoken, has hardly been developed, and the new local Administration elected in May 2011 has announced that they will reform the Act, which they opposed, rather than develop it. According to the UNESCO Atlas of Endangered Languages, Aragonese is categorized as “definitely endangered”.

You can hear the sound of Aragonese at the Archivo Audiovisual del Aragonés.

KPS: What opportunities are there to use the language online?

SP/JPM: In Aragon, access to technology is not itself an issue. However, native speakers of Aragonese are a mainly aging and rural-based population, so their access to the Internet, computers, and ICT in general is on average lower than the rest of the population. Speakers of Aragonese as a second language are, in contrast, much more active on the Internet and, being more conscious of the language, they tend to use the language more often.

There are not many sites or software translated into Aragonese.  Some examples are Mediawiki (the software to build wiki webpages like Wikipedia), some parts of Ubuntu and Firefox, and several other small programs.  There is a nonprofit association, Softaragones, in which we are also involved, promoting software localization for Aragonese.

Aragonese Wikipedia
As for resources, Wikipedia in Aragonese is probably the main one nowadays. It is a very active project (the most active Wikipedia in terms of size per number of speakers), and represents now the widest corpus in Aragonese which can be found on the Internet (with the advantage of being free content). It has also acquired the attention of Aragonese mass media, with several interviews on the public radio station and a full-page story in the main newspaper. We are currently involved in developing open-source tools for the language: spell checkers, machine translation systems, online dictionaries… We can also highlight the efforts in the field of distance language learning; for example the non-profit cultural association Nogará-Religada which launched distance courses in Aragonese in recent years, based on the Moodle platform and assisted by other technologies, such as VoIP.

However, lack of resources and translated software does not preclude the use of the language on the Internet: we can find a number of websites and blogs written in Aragonese, and even a recently-created digital newspaper. Although modest in absolute numbers, their relative prevalence is high, given the size of the Aragonese-speaking community. Social networks represent a good opportunity to use the language online, by creating online speaker communities (very important for a community that is so sparse in the “real world”), or just using the language for general communication purposes (taking advantage of the fact that intercomprehension with the majority language, Spanish, is not difficult).

KPS: Many speakers of indigenous and minority languages are reluctant to use their languages online.  What is the general attitude toward using the language online?  Are there any special obstacles that arise for Aragonese speakers? 

SP/JPM: Most native speakers wouldn’t even think about using the language online, because the language still has a stigma of being “bad speaking”, “useless language”, “only valid to speak about the rural world”.  Some don’t even feel comfortable using the language outside their family circle. This does not fully apply to the youngest generations who have received the language from their parents: they often have a better linguistic awareness, as a part of their identity, and are less reluctant to use the language online, at least when communicating with known people. However, as most of them have not received any education in Aragonese, nor have they ever written the language, they often feel insecure about it. On the contrary, speakers of Aragonese as a second language are more likely to use Aragonese online, not only as a communication tool with other Aragonese-speaking Internet users, but also as an activist decision to promote the language. We think that the main driving forces for using the language online are activism and identity.


The proposed official orthography
KPS: How is/was computing terminology developed?  Is there a "language board" or are terms developed naturally by the community?  If there are official terms, how are they communicated to the community?

SP/JPM: That also holds in the case of Aragonese. The community usually adapts most commonly used terms from Spanish or Catalan to Aragonese, but there is not always a unique solution.  For lesser-used, more specific terms, we can mention the community working on the Aragonese Wikipedia as a source for terminology.  Softaragones has also developed a “collection of computing terms” and a style guide for software localization and translation, but this is mainly useful for advanced users and translators, rather than for regular users. Due to the lack of response from the administration, the II Congress of Aragonese created in 2006 a nonofficial regulatory board, the “Academia de l’Aragonés”. Together with their proposal of an interdialectal spelling system (PDF), they published some guidelines on the adaptation of technical words, which has somewhat reduced the multiplicity of possible solutions.  In brief, development of computing terminology is needed in Aragonese, but does not preclude online use of the language.

KPS: Are there other special challenges your community faces in terms of developing technology for the language and/or communicating online?

SP/JPM: We believe the adoption of a unique spelling system would be crucial to booster the generation of new resources. The 2010 proposal of the Academia de l’Aragonés linked above has not reached full consensus, but it is the spelling system most widely used in the generation of new online content (e.g., in the Aragonese Wikipedia and in the online newspaper Arredol), as well as among most active online users (as an example of this, it is used by 25 of the 26 top tweeters listed on the Indigenous Tweets Aragonese page). As a consequence of this, the open source linguistic tools now under development are using this spelling system. Another issue is that of dialectal variation. While there is no communication problem caused by dialectal differences, it is necessary to provide them with tools as spellcheckers and/or translators (or at least take them into account, as there is not a strong standard dialect). In general, dialects are not represented enough online.
Bilingual signs on a hiking trail (CC-BY)

Of course being such a small minority, software vendors and service providers do not show interest in including localizations for Aragonese, to say nothing of developing linguistic resources. We must find the way forward for our language in open source/free software projects, which allow the reuse or adaptation of technologies and resources developed for other languages. An example of this is Apertium, a free/open source machine translation project which has just released a first version of an Aragonese-Spanish bidirectional translator (the latest version can be tested here or here). These projects also promote cooperation between developers interested in different lesser-used languages or language lovers in general. Another example is the release of an Aragonese spell checker, which already has extensions for Mozilla products and LibreOffice.

KPS:
Are young people using the language online?  Do you think social media sites like Facebook and Twitter are helping encourage language use by younger speakers?

SP/JPM: Yes, mostly young people use the language online. Until a couple of years ago, the use of the language online was mostly limited to some second-language speakers and activists.  Recently, social networks like Facebook and Twitter have opened new chances to use the language, to connect with other speakers, and are seen as a window to show the language and the community. This has indeed encouraged the use of Aragonese by younger speakers, now including native speakers, who have shifted their oral communication habits to these new modalities.  This is very good, as it puts people speaking different dialects in contact with each other, and also native speakers with second-language speakers, improving the feeling of being a community.

KPS: What is your vision for your language in ten years, both in general terms and in terms of software/online use?

Aragonese-speaking village of Ansó (CC-BY-SA)
SP/JPM: It is difficult to say.  The dream scenario would be that children in the speaking areas would be able to learn the language at school, and children in the rest of Aragon would have the opportunity to learn it. Aragonese society should also be more aware of the cultural value of their own language. With support from the Administration and Civil Society, the objective of preserving intergenerational transmission and increasing language vitality could be achieved.  In terms of online use, the aim would be that Aragonese speakers find the tools and resources to use their language online (translators, spellcheckers, speech synthesis and recognition, localized applications…), to get and create content in their language, and to use it correctly. 

In more realistic terms, we believe that the use of the language online and the availability of online/computer language resources will indeed increase in the coming years, and this will open opportunities for the language, but this by itself does not guarantee the survival of Aragonese.  The language must be transmitted to the children, and they need to learn to read and write the language at school.  Otherwise, the efforts we are undertaking in the “digital world” might be useless.  On the positive side, while decades ago it was already thought to be very close to extinction, Aragonese is still a living language in the 21st century, and we are working to keep it alive.