2011-12-23

1000 Languages on the Web

Click to see the full size image

About the Image

Since 2003 I've been gathering texts from the web written in indigenous and minority languages.  The image above is a "family tree" of the 1000 languages I've found to date, where proximity in the tree is measured by a straightforward statistical comparison of writing systems (details below).
  • When you load the full image it will be too big to fit in a browser window and you may not see anything at first you'll need to use the horizontal and vertical scrollbars to explore different parts of the tree (most browsers will let you zoom in and out also).  And because it's an SVG image, you can use your browser's search functionality (probably Ctrl+F or ⌘-F) to find different language codes, although the search behavior can be a bit weird/unpredictable.
  • Each language is colored according to its linguistic family (details here).  For example, all Indo-European languages are greenish colors, with different subfamilies (Celtic, Germanic, etc.) being slightly different shades of green.  I also tried to use similar colors for languages from the same geographical region even when there is no known genetic relationship among them, and so Arawakan, Quechuan, Tucanoan languages (all from South America) are shades of purple, while Central and North American languages are shades of blue.
  • Clicking on a language opens a new tab or window with the documentation page for the ISO 639-3 language identifier where you'll find a name for the language in English and a link to its Ethnologue page for additional information.
  • What I'm calling "languages" are really "writing systems"; you'll see, for example, separate nodes for bo (Tibetan) and bo-Latn (Tibetan written in Latin script).  In a small number of cases I track macrolanguages, regional variants (e.g. en, en-IE, en-ZA), and some dialects.  In total, there are 919 distinct ISO 639-3 codes among the 1000 writing systems represented.
I'm using these data in collaboration with language groups all around the world to develop basic resources that help people use their language online: keyboard input methods, spell checkers, online dictionaries, and so on.  This work also underlies the Indigenous Tweets and Indigenous Blogs projects, which aim to strengthen languages through social media.  You can learn more about how indigenous and minority language communities are using the web, social media, and technology to help revitalize their languages by following us on Twitter.

The Gory Details

Everything is based on an analysis of three character sequences ("3-grams") in the different languages. It turns out that computing the statistics of 3-grams in a given language provides a "fingerprint" that can be used for language identification and a number of other applications.  Specifically, imagine the huge-dimensional vector space V whose axes are labelled with all possible 3-grams of Unicode characters (dim V > 1015).  Given a collection of texts in a language, you can compute the frequencies of all 3-grams that appear in the collection, defining a (sparse) vector in V "representing" the language.  We then define the distance between two languages to be the angle between their representative vectors in V.  This can be computed by scaling the vectors to unit length and computing their dot product (which is the cosine of the angle we want).

Once we know the distance between each pair of languages, we can reconstruct a phylogenetic tree using any of a number of well-known algorithms.  The image above was created using the so-called "neighbor-joining" algorithm (which basically builds the tree in a greedy, bottom-up way). A side-effect of the algorithm is that each edge in the tree is assigned a length, but note that the edge lengths in the rendered image have nothing to do with the computed edge lengths (indeed, it's unlikely that the tree can be rendered in a distance-preserving way in two dimensions).  Another side-effect of the algorithm is that the tree is connected by definition, all languages are within a bounded distance of each other and so near the root of the tree you'll see various languages which use completely different scripts joined in a more-or-less random fashion (Khmer, Georgian, Tamil, Cherokee, etc.).  It would be easy enough to tweak the distance function or the algorithm to render languages with different scripts as separate connected components.

How many languages are out there?

Ethnologue lists 6909 living languages in the world, but how many have some presence on the web?  The answer depends greatly on what kinds of documents you include.  If one takes linguistic studies into account, the number might be as high as 4000 – the Open Language Archives Community (OLAC) brings together data from linguistic archives all over the world into a single, searchable interface.  The OLAC coverage page shows, at present, the existence of online resources for 3930 of the 6909 Ethnologue languages, with more material coming online every day.  The amazing ODIN project harvests examples of interlinear glossed text from linguistic papers, and has over 1250 languages in its database.

The 1000 languages found by my web crawler are, for the most part, what you might call "primary texts": newspapers, blog posts, Wikipedia articles, Bible translations, etc.  My best guess at present is that around 1500 languages have primary texts of this kind on the web.  If you know of online resources written in a language that's not listed on our status page, please let me know in the comments.

Here are a couple of closely-related (but ill-defined) questions: first, "How many of the 6909 languages have a writing system?" and second, since a great number of the texts we've found are Bible translations or other evangelical works, one might ask "How many languages have a writing system that's used regularly by members of the speaker community?"  I've looked around a bit for answers to these questions but I haven't found any careful studies in the literature.


Mash it up!

I put all of the data and scripts needed to generate the image in a github repository.  I'm not an expert on data visualization, so I'm hoping others will grab the data and experiment.  One idea would be to use a more sophisticated algorithm for reconstructing the tree, such as Fitch-Margoliash. In terms of the visualization itself, it would be cool to do something that connects the tree to locations on a world map where the languages are spoken. There are also some Javascript/HTML5 graph viewers that might provide a better browsing experience.  Or you might simply select the colors in different ways (perhaps colors for different typological features: for example, SVO, VSO, etc.).  Feel free to post additional ideas in the comments!

Thanks

First, I'd like to thank the hundreds of people who have contributed to the project over the years by providing training texts in many of the languages, correcting errors in the language identification, editing word lists, and helping separate different dialects/orthographies.  You'll find many of their names on the project status page. Thanks also to Michael Cysouw who first suggested generating an image of this kind (you can find his image, created in 2005, on the main project page). Finally, thanks to my colleagues at Twitter for several helpful conversations and for their interest in the Indigenous Tweets project.

2011-12-06

Language revitalization through free software: the case of Aragonese

Aragonese is one of the minority languages of Spain, spoken in the autonomous community of Aragon in the northeastern part of the country.  With an estimated 10,000 native speakers, it is in a much more precarious position than its neighbors Catalan and Basque.  Nevertheless, there is a vibrant online Aragonese community that is working hard to develop free and open source resources to support and help revitalize the language.  One notable example is the tremendous volunteer effort that has gone into developing the Aragonese Wikipedia; weighing in at 25,000+ articles and 2.5 million words, it is believed to be the largest Wikipedia of any language, per number of native speakers.  For this interview, I spoke with two leading figures in the Aragonese online community about their work on behalf of the language: Santiago Paricio, a high school teacher of Spanish in Navarra, and Juan Pablo Martínez, a university professor in the Engineering School at the University of Zaragoza.


Santi Paricio (L) and Juan Pablo Martínez (R)
KPS: Please tell us a little bit about the Aragonese language, how many speakers there are currently, whether it's taught in schools, etc.

SP/JPM: Although there are no official data, it is estimated that some 10,000 native speakers in the north of Aragon (less than 1% of the Aragonese population) plus an indeterminate number of second-language speakers speak Aragonese. The number of native speakers is dramatically decreasing mainly due to the fall of intergenerational transmission. In most areas, only older people use the language. In contrast, there is a certain interest among young and mid-age people to learn the language in areas where the language is not spoken anymore as a native language. Some of them are even raising their children in Aragonese.

But this has not always been like that. Aragonese was once spoken in almost all Aragon and was one of the administrative languages of the Kingdom of Aragon. However, it has suffered a constant decline and progressive substitution by Spanish since the 15th Century.

The language is only being taught as a voluntary subject at five primary schools in the north of Aragon. Since 2010, with the passage of the “Law on Languages of Aragon” the language has a minimal legal recognition from the local government. However, the Act, which established a Language Regulator Body (Academy) and voluntary classes in all educative levels in the regions where the language is still spoken, has hardly been developed, and the new local Administration elected in May 2011 has announced that they will reform the Act, which they opposed, rather than develop it. According to the UNESCO Atlas of Endangered Languages, Aragonese is categorized as “definitely endangered”.

You can hear the sound of Aragonese at the Archivo Audiovisual del Aragonés.

KPS: What opportunities are there to use the language online?

SP/JPM: In Aragon, access to technology is not itself an issue. However, native speakers of Aragonese are a mainly aging and rural-based population, so their access to the Internet, computers, and ICT in general is on average lower than the rest of the population. Speakers of Aragonese as a second language are, in contrast, much more active on the Internet and, being more conscious of the language, they tend to use the language more often.

There are not many sites or software translated into Aragonese.  Some examples are Mediawiki (the software to build wiki webpages like Wikipedia), some parts of Ubuntu and Firefox, and several other small programs.  There is a nonprofit association, Softaragones, in which we are also involved, promoting software localization for Aragonese.

Aragonese Wikipedia
As for resources, Wikipedia in Aragonese is probably the main one nowadays. It is a very active project (the most active Wikipedia in terms of size per number of speakers), and represents now the widest corpus in Aragonese which can be found on the Internet (with the advantage of being free content). It has also acquired the attention of Aragonese mass media, with several interviews on the public radio station and a full-page story in the main newspaper. We are currently involved in developing open-source tools for the language: spell checkers, machine translation systems, online dictionaries… We can also highlight the efforts in the field of distance language learning; for example the non-profit cultural association Nogará-Religada which launched distance courses in Aragonese in recent years, based on the Moodle platform and assisted by other technologies, such as VoIP.

However, lack of resources and translated software does not preclude the use of the language on the Internet: we can find a number of websites and blogs written in Aragonese, and even a recently-created digital newspaper. Although modest in absolute numbers, their relative prevalence is high, given the size of the Aragonese-speaking community. Social networks represent a good opportunity to use the language online, by creating online speaker communities (very important for a community that is so sparse in the “real world”), or just using the language for general communication purposes (taking advantage of the fact that intercomprehension with the majority language, Spanish, is not difficult).

KPS: Many speakers of indigenous and minority languages are reluctant to use their languages online.  What is the general attitude toward using the language online?  Are there any special obstacles that arise for Aragonese speakers? 

SP/JPM: Most native speakers wouldn’t even think about using the language online, because the language still has a stigma of being “bad speaking”, “useless language”, “only valid to speak about the rural world”.  Some don’t even feel comfortable using the language outside their family circle. This does not fully apply to the youngest generations who have received the language from their parents: they often have a better linguistic awareness, as a part of their identity, and are less reluctant to use the language online, at least when communicating with known people. However, as most of them have not received any education in Aragonese, nor have they ever written the language, they often feel insecure about it. On the contrary, speakers of Aragonese as a second language are more likely to use Aragonese online, not only as a communication tool with other Aragonese-speaking Internet users, but also as an activist decision to promote the language. We think that the main driving forces for using the language online are activism and identity.


The proposed official orthography
KPS: How is/was computing terminology developed?  Is there a "language board" or are terms developed naturally by the community?  If there are official terms, how are they communicated to the community?

SP/JPM: That also holds in the case of Aragonese. The community usually adapts most commonly used terms from Spanish or Catalan to Aragonese, but there is not always a unique solution.  For lesser-used, more specific terms, we can mention the community working on the Aragonese Wikipedia as a source for terminology.  Softaragones has also developed a “collection of computing terms” and a style guide for software localization and translation, but this is mainly useful for advanced users and translators, rather than for regular users. Due to the lack of response from the administration, the II Congress of Aragonese created in 2006 a nonofficial regulatory board, the “Academia de l’Aragonés”. Together with their proposal of an interdialectal spelling system (PDF), they published some guidelines on the adaptation of technical words, which has somewhat reduced the multiplicity of possible solutions.  In brief, development of computing terminology is needed in Aragonese, but does not preclude online use of the language.

KPS: Are there other special challenges your community faces in terms of developing technology for the language and/or communicating online?

SP/JPM: We believe the adoption of a unique spelling system would be crucial to booster the generation of new resources. The 2010 proposal of the Academia de l’Aragonés linked above has not reached full consensus, but it is the spelling system most widely used in the generation of new online content (e.g., in the Aragonese Wikipedia and in the online newspaper Arredol), as well as among most active online users (as an example of this, it is used by 25 of the 26 top tweeters listed on the Indigenous Tweets Aragonese page). As a consequence of this, the open source linguistic tools now under development are using this spelling system. Another issue is that of dialectal variation. While there is no communication problem caused by dialectal differences, it is necessary to provide them with tools as spellcheckers and/or translators (or at least take them into account, as there is not a strong standard dialect). In general, dialects are not represented enough online.
Bilingual signs on a hiking trail (CC-BY)

Of course being such a small minority, software vendors and service providers do not show interest in including localizations for Aragonese, to say nothing of developing linguistic resources. We must find the way forward for our language in open source/free software projects, which allow the reuse or adaptation of technologies and resources developed for other languages. An example of this is Apertium, a free/open source machine translation project which has just released a first version of an Aragonese-Spanish bidirectional translator (the latest version can be tested here or here). These projects also promote cooperation between developers interested in different lesser-used languages or language lovers in general. Another example is the release of an Aragonese spell checker, which already has extensions for Mozilla products and LibreOffice.

KPS:
Are young people using the language online?  Do you think social media sites like Facebook and Twitter are helping encourage language use by younger speakers?

SP/JPM: Yes, mostly young people use the language online. Until a couple of years ago, the use of the language online was mostly limited to some second-language speakers and activists.  Recently, social networks like Facebook and Twitter have opened new chances to use the language, to connect with other speakers, and are seen as a window to show the language and the community. This has indeed encouraged the use of Aragonese by younger speakers, now including native speakers, who have shifted their oral communication habits to these new modalities.  This is very good, as it puts people speaking different dialects in contact with each other, and also native speakers with second-language speakers, improving the feeling of being a community.

KPS: What is your vision for your language in ten years, both in general terms and in terms of software/online use?

Aragonese-speaking village of Ansó (CC-BY-SA)
SP/JPM: It is difficult to say.  The dream scenario would be that children in the speaking areas would be able to learn the language at school, and children in the rest of Aragon would have the opportunity to learn it. Aragonese society should also be more aware of the cultural value of their own language. With support from the Administration and Civil Society, the objective of preserving intergenerational transmission and increasing language vitality could be achieved.  In terms of online use, the aim would be that Aragonese speakers find the tools and resources to use their language online (translators, spellcheckers, speech synthesis and recognition, localized applications…), to get and create content in their language, and to use it correctly. 

In more realistic terms, we believe that the use of the language online and the availability of online/computer language resources will indeed increase in the coming years, and this will open opportunities for the language, but this by itself does not guarantee the survival of Aragonese.  The language must be transmitted to the children, and they need to learn to read and write the language at school.  Otherwise, the efforts we are undertaking in the “digital world” might be useless.  On the positive side, while decades ago it was already thought to be very close to extinction, Aragonese is still a living language in the 21st century, and we are working to keep it alive.

2011-11-11

"Murdered on its native territory": Jordan Kutzik on Yiddish

Yiddish is a Germanic language tradtionally spoken by Ashkenazi Jews in Central and Eastern Europe and in diaspora communities around the world.  Prior to World War II, it was the mother tongue of more than 10 million people, and had a thriving written tradition, with newspapers, scholarly works, and a modern literature being produced in the language.  This came to an abrupt halt with the Holocaust, which left the vast majority of Yiddish speakers dead, and saw the survivors scattered to all corners of the globe.  Although the language remains relatively strong among certain Hasidic and Orthodox Jewish communities, outside of those communities it faces many of the same obstacles as other minority languages in terms of encouraging its use among the younger generation, and guaranteeing intergenerational transmission.

Jordan Kutzik just finished his BA at Rutgers University in Jewish Studies and Spanish, focusing in particular on the Yiddish language and Spanish translation.  He is currently working at the National Yiddish Book Center in Amherst, Massachusetts as a fellow.

KPS: For readers not familiar with your language, tell us a bit about the history of Yiddish and its current status.
Jordan Kutzik

JK: The history of Yiddish and its current status is much more complicated than any other indigenous or minority language except for perhaps Romani, because the language was murdered on its native territory and exists today in different pockets of speaker communities descended from immigrants from Eastern Europe on four different continents and the language’s “strength” or “health” varies by community, country, and of course how one decides to measure it.

Yiddish, a Germanic language written in the Hebrew alphabet, was the mother-tongue of around 11 million people, 8 million of them in Eastern Europe prior to the Holocaust (Ukraine, Poland, Belarus, parts of Russia, Lithuania, Latvia, Romania, Moldova, etc,) with immigrant communities around the world.  In its Eastern European heartland it was the language of Jews of all levels of religious affiliations and the language of various schooling systems from secular schools to traditional religious academies.  Yiddish had an important literature of religious materials and original secular literature as well as translations from other languages and more than 100 daily newspapers, some of which were of a very high quality, on par with the national newspapers in other languages of the time period. The common language throughout Eastern Europe promoted a common ethnic identity among Ashkenazi Jews (those who traced their ancestry to Germany) and Yiddish was the strongest non-territorial language in the world, especially in terms of written material.  Right as the language was coming into its own in a modern sense, the Holocaust left around 6 million Jews dead in Europe, including 5.5 million Yiddish speakers.  The genocide not only killed its speakers, but more devastatingly for Yiddish it all but destroyed the civilization in which it had been the natural language.  Although by my own estimates around 1.25 million Yiddish speakers survived the war (most fleeing deep into the USSR, some surviving concentration camps, in Partisan Units, blending in with the surrounding population, joining the Russian army, etc.), the communities and institutions in which the language lived did not, and the vast majority of survivors left Eastern Europe for the Americas or British Mandate Palestine and later Israel.

In America the language died out in immigrant Jewish communities just as most immigrant languages eventually die out and in Israel the language was strongly discouraged and in some spheres actually outlawed in favor of Hebrew so it was not passed on for more than one generation for the most part there either.  After World War II, the USSR gained the Baltics and Poland and the strength of Yiddish among those few Jews who remained declined even further as the USSR enacted strong anti-Jewish national programs in Poland and the Ukraine and to a lesser extent Lithuania.  Yiddish did survive, however, among Hungarian Hasidim (who despite the name came not just from Hungary but also parts of Romania and Poland) for whom it largely remains the lingua franca whether these communities are in New York, Israel, Belgium, England, Canada or Australia.  In these communities Yiddish is the language of schools and religious academies, some media (newspapers, magazines, radio shows done through telephone hotlines, etc.) and the home.  In New York there are around 100,000 Hasidic Yiddish speakers and the population is extremely young and growing rapidly as the average family has 7 or 8 children.  There are about the same number of Yiddish speakers among Orthodox Jews in Israel as well, although the number there is tougher to gauge as language of the home is not asked as part of the census.  There are perhaps 20,000 Yiddish speaking Orthodox Jews in Antwerp, and perhaps a similar figure in both Montréal and London.  So a figure of 250,000 Hasidic Yiddish-speaking Jews is a fair guesstimate and the language is healthiest among these communities, being spoken by people of all ages.

Outside of the Hasidic world, Yiddish survived as the lingua-franca of many Holocaust survivors and many of their children speak it too.  There are still probably around 200,000 Yiddish speaking Holocaust survivors, with the majority in the USA and Israel.  But this population is very elderly and unfortunately will be gone in the coming decades.  Additionally, the language never died out entirely as a language of culture in Jewish communities in America, Latin America, Australia, France and Israel, and there are still non-Hasidic Yiddish language publications around the world.  There are, however, very few families who have kept the language alive as the language of the home and of raising children outside of the Hasidic world.  My generation has seen a bit of a revival as I know several hundred young people (age 16-30) like myself who have learned the language to fluency and I know a few dozen families who are raising their children as Yiddish speakers even though it was the mother-tongue of neither parent.  This is something I particularly hope to see more of in the coming years.  There are Yiddish courses in several dozen universities around the world, and some non-Hasidic Jewish day-schools teach Yiddish, although only a few do it so that the children leave with any real fluency. Among non-Hasidic Jewish schools Yiddish is strongest today in Australia.

In Lithuania with Fania Brantsovsky
As far as official status; Yiddish has official status in the Jewish autonomous region of Russia known as Birobizhan (near Korea!), but there are very few Jews there and few of them speak Yiddish.  Many non-Jews there learn Yiddish in the schools, however, some extremely well, and there are even government signs on courthouses and such in Yiddish, the only place in the world with actual Yiddish signage on public buildings.  Yiddish has token recognition in Israel, along with Ladino, the language of Jews who left Spain after the expulsion of 1492, but for all intents and purposes the Israeli government doesn’t do much to support Yiddish.  Yiddish is also an official minority language of Sweden, Holland, Poland, Romania, and the Ukraine under the European Charter for Regional or Minority Languages but not much is done on its behalf by these governments.

KPS: How have you been personally involved with language revitalization and activism on behalf of Yiddish?

JK: I have been involved with Yiddish language revitalization/activism for the past four years in various capacities.  I am a board member of Yugntruf Youth for Yiddish, an organization which promotes Yiddish among young people around the world and most especially in the NYC area.  Almost all of our events are run exclusively in Yiddish, most prominently our “Yiddish Week” which attracts around 150 people from around the world.  I am particularly active with Yugntruf’s facebook and twitter presence, as well as finding young Yiddish speakers in unexpected places around the world through the internet.  I also run a Yiddish-themed Youtube channel with lots of films of tours in Yiddish with English subtitles with Fania Brantsovsky, the librarian of the Vilnius Yiddish Institute and a Holocaust survivor and former partisan.  I didn’t know how to make/edit films when I made the channel so most of the films aren’t of the highest quality but there is a lot of interesting and important stuff there about Yiddish, the Holocaust, Jewish culture, etc.  Now that I’ve learned how to shoot/edit film properly I will have higher quality films in the future.  I also work as both a freelance (paid) translator as well as a volunteer translator for people using Yiddish language source materials for research involving the Holocaust for creative writing projects, historical research etc.  I copyedit an online web-journal connected with the Yiddish Farm project and have a blog in Yiddish that desperately needs to be updated.  I also tweet in Yiddish on my personal Twitter feed and run a Twitter feed dedicated to publicizing Yiddish classes and immersion opportunities (@yiddishclasses). 

KPS: What opportunities are there to use the language online?  Are there websites translated into your language?  What about software and other resources like web browsers, office software, spell checkers?

JK: Most Yiddish online now is computer generated as Google translate is available in Yiddish.  It is quite poor, actually, because if you translate a text with a word in plural form it won’t actually translate it but rather transliterate it into the Hebrew alphabet.  But when you search for a Yiddish word now most of the websites that come up are Google translations of other sites that are computer generated, which makes it more difficult to find websites that were actually written in Yiddish.  Among non-Google translated websites in Yiddish there are some Yiddish language publications, some Yiddish organizations, some Hasidic message-boards, a few Yiddish bands and so forth with Yiddish websites.  Almost all of these sites are also in a national language like English, Hebrew, French or Polish and usually the Yiddish site itself is far less extensive than the versions in other languages. It is particularly strange and frustrating to me that none of the websites for Holocaust survivors run in Yiddish.  There is also a Yiddish Wikipedia with some 7,000 articles (largely written by two very dedicated men), a Yiddish version of Google search, and some Jewish communal organizations, especially in Eastern Europe, have summary pages in Yiddish.  There is also an excellent online dictionary created by Refoyl Finkel.

KPS: Many speakers of indigenous and minority languages are reluctant to use their languages online, for various reasons.  How do speakers of your language feel about using the language online?

JK: With the internet and Yiddish there are three distinct communities; Hasidic, Yiddishist and heritage.  Hasidic Jews are, generally speaking, not supposed to be on the internet according to the rules of their own communities or are only supposed to use the internet for business in which case they will probably be doing so in a national language.  Many are, however, and there is a lot of informal Yiddish language internet use among them on message boards, twitter, facebook etc.  Most Yiddish-speaking Orthodox Jews on the internet, however, use English, Hebrew, French or Dutch as these languages are more widely understood so Yiddish usage is usually restricted to intra-community affairs, especially when they want to keep non-Hasidic Jews out.

A few Yiddishists like myself have set up Yiddish blogs, twitters, facebook pages and so forth in an effort to make the language more visible.  We also have Yiddish language Google groups and so forth.  Often times we use Yiddish as a matter of principle online even though we could be communicating in another language.

Trilingual sign (English/Spanish/Yiddish) in Brooklyn, NY
Some heritage Yiddish speakers, often the children of Holocaust survivors, will use Yiddish if they find that they don’t have another language in common with another person.  This sometimes overlaps with the Yiddishist community as well.  For instance I’ve written people at Jewish communal organizations in France and Brazil about things that had nothing to do with Yiddish just to get a response that they didn’t speak English and asking if I spoke Hebrew or Yiddish!  Far more people, and probably far more French Jews for that matter, speak English than Yiddish, but in some cases my knowledge of Yiddish proved to make communication possible where it wouldn’t have been otherwise.  So there is some non-ideologically based internet Yiddish use going on too.  I never run into that type of thing when I email a Jew in say, England or Mexico because I speak/write English and Spanish but with Brazil and France it happens occasionally. So in that sense the internet has actually gotten people to use the language more often than they would have otherwise because people are meeting online who would not meet otherwise and would otherwise have no practical use for the language.

Actually using Yiddish, however, poses some technical challenges.  Yiddish uses a modified form of the Hebrew alphabet and makes use of some vowel markings and diacritical markings that are not used in Hebrew.  Many people don’t know how to use the Hebrew keyboard or the Yiddish keyboard programs that have been developed and most people who can write Hebrew can’t write the special characters used for Yiddish with their Hebrew word-processing programs.  Furthermore, many online programs have problems displaying right to left languages like Yiddish and have particular difficulties displaying Yiddish so things like periods, commas, and exclamation points will end up on the wrong side of a line.  On Twitter the vowel markings get counted as an extra character and to make matters worse they often do not display correctly!!! A friend of mine who is very good with computers tried to make a “twitter friendly” Yiddish program with pre-combined characters but twitter still split the characters up.  This makes it much easier to leave out the vowel markings and diacritical marks on Twitter but some sticklers would rather tweet shorter messages or not tweet in Yiddish at all than tweet without using the proper Yiddish spelling.  Most Hasidic Jews, as well as myself sometimes, forgo the vowel markings and diacritical markings on the internet and especially on Twitter because it really can be a headache.  I use a transliteration machine to type Yiddish so I can’t write Yiddish in a chat program like Facebook message so I’ll transliterate the language into the Latin alphabet.  I do the same thing with text messages in Yiddish.

A bunch of us tried to organize a massive effort to translate Facebook into Yiddish since they were using crowd-source translations but it just didn’t take off.  There is a Yiddish translation for Blackberry and a few smartphones have been made for Hasidic Jews in Israel in Yiddish.

KPS: I mentioned above that many indigenous languages lack computing terminology.  Is this an issue for your language?  How is/was terminology developed?

JK: As far as vocabulary, most Yiddish speakers learned to use a computer in another language but since Yiddish is sometimes the only common language among people using it online there has been a slight tendency toward the creation of neologisms.  Most of these are unknown among Hasidic Yiddish speakers and are only used by Yiddishists but a dozen or so including some of the most essential like blitspost (“email” as a category) blitsbriv (an individual email), vebzaytl (website), shleptop (laptop) have caught on in both the Hasidic and Yiddishist world.  Blits means lightening in Yiddish, so the words for email mean “lightening mail” or “lightening letter.”  Veb means “web” and zaytl means “page” so that renders “webpage” but it also echoes the English “website” as the pronunciation is similar.  Older Yiddish words like the words for screen, document, keyboard, erase, save, etc have been naturally given newer meanings but you’ll also see English or Hebrew equivalents being used and transliterated to Yiddish spellings too.  For basic everyday computer usage it’s never a problem and there are basic computer classes in Yiddish for Yiddish speaking Hasidic Jews taught over the internet but I doubt anyone is doing complicated programming in Yiddish on a regular basis, with the exception of some database work cataloging literature which was done at an Israeli University.

KPS: Are there other special challenges your community faces in terms of developing technology for the language and/or communicating online?

JK: There is an academic standard written Yiddish spelling but most speakers don’t use it.  This really doesn’t cause any problems in computer usage or reading the language because everyone except students just beginning to read/write is familiar with variations in spelling.  This does cause problems, however, when someone wants to make searchable databases.

KPS: What is your vision for your language in ten years, both in general terms and in terms of software/online use?

JK: Yiddish speakers need to organize to use resources and funding available from governments, especially in Europe, to teach Yiddish to more people, especially children.  I am particularly interested in the language-nest model and want to assemble a team of people down the road who could start an international non-profit to run a steering committee to run language nests in Jewish communities where Yiddish was spoken before World War II and where it enjoys protection under the European Charter for Regional or Minority Languages.  There is also enormous potential for broadcast media in Yiddish done through the internet.  We have radio shows which double as podcasts and Youtube channels but we could really use something like a weekly TV show done as a podcast.  There is no local market that would justify the expense of a Yiddish TV show on TV as the Orthodox don’t use TV’s but now with the internet and archiving it could be done. And I think that any use of media; whether websites like Twitter, radio broadcasts, podcasts and more traditional media like newspapers and magazines help to promote the language.

As far as online use, I’d like to see more Jewish organizations and governments, especially those that serve Yiddish speakers such as Holocaust survivors or Hasidic communities, have websites in Yiddish.  It’s absurd that the government of Sweden and the New York Health Department publish information online in Yiddish but the government of Israel does not.  German, French and American websites written for Holocaust survivors and their children should also have information available in Yiddish.  I’d also like to see a usable Facebook interface in Yiddish.  Obviously Facebook in Yiddish wouldn’t be practically useful like say a Health Department bulletin written for Hasidic Jews but it would be a really cool thing to be able to show to young people and say “hey, you can even use Facebook in Yiddish!”

2011-09-18

New feature: Indigenous Blogs!

The Indigenous Tweets project turned six months old on Saturday, coincidentally the same day we reached 1000 followers on Twitter.  To celebrate these milestones, I've added an exciting new feature to the site that tracks blogs written in 50 indigenous and minority languages.  You can find this new feature at http://indigenoustweets.com/blogs/ (I also registered http://indigenousblogs.com/ but it should just redirect you to the other address).

Indigenous Blogs: Main Page
For now, I'm only tracking blogs hosted at Blogspot, which hosts more than 90% of the blogs written in the languages I'm interested in.  That said, I hope to add other popular services like Wordpress, Tumblr, MovableType, etc. going forward.

The site is laid out just like Indigenous Tweets: there is a main page with a table of the supported languages, and then if you click on a language in the table you'll be taken to a new page that shows all of the blogs in the language along with some statistics for each: number of posts, percentage of posts in the language, total number of words, date and title of last post.

Indigenous Blogs language page: Irish/Gaeilge
What I hope will be most useful are the feeds that I've provided on each language page; these will contain every post in every blog written in the language.  You'll see a link to the feed on the right-hand side of the page, with the text "Subscribe to all posts in this language: ".  With most browsers, you can subscribe to the feed just by clicking on this icon (if you've never used a news feed before, here is a useful introduction).  I subscribe to feeds using Google Reader, but there are many other popular readers like NetVibes, NetNewsWire, My Yahoo!, and RSSOwl.

If you'd like to be more selective about what you read, you can pick any blog that looks good to you, click on it in the table to visit the blog itself, and subscribe from there.  Most Blogspot blogs have a link that says something like "Subscribe to: Posts (Atom)", usually at the bottom of the page.

If you know of a Blogspot blog that is missing from one of the tables, simply enter it into the form on the right-hand side of the page.  It should appear in the table within 24 hours.

Finally, like the Indigenous Tweets site, I've designed things to make it easy to translate the individual language pages.   The Indigenous Blogs pages for Aragonese, Aymara, Welsh, Frisian, Irish, Scottish Gaelic, Haitian Creole, Māori, Chicheŵa, and Yiddish are already translated; great thanks to Juan Pablo Martínez Cortés, Ruben Hilaire, Carl Morris, Rhys Wynne, Wim Benes, Michael Bauer, Jean Came Poulard, Karaitiana Taiuru, Edmond Kachale, and Jordan Kutzik for providing these translations. There are just seven short messages to translate (in addition to the 13 needed for the Indigenous Tweets translation):


  • Title
  • Author
  • Posts
  • Last Post
  • Words
  • Any blogs missing?
  • Subscribe to all posts in this language 
I hope you all enjoy this new feature, and I hope it inspires some of you to start a blog in your own language!

2011-09-07

In the shadow of Pinatubo: José Navarro on Kapampangan

Kapampangan is spoken in Central Luzon, on the main Philippine island of Luzon, north of Manila (see map below).  It is the seventh largest language of the Philippines, with about 2.5 million native speakers.  According to the Philippine Constitution, regional languages have "auxiliary official" status in the regions, but, despite being the main language of Pampanga Province and one of the two main languages of Tarlac Province, Kapampangan does not have official status, and is not taught in schools.

According to the 1987 Constitution, the official languages of the Philippines are English and Filipino.  Filipino was originally conceived of as a national language that would be "developed and enriched on the basis of existing Philippine and other languages", but in practice this has not happened; nowadays it is usually described as a "standardized form" or a "prestige register" of the Tagalog language (Tagalog is the most widely spoken indigenous language on the islands and the traditional language of the capital city, Manila).  Many speakers of regional languages in the Philippines view Filipino and Tagalog as one and the same.  It appears that Google does as well; the Google search interface was available as far back as 2000 in "Tagalog", but if you browse the Internet Archive, you'll find that sometime in 2004 it was renamed "Filipino".

In any case, the promotion of Filipino has taken its toll on the use of Kapampangan and the other indigenous languages of the Philippines.  If you look at the Indigenous Tweets pages for Philippine languages like Ilocano, Waray-Waray, or Kapampangan, you'll notice that the percentage of tweets "in language" is on average quite low, reflecting the fact that many speakers of these languages are more accustomed to using Tagalog or English online.

This linguistic landscape is in some ways similar to the ones found in multilingual African countries like Malawi, Tanzania, or Ghana, where one indigenous language is promoted as a national language and is taught in schools alongside English, while smaller indigenous languages are used primarily at home and in local communities.  Comparisons can also be made with other multilingual states, such as Spain, Switzerland, Canada, etc., each offering a different model of regional and linguistic autonomy.


José Navarro is a writer, editor, and researcher who has written a number of articles on language revival, focusing particularly on Kapampangan, for online discussion groups, local publications, and Wikipedia.  He agreed to talk with us about the current state of the language, both online and offline.
Map by Christopher Sundita, CC-BY-SA

KPS: What opportunities are there to use Kapampangan online?  Is internet connectivity or access to computers an issue for your community?

JN: Online, Kapampangan is used on several Kapampangan-language discussion groups, whether connected with Pampanga, Tarlac or towns or cities in these provinces, or with websites or discussion groups catering in general to Kapampangan speakers.  Often, in general Philippine sites, where there are many Kapampangan speakers, the language is often used. With respect to Internet connectivity, there is connectivity in major towns and cities, although availability is still limited in small towns and rural areas. Unfortunately, software and other resources such as web browsers, office software, and online dictionaries are still generally nonexistent.

Google, the search engine, is unavailable for Kapampangan, but the Tagalog version has been made the default engine for the Philippines, reinforcing among Kapampangans contempt or a low regard for their native language, and at the same time magnifying their admiration for Tagalog, an attitude which has been encouraged by government and the schools ever since the late American regime and the Japanese Occupation, when it was first enforced in the schools.

KPS: Many speakers of indigenous and minority languages are reluctant to use their languages online, for a number of reasons. What is the general attitude toward using Kapampangan online?

JN: Sometimes, the lack of a standard, generally accepted orthography is a problem, since the older generation used a Spanish-based spelling, while young people today, who were educated in Tagalog, are more comfortable with a Tagalog-based script.  However, this is often not a problem, and if they decide to use Kapampangan, they use the orthography with which they are most comfortable.  In discussion groups, however, where the audience is general Philippine rather than specifically Kapampangan, they would opt to use a more widely used language, such as English or Tagalog, since they would be ashamed to use their language when non-Kapampangans are present, something which also happens in real life (that is, not online). Sometimes, these non-Kapampangans are unusually assertive in forcing their language even in the Kapampangan sites. For instance, in a Kapampangan-language discussion group (in which Kapampangan was the usual medium), there was a Tagalog who, because he was unable to speak Kapampangan, asked a question in Tagalog.  When he was requested to speak to the group in English instead, he said that in mixed company, he expected Kapampangans to use Tagalog.  Unfortunately, this is applied more forcefully in some discussion groups.  I've encountered groups where Tagalog and/or English are enforced, in effect discriminating against non-speakers of English or Tagalog or against non-Tagalogs.

KPS: You've been actively involved with the Kapampangan Wikipedia. Can you comment on the importance of that work, both in terms of its usefulness as a source of native language information for Kapampangans and in terms of raising the profile of the language online?


JN: As you know, Wikipedia has become the most popular encyclopedia in the age of the Internet, one whose reach has truly become prodigious. It has become everybody's encyclopedia, one of the ten most visited sites on the Net.  There's no question that it has gone a long way in raising the profile of the language online, and especially among young Kapampangans in general. This is very important, because the advent of the Internet has intensified overt or covert suppression of Kapampangan and the domination of Tagalog, the basis of the national language, and the only language, besides English, taught in Philippine schools. Google, as well as the major social networking or blogging media and online translators have made Tagalog their default language, making Kapampangans look down on their own language more strongly than before. This neo-colonial setup has worsened over the years, with Tagalog tending to monopolize mass media
and the schools, and this now extends to the Internet.

The existence of a Kapampangan Wikipedia has a valuable role in countering this. In addition, the proposed use of the mother tongue, at least for beginning schooling, in the early grades (after decades of using exclusively Tagalog and English) will also provide a readily available reference in the language, which students can easily consult. It will also help standardize the spelling and equivalents of English terms, things which were previously available only in Tagalog.
Church in Pampanga; photo by Shubert Ciencia (CC-BY)

KPS: I mentioned above that many indigenous languages lack computing terminology.  Is this an issue for your language?  How is/was terminology developed?  Is there a "language board" or are terms developed naturally by the community?  If there are official terms, how are they communicated to the community?

JN: This is usually not regarded as a problem.  English terms are usually borrowed where a Kapampangan term is unavailable.  Unfortunately, there is no official language board.  In effect, the closest thing to this would be online discussion groups, or the Kapampangan Wikipedia.  For the most part, few Kapampangan terms have been developed.  Examples include "Aptas" for "Internet" and "ikuldas" for "download."

KPS: Are there other special challenges your community faces in terms of developing technology for the language and/or communicating online?

JN: One problem is the lack of a standard, generally-accepted orthography or spelling system, which I have mentioned.  However, I do not see this as a problem in the long term, since young people, who form the bulk of the online Kapampangan community, are converging on an orthography similar to Tagalog, which is also the one supported by the government, which may use it if the plan to use Kapampangan in schools is implemented.  A more serious setback is the lack of interest from the dominant software vendors or players.  Google, in particular (and now Facebook and Blogspot) has shown a bias for Tagalog in the Philippines, making Tagalog the default medium in the country, even if Tagalog computer terminology is not uniform or standard, and hence would be more difficult to understand than English, above all by non-Tagalogs.  This Google bias is even more galling if one considers the fact that two non-Tagalog languages, the second and third biggest in the Philippines, Cebuano and Ilocano, have submitted interfaces for their languages to Google, which has, as far as I know, ignored them completely, not even bothering to send an acknowledgement.  If bigger languages have met this kind of response, I can easily imagine the response Kapampangan would get.

KPS: Are young people using the language online?  Do you think social media sites like Facebook and Twitter are helping encourage language use by younger speakers?

JN: Young people do use the language online, including on social media sites like Facebook or Twitter.  I would hope these sites encourage the language, but the existence of the medium alone does not assure that.  I've been constantly monitoring Kapampangan tweets, and have come across encouraging ones, like the following (with English translations):

"Hannggang eni byasa ku pa din mag capampangan. Kung capampangan ya ing casabi mu mag capampangan bang masanting. Mag praktis ku para e mawala" [Even now, I can still speak Kapampangan. Whenever I speak with people who are Kapampangans, I use the language, so it's better/more advantageous. I keep practicing the language so that I do not forget it.]

"kasanting byasa ka pa mu rin kapampangan :)" [It's good to know that you can still speak Kapampangan.]

I would like to think that Indigenous Tweets, and particularly Kapampangan Indigenous Tweets, has something to do with this increased pride in the language, which can help bring about a renewed revival. This is of course a continuing process... but the fact that the members of the Kapampangan Indigenous Tweets are growing is also something to celebrate.

KPS: How is the government's support of Tagalog impacting the use of Kapampangan and other Philippine languages, online and offline?

JN: There is, unfortunately, a language shift to Tagalog among younger people, which is accelerating due to the exclusive use of Tagalog by government (the only Philippine language taught and encouraged by Philippine authorities), and the domination of this tongue in the media. Even worse, the use of Kapampangan is, in many cases, prohibited by schools, which instead force them to use Tagalog (or English), on pain of a fine. Placed on top of the dominance of Tagalog in the media and the increasing social pressure to move towards this more prestigious language, the effect on the native tongue is truly devastating, indeed, and is proving to be fatal in more and more places.  Something has to be done to arrest this destructive attitude, and it will have to involve changing government policy and offering the language in the schools. There has to be an energetic effort to promote the language, but unfortunately, Google, Facebook, Blogspot, and the rest are helping the government push Kapampangan and other oppressed languages further toward the brink, and at a much more rapid rate. Something should be done by concerned technical (and, one hopes, influential) people in a position to do something to oppose [this].

Mt. Pinatubo in 1991

KPS: The Kapampangan people were particularly hard hit by the eruption of Mt. Pinatubo in 1991.  Was there a direct impact in terms of language shift because of the eruption?

JCN: Yes, in many ways, and on several levels. For one, people did move away from the area affected by the eruption. For example, I know of people who moved from Bacolor (a town nearly buried by lahar or "mudflows" (actually more similar to sand - the word originated in Javanese, and has entered scientific usage) whe ended up in Cavite (south of Manila). This was repeated for thousands of people.  Many of them ended in up in Mindanao, the southern island of the Philippines. Needless to say, their children ended up speaking not Kapampangan but Tagalog, or the languages of the areas to which they transferred.

For many who remained, the weakened state of the language aggravated the on-going shift to Tagalog, which began with the increasing use of Tagalog in schools, the media and society both as an inter-ethnic medium and as a language of instruction and formal discourse.  On the other hand, among many Kapampangans, there was something of a backlash, with a good number reasserting their culture and identity in response to the eruption. The aftermath of the catastrophe led to the revival of Kapampangan festivals and culture in general, of Kapampangan publications, and of government support. One of the important things which happened was the founding of the Center for Kapampangan Studies by a leading university in Angeles City, which has become a center for revival of the language.

KPS: What is your vision for your language in ten years, both in general terms and in terms of software/online use?

JN: I want my mother tongue to be fully available online and in software, and to be able to utilize it, or for it to be utilized, in the modern digital media available to bigger or more powerful languages.  (As it is, the new media have, in many cases, become additional factors for oppression and denigration, instead of fulfilling their potential as equalizers).  Search engines, translators, computer/online games, and social media should be available in Kapampangan versions, so the language would then be able to hold its own and compete on equal terms with other languages, including those which have become the agents of its persecution and elimination in its native country.

2011-08-22

“We're here, we're using this language”: Michael Bauer on Scottish Gaelic

   Scottish Gaelic (a.k.a. "Scots Gaelic", "Gàidhlig", or just "Gaelic") is the Celtic language traditionally spoken in Scotland.  It is closely related to my own language of Irish, and also to Manx Gaelic which is spoken on the Isle of Man.  While it has a relatively healthy population of around 60,000 speakers (2001 census), there has been a steady shift to English over the last hundred years, even in the places where the language is the strongest and where it remains the primary community language, on Scotland's Western Isles.  UNESCO's Atlas of the World's Languages in Danger lists Gaelic as "definitely endangered".

   Gaelic has been used online for many years.  Indeed, my friend Caoimhín Ó Donnaíle, who teaches Computing at Sabhal Mòr Ostaig, the Gaelic-medium university on the Isle of Skye, co-founded the email list GAELIC-L as far back as 1989!  You'll find more than twenty years of messages, millions of words of Irish, Manx and Scottish Gaelic, in the archives of that list.

  Over the last couple of years, a flurry of open source software packages has been made available in the language, mostly due to the tireless work of Michael Bauer, who was kind enough to take time out of his busy schedule to talk with us about the state of the language and some of his recent projects.  Michael is self-employed as a full-time language consultant, providing what he calls "Gaelic Language Services": translation, proofreading, adult teaching, linguistic research, and, latterly, micro-publishing. He has produced some truly remarkable online resources for speakers of the language: a large bilingual dictionary (Am Faclair Beag), a high-quality digitized version of Dwelly's famous 1911 dictionary (a massive undertaking, produced over a ten year period), an open source spell checker (An Dearbhair Beag), and translations of several important software packages, including Firefox, Thunderbird, Opera, and Freeciv, the open source version of the classic game "Civilisation".  All of these projects were done on a purely volunteer basis, with no external funding a good lesson for any small language groups that might be waiting for financial support before beginning terminology development or software translation projects!  Since the launch of Indigenous Tweets in March, Michael has provided a huge amount of help to me personally, using his broad linguistic expertise to find tweets in several new languages.   You'll find him on Twitter as @LowRisingTone and @akerbeltzalba (Gaelic tweets only).

KPS: For readers unfamiliar with Scottish Gaelic, tell us a bit about the language, how many speakers, whether it's taught in schools, etc.

MB: Scots Gaelic is in a peculiar situation today. It enjoys official support not seen for centuries while at the same time suffering severe attrition of speakers and usage. The 2011 census figures aren't out yet but in 2001, the census reported just under 60,000 speakers. That's just over 1% of the population and to put that into perspective, that's down from just over 200,000 (about 4.5% of the population) in 1901. The largest challenge posed to the language today is a mixture of rural depopulation, an ageing speaker demographic and a collapse of everyday usage in the remaining majority Gaelic-speaking communities up and down the West Coast. Yet at the same time, Gaelic-medium education (GME) is on the increase (though still pitifully low, with about 0.4% of all schoolchildren receiving GME), as are adult learner numbers, an improving offer of books published, there is a Gaelic TV channel and a government broadly supportive of the language.

Michael Bauer
Legally the language has a similarly ambiguous status – for example for immigration purposes, a knowledge of Scots Gaelic fulfils the legal requirements of speaking a UK language and you can sit the Citizenship Test in Gaelic but on the other hand, it's not an official language which you are entitled to use at an official level unless it happens to be on offer.

The other challenge it still faces is widespread ignorance of the language and its history in the general population. The curriculum makes little to no reference to the position of dominance the language enjoyed for centuries or the reasons for its decline and though a majority of people feels broadly supportive of the language, there is still much animosity towards the language and a vocal minority who feels it is irrelevant to Scottish identity in the 21st century. The emergence of the concept of the Gaelic-speaking Highlander (them) and the Scots or English-speaking Lowlander (us, for most) goes back so far that most simply aren't aware that there was a time when a Scot de facto spoke Gaelic. 

KPS: What opportunities are there to use the language online? 

MB: It's a mixed picture. Google has had a Gaelic interface since about 2001 (which I started working on while at university) but I don't know what the uptake is. The main problem is that Google has a very strange approach to selecting which parts of their software suite are up for localisation and which aren't. For example the simple search interface is available, Google Docs isn't.

Facebook isn't available - their selection process is even stranger, but on the bright side, it doesn't seem to deter a lot of people from using the language on Facebook. And I know of a fair number of people who use the Irish interface.


There's an old release of OpenOffice (and I'm working on the update); Microsoft has been working on a (C)LIP [Language Interface Pack] for … oh, a long time but hasn't released anything yet. There are no localised operating systems but I personally feel that's a low priority. With limited resources, I always try to focus on projects which maximise impact. Few everyday users tinker with their OS on a daily basis and even fewer would be confident doing that in Gaelic – there is no Gaelic support team, and you have the problem that many computers are shared by speakers and non-speakers. So it's sort of on my to-do list but way down.

There are a few spell-checkers, only two of which are used widely (again thanks to you for helping us create one of them!). The Firefox app version of one of them has about 400 daily users, which is encouraging. As is the increase in the number of Open Source software packages in general. Firefox was launched in Gaelic in 2010 (and I'd like to thank you for bullying me into that!), followed by Thunderbird (Mozilla's email program), an app for Firefox that let's you switch between interface languages (the Quick Locale Switcher), and hopefully the upcoming release of Lightning, Mozilla's calendar program, and a localised version of Accentuate.us which automatically inserts grave accents. And then there's the re-release of the Opera browser at the end of 2010 (the project had fallen dormant and way behind until I took it over in 2010). The phpBB forum interface has also been translated by a friend and me, and is used in several places now.

There are three main dictionaries online now, plus a few smaller ones. One is essentially a big wordlist (the Stòr-dàta), the other a digitised version of the nearest equivalent Gaelic has to the OED (Dwelly's dictionary) and the third is a merger of Dwelly's and more modern material (called Am Faclair Beag 'the small dictionary'). Dwelly-d and Am Faclair Beag were developed between me and another friend who's a software developer in our free time (at least that's what other people call it).

There's a Gaelic Wikipedia (the Uicipeid) which isn't doing too badly considering the number of speakers but we could do with more active editors, especially fluent ones. I hear the Welsh are thinking of giving retired Welsh teachers some training in how to edit Wikipedia to add more content, which might be a way forward for us too.

Beyond that, there's not much else but overall, the Open Source movement has been a great opportunity so far for Gaelic and will continue to benefit the language. I just wish I could clone myself!

KPS: Many speakers of indigenous and minority languages are reluctant to use their languages online, for many different reasons (orthography, terminology, etc.)  How do speakers of your language feel about using the language online?

MB: Depends on whether we're talking just casually using the language or using localised software. In terms of casual use, terminology on the whole is not a massive issue, both in speaking and writing the language people code-switch a lot and I've rarely come across complications when using an English term in a Gaelic phrase. The only way that usually happens is when you get a learner who hasn't yet developed enough sensitivity to adjust the number of newly coined words they use depending on their audience. That can be a bit of an issue.

Gaelic-medium school in Glasgow, Scotland
Literacy is an issue for many older speakers that still needs addressing, sadly, which unfortunately reduces the number of potential users of the language online overall.

Translation of software on the other hand is an interesting challenge. There's not much that you cannot translate into Gaelic but the challenge is translating it in such a way that a non-technical user of the language can find their way around without having to resort to the dictionary all the time which tends to turn people off. But it can be done with a bit of forethought and a healthy approach to using loanwords. For example, when I was translating Firefox, we had to tackle the term 'export', quite a good example of subtle language engineering. There are several terms in dictionaries for the verb 'export' but they all try to carry the meaning by using native roots, for example 'às-mhalairt' – literally 'out-trade'. That sort of word sometimes works but in this instance it leaves most native speakers confused. So after some debate we settled on a new term, 'às-phortaich' or 'out-port' because it gives non-technical users more clues as to the meaning and that seems to have worked very well. The other aspect of this involves a bit of best practice in translation – when you get volunteers who translate software they often stick too close to the original language which results in really bad translations which put off end-users but if you get it right, it makes the localised versions much more readily acceptable to everyone, including non-technical native speakers.

The writing system is not too much of an issue – there is a grave accent (and an acute if you follow the traditional spelling) but casually, you can understand written Gaelic even without the accents. The one thing that causes minor headaches is the Gaelic ampersand – Gaelic doesn't use '&' but instead the so-called Tironian Ampersand ⁊. And most of you are probably seeing a square box now. QED. On the bright side, the mathematical operator ┐ looks just like it and bizarrely, displays widely so I tend to use that. I'm quite pleased that Mozilla and Opera have it. It may sound like a so-what issue but even Gaels are generally ignorant about the period where Ireland and Scotland were are the forefront of scholarship in Europe, writing in their own languages and with their own scribal tradition that it's an important little landmark for the language to have the ┐.

KPS: How is computing terminology for Gaelic developed?  Is there a "language board" or are terms developed naturally by the community?  If there are official terms, how are they communicated to the community?

MB: Yeees... good question. It's very haphazard. There's no official body that oversees terminology development so there are the usual gaps and the problem of too many terms for the same thing or indeed some terms getting overused.

Interestingly, being locale leader on Mozilla, Opera, phpBB, Google and other projects has allowed me to standardize at least web-terminology across most of the software on offer. For example, there are about 4 words each for copy, browser and import but virtually all now use the same terms. A very small number of people is upset about some of the choices but on the whole, people are glad that software is beginning to speak 'the same language'.

KPS: Are there other special challenges your community faces in terms of developing technology for the language and/or communicating online? For example, differences in dialects, different spelling systems, problem with fonts, lack of computing expertise in the community, lack of interest from software vendors like Microsoft/Apple/Google?

MB: All of the above? No, it's not that bad. Funnily enough, access to fast internet is what I'd put at the top of the list. The web is an increasing source of Gaelic stuff, from TV to radio and news, software, the web, access to services and so on. But access is not always straightforward, especially bearing in mind the geography of the West Coast with its many inhabited islands. So funnily enough, in this regard I'd ask Santa for superfast broadband in those remote Gaelic-speaking communities.

There are dialect issues but they're not insurmountable, fortunately the writing system is native and very old so usually a single spelling can accommodate a vast variety of pronunciations. For example the word 'bainne' (milk) has about a dozen pronunciations but fortunately they can all be derived from the same spelling.

With less than 60,000 speakers, interest from Apple & Co is as you'd expect. Low. Elsewhere, it's not quite that bad. The expertise is there but getting people to commit time is much more of a challenge. That and, which is not so much the fault of the community, the fact that even for the Open Source movement, localisation seems to be a bit of an afterthought. You could argue that the option was always there, which is true, but if you look at the processes within each project and across projects, they are nothing but arcane to your average educated user of any language.

It's a "you'd-think" thing – you'd think that there would be a central pool of translations for all Open Source projects (they're Open Source after all...), pooling all of OpenOffice, Mozilla, Linux, WikiTranslate and so on in one place, with each project able to draw upon this pool. Real-time would be nice but even a manual or nightly update would be great. Instead, I don't know how many times I have translated the word “Edit” or “Close”, “Save as” and so on. On their own, it doesn't seem like much but it adds up. And you have to remember that the ratio of speakers to localisers is big, scarily big. Gaelic has some 60,000 speakers and at the most, 2 1/2 people whom I would regard as being “regularly active” in unpaid localisation of Open Source software. Irish has somewhere between 80,000 to 100,000 highly fluent speakers and at a guesstimate, I'd say maybe half a dozen active people. If you look, for example, at the Mozilla localisation dashboard, you'll find that even large languages like Bengali, Hebrew or Indonesian are struggling to stay up to date. So there's something in the localisation process that's not working as well as it could. Or should.

From a personal angle too, I would have translated Firefox a long time ago but being a good translator is quite obviously not good enough – don't get me wrong, the Mozilla team are great people and very supportive but it's still a big challenge to understand a lot of what you have to do. I wouldn't recommend it without the help of someone who can speak code. That's really something that the Open Source community needs to improve.

Related to that are the more general problems of translation and localisation – programmers in general are very keen to rush off and program some neat bit of code that will calculate your tax, make a roach dance rumba across your screen and remind you to eat and drink once in a while but they rarely seem to consider cross-linguistic issues. They write their code and then downstream, some poor translator is going insane because they chopped up sentences in a way that's ok in English but not any other language, or they go placeholder happy. It's getting better but there's still a lot that needs improving. Plurals for example are getting quite good these days. English has 1 file and 2+ files. So you'd often get things like “You are about to delete %s file” and “You are about to delete %s files” to translate. Let's just say that this is a pattern few languages follow... Today on most localisation projects you can specify which numbers go with which plurals, which is good. But there's still a lot of weird language appearing on screens because such issues are rarely thought about. For example, in English you can use a sentence like “Give me results in” and the just have a dropdown of language names. But in a lot of languages the preposition “in” plus a language name results in a variety of different outcomes. For example “in English” is “sa Bheurla” in Gaelic, but “in Japanese” is “san t-Seapanais”. Or worse, there are languages which don't do prepositions. In Basque for example you have to use an instrumental suffix, resulting in “Ingelesez” and “Japonieraz”. And usually, you don't have the option of having two lists of languages. So you end up with strange syntax and strange idiom, which isn't that great for the user.
Language revitalization from below!

KPS: Are young people using the language online?  Do you think social media sites like facebook and twitter are helping encourage language use by younger speakers?

MB: Good question. If people between 20-35 are young, then yes. As for those under that age bracket, I think they are but I'm not sure, I'm not really connected to any really young people or following any.

But speaking of younger people, there's another project which isn't particularly technical but nonethless exciting.  It's called the Sgoil-Choimhearsnachd or 'Community School'. The underlying issue we're trying to address is the fact that in a place like Glasgow, even though there are more than 10,000 speakers, they're hard to find. Also, perhaps only some 200 or so regulary show up at Gaelic cultural events – which means we're losing a lot of opportunities for interaction in Gaelic. Beyond that, there's not much on offer for adults as everything is centred on kids. Which is jolly good for the kids but what will they do after leaving school? Not to mention the age makeup of those 10,000 speakers...

Working on the bold assumption that as important as traditional stuff is, it's not everyone's cup of tea. Not every American shoots moose and not every Welshman owns a harp. So it's unreasonable to assume that every Gaelic-speaker likes waulking songs or indeed should like them.

So we ran a pilot where we got members of the community who were willing to pass on skills they have to run a 6 week pilot teaching 2 hours a week, offering an art course, Esperanto, creative writing, Tae Kwon Do, Jazz Dancing and Chinese arts and crafts – all taught through the medium of Gaelic. We had some problems with advertising and attendance but we hope to improve that next time round because the feedback was great, people really enjoyed doing something totally different where Gaelic wasn't the target but just a means of interacting. People have to pay a contribution which pays the tutor and the rooms and so on but split between several people, that's not a lot. We also don't require the tutors to have teaching qualifications or suchlike or indeed offer certificates – for the most part, people are just interested in the subjects. It's a very simple model but we have great hopes for it and I think it could be easily applied to other communities.

KPS: Tell me about the third picture above, of the "No Fouling" sign!

MB: It's a picture I took on Skye. It's the other side of the coin, in a way the more precious one and the one that is much harder to achieve. It's just a cheap sign, a piece of wood on a stake with a laminated page someone ran off the printer. But in it stands for someone locally who decided to put their own language on the sign as well. No application for funding, no big fuss but a small bit of linguistic landscape that says “we're here, we're using this language”.


KPS: What is your vision for your language in ten years, both in general terms and in terms of software/online use?

MB: In terms of technology, I think we can look forward to a few more programs and applications in the language, with Open Source playing an increasing role. In particular, I'd like to see smaller languages exploit the games niche more, perhaps even on a cross-national collaborative basis. If games can teach German speaking children English without a teacher, then that's something we cannot afford NOT to use. In a way I'm quite proud that 2011 is the year that saw the release of the first Scots Gaelic computer game (Freeciv – the open source development of what many of you over 30 will remember as Civilisation II). It was fun doing the translation, actually, so much better scratching your head over how to say “The Basque catapult has been destroyed by the Babylonian horsemen” than some policy document. But I also can't believe it's 2011... we're really missing a trick here, given how much grief my mum used to give me over playing this game.
Screenshot of Freeciv in Scottish Gaelic

I'd also like to see speech technology advance, in particular to support speakers with shaky literacy. One item on my personal wishlist which probably won't happen is one I mentioned earlier – a shared online repository for all these localisation projects, linked into a better online translation memory. Sort of a mega-Pootle with live suggestions bringing together Mozilla, OpenOffice, Linux and the rest. I waste a lot of time re-translating the same strings.

Overall... I'd like to see GME become compulsory in those areas that still have a strong Gaelic-speaking element, which also entails more teachers being trained. Less money on white, flashy elephants and more for someone to grapple with the thorny issue of seriously increasing language use in the Gaelic-speaking communities before we lose them. And nationally, better education about the historical role of Gaelic in Scottish history. Some of the views people have about Gaelic are about as down-to-earth as the birther debate in the United States.