Creating multilingual wikis and wiki engines

Contents

Introduction

In this text, the term wiki refers to any wiki web, not to the software used to implement such a web; the software is referred to as wiki engine.

What is a wiki, and why would we want it to be multilingual? A given mailing list normally works in a given language. Why would a wiki be different?

The answer seems to be that as a community grows, it spawns sub-communities that may work in different languages. In such cases, new mailing lists are created, for example. But for a wiki, things are more complicated, because unlike mailing list messages, which are transient, wikis contain permanent articles, and there may be a strong correspondence between articles in different languages. Thus, when a wiki community spawns a sub-community, it is normally insufficient to merely create a new, independent wiki. Instead, a certain amount of interlinking must happen, and it must be supported by the wiki engine.

However, if multilinguality is the result of the creation of sub-communities, it is important to recognize that these sub-communities shall generally work independently. Attempting to impose rules about blurring language barriers, forcing translations to exist, and ensuring a 1-1 correspondence between articles in different languages, may be counter-productive. Multilingual users are the glue that joins the sub-communities together in one large loose community, and if it weren't for them, we wouldn't have anything to discuss now. Thus, we must care about the needs of multilingual users, and make sure they can navigate and easily jump from an article to its alternative language version, but also recognize that a large part of the work occurs independently in the language sub-communities.

Principles

Alternative versions don't contain exactly the same content

If page B is the French version of the English page A, do both pages contain different language versions of exactly the same content? This is only true if one of them is a translation of the other. But it may be that it is a translation of an old version of the other; or that A and B have been written independently; for example, if the author of B decided to rewrite from scratch and only consulted A; or if he didn't know that A existed before writing B.

It seems that pages will have the same content only if there exists a formal policy that dictates it to be so. For example, a company can have a multilingual web site, and a policy that a page be translated in all designated languages before being made public. However, such policies don't scale, and they cause publishing delays which are considered more harmful than multilingual non-parity. As a result, most companies seem to abandon such strict policies and only try, but not too hard, to make content available in all languages of their web site.

Thus, identical content is the norm, at least in theory, only in such cases as multilingual law, such as Swiss law or EU law. Laws are decisions with a timestamp, that are finally approved and enforced only after they have been translated in all designated languages. This need for accurate translation is part of the formality that results in law making being extremely slow and hardly comparable to a wiki. Incidentally, despite this slowness, quality of translation is a major problem in EU law making.

Bijection, commutativity and transitivity

If page F is the French version of the English page E, is E the English version of F? If F is the French of E, and G is the German of F, is G the German of E? Is it possible for two different French pages to map to the same English page? Does one page have only one alternative version in a given language? In other words, are language mappings bijective, commutative, and transitive?

At first sight these properties seem to hold; but there are exceptions. A hypothetical example is that in Eskimo there may be many different words for the English word snow, each one having a slight difference in meaning, which might be important for Eskimo culture. It can be argued that if there were many different Eskimo snow articles, there could be a separate English version for each one, whose title or WikiName could be a periphrasis. However, what is important here is that it is not anymore obvious that bijection still holds. Another example is when English users decide that page A (whose French version is B) is too long, and break it in two different articles, A1 and A2.

A real demonstration of the problem is the list of Star Trek races in Wikipedia. The list contains short comments on each race, but for a number of races it only refers the reader to a separate article about the race, such as Borg or Vulcan. The German Wikipedians, however, decided that a single article is better; they have no separate article for Borgs or Vulcans, every race being treated in the single Völker im Star-Trek-Universum article. The English list of races and the German Völker link to each other, but what about the English articles Borg and Vulcan? The community is divided as to whether they should link to Völker or not link at all, and the winning opinion is to link. Thus, the English Borg links to the Borg section in Völker im Star-Trek-Universum; no reciprocal link exists, of course.

Another example is the English article on Degree (angle), which links to the French Degré, which links to the English Degree (disambiguation). The French Degré (homonymie) also links to Degree (disambiguation), which, suprisingly, does not link to homonymie but to Degré. Although it looks like an error, a certain discussion shows that there are that there are some disagreements.

It is possible that if the knowledge collected in a wiki were finite and were perfectly organised, then language mappings would be commutative, transitive, and bijective, which let's call altogether perfect. This means that, as a wiki evolves, language mappings tend to become perfect (and pages in different languages tend to the same content). During the process, however, they are not. It is impossible that English page A and French page B are both broken up in A1 and A2, and B1 and B2, at the same time; one of them will necessarily be processed some time before the other; and articles will occasionally be messed up, as seems to be the case with Degré above. Since a wiki does not have a terminal state and is continuously in the process of evolution, language mappings will not be perfect. Not that it matters much, but in addition, cases similar to the above mentioned Star Trek could exist even in the ideal terminal state, because of cultural differences.

Nevertheless, it is important to note that the vast majority of cases, and by vast majority we mean something like 99% or more, are commutative, transitive, and bijective. It might, therefore, be preferable to impose it, on the grounds that it is conceptually simpler. In fact, it appears that both Völker im Star-Trek_Universum and Degré have caused discussion and disagreement, and it might be that everyone would be better off if users had had no alternative option but the simplest and most intuitive.

There is an additional problem. With nonperfect interlanguage links like Wikipedia's, each language community is free to do what they want with the interlanguage links from their language to other languages. Perfect links, instead, will force a certain degree of intercommunity co-operation: if, in an ambiguous case, a French user decides that page E1 and not E2 is the English equivalent of the French F, he imposes this opinion to English users as well. However, whenever such disagreements exist, all disagreeing users will be multilingual, which means that they should be able to work out a solution together.

Different name spaces

Whereas it is clear that "Battle of Normandy" is English and "Bataille de Normandie" is French, many articles have the same title in many languages; examples include "Ella Fitzgerald", "Smalltalk", and generally titles that are names. As a result, different languages must be assigned different name spaces. Name space here is meant in a general sense, not in the sense of a MediaWiki namespace.

One way of implementing language name spaces is with MoinMoin categories, where all English pages would begin with "En/". An alternative that has been proposed and used, to put the language in the name of the page, such as BattleOfNormandyEn, results in ugly names littered with meta-information. Another alternative, used by MediaWiki, is to use one wiki per language.

CamelCase

Historically, wikis have been using CamelCase for hyperlinks. This practice, however, causes several problems:

  • CamelCased terms are recognized by search engines as single words, thus ranking pages incorrectly.
  • CamelCase reduces link readability.
  • In several languages, such as Japanese, Chinese, Hebrew, and Arabic, CamelCase is not possible.
  • Valid CamelCase words have to be escaped in the wiki source.

People with a background in programming languages may find the use of CamelCase natural, and they might even prefer it for the same reason for which CamelCase is often used in naming conventions when programming: it indicates that the entity belongs in a different class. In wikis, however, this is much less important than in programming, as web browsers render hyperlinks in different color. As wikis become more available to nontechnical users, CamelCase becomes less appealing. Sometimes organisations use wikis as their formal web sites, where unregistered users only have permission to view, whereas logged on users can modify; in such wikis, CamelCase is clearly undesirable.

If CamelCase is harder for readers, it has some advantages in writing:

  • Wikisource can be more readable and closer to the processed result with CamelCase than with markup.
  • CamelCase can be faster to type than markup.

The second advantage, is, however, disputed. It generally seems that the disadvantages of CamelCase outweigh the advantages, and there is a tendency to not use it any more. MediaWiki has dropped CamelCase support altogether.

In multilingual wikis, there isn't much to dispute about CamelCase. If the wiki is really expected to only be used in certain languages in which there is a distinction between lower and upper case letters, CamelCase can be used. In other cases, it should better be discouraged for the benefit of uniformity of hyperlinking habits in different languages.

See also: http://c2.com/cgi/wiki?WikiWordsConsideredHarmful

Manual language selection

It is occasionally proposed that the wiki server looks at the Accept-Language http header, or at user preferences stored in cookies, and automatically serve the preferred language version of the requested page. Such automation is unwelcome. First, users expect a given URL to point to a page with given content; cookies may affect the skin, or other details, but not the main content of the page. For this reason, the Accept-Language header is only ever used in order to redirect the top-level page of a site, such as http://www.foobar.com/, to the top-level page in a specific language, such as http://www.foobar.com/en/.

Second, users will be confused by such automation. Multilingual users cannot be expected to change their user preferences each time they want to view a page in another language. Even if a means of manually selecting a language is provided, but wiki links are generic, thrs will either want to view RecentChanges in a given language each time, or to view the combined changes of a given set of languages. Besides RecentChanges, there may be other indexes, such as a site map, or a text search. If it is not possible to provide an option, or until such functionality is developed, providing single-language indexes seems preferable to providing indexes of all languages combined, given the general independence of languages sub-communities.

Existing implementations

MediaWiki

The most prominent multilingual wiki on the web is Wikipedia, whose wiki engine is MediaWiki. Multilinguality is achieved by assigning one wiki per language. The English Wikipedia is almost independent of the French Wikipedia, them being only connected by manually specified links to alternative languages. The article on the Battle of Normandy contains the markup [[fr:Bataille de Normandie]] in the wikisource. This is not rendered; it is only processed by the skin, which adds a link to the French version of that page, Bataille de Normandie. The French version, accordingly, contains [[en:Battle of Normandy]].

This system has many advantages. It is simple to develop, easy for users to understand, free of the questionable assumptions of bijection, transitivity and commutativity, and provides manual selection of language through clearly defined namespaces.

MediaWiki's main disadvantage is that it can be very tedious and error-prone to manually manage Wikipedia's multilingual links. If an alternative language is added, not only links to all existing language versions have to be included in the new version, but links to the new version have to be added in all existing versions. In articles that exist in more than 50 languages, such as Water or the article on Wikipedia itself, this can be extremely hard. As a workaround, Wikipedia has bots, like the German ZwoBot, that periodically visit all articles and fix the links. In fact, there is a separate bot for each language; ZwoBot, for example, only fixes German pages: it takes a German page, follows (recursively) all interlanguage links from it, and then fixes only the links of the initial German page (it normally makes the unambiguous corrections and notify the operator about the ambiguous ones). Obviously this causes much more traffic than if one bot fixed the pages of all languages at the same time, but it appears that a more conservative approach has been taken, of letting each language community decide independently how it wants its bot to operate on the links of the articles of that language.

Some more disadvantages of MediaWiki are the consequence of the independence of the wikis for different languages. The most inconvenient for multilingual users is that different user accounts are required for different languages; another problem is that the validity of the multilingual links is not checked, as they are actually external links, created by the information found in the configuration.

Other implementations

I haven't seen any other successful implementation, so this section is yours to expand.

Recommendations

Either one wiki for all languages can be used, or a wiki farm (a set of wikis operated by the same wiki engine installation). With a farm, it is more difficult to provide one account and combined language indexes. With a single wiki, it is more difficult to provide separate language indexes. Since separate indexes are a priority over combined indexes, starting implementing using a farm seems preferable and easier. Having to configure many wikis is more an advantage than a disadvantage, as the subcommunities may want different logos, different default skins, and generally different settings. In addition, having many wikis served by the same account server, and creating meta-indexes to show combined recent changes or search all wikis, seem to be cleaner development solutions than trying to hack functionality into one wiki.

The main problem, then, is to decide whether to force bijection, commutativity and transitivity. The problem has been analyzed in detail above, and it is hard to make a choice. Not forcing these properties results in easier implementation like MediaWiki's, which however causes problems which may need bots to fix. Forcing them is harder, as it seems to require a subsystem independent of all languages to keep the language correspondence of the articles, but may be cleaner in the end.

I think I'd go for bijection, commutativity and transitivity.

See also

MultilingualWiki in Meatball wiki is a collection of ideas from which this document has wildly and shamelessly stolen.

The discussion on Wikipedia's interlanguage links presents a number of problems which developers should read before attempting to implement multilinguality in a wiki engine.

Multilingual communication, an article in Wikimedia Metawiki, is an idea with which we obviously disagree, because it violates all assumptions presented above, but is linked to from here for completeness.

The discussion on ZwoBot, which were done during revision of this article, contains some more links.

Meta

Thanks to MoinMoin developers Thomas Waldmann, Alexander Schremmer and Heather Stern for discussing the subject with me, and to German Wikipedian "Head" for answering my questions on Zwobot.

Copyright (C) 2005 Antonios Christofides

Permission is granted to copy, distribute and/or modify this document under the terms of either: