How to create a language in one day

Purpose: In this article I am presenting an easy, fast and fun method to create the illusion of real language and produce material can be used for a variety of purposes.

About a year ago I worked on a very interesting project which involved creating a unique world with all its history, people, physics, metaphysics and so forth. I like fictional worlds that are thoroughly created and I have always marveled at people like Tolkien or Richard Garriot who go such great lengths and even create languages for their worlds. I have since I was young thought that it would be cool to one day create my own language.

When I started studying linguistics and computational linguistics many years ago I learned a lot about the behavior of language. I got more acquainted in the world of languages and learned what I needed to cover to construct a language of my own, and roughly in what end I should start. I also realized the daunting scope of such a project.

However, a year ago I was thinking about the game world we were creating and I briefly returned to the idea of creating a language. I though about it and wondered if I couldn’t be much more efficient. I mean, I wouldn’t wanna spend a couple of months on a language that would just be a minor background element in this fictional world. It would add some depth to the world, but few would probably fully appreciate a proper constructed language.

One evening I began to do some basic research, looking for ways to cheat and sidestep what would ordinarily be required in the process of creating a language. I figured that for my specific purpose I could fake quite a lot. This lead to some quick tests and after spending another evening I was done with my language. I had created a fictional language in (less than) one day.

Linear B

First, I wanted a language that felt real. It should reek of history. In the end I turned to Linear B and figured I could use it. (Of course I could have drawn my own set of symbols and worked out their pronunciation, but this time I decided to go with Linear B as it is)

Linear B

This is not the whole Linear B writing system. There is a set of logograms and special characters in the system as well, but I decided to ignore them and just go with the symbols you see above.

One interesting aspect of this part of the Linear B system is that each symbol corresponds to a syllable. This is quite different from our Latin alphabet. Whereas Linear B uses one symbol to denote the syllable “wo”, we would in English write it with two symbols: ‘w’ and ‘o’.

Translating syllables

Now, what would happen if I could just somehow translate English syllables into Linear B ones? After some more digging I found a list of the few hundred most common digraphic (two character) syllables in English. The 10 most common being:

Syllable Frequency
TH 3,99%
HE 3,65%
AN 2,17%
ER 2,11%
IN 2,10%
RE 1,64%
ND 1,62%
OU 1,41%
EN 1,37%
ON 1,36%

That’s well and good. Now, If I could set up a table matching the 60 most common digraphs in English against the 60 Linear B symbols I might get somewhere. Piece of cake! Python (or Ruby or Perl for that matter) to the rescue! These are excellent languages for these kinds of tasks. Here comes the translation table:

translation_table = [
    ('en','a'),  # Digraphs
    # ... more pairs like these ...

I can pretty much pair these as I want since Linear B syllables always have a vowel in them. So I won’t end up with long strings of consonants ("jfdksjfdf") however hard I try.

Ok, we also need translation functions. translateWord() translates single words syllable for syllable and translate() iterates over a whole string (sentence) and translates it word by word:

punctuation = (',','.',':',';','!','?')
def translateWord(word):
    def trans(str):
        for (ep, lp) in translation_table:
            if str.startswith(ep):
                return (lp, str[len(ep):])
        # didn't find a syllable. chip off one character and move on
        if str[0].endswith(punctuation):
            return (str[0], str[1:])
            return ('', str[1:])
    tword = ''
    word = word.lower()
    while word != '':
        (syl, word) = trans(word)
        tword = tword + syl
    return tword
def translate(str):
    return " ".join([translateWord(w) for w in str.split(' ')])

Now we can try to translate sentences:

This is my new language

translates into

oqe qe je teze

This looks promising, but we need to fix one thing. Since there is no corresponding syllable to “my”, the whole word “my” gets consumed. Adding the single vowels (‘a’, ‘o’, ‘u’ etc) to translation_table and have them correspond to Linear B syllables does the trick.

Why is this your new language?

now becomes

o qe oqe opi je tezeanesi?

Giving the language more flavor

It’s a good start, but we can get a bit further. First of all, the translation table could be expanded a bit with entries for semi-wovels (‘w’, ‘j’, ‘l’) and some consonants. But there’s also things we can do with the language structurally. There is a linguistic term called “agglutination” which means that instead of isolating a word of some syntactic meaning, it is instead tacked onto another word as a prefix or a suffix. English does this with the plural marker ‘-s’, for instance, while pronouns like “your” and “us” are separate words.

Some languages are heavily agglutinating, like Finnish where “talossanikin” means “in my house, too” whereas a language like Mandarin isolate everything (these are also called analytic languages).

For the sake of making my language more exotic than English I decided to have it use suffixes where English uses separate words in a number of cases. Another table does the trick:

switch_table = [ 
    'a', 'an', 'the', 'my', 'your', 'his', 'her', 'its', 'their', 'your', 'our',
    'i', 'we', 'you', 'he', 'she', 'it',
    'one', 'two', 'three', 'many', 'some',

(My final table is a little bigger than this but this illustrates the point)

If any of the words in the table are encountered, they switch place with the next word and joins it as a suffix. The function intermediate() handles that and creates the “intermediate” English form:

def intermediate(str):
    i = 0
    s = str.lower().split(' ')
    s2 = []
    while i < len(s) - 1:
        if switch_table.count(s[i]) > 0:
            # Make suffix
            n = s[i+1]
            nsuffix = ''
            if n.endswith(punctuation):
                nsuffix = n[-1]
                n = n[0:-1]
            i = i + 1
        i = i + 1
    if i < len(s):
    return ' '.join(s2)

So if I run the string Why is this your new language? through intermediate() I get:

why is this newyour language?

And feeding that through translate() yields:

o qe oqe jeopi tezeanesi?

Writing it out

Now we only have to get it written into the nice Linear B symbols. Fortunately, Unicode covers Linear B so if we only have a font that includes its symbols (You’ll find one called “Aegean” here), any web browser will be able to display the text. First, we just add the Unicode codes for each entry in the translation table:

translation_table = [
    ('en','a', '&#x00010000;'),  # Digraphs
    ('er','e', '&#x00010001;'),
    ('nt','i', '&#x00010002;'),
    ('th','o', '&#x00010003;'),
    # ...and so on...

We also need to modify the translateWord() function to return tuples of Ascii and Unicode (exercise left to the reader). Then we can easily dig out either the written or “spoken” version of the text and put it all in a HTML page (another exercise to the reader) for your favorite web browser to render.

Let’s try it…

It is a dark time for the Rebellion. Although the Death Star has been destroyed, Imperial troops have driven the Rebel forces from their hidden base and pursued them across the galaxy.

(Intermediate form) isit darka time thefor rebellion. although deaththe star has been destroyed, imperial troops have driven rebelthe forces theirfrom hidden base and pursued them theacross galaxy.

qewo rotawine kuzo osinita tasikisuina. reopise qokotasi duma mo kime qodutioja, ruzeerure titazeso nejo rosojo tasikisuosi nitariso owetati raroqo kimosi diro zesesoaja osi oqatisoso sereneo.

It is a dark time

Now we are done!

44 Responses to “How to create a language in one day”

  1. Informatica Says:

    Awesome work and nice code.

  2. Tony Ledford Says:

    This is a great article! I like the systematic approach to breaking down the language into component parts. I see it does require some understanding of linguistics (digraphs, agglutination) but without requiring a linguistics degree. This overview gives me a good jumping-off point to creating my own languages and learning how to give them some flair. Thanks so much for the info!!

  3. Arun Says:

    smashing job, boss

  4. Angel Ortega Says:

    This is not a language, it’s an encoding. It’s fundamentally different.

  5. Thiago Says:

    Thank for sharing your approach, tolkien must be honored by that.

  6. Oladipo MD Says:

    This is awesome. Now, I am inspired to do something like this into an app and offer a way to secretly send information

  7. sicher Says:

    @Angel Ortega:

    Well, what is a language? This “thing” has a unique vocabulary that follows (almost English) rules but with different morphology. The syntax is also unique (based on English though) – but it’s obviously not a naturally evolved language.

    The point of this is to with very little work create something that is very hard to distinguish from a real language that can be used in, for instance, a videogame. Would you be able to figure out that this is not a “real” language?

  8. someguy Says:

    This is a great aproach to the problem.

    Now I want to study linguistics too. Would you suggest any introductory book in particular.


  9. How to create a language in one day | fozbaca’s WordPress Says:

    […] Share this:DiggRedditLike this:LikeBe the first to like this post. This entry was posted in Uncategorized by fozbaca. Bookmark the permalink. […]

  10. Strass Says:

    Wow, so cool! I love languages (in particular, fake ones), especially when combined with clever tricks like this.

    Any other posts you think I’d enjoy? ;)

  11. name Says:

    Nobody care your language.

  12. sicher Says:


    Nice syntax! Thanks for caring.

  13. Ryan Connell Says:

    I really like what you’ve done here. While I agree with Angel Ortega that what you’ve created isn’t actually a language, I think it was quite the worthwhile project and I’m very pleased that you’ve shared it with us. I think that a cypher like this is a good stepping stone to creating something more involved which not only has grammar differing from English but also different semantics. The fact that you coded in Python is a pleasure as well, because I’m teaching myself to code, and this is a lovely script for me to pick apart.

    Thank you, very much.

  14. Ryan Connell Says:


    If you’re interested in Conlanging might I suggest Mark Rosenfelder’s The Language Construction Kit. It’s inexpensive, quite accessible, and comes with a large section for suggested further reading. You can find it on Amazon.

    Sorry for being a shill on your blag.

    Is the entire body of your code found in this entry? If not, would it be crass of me to request that you put the entire thing up someplace for people to see it? I’d love to learn from it but I want to be sure I have it all.

  15. Ben Says:

    Ignoring whether or not this is pretty neat, and a great way to quickly knock together something serviceable that looks/feels neat (it is neat, by the way), Angel is right to say that this process is encoding rather than creating a language.

    As far as I’m aware (i.e. I’m not an expert on the subject, just an enthusiast), changing the sounds in words so they seem different does not a new language make (and adding suffixes or prefixes that remove the need for some words does not constitute a unique syntax, just a slightly altered one.) By that argument, pig latin (among others) should be considered a ‘language’.

    Encoding/encryption is a process where you obfuscate language (as in speech, or writing) so the meaning isn’t readily apparent. Most large groups (governments, businesses, etc) are very interested in this practice—it’s a handy trick to be able to keep others (enemies, competitors, etc) from knowing what you’re communicating about, while deciphering (i.e. decoding) what they’re talking about. However it’s done (it’s a complicated subject) the idea is that you convert the communication into code at one end, and convert it back at the other. The communication at one end has to end up the same at the other—translation errors must be impossible. In the end, if you’re encoding English, it can be decoded by reversing the process that went into encoding it, and you end up with the exact message that went into it.

    True language, constructed or otherwise, is different. With the exception (usually) of very simple words (things like ‘blue’, and ‘tree’), it’s a complicated process to translate ideas and concepts from one language to another, because words that seem similar or identical at first glance include nuances that can change the meaning greatly in context. (One of the reasons translators get credit on the books they translate, I suppose.) Google Translate can make it seem like translating is an easy thing…until you start feeding it poetry, or conversations. Then it begins failing all over the place. (Speaking from experience.)

    There’s a story from ye olde Roman days about a man picking up a book from his local library. It had just arrived via inter-library loan. (No, I’m not making this up.) He saw another man there, who asked what he was reading. The first man showed him the book, which was written in Greek. When the second man asked what it meant, the first translated the title. The words in Latin suggested one meaning, and the second man started making conversation about it until the first one stopped him to try to explain that the words, in that context, meant an entirely different thing in Greek. That the Greek author was, in fact, encouraging the exact opposite point of view of what it seemed when the title was translated into Latin.

    All of that sums up to: No, it’s not a language, it’s an encryption. …which doesn’t take away from the fact that it’s very clever and would be difficult to tell apart from an actual language. Very cool, and well done. Nice job!

    ***This was not an attempt to belittle or demean or attack the original post in any way. I just thought the question “What is a language?” deserved some kind of answer, even a (very) crude attempt like mine. :) ***

  16. Michael Everson Says:

    Nice to see someone using the Unicode encoding for Linear B. It makes it all worthwhile. :-)

  17. KrazyDad » Blog Archive » How to create a fake language in one day Says:

    […] interesting blog post from game designer Mikael Säker, in which he explains how to create a fictitious language (for a computer game) in one […]

  18. Scott Says:


    Japanese has a couple of syllabaries too – hiragana and katakana – for a more Asian flavor, if you like that sort of thing.

  19. pozorvlak Says:

    Ben: actually, you’d be surprised how many languages lack a word for “blue”. See , or Guy Deutscher’s book “Through the Language Glass”.

  20. In praise of the relex and how to make a better one | Fake languages by a fake linguist Says:

    […] I read this article on how to create a language in one day and really it is about machine generated relexes. And that isn’t a bad thing, it has a […]

  21. Tamfang Says:

    There are more syllabaries than alphabets; it’s only because of a quirk of Semitic morphology that alphabets now dominate the world.

  22. Magnus Kochman Says:

    Your translation process is heavily reminiscent of Japanese romanization technique, especially as it produces words with similar cadences and consonant/vowel balance. The Linear B table, especially, looks like a hiragana/katakana chart with the kana deleted and replaced by Linear B glyphs.

  23. David J. Peterson Says:

    Just to throw in my two cents: It’s not that this project isn’t a language, it’s that it’s not a separate language from English. Like Pig Latin or the language in Skyrim, it’s just a different version of English—a different way of speaking English.

    I think it’s pretty clear that it serves its purpose, though. If all that’s wanted is something that sounds like it might be a real language but which is more consistent than gibberish, this serves rather well.

  24. Roberto Says:

    Very nice experiment. The result is very convincing. I wonder how difficult would be to add new “features” to english using this kind of process.

  25. atsiko Says:

    It’s not a language. It’s not even an encoding, as some previous posters have proposed. It’s just nonsense text using English as its base for generation. You seem to have learned a lot about the superficial aspects of computational linguistics and how to apply a scripting language to them, but your understanding of actual linguistics is severely lacking.

    It’s fine if you think using this system gets the result you want. But you should be honest with the people reading your blog. You have not created a language; there is no translation or encoding going on here. You’re just using string manipulation to generate someone nonsense to put into your game.

  26. sicher Says:


    The whole point of the method is not to do what a proper language builder would do but to sidestep that work and arrive at a crude approximation of language that still do the job pretty well. I added a clarification about that in the text.

  27. Crew of DampfCraft VII Says:

    Man… I’ve traveled for more than 30 of your Earth years through our (local) universe an beyond, know nealry every major Einstein tunnel..
    Scanned all the languages and dialects I’ve encountered (using the Bablefish VII Extented System).. Found a striking similarity of your “language” with that of the planet of Ooopx. (methan type) Al I can say is, don’t go there. Too dangerous. ..well err.. you might anyway. but please.. don’t use your linguistic attempts there. Even the smallest deviation or misspelling could be interpreted by the Ooopxers as a severe insult and/or obscenety, resulting in immediate termination. So don’t. In general one should be very careful with inventing languages from scratch. As there are so many planets the chance of “getting one of those right” are almost certain.. Havoc when the cultures don’t much..
    Btw. on the artificial planet Allmart, near Ooopxm is a great supermarket,
    they do speak English there.
    Bedankt voor het aandachtige lezen,
    ik wens u een goed weekend
    met vriendelijke groeten

  28. Bill Woodruff Says:

    Really enjoyed your article, and feel no compulsion to speculate as to whether this is a “real language” or not :)

    Did you consider modifying subject-verb-object order ? Masculine / feminine / transgender modifiers ? Declensions, conjugations ? Positive and negative ‘intensifier’ suffixes or prefixes.

    Reading aloud the generated text it seemed guttural, more Nordic, or Germanic.

    Your article reminded me of this news article today on Live Science:

  29. Ekitaja Says:

    Ebu neksa, el-bag jatsu veko ni ja. Elag ekta nago. Rabo tim.

    Sech-la sich ta nacho.

    Mech-da la im.

  30. Stephan Says:

    What would be a language then? It’s all about shifting from a base language to what evolves into a language that a group of people use to communicate and understand each other. A language IMHO only requires a dictionary, grammar and users. That’s it. This would be a language, because every English word could possibly be translated to this language while risking some clashes, but ambivalence is common in a live language. Chinese in speech only floats on context, because a syllable can be represented as much as 20+ meanings.
    If we would use this to act like a German Consonant Shift happened to English would that result into a new language? (Using this because iirc they both share a Saxon base)

    I said them not to leave the ship!
    Ich sagte ihnen nicht das Schiff zu verlassen!
    Shifted (English word order):
    Ich (Ik) sag dem nit (niet) laven (lassen) de schip!
    Shifted (German word order):
    Ich (Ik) sag dem nit (niet) de schip laven (lassen)!
    See also how Dutch fills an intermediate form:
    Ik zei ze niet het schip te verlaten!

    Of course this leaves out the fact that over the course of a 1000 years the 2 languages shifted further away from each other and English having a butt-load of French stuff in their language not found in German.
    But is the shifted form a new language? It could be and this shift can also be done by a computer. What’s the difference?

  31. Csaba Szigetvári Says:

    Very good article, I like your approach!

  32. bob Says:

    This is a really fantastic concept :-)

  33. dgreen Says:

    This is really cool! It is really Art! After all, what is art other than an imitation of what is real from the artist’s perspective! And if I were playing your video game where you used this “language”, I would FEEL it was a language. No one with a straight face could look Picaso in the face and say “Those eyes are not real eyes! There even in the wrong place!”.

    Keep up the good work, artist! And maybe you shouldn’t share your brush techniques with everyone :)

  34. Navin Says:

    Have you looked at the whole group of syllable languages which started with Sanskrit?
    About 12 major languages, and over 60 dialects from these already exist and are in use on a daily basis. There are over 900 million people already conversant with syllabic language based on the original Sanskrit alphabet.
    There exists a world East of Suez.

  35. atsiko Says:

    @Stephan: There’s no hard and fast definition of what makes one language separate from a related speech form(language/dialect/idolect/etc). There is a strong and well-elaborated definition of what constitues a language in general, and the final product of this algorithm as presented fails that test in certain ways.

    Chinese syllables don’t actually mean 20 different things. A specific syllable in Chinese can certainly have more than one possible interpretation. I wouldn’t argue against the suggestion that the majority of syllables have multiple interpretations. But keep in mind that unlike in English, tones in Chinese are phonemic. That means that even if four different syllables are all written without tone marking as “ma”, they are still four completely independent syllables based on tone.

    As for floating entirely on context, every language does that. If I say “it was”, that is a legitimate independent sentence in English. It could mean an almost infinite number of things depending on what antecendent the pronoun “it” refers to. And yet there are still regular patterns to English and Chinese which the output above fails to follow.

    A shift such as the one you exemplify using English, German, and Dutch could be executed by a computer, however that is not what the code above does. The shift you refer to follows several well-known principles of language change across various areas of the sound system and grammar.

    The above code follows none of these principles. It’s just a randomly designed cypher algorithm. Which meets the needs of the blogger, but doesn’t qualify as creating a language or even a dialect or creole. That’s not a value judgement. The above code has plenty of value to game designers on a low budget. It’s just not a language.

    @Bill W: Creating a program capable of recognizing the syntactic categories of English would require an enormous database of hard-coded values, and even with a strong computational linguistic framework behind it, it could never be 100% accurate. The same goes for the other areas you mentioned, and possibly most of them would be even more difficult. A human brain could certainly do it, but a human wouldn’t be capable of batch processing a source text or lexicon.

    The output text certainly has more velar phonemes than the English source text, but it’s actually closer to Japanese or another CV syllable language than to any of the Nordic languages or German.

  36. atsiko Says:

    @The Original Human Language:

    The article is interesting and from a pop-sci perspective may sound convincing, but there are several holes in the argument. This is not to say the conclusion is false, but I would be careful about how much you credit it.

    First, there are estimated to be about 6,000 extant languages currently spoken in the world. Many of these have not been well-studied and are much more distantly related to Indo-European than the ones the study probably examined. This is relevant because Indo-European is one of the largest commonly spoken languages in the modern world, and thus one of the most well-studied. However, it only makes up a small proportion of the total number of world languages. The researchers are also IE speakers, and that means they have a bias towards IE traits.

    The second issue is that it is not true that all languages progressed forward from SOV to other orders. Research on Oceanic languages suggests that word order does not move in only one direction, and that many languages evolved into SOV languages from other word orders. The assertion that many languages have not changed word order since proto-World is demonstratably false, and many languages have in fact changed word order several times, including shifting back and forth between their parent languages word order and various other patterns. Further, there is zero real agreement in the linguistic community about any feature of the so-called “proto-World” language, although the probable geographic location is generally agreed to be somewhere in Africa.

    Third, word order is actually much more complicated than just “SOV” or “OVS”. Many languages do not fit well into this typing framework, and even English has some fairly obvious deviations from its “basic word order”. Finally, many languages exhibit something called “free word order”, which exists because those languages use morphology to mark subjects, objects, etc, rather than relying on syntactic position as English does.

    The note about Talmy Givon’s earlier research(The article mis-cites him as “Tom Givon”.) is also misleading, because the link that follows has no relation to the topic whatsoever. Also, several eminent Linguists, including Lyle Campbell, have criticised the research. The idea that enough reflexes(words derived froma common root) have survived from proto-World, when just looking at relatively well-known languages provides extremely diverse lexicons, is very suspect, and several words cited in the study have been debunked as happenstance.

    Articles on sites such as commonly print suggestive studies as fact to create pageviews, and while interesting conceptually, they aren’t very trustworthy in terms of actual science.

  37. manic Says:

    To make record we can use Linear B writing system.
    Any language will do.

  38. Najja Nasiif Says:

    wow, that’s a brilliant idea an Article!

  39. Nate Allan Says:

    Cool as something for fun, but not passable as a real language I’m afraid. Even with my limited understanding of language evolution I think it is safe to say that a language expert would easily see through this due to things like the close spacing of voiced and unvoiced consonants.

  40. Tamfang Says:

    Navin: By ‘syllable languages’ do you mean those languages written in alphasyllabaries descended from Brahmî, or something else?

  41. Hire .Net Developer Says:

    Wow thats a brilliant idea, I will try.thanks.

  42. direduck Says:

    @name: here’s your sentence in arabic!
    لا أحد يهتم لغتك.
    Kukaan välitä kieli.
    and now, as a graph:

    Language: | People’s care:
    Your language | no one
    klingon | >=all

  43. Blancbard Says:

    Sorry, but this is not a language. It English encrypted by some VERY simple functions.

  44. sicher Says:

    You are right. The idea here is to create the illusion of language with simple and fast means.

Leave a Reply