On the frequencies of individual phonemes in the phonetic inventories of costructed languages

O. Introduction

Constructed languages, or “conlangs”, are languages that are constructed. Language constructors, or “conlangers”, construct languages for a variety of reasons. Some conlangs are meant to resemble natural languages, or “natlangs” (these are called “naturalistic artistic languages”, or “artlangs”), while others are simplified and regularised for the purposes of international communication (these are called “auxiliary languages” or “auxlangs”), and still others are created without any goal but to entertain the creator (these are called “personal languages”; there isn’t a portmanteau for this one). Now I can already hear what you’re thinking: “How do the frequencies of individual phonemes in conlangs’ phonetic inventories compare to those in natlangs’?” Well, dear reader, I too once yearned to know the answer to this age-old question, and I determined to find out – with science! As it turns out, they are quite similar, with the exceptions of /v/ and /ɲ/.

I. Methods

This essay is based on a survey I conducted on Reddit last week in which I asked conlangers for their phonetic inventories. I got 117, which I supplemented with 24 famous conlangs (mostly auxlangs, which were woefully absent from my crowdsourced data) from elsewhere on the internet. I divided these inventories into four classes, leaving me with 18 auxlangs, 47 naturalistic artlangs, 44 personal languages, and 31 other conlangs (alien, philosophical, and engineered languages). I compared these inventories to those of the 1673 natlangs documented in the Phoible data-set. To prevent my data-sets from being absurdly large and incomparable, I lumped together many phonemes into broad groups, such as “all prenasalised affricates”, or “all voiced stops besides /b/, /d/, /ɖ/, and /ɡ/”. Further, because many phonetic pairs, such as /ʃ~ɕ/, have blurry boundaries (Phoible classifies as /ʃ/ a lot of fricatives that I see elsewhere transcribed /ɕ/), I counted them not as individuals, but as groups and distinctions. Thus, you will not find “ʃ” in my tables below – only “ʃ~ɕ” and “ʃ/ɕ”.

II. General results

The first thing I noticed from these data was that auxlangs are not nearly as popular as I had thought. Of the 117 conlangs I got from Reddit, only three of them were auxlangs; the rest of the auxlangs in my data were the famous ones I recorded myself. The largest class of conlangs on Reddit was, not too surprisingly, personal languages (I got 44 of those), followed by naturalistic artlangs (43) and all other conlangs (26).

Figure 0. The phoneme frequency results for all conlangs against all natlangs, with outliers labelled.

So how well do conlangs’ phonetic inventories align with those of natlangs? Fairly well. Plotting the number of conlangs that contain each phoneme against the number of natlangs that do in Figure 0, I found that the RMSD of conlang phoneme frequencies from their natlang counterparts was a mere 7.78%. Interestingly, if I break the conlangs into their subgroups, the average residuals increase, implying that the deviations from natlang distributions in the various types of conlangs cancel each other out somewhat when added together. Auxlangs had the greatest residuals, with an RMSD of 11.54%, followed by personal languages at 9.31%, other conlangs at 8.76%, and naturalistic artlangs at 7.88%. This bodes well for the conlanging community; the conlangs whose inventories are least supposed to look naturalistic, the auxlangs, have the least naturalistic inventories, whereas the opposite is true for naturalistic artlangs. We’ll look at each of these groups individually in a bit.

Phoneme Natlangs (%) Conlangs (%) Diff. (%)
x 18.05 48.23 +30.18
y 3.65 31.21 +27.56
θ 3.89 31.21 +27.32
ʒ~ʑ 14.76 41.13 +26.37
v 30.66 54.61 +23.95
ʃ~ɕ 40.05 63.83 +23.78
ø~œ 4.00 24.11 +20.11
◌̃ 26.30 4.96 -21.3
w 84.64 60.99 -23.65
ɲ 50.93 21.28 -29.65

Table 0. The values for all phonemes whose frequencies differed by at least 20% between conlangs and natlangs.

If we look specifically at the outliers, shown in Table 0, we see the most over- and underrepresented phonemes of conlangery. /ɲ/ and /w/, interestingly enough, are underrepresented not only in the full data-set, but in each individual group of conlangs, as well. The overrepresented phonemes arise from different factors in different conlang types. To understand these better, let’s see what happens when we look at one class of conlang at a time.

III. Auxlangs

As stated before, auxlang inventories are the least representative of natlang inventories. This makes sense. When making an auxlang, you generally want to apply as little creativity to your phonetic inventory as possible, instead taking common, easy, and useful phonemes and discarding the rest. This results in the two blobs we see in Figure 1. If a phoneme appears in, say, 80% of natlangs, then it will likely turn up in almost all auxlangs. On the other hand, if it appears in only 20%, then it will likely be in almost no auxlangs. This explains many of the outliers; /f/, /ɡ/, and /b/ all appear in a slight majority of languages and a vast majority of auxlangs, while labialisation, the e/ε and o/ɔ distinctions, tones, nasalisation, vowel length, and glottal stops fare the opposite. Some of the other outliers can be explained this way, but only when natlang size is taken into account. It may seem odd that /ʃ~ɕ/, for instance, should show up in the majority of auxlangs despite being in the minority of natlangs, but this is because both are fairly well-represented in the most commonly spoken natlangs. /ɲ/, on the other hand, appears in the majority of natlangs, but appears in very few major natlangs beside Spanish, so it appears in few auxlangs.

Figure 1. The phoneme frequency results for auxlangs against natlangs, with outliers labelled.
Phoneme Natlangs (%) Auxlangs (%) Diff. (%)
v 30.66 83.33 +52.67
f 50.21 94.44 +44.23
ʒ~ʑ 14.76 55.56 +40.80
ʃ~ɕ 40.05 77.78 +37.73
z 35.51 72.22 +36.71
r 42.32 66.67 +24.35
ɡ 64.91 88.89 +23.98
b 71.85 94.44 +22.59
◌ʷ 20.02 0.00 -20.02s
e/ɛ 31.68 11.11 -20.57s
w 84.64 61.11 -23.53
2+ tones 29.11 5.56 -23.55
◌̃ 26.30 0.00 -26.30
o/ɔ 32.99 0.00 -32.99
Long vowel 33.65 0.00 -33.65
ŋ~ɴ 58.10 22.22 -35.88
ʔ 43.04 5.56 -37.48
ɲ 50.93 5.56 -45.37

Table 1. The outlier values for auxlangs alone.

That leaves a few cases that are still tricky to explain. /r/ is fairly over-represented for no apparent reason, but this seems to be because it is the most common stand-in for the “whatever-rhotic” in auxlangs; its inflated frequency is pretty similar to the frequency of all rhotics combined. I’m not sure what the deal is with /w/. It’s quite common in natlangs, but substantially less common in auxlangs. This is a pattern we’ll see repeated in other conlang types. Perhaps it has to do with people using it as an allophone for /u/ or /v/? The same thing does not happen with /w/’s sister /j/, who is of practically the same frequency in natlangs and auxlangs.

The voiced fricatives are another interesting case. /v/, /ʒ~ʑ/, and /z/ all appear in the majority of auxlangs despite all being in a small minority of auxlangs. /v/’s overrepresentation is the most striking by far, rising from 31% of natlangs all the way to 83% of auxlangs. Why do auxlangers love voiced fricatives so much? I’m not entirely sure. Perhaps they like the voiced fricatives to complement their voiceless ones, so that they have a complete set (which doesn’t explain why we never see /ɣ/ or /ɦ/ in auxlangs). Perhaps they simply don’t realise how uncommon voiced fricatives are compared to their plosive counterparts (which aren’t that common to begin with). Perhaps it’s just that Esperanto has /v/, /z/, and /ʒ~ʑ/ and continues to strongly influence auxlangers today? The polar opposite happens with /ŋ~ɴ/, which auxlangers seem to hate for some reason. Again, Esperanto lacked it, so it could just be Zamenhof’s ghost pulling strings. Alternatively, it could be that auxlangers take issue with its rarity outside of codas—do you really want to add a whole new phoneme to your inventory if you’re never going to see it in the onset?

IV. Naturalistic artlangs

Naturalistic artlangs are, for the most part, far more phonologically similar to natlangs than auxlangs are. Good job, artlangers! There are, however, still some notable outliers. For one thing, there seems to be anti-/ɲ/ and anti-/ʔ/ bias. Both are severely underrepresented despite both appearing in about half of natlangs. As stated above, it could be that the fact that both are uncommon in major natlangs leads artlangers to believe that they are uncommon in all natlangs. The anti-/ɲ/ bias is even stranger when one notes that its cousin, /ʎ/, is overrepresented by 19%, almost as much as /ɲ/ is underrepresented. Other overrepresented phonemes include the voiced fricatives /ʒ~ʑ/ and /v/, which you likely remember from the auxlang section, /y/, for which I can’t blame people because it’s such a fun vowel, /ø~œ/, which may have some French or Scandinavian appeal, and /x/. I have no idea why /x/ is there, but its in nearly half the artlangs I saw. Maybe, since it’s in so many famous languages besides English (Spanish, German, Mandarin, etc.), artlangers think it sounds exotic or something.

Figure 2. The phoneme frequency results for naturalistic artlangs against natlangs, with outliers labelled.
Phoneme Natlangs (%) Artlangs (%) Diff. (%)
x 18.05 48.94 +30.89
y 3.65 34.04 +30.39
v 30.66 53.19 +22.53
ø~œ 4.00 25.53 +21.53
ʒ~ʑ 14.76 36.17 +21.41
ʔ 43.04 21.28 -21.76
ɲ 50.93 27.66 -23.27

Table 2. The outlier values for naturalistic artlangs alone.

V. Personal languages

I consider the outliers of personal languages to be indicators of what sounds are the easiest and most fun to say. Apparently, /θ/ and /æ/ are super easy (I mean, most conlangers speak English to some degree from what I can tell, so that makes sense). /θ/’s frequency in personal languages is more than ten times its frequency in natlangs, and /æ/’s is up by a factor of six. /x/ is up there, too, presumably because of how nicely it fills out the labial-alveolar-velar nasal-plosive-fricative matrix, or perhaps because it’s so easy to say despite its obscurity (I think [x] was the first phone I learned to say without having ever heard it in a natlang). Then, we see the overrepresentation of /y/, /ʃ~ɕ/, /ʒ~ʑ/, /ø~œ/, /ç/, and syllabic consonants. You’ll notice that we saw most of these in the naturalistic artlang section. My best hypothesis as to why these are all so disproportionately common in personal languages is that they are all great fun to say. I find them fun, anyway.

Figure 3. The phoneme frequency results for personal languages against natlangs, with outliers labelled.
Phoneme Natlangs (%) Conlangs (%) Diff. (%)
θ 3.89 40.91 +37.02
x 18.05 52.27 +34.22
y 3.65 36.36 +32.71
ʒ~ʑ 14.76 43.18 +28.42
ø~œ 4.00 31.82 +27.82
æ 5.86 31.82 +25.96
ç 4.66 29.55 +24.89
◌̥ 1.26 22.73 +21.47
ʃ~ɕ 40.05 61.36 +21.31
3+ tones 21.04 0.00 -21.04
◌̃ 26.30 2.27 -24.03
2+ tones 29.11 2.27 -26.84
ɲ 50.93 20.45 -30.48
w 84.64 52.27 -32.37

Table 3. The outlier values for personal languages alone.

On the underrepresented side, we see tones, nasal vowels, and our old unloved friends /ɲ/ and /w/. The tones and nasals are a shame, as I think those are both pretty fun. Perhaps people just don’t think of them when creating conlangs; they’re both fairly niche relative to their prevalence in natlangs. I really can’t say why /ɲ/ and /w/ are down here again, though. Chalk it up to targeted anti-phonemic bias.

VI. Other conlangs

Finally, we hit the engelangs, alien artlangs, and philosophical languages. Since these are so varied, I’m not sure what I intend to glean from lumping them all together like this, but let’s have a go anyway, shall we?

Figure 4. The phoneme frequency results for other conlangs against natlangs, with outliers labelled.

At first glance, it looks very similar to the personal languages, which makes sense since neither of these categories has a coherent purpose. /x/, /ʃ~ɕ/, /ʒ~ʑ/, and /y/ all make encore appearances in the overrepresented section. /r/ appears as well, possibly standing in as the whatever-rhotic again. The two notable newcomers are /l/ and /ɬ/. To be honest, I don’t know why the laterals didn’t show up in the personal languages. These are two of my favourite sounds; they’re just so fun and versatile. As far as underrepresentation goes, we see nasal vowels again (I can see those being used to great appeal in some engelangs, but again, I suppose most people just don’t think of them), the o/ɔ distinction (but not e/ε interestingly), and (drumroll please) /w/ and /ɲ/. Like I said at the beginning; these two guys are just despised all around the board. Especially /ɲ/. What’s with that? It’s not like it’s hard to say. Is it that it sounds too similar to /nj/? Because it has a really distinctive sound in the coda if you’re into codal nasals (and who isn’t into codal nasals?).

Phoneme Natlangs (%) Conlangs (%) Diff. (%)
x 18.05 50.00 +31.95
ʃ~ɕ 40.05 65.62 +25.57
ɬ 6.69 31.25 +24.56
ʒ~ʑ 14.76 37.50 +22.74
y 3.65 25.00 +21.35
l 76.69 96.88 +20.19
r 42.32 62.50 +20.18
◌̃ 26.30 6.25 -20.05
w 84.64 62.50 -22.14
o/ɔ 32.99 9.38 -23.61
ɲ 50.93 21.88 -29.05

Table 4. The outlier values for all other conlangs.

VII. Conclusion

In summary, conlangers are, on the whole, good at what they do. On average, inventories designed to be naturalistic look naturalistic, and those that aren’t don’t. To offer some advice to those designing inventories in the future, I derive a few pointers from these data. If you’re making an auxlang, reconsider whether you really need voiced fricatives. Just because Esperanto has them doesn’t mean you necessarily should, too. If you’re making something naturalistic and want to stand apart, steer clear of /x/, front close vowels, and voiced fricatives. Consider a glottal stop instead. For those of you making a personal language or something else, don’t underestimate the power of tones and nasals—they may not be as niche as you think. And everyone, for the love of Sudre, stop hating on /w/ and /ɲ/! They’re perfectly good phonemes, and they deserve better than this!


One thought on “On the frequencies of individual phonemes in the phonetic inventories of costructed languages

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.