Last edited on 1997-11-24 16:53:11 by stolfi

Languages A and B and the prefix-midfix-suffix decomposition

Currier languages A and B

It is well known that the VMs can be partitioned in tow subsets, conventionally called "language A" and "language B", with very different statistics. This note compares the two languages under a particular prefix-midfix-suffix decomposition of Vms words.

If we look at listings or plots of frequencies of each component, we see that the two languages actually have the same general structure. In particular, the midfix contains only "hard" letters, and consists of one or two main letters {ch,sh,t,k,p,f,cth,ckh,cph,cfh} each followed by zero or more "e"s.

Also, the prefixes are practically the same, with same distribution; and the suffixes seem composed of the same letter groups {y, or, ol, ar, al, am, aiin, etc.). The all-soft words ("unifixes") also share this same "fine structure".

However, there are substantial differences in the distribution of midfixes, suffixes, and unifixes. In the case of suffixes, the main differences still seem relatively simple to describe:

The -dy suffix is little used in language A, but ends more than 25% of the words of language B; and
Language A uses many suffixes that start with -o.. (especially -ol, -or, and -o) which are much less common in language B. In particular, the -om suffix occurs 66 times in the A sample, and -ory occurs 27 times, and both are entirely absent from the B sample.

If we ignore these suffixes, the ones that remain have roughly the same distribution in both languages.

As for the midfixes, the differences are larger, and harder to describe. The "platform gallows" -cth- and -chk-, that account for 12% of the A midfixes, account for only 1.5% in the B language. The -ch- and -sh- midfixes also drop from 17% to 10%. The "compounds" -tch- and -kch- also drop from 11% to 5%.

On the other hand the -k- and -t- midfixes jump from 15% to 25%, and -ke-, -te- go from 3% to 11%.