Last edited on 1997-11-24 16:53:11 by stolfi
It is well known that the VMs can be partitioned in tow subsets, conventionally called "language A" and "language B", with very different statistics. This note compares the two languages under a particular prefix-midfix-suffix decomposition of Vms words.
If we look at listings or plots of frequencies of each component, we see that the two languages actually have the same general structure. In particular, the midfix contains only "hard" letters, and consists of one or two main letters {ch,sh,t,k,p,f,cth,ckh,cph,cfh} each followed by zero or more "e"s.
Also, the prefixes are practically the same, with same distribution; and the suffixes seem composed of the same letter groups {y, or, ol, ar, al, am, aiin, etc.). The all-soft words ("unifixes") also share this same "fine structure".
However, there are substantial differences in the distribution of midfixes, suffixes, and unifixes. In the case of suffixes, the main differences still seem relatively simple to describe:
If we ignore these suffixes, the ones that remain have roughly the same distribution in both languages.
As for the midfixes, the differences are larger, and harder to describe. The "platform gallows" -cth- and -chk-, that account for 12% of the A midfixes, account for only 1.5% in the B language. The -ch- and -sh- midfixes also drop from 17% to 10%. The "compounds" -tch- and -kch- also drop from 11% to 5%.
On the other hand the -k- and -t- midfixes jump from 15% to 25%, and -ke-, -te- go from 3% to 11%.