Hacking at the Voynich manuscript - Side notes
111 Analyzing the statistics of Vietnamese word elems 

Last edited on 2025-05-04 17:14:55 by stolfi

INTRODUCTION

  This note computes the length distribution (over
  tokens and lexemes) for each element of Vietnamese words.

SETTING UP THE ENVIRONMENT

  Links:
  
    ln -s ../tr-stats/dat
    ln -s ../tr-stats/fig
    ln -s ../tr-stats/tex
    ln -s ../../../work 
    
COMPUTING THE LENGTH DISTRIBUTION OF VIETNAMESE SLOTS

  Let's compute the length distribution of slots 
  in Vietnamese lexemes, in the VIQR encoding:
  
    make LANG=viet BOOK=ptt ELEM=viqr -f slot-length-stats.make all
  
      # viet/ptt/tot.1/gud slot = inic counts = t
      # len  count   freq strings
      # --- ------ ------ ------------------
          0   1955 0.0558 -
          1  21739 0.6206 q,k,x,d,g,h,b,r,s,n,m,t,v,l,c
          2  11233 0.3207 gh,kh,nh,ph,th,tr,ng,ch,dd
          3    100 0.0029 ngh
          4      0 0.0000 

      # viet/ptt/tot.1/gud slot = vows counts = t
      # len  count   freq strings
      # --- ------ ------ ------------------
          0      0 0.0000 _
          1  14460 0.4128 a|e|i|o|u|y
          2  13386 0.3822 a(|a^|ai|ao|au|ay|e^|eo|ia|ie|io|iu|o+|o^|oa|oe|oi|u+|ua|ui|uy
          3   4881 0.1393 a^u|a^y|e^u|ia(|ia^|iai|iao|iau|iay|ie^|ieo|io+|io^|ioi|iu+|o+i|o^i|oa(|oai|u+a|u+i|u+u|ua(|ua^|uay|ue^|uo+|uo^|ye^
          4   1127 0.0322 a(e^|e^a^|ia^u|ie^u|iea^|io+i|iu+a|u+o+|u+oi|ua^i|ua^y|uo^i|uye^|ye^u
          5   1173 0.0335 u+o+i|u+o+u
          6      0 0.0000 

      # viet/ptt/tot.1/gud slot = tone counts = t
      # len  count   freq strings
      # --- ------ ------ ------------------
          0  12727 0.3633 _
          1  22300 0.6367 '|.|?|`|~
          2      0 0.0000 

      # viet/ptt/tot.1/gud slot = finc counts = t
      # len  count   freq strings
      # --- ------ ------ ------------------
          0  20723 0.5916 _
          1   9728 0.2777 c|m|n|p|t
          2   4576 0.1306 ch|mg|ng|nh
          3      0 0.0000 

  In the lexicon:
  
      # viet/ptt/tot.1/gud slot = inic counts = w
      # len  count   freq strings
      # --- ------ ------ ------------------
          0     64 0.0378 _
          1   1089 0.6436 b|c|d|g|h|k|l|m|n|q|r|s|t|v|x
          2    528 0.3121 ch|dd|gh|kh|ng|nh|ph|th|tr
          3     11 0.0065 ngh
          4      0 0.0000 

      # viet/ptt/tot.1/gud slot = vows counts = w
      # len  count   freq strings
      # --- ------ ------ ------------------
          0      0 0.0000 _
          1    610 0.3605 a|e|i|o|u|y
          2    694 0.4102 a(|a^|ai|ao|au|ay|e^|eo|ia|ie|io|iu|o+|o^|oa|oe|oi|u+|ua|ui|uy
          3    286 0.1690 a^u|a^y|e^u|ia(|ia^|iai|iao|iau|iay|ie^|ieo|io+|io^|ioi|iu+|o+i|o^i|oa(|oai|u+a|u+i|u+u|ua(|ua^|uay|ue^|uo+|uo^|ye^
          4     89 0.0526 a(e^|e^a^|ia^u|ie^u|iea^|io+i|iu+a|u+o+|u+oi|ua^i|ua^y|uo^i|uye^|ye^u
          5     13 0.0077 u+o+i|u+o+u
          6      0 0.0000 

        # viet/ptt/tot.1/gud slot = tone counts = w
        # len  count   freq strings
        # --- ------ ------ ------------------
            0    475 0.2807 _
            1   1217 0.7193 '|.|?|`|~
            2      0 0.0000 

        # viet/ptt/tot.1/gud slot = finc counts = w
        # len  count   freq strings
        # --- ------ ------ ------------------
            0    765 0.4521 _
            1    637 0.3765 c|m|n|p|t
            2    290 0.1714 ch|mg|ng|nh
            3      0 0.0000