From these preliminary hacking, I got the following conclusions:

  The manuscript does not appear to use any hyphenation mark.  Either
  words are not broken across lines, which would be unusual, or they
  are broken without any extra marks.  Such word breaks may 
  result in statistical anomalies at the beginning and end of lines.
  Could this explain Currier's claim that lines are "functional units"?

  Comparing the two versions (Currier and Friedman), and looking at
  the word statistics, it seems that both are highly contamiated with
  error (5-10% of the words.  This large amount of noise will 
  mess up any statistical analysis based on either text alone.

  Therefore, before spending more time in the analysis, I must first
  prepare a "corrected" interlinear where discrepancies between FSG
  and Currier are resolved, taking into account the probabilities
  above.

  Loking at the actual shape of the characters, I realized that the
  FSG encoding was not very good for my purposes, since is assigns
  completely different codes to glyphs which may be just calligraphic
  variations of the same grapheme.  Thus I decided to do most
  processing using a more analytical encoding, which can be lumped later.

  I considered using Jacques Guy's "Neo-Frogguy" or "Gui2" encoding,
  but even that is a bit too synthetic --- for example, his <2> should
  be "i'", and his <9> should be `c)', for consistency. (The
  statistics on the occurrence of repeated <i>s apparently confirm
  this choice).

  Thus I decided to define my own "super-analytic" or "JSA" encoding.

  My super-analytic encoding
  --------------------------

    The idea is to break all characters doen to individual "logical"
    strokes, and use one (computer) character to encode each stroke.

    There is some question as to what is a logical stroke, and when two
    strokes are different.  Obviously, the definition of a stroke must
    include not only its shape but also the way it connects to the
    neighboring strokes; and, given the irregularity of handwritten
    glyphs, that may be hard to decide.

    For instance, FSG's [A] character can be broken down into two
    strokes, shaped like the [C] and [I] glyphs.  Supposedly, the
    difference between an [A] and a [CI] is that in the former the
    strokes are connected into a closed shape.  Is this difference
    significant?

    I checked the occurrences of [CI], [CM], and [CN] in the interlinear
    file.  Two things are curious. First, these combinations are
    extremely rare.  Second, a good many of them are transcribed
    differently by Currier and the FSG: where one has [CIIR] the other
    often has [AIR], and vice-versa.  Same for [CM] versus [AN], etc.

    In light of these observations, I have decided to treat all
    occurrences of [A] as [CI]. If the two are indeed different, that
    will be just one more ambiguity added to the inherent ambiguity of
    natural language; so it cannot make the decipherment task more
    difficult.  Confusing the two will change the letter frequencies, it
    is true; but, since the language does not appear to be a
    standardized one, there is not much information we can extract from
    absolute letter frequencies.  The methods we hope to use --- such as
    automaton analysis --- are not significantly disturbed by collapsing
    letters.

    On the other hand, if [A] and [CI] are the same grapheme, using
    different encodings will seriously confuse statistics --- especially
    if the spacing depends on the immediate context.

    For similar resons, it is best to ignore the distinction between
    [T] and [CC], or between [S] and [2C].  The ligature is often
    lost, and we don't know whether it is significant.

    Also, the characters that Currier transcribes as [6] are usually
    transcribed [K] by Friedman, and the two are very similar.  
    Strangely [K] seems to occur mostly at the end of *lines*.

    The characters [7] [V} [Y] do not occur in this corpus.

    Summarizing, the JSA encoding breaks down evey character 
    into strokes, which are cast into one of these types:

      1. "Body" strokes:

          q  same as FSG [4], Guy <4>; also part of [H], [P], [HZ], ...

          o  same as [O], <o>

          c  same as [C], <c>; also part of [A], [8], etc.

          i  same as [I], <i>; also part of [A], [M], [N], [R], etc.

          l  long vertical bar of [D], [F], [DZ], [FZ]

       2. "Limb" strokes ("flourishes", "plumes", ...)

          g  an 8-shaped loop with both ends attached to the previous letter,
             as the right three-fourths of [8] and [7]; and also
             the right-hand swirls of [P], [F], [PZ], [FZ].

          y  a curving descender shaped like a right-parenthesis,
             attached to the top of the preceding stroke; 
             the right-hand stroke of [G] = <9>

          s  a plume attached to the top of the preceding char,
             pointing NE and curving up, as in [2] = <s>, [R] = <2>,
             and [S]

          x  a hook attached to the top of an \i/ stroke,
             curving sharply down and crossing the \i/;
             half of [E] = <x>.

          j  a P-shaped loop with one end attached to the
             top of the previous slope, and the other extending straight down;
             as in the right half of [H], [D], [HZ], [DZ],
             and [K].

          u  a plume similar to \s/, but attached to the *bottom* of the preceding 
             stroke; as in [L], [N], [M].


    The ligature in [T] is ignored, i.e. Guy's <t> and <e> are
    identified with his <c>, and denoted uniformly by \c/.  This
    identification is consistent with the digraph statistics.

    The character <a> is rendered \ci/. In fact, <a> is probably not a
    letter --- it appears to be a \c/ stroke (possibly half of the
    preceding letter) accidentally connected to an \i/ stroke
    (probably the beginning of the next letter).

    The weirdo symbols [Y], [V], etc. will be translated as \?/.

    The FGS -> JSA correspondence is, therefore

        IIIK -> iiiij   IE -> iix   A -> ci   N -> iiu  
        IIIL -> iiiiu   IR -> iis   C -> c    O -> o   
        IIIR -> iiiis   IK -> iij   D -> lj   P -> ag   
        IIIE -> iiiix   2 -> cs     E -> ix   R -> is   
        IIE -> iiix     4 -> a      F -> lg   S -> csc  
        IIR -> iiis     6 -> cj     G -> cy   T -> cc  
        IIK -> iiij     7 -> ig     H -> aj   V -> ?   
        HZ -> cajc      8 -> cg     I -> i    Y -> ?   
        PZ -> cagc                  K -> ij         
        DZ -> cljc                  L -> iu   
        FZ -> clgc                  M -> iiiu 

    Note that the \i/ groups have one more \i/ in JSA than they have 
    in Guy's encoding.  This is redundant but makes it more evident
    that <v>, <x>, <2> are homologous members of their respecive series.
    Also, this encoding fixes a minor discrepancy of Guy2, which
    uses one extra \i/ in the series <ig>, <iig>, ... <iiiig>.

  Ad-hoc encodings
  ----------------

    After mapping everything to the JSA encoding, and looking at the
    digraph frequency tables, I observed that:

    The stroke `l' is always followed by either `j' or `g', hence `lj'
    and `lg' should be single letters.

    Note also that there are two clearly different kinds of strokes, "body" B =
    {`c',`o',`t',`i',`q',`l'} and "limb" L = {`u',`x',`y',`j',`g',`s'}.  If we
    reduce the digraph count matrix to these two classes, plus word break W, we
    get

                    B     L
          ----- ----- -----
              .  6420     .
        B    59 19849 15616
        L  6361  9255     .
          ----- ----- -----

      Next-symbol probabilities (× 99):

                    B     L
          ----- ----- -----
              .    99     .
        B     .    55    44
        L    40    59     .
          ----- ----- -----

      Previous-symbol probabilities (× 99):

                    B     L
          ----- ----- -----
              .    18     .
        B     1    55    99
        L    98    26     .
          ----- ----- -----

    Note that every word begins with a body stroke; this was expected from the
    definition of the limb strokes (they can be recognized only by their
    relationship to a previous stroke).  Note also that a limb stroke cannot be
    followed by another limb stroke; this too is not wholly unexpected.

    The surprise is that almost no words *end* in a body stroke.  The least rare
    body stroke in word-final position is `o'.  The words that end
    in body strokes appear to be errors or the result of breaking a line in
    the middle of a word.

    An interesting observation from the body/limb frequency tables above
    is that the transition probabilities from body stroke to body and
    limb are respectively 55% and 45%.  Thus, if the limb strokes mark
    the end of a syllabe (or letter?), the the average number of body
    strokes in a syllabe is slightly over 2.  (Considering that we are
    counting each \i/ as a body stroke, the correct number may well be
    precisly 2.)