/* Last edited on 2013-03-24 23:54:05 by stolfilocal *
/* Probability theory for the character grouper. */

/*

    For any candidate {W} with {n} elements {L[0..n-1]} we need to
    compute two numbers: the probability {PrT(W)} of a true word
    beginning with precisely those elements, and the probability
    {PrU(W)} of those {n} boxes being considered by the algorithm,
    whether they are . Then {PrF(W)} is {PrU(W)-PrT(W)}. The probability of
    {W} being the prefix of a true word is then {PrTrue(W)=
    PrT(W)/PrU(W)}.
    
    Note that {PrU(W)} is less than 1, and in fact all three quantities
    {PrT(W),PrF(W),PrU(W)} become vanishingly small as {n} increases,
    and could possibly underflow for linger text lines. In practice we
    work with the logarithms of those quantities, which grow only
    linearly with {n}.
    
    Computing {PrT(W)}
    
    In our model, the probability {PrT(W)} is easier to compute,
    although it has a complicated formula, because it is simply the
    product of several independent terms. Namely
    
      { PrT(L[0..n-1]) = 
          Pr0(n) \times
          Pr1(L[0]) \times
          Pr2(L[0],L[1]) \times
          \prod_{k=2}^{n-1} Pr3(L[k-2],L[k-1],L[k])
      }
      
    The length factor {Pr0(n)} 
      
    In the formula of {PrT(W)}, the factor {Pr0(n)} is the probability
    that {n} unspecified random elements are the first {n} elements of a
    word. That value is the expected number of words with {n} or more letters,
    divided by the expected number of sets of {n} distinct elements in
    the input set. 
    
    Let {NW} be the expected number of true words in the input, {avgLen}
    be the average number of letters per true word; the expected number
    of true elements (and of true boxes) is therefore {NT=NW*avgLen}.
    For simplicity, we assume an exponential distribution of word
    lengths, so that there is a constant probability {pcont=1-1/avgLen}
    of a word continuing after any letter. Therefore the expected number
    of words with {n} or more letters is {NW*pcont^{n-1}}.
    
    Let {NF} be the number of false input boxes. A false element is
    either one built from a false box, or one built from a true box with
    the wrong asc/des bits. The number of false elements is therefore
    {4*NF+3*NT}, and the total number of elements is {NE=4*(NF+NT)}. The
    number of sets of {n} distinct elements is {choose(NE,n)}. Since {n}
    is much less than {NE}, we can approximate it by {NE^n/n!}.
    
    Thus we get {Pr0(n) = (NW/NE)(pcont/NE)^{n-1}/n!}.  This term
    can be computed incrementally by starting with {Pr0=NW/NE} after
    the first letter, and multiplyng by {pcont/NE/k} after appending the
    {k}th letter.
    
    The single-element factor {Pr1}
    
    The factor {Pr1(L[0]} is the probability of the first element of a
    true word having the bounding box {L[0].box} and ascender/descender
    bits element {L[0].asc,L[0].des}.
    
    In general, we assume that the value of {Pr1(L)} is independent of
    the position of the box {L.box} within the image. Therefore {Pr1(L)}
    depends only on the dimensions of {L.box} and asc/des bits. We
    assume that the height {h} of the /body/ (between {L.ytop} and
    {L.ybas}) is limited to the interval {[hmin _ hmax]}, where
    {hmin,hmax} are parameters specified by the client.
    
    This constraint and the asc/des bits define an integer range
    {hBmin..hBmax} for the box height {hB = L.iymax-L.iymin+1}. We also
    assume that the width {wB} of the box (which is the same as the body
    width {w}) is restricted to the interval {wBmin..wBmax]}, where
    {wBmin=ceil(h*rmin)}, {wBmax=floor(h*rmax)}, and {rmin,rmax} are
    bounds on the /body/ aspect ratio set by the client. Note that 
    {wBmin} and {wBmax} depend on {hB}. The probability
    {Pr1(L)} is the product of two discrete probability distributions, a
    uniform one for {hB}, namely {1/(hBmax-hBmin+1)}, and a humped
    distribution {J(wB,wBmin,wBmax)} for {wB} once fixed {hB}.
    
    The two-element factor {Pr2}
    
    The factor {Pr2(L[0],L[1])} is the probability of the second element of
    a true word having the bounding box and ascender/descender bits of
    element {L[1]}, given that the previous element was {L[0]}. That
    condition places constraints on the position and dimensions of {L[1]}
    that are different from those of the first element {L[0]}.
    
    In general, when computing {Pr2(L0,L1) the body height {h1} of {L1}
    is expected to be similar to the body height {h0} of {L0}.
    Specifically, {h1} is assumed to be in the range {[h1min _ h1max]}
    where {h1min = max(hmin,h0-TBsmax*dx)}, {h1max =
    min(hmax,h0+TBsmax*dx}, {dx} is the difference {L1.xctr-L0.xctr},
    and {TBsmax} is a client parameter (the max absolute slope of the
    difference line TL-BL). Note that {h1min \leq h1max} since {hmin\leq
    h0\leq hmax}.) Note also that this range for {h1} is the range
    determined by {Pr1(L1)} intersected with the new range
    {h0±TBsmax*dx}. As before, from {h1min,h1max} and the asc/des bits
    we derive an integer range {hB1min..hB1max} for the box height
    {hB1=L1.iymax-L1.iymin+1}. The height {hB1} is assumed to have a
    humped probability distribution {J(hB1,hB1min,hB1max)} in that
    range.
    
    Likewise, for {Pr2(L0,L1)} we also assume that the midline ordinate
    {m1 = L1.ymid} of {L1} is similar to {m0 = L0.ymid}; that is, it is
    constrained to the range {[m1min _ m1max]} where {m1min =
    m0-MLsmax*dx}, {m1max = m0+MLsmax*dx}, and {MLsmax} is the max
    absolute slope of the midline specified by the client. From these
    bounds and the asc/des bits of {L1} we derive integer bounds
    {mB1min..mB1max} for the rounded midline {mB1=floor(m1)} (the
    fractional part being 0.0 or 0.5 depending on the parity of {hB}).
    Note that this range is different for each {hB}. This range is
    further constrained by the algorithm's requirement that {L0} and
    {L1} have overlapping {y}-ranges (although this condition is almost
    always weaker than the bound on the slope of the midline). The
    rounded midline {mB1} is assumed to have distribution
    {J(mB1,mB1min,mB1max)} for each fixed {hB1}.
    
    Also for {Pr1(L0,L1)}, the {x}-width {gB = L1.ixmin-L0.ixmax-1} of
    the gap between {L0} and {L1} is assumed to be in the range {[gmin _
    gmax]} where {gmin = max(-w0+1,hm*dxmin)}, {gmax = hm*dxmax}, and
    {hm=(h0+h1)/2}. Here {dxmin,dxmax} are bounds specified by the
    client; note that {dxmin}, and hence {gmin}, is usually negative, to
    allow for overlap between successive boxes. Note that the bounds
    {gmin,gmax} depend on the height {hB1} of {L1.box}. Specifically,
    {gB} is assumed to have distribution {J(gB,gBmin,gBmax)} where
    {gBmin=ceil(gmin)} and {gBmax=floor(gmax)}, for each {hB1}.
    
    Finally, for {Pr1(L0,L1)} the width {wB1} of {L1} is assumed to be
    in the integer range {wB1min..wB1max]} where
    {wB1min=max(ceil(h1*rmin),xBove+1)}, {wB1max=floor(h1*rmax)}, and
    {xBove} is the amount of {x}-overlap {max(0,L0.ixmax-L1.ixmin+1)}
    which is the same as {max(0,-gBx)}. The clipping of {w1min} by
    {xBove+1} accounts for the restriction that both {L[i].ixmin} and
    {L[i].ixmax} must be strictly increasing with {i}. Again, note that
    the range on {wB1} is the same range defined by {Pr1(L1)},
    intersected with the new range {xBove+1..oo}. Note also that these bounds
    vary with the height {hB1} and the gap width {gB1}. The parameter
    {wB1} is assumed to have the distribution {J(wB1,wB1min,wB1max)} for
    each value of {hB1} and {gB1}.
    
    ??? Intersec twith mB,hB range limit of {Pr1(L1)} ???
    
    The three-element factor {Pr3}
    
    The three-box factor {Pr3(L[i-2],L[i-1],L[i])} is used on every element
    {L[i}} starting with the third one. It computes the probability of that 
    element having the box {L[i].box} and asc/des bits {L[i].asc,L[i].des},
    given that the two previous boxes are {L[i-2]} and {L[i-1]}. 
    That information places even more constraints on the box {L[i]}
    than {Pr2(L[i-1],L[i])} or {Pr1(L[i])}.
    
    For one thing, {Pr3(L0,L1,L2)} expects the midline points of 
    the three boxes to be approximately aligned. Namely,
    
    ??? ML and TB alignment, or TL and BL ???
    
    Computing {PrF(W)}
    
    The other quantity that we need in order to compute the score of a
    word candidate is the probability {PrF(W)} that {W} is /not/ a
    prefix of a true word and {n} randomly chosen elements among the
    input ones are precisely the {n} elements {L[0..n-1]} of {W}. This
    probability does not have a factorable formula like {PrT}, because
    the elements of {W} may contain pieces of true words possibly mixed
    with false boxes.
    
    We will therefore define {PrF(W)} incrementally. Namely, if {W} has
    a single element {L}, {PrF(L)} will be the probability that {L} is a
    false element. That is simply number of false elements divided by
    the the total number of elements, namely {(4*NF+3*NT)/(4*NF+4*NT)}
    or {1-NT/(NF+NT)/4}.
    
    For two or more elements, suppose we have computed
    {PrT(W)} and {PrF(W)}, and we need {PrF(W1)} where {W1} is {W}
    with an extra element {L} appended.  We have three exclusive cases,
    whose probabilities should be added to get {PrF(W1)}: 
    
      (0) {W} is the beginning of a true word, but {L} is a
      false element, or a true element that is not part of that word;
    
      (1) {W} is not the beginning of a true word, and {L}
      is a false element, or a true element that is not part of the same
      word as {L[n-2]};
      
      (2) {W} is not the beginning of a true word, but {L}
      is a true element that is part of the same word as 
      the last element {L0} of {W};
      
    In cases (0) and (1) we get {PrF(W1)} by multiplying {PrT(W)} or {PrF(W)},
    repectively, by a suitable factor {A}. Case (2) is much more complicated.
    We will simply ignore it. By doing so we will underestimate {PrF(W1)};
    but this decision will cause no harm when implemented in the algorithm.  
    It may cause us to keep a false candidate {W1} that consist of some 
    garbage elements followed by the beginning of a true word {W2}; but the 
    algorithm will also consider {W2} by itself as a candidate, and therefore
    it will discard {W1} by the DP criterion. 
    
    The factor {A} used to get 
    
    The probabilities {PrF} and {PrW} 
    
    The candidate score
   
    The score of {W} is defined as {log(PrT(W)/PrF(W)) = log(PrT(W)} -
    log(PrF(W))}, which is a monotone function of {PrTtrue(W)}.
    Moreover,since we need only the difference {log(PrT(W)} -
    log(PrF(W))} we can normalize
 
    For a candidate word {W} with a single element {L}, {W.PrT}
    must be the probability that {L} is the initial letter 
    of a true word.
    
    A state that corresponds to a single element {L} has {Pr1}
    
    The probability that a randomly chosen input element is the initial
    element of a true word, without taking into account the box geometry
    or the ascender/descender bits, is therefore {NW/NE = NW/(4*NB)}.
    
    
    
    {NW} be the expected number of true words in the input, and . The expected number of
    true elements (and of true boxes) is {NT=NW*avgLen}. For simplicity,
    we assume an exponential distribution of word lengths, so that there
    is a constant probability {pcont=1-1/avgLen} of a word continuing
    after any letter. Therefore the expected number of words with {n} or
    more letters is {NW*pcont^{n-1}}.
    
    The
    number of sets of {n} distinct elements is {choose(NE,n)}. Since {n}
    is much less than {NE}, we can approximate it by {NE^n/n!}.
    
    At present
    we assume that all fourcombinations of {L.asc} and {L.des} are
    equally likely at the begining 
    
    If {W} has a single element {L}, then {W.PrT} is the probability 
    that {L} is the first letter of a true word.  That is basically 
    {NT/(NT+NF)}
    times the probability 
 */