/* Last edited on 2013-03-24 23:54:05 by stolfilocal * /* Probability theory for the character grouper. */ /* For any candidate {W} with {n} elements {L[0..n-1]} we need to compute two numbers: the probability {PrT(W)} of a true word beginning with precisely those elements, and the probability {PrU(W)} of those {n} boxes being considered by the algorithm, whether they are . Then {PrF(W)} is {PrU(W)-PrT(W)}. The probability of {W} being the prefix of a true word is then {PrTrue(W)= PrT(W)/PrU(W)}. Note that {PrU(W)} is less than 1, and in fact all three quantities {PrT(W),PrF(W),PrU(W)} become vanishingly small as {n} increases, and could possibly underflow for linger text lines. In practice we work with the logarithms of those quantities, which grow only linearly with {n}. Computing {PrT(W)} In our model, the probability {PrT(W)} is easier to compute, although it has a complicated formula, because it is simply the product of several independent terms. Namely { PrT(L[0..n-1]) = Pr0(n) \times Pr1(L[0]) \times Pr2(L[0],L[1]) \times \prod_{k=2}^{n-1} Pr3(L[k-2],L[k-1],L[k]) } The length factor {Pr0(n)} In the formula of {PrT(W)}, the factor {Pr0(n)} is the probability that {n} unspecified random elements are the first {n} elements of a word. That value is the expected number of words with {n} or more letters, divided by the expected number of sets of {n} distinct elements in the input set. Let {NW} be the expected number of true words in the input, {avgLen} be the average number of letters per true word; the expected number of true elements (and of true boxes) is therefore {NT=NW*avgLen}. For simplicity, we assume an exponential distribution of word lengths, so that there is a constant probability {pcont=1-1/avgLen} of a word continuing after any letter. Therefore the expected number of words with {n} or more letters is {NW*pcont^{n-1}}. Let {NF} be the number of false input boxes. A false element is either one built from a false box, or one built from a true box with the wrong asc/des bits. The number of false elements is therefore {4*NF+3*NT}, and the total number of elements is {NE=4*(NF+NT)}. The number of sets of {n} distinct elements is {choose(NE,n)}. Since {n} is much less than {NE}, we can approximate it by {NE^n/n!}. Thus we get {Pr0(n) = (NW/NE)(pcont/NE)^{n-1}/n!}. This term can be computed incrementally by starting with {Pr0=NW/NE} after the first letter, and multiplyng by {pcont/NE/k} after appending the {k}th letter. The single-element factor {Pr1} The factor {Pr1(L[0]} is the probability of the first element of a true word having the bounding box {L[0].box} and ascender/descender bits element {L[0].asc,L[0].des}. In general, we assume that the value of {Pr1(L)} is independent of the position of the box {L.box} within the image. Therefore {Pr1(L)} depends only on the dimensions of {L.box} and asc/des bits. We assume that the height {h} of the /body/ (between {L.ytop} and {L.ybas}) is limited to the interval {[hmin _ hmax]}, where {hmin,hmax} are parameters specified by the client. This constraint and the asc/des bits define an integer range {hBmin..hBmax} for the box height {hB = L.iymax-L.iymin+1}. We also assume that the width {wB} of the box (which is the same as the body width {w}) is restricted to the interval {wBmin..wBmax]}, where {wBmin=ceil(h*rmin)}, {wBmax=floor(h*rmax)}, and {rmin,rmax} are bounds on the /body/ aspect ratio set by the client. Note that {wBmin} and {wBmax} depend on {hB}. The probability {Pr1(L)} is the product of two discrete probability distributions, a uniform one for {hB}, namely {1/(hBmax-hBmin+1)}, and a humped distribution {J(wB,wBmin,wBmax)} for {wB} once fixed {hB}. The two-element factor {Pr2} The factor {Pr2(L[0],L[1])} is the probability of the second element of a true word having the bounding box and ascender/descender bits of element {L[1]}, given that the previous element was {L[0]}. That condition places constraints on the position and dimensions of {L[1]} that are different from those of the first element {L[0]}. In general, when computing {Pr2(L0,L1) the body height {h1} of {L1} is expected to be similar to the body height {h0} of {L0}. Specifically, {h1} is assumed to be in the range {[h1min _ h1max]} where {h1min = max(hmin,h0-TBsmax*dx)}, {h1max = min(hmax,h0+TBsmax*dx}, {dx} is the difference {L1.xctr-L0.xctr}, and {TBsmax} is a client parameter (the max absolute slope of the difference line TL-BL). Note that {h1min \leq h1max} since {hmin\leq h0\leq hmax}.) Note also that this range for {h1} is the range determined by {Pr1(L1)} intersected with the new range {h0±TBsmax*dx}. As before, from {h1min,h1max} and the asc/des bits we derive an integer range {hB1min..hB1max} for the box height {hB1=L1.iymax-L1.iymin+1}. The height {hB1} is assumed to have a humped probability distribution {J(hB1,hB1min,hB1max)} in that range. Likewise, for {Pr2(L0,L1)} we also assume that the midline ordinate {m1 = L1.ymid} of {L1} is similar to {m0 = L0.ymid}; that is, it is constrained to the range {[m1min _ m1max]} where {m1min = m0-MLsmax*dx}, {m1max = m0+MLsmax*dx}, and {MLsmax} is the max absolute slope of the midline specified by the client. From these bounds and the asc/des bits of {L1} we derive integer bounds {mB1min..mB1max} for the rounded midline {mB1=floor(m1)} (the fractional part being 0.0 or 0.5 depending on the parity of {hB}). Note that this range is different for each {hB}. This range is further constrained by the algorithm's requirement that {L0} and {L1} have overlapping {y}-ranges (although this condition is almost always weaker than the bound on the slope of the midline). The rounded midline {mB1} is assumed to have distribution {J(mB1,mB1min,mB1max)} for each fixed {hB1}. Also for {Pr1(L0,L1)}, the {x}-width {gB = L1.ixmin-L0.ixmax-1} of the gap between {L0} and {L1} is assumed to be in the range {[gmin _ gmax]} where {gmin = max(-w0+1,hm*dxmin)}, {gmax = hm*dxmax}, and {hm=(h0+h1)/2}. Here {dxmin,dxmax} are bounds specified by the client; note that {dxmin}, and hence {gmin}, is usually negative, to allow for overlap between successive boxes. Note that the bounds {gmin,gmax} depend on the height {hB1} of {L1.box}. Specifically, {gB} is assumed to have distribution {J(gB,gBmin,gBmax)} where {gBmin=ceil(gmin)} and {gBmax=floor(gmax)}, for each {hB1}. Finally, for {Pr1(L0,L1)} the width {wB1} of {L1} is assumed to be in the integer range {wB1min..wB1max]} where {wB1min=max(ceil(h1*rmin),xBove+1)}, {wB1max=floor(h1*rmax)}, and {xBove} is the amount of {x}-overlap {max(0,L0.ixmax-L1.ixmin+1)} which is the same as {max(0,-gBx)}. The clipping of {w1min} by {xBove+1} accounts for the restriction that both {L[i].ixmin} and {L[i].ixmax} must be strictly increasing with {i}. Again, note that the range on {wB1} is the same range defined by {Pr1(L1)}, intersected with the new range {xBove+1..oo}. Note also that these bounds vary with the height {hB1} and the gap width {gB1}. The parameter {wB1} is assumed to have the distribution {J(wB1,wB1min,wB1max)} for each value of {hB1} and {gB1}. ??? Intersec twith mB,hB range limit of {Pr1(L1)} ??? The three-element factor {Pr3} The three-box factor {Pr3(L[i-2],L[i-1],L[i])} is used on every element {L[i}} starting with the third one. It computes the probability of that element having the box {L[i].box} and asc/des bits {L[i].asc,L[i].des}, given that the two previous boxes are {L[i-2]} and {L[i-1]}. That information places even more constraints on the box {L[i]} than {Pr2(L[i-1],L[i])} or {Pr1(L[i])}. For one thing, {Pr3(L0,L1,L2)} expects the midline points of the three boxes to be approximately aligned. Namely, ??? ML and TB alignment, or TL and BL ??? Computing {PrF(W)} The other quantity that we need in order to compute the score of a word candidate is the probability {PrF(W)} that {W} is /not/ a prefix of a true word and {n} randomly chosen elements among the input ones are precisely the {n} elements {L[0..n-1]} of {W}. This probability does not have a factorable formula like {PrT}, because the elements of {W} may contain pieces of true words possibly mixed with false boxes. We will therefore define {PrF(W)} incrementally. Namely, if {W} has a single element {L}, {PrF(L)} will be the probability that {L} is a false element. That is simply number of false elements divided by the the total number of elements, namely {(4*NF+3*NT)/(4*NF+4*NT)} or {1-NT/(NF+NT)/4}. For two or more elements, suppose we have computed {PrT(W)} and {PrF(W)}, and we need {PrF(W1)} where {W1} is {W} with an extra element {L} appended. We have three exclusive cases, whose probabilities should be added to get {PrF(W1)}: (0) {W} is the beginning of a true word, but {L} is a false element, or a true element that is not part of that word; (1) {W} is not the beginning of a true word, and {L} is a false element, or a true element that is not part of the same word as {L[n-2]}; (2) {W} is not the beginning of a true word, but {L} is a true element that is part of the same word as the last element {L0} of {W}; In cases (0) and (1) we get {PrF(W1)} by multiplying {PrT(W)} or {PrF(W)}, repectively, by a suitable factor {A}. Case (2) is much more complicated. We will simply ignore it. By doing so we will underestimate {PrF(W1)}; but this decision will cause no harm when implemented in the algorithm. It may cause us to keep a false candidate {W1} that consist of some garbage elements followed by the beginning of a true word {W2}; but the algorithm will also consider {W2} by itself as a candidate, and therefore it will discard {W1} by the DP criterion. The factor {A} used to get The probabilities {PrF} and {PrW} The candidate score The score of {W} is defined as {log(PrT(W)/PrF(W)) = log(PrT(W)} - log(PrF(W))}, which is a monotone function of {PrTtrue(W)}. Moreover,since we need only the difference {log(PrT(W)} - log(PrF(W))} we can normalize For a candidate word {W} with a single element {L}, {W.PrT} must be the probability that {L} is the initial letter of a true word. A state that corresponds to a single element {L} has {Pr1} The probability that a randomly chosen input element is the initial element of a true word, without taking into account the box geometry or the ascender/descender bits, is therefore {NW/NE = NW/(4*NB)}. {NW} be the expected number of true words in the input, and . The expected number of true elements (and of true boxes) is {NT=NW*avgLen}. For simplicity, we assume an exponential distribution of word lengths, so that there is a constant probability {pcont=1-1/avgLen} of a word continuing after any letter. Therefore the expected number of words with {n} or more letters is {NW*pcont^{n-1}}. The number of sets of {n} distinct elements is {choose(NE,n)}. Since {n} is much less than {NE}, we can approximate it by {NE^n/n!}. At present we assume that all fourcombinations of {L.asc} and {L.des} are equally likely at the begining If {W} has a single element {L}, then {W.PrT} is the probability that {L} is the first letter of a true word. That is basically {NT/(NT+NF)} times the probability */