Natural Period of Text

Stroked Text

In most writing systems, the /glyphs/ (characters, symbol, and punctuation) consist of strokes with moderately fixed relative lengths, shapes, directions, and positions. On the other hand, the same charcter can be rendered in a variety of fonts, sizes, weights, colors, and textures, and can be rotated or distorted by stretching or sharing, but surprisingly they still remain easily recognizable by human readers on first sight, without the need for any previous "training". While knowledge of the script or language helps to resolve ambiguity in borderline cases, the glyph recognition itself seems to be language-independent: readers can often tell whether two never-before-seen glyphs (or two characters in a foreigng script) are the same, even when they have been rendered in different ways. This is not as surprising when one considers that major forces in the evolution of established scripts were the need to (a) maintain legibility under a variety of reproduction methods and viewing angles and distances, and (b) pack as many glyphs as possible in the smallest amount of space. As a result of these forces, /the essential information contents of glyph --- that which allows it to be recognized as such and distinguished by other glyphs --- is all packed in the lowest frequency band of its Fourier spectrum/. That band is preserved bt most distortions, including weight, font, and decoration changes. Most importantly, as the size is reduced, it is the last band to be lost. For each script, this /essential frequency band/ can be determined from the minimum image scale at which the script remains readable. By the Nyquist criterion, an horizontal or vertical frequency component (wave) is retained only if is wavelength {\lambda} is strictly larger than twice the pixel spacing {\delta}. Thus each script has a /fundamental scale/, the minimum glyph width and height (in pixels) that allows the glyphs to be recognized and read. Theory predicts, and experiments confirm, that this is a fairly sharp threshold. Namely, if the text is reduced slightly below its natural scale, essential information is lost, and the text is mostly unreadable. If the imaging scale is slightly larger than its natural scale, all the essential information is present in the image, and so the text is perfectly readable; and the readability does not improve significantly at even larger image resolutions. Actually, good imaging devices always use a Gaussian-like anti-aliasing kernel, which strongly attenuates those Fourier components that are close to the Nyquist limit. Therefore, at the fundamental scale, the minimum wavelength {\lambda_\min} for any essential vertical and horizontal Fourier component of the script seems to be around {2.5 \delta} rather than {2\delta}. (For diagonal waves the Nyquist wavelength limit is {\sqrt{2}} rather than 2, so the minimum essential wavelength is around {1.8 \delta} instead of {2\delta}.)

Natural scale and stroke spacing

Two parallel strokes can be recognized as such long as the distance between their midlines is above the effective Nyquist limit; in practice, larger than {\lambda_\min}. Therefore, the fundamental scale is one where the distance {\lambda^\ast} between two closest strokes in the script (whether in the same glyph or in adjacent glyphs) is equal to the minimum visible wavelength {\lamda_\min}. A consequence of this phenomenon is that in mature scripts and fonts the Fourier components with this /critical wavelength/ {\lambda^\ast} define a /natural grid/ whose cells are spaced {\lambda^\ast/2} apart, such that all glyph strokes and dots are ideally synchronized to this grid --- at least within the scope of a single glyph, and often over several adjacent glyphs. See figure [BIKINI]. When this "synchronism" constraint is violated --- that is, when the axes of parallel strokes are not separated by an integer multiple of {\lambda_\min/2} --- both the readability and the aesthetics seem to suffer. If the violation occurs inside a glyph, the glyph looks ugly; if it occurs between adjacent glyphs, their spacing seems wrong. For example, the "M" in figure [MEN] is visibly compressed horizontally in comparison with the other letters, while the "N" looks a bit too fat; and in figure [WOMEN], the "W" looks okay but both the "M" and the "N" are compressed. More importantly, when the text is reproduced at scales approaching the fundamental one scale, the out-of-sync strokes become blobs at unexpected places, and legibility is impaired. Thus, at the fundamental scale, each glyph can be modeled as a characteristic /fundamental template/, a small regular grid where each cell --- a /glyphel/ --- is {\lambda_\min/2} wide, and is filled with a gray value. If a stroke or dot in an actual glyph is displaced by less than {\lambda_\min/2} from its ideal position, as implied by the fundamental bitmap, the distortion usually results in a small change in the relative values of the corresponding glyphels, which seems to be perceived as an unimportant defect or stylistic variation. This true not only of low-resolution digital fonts, like [ZOMBIES_AHEAD], but also of high-resolution or pre-computer fonts, like [BIKINI] and [alau]. Indeed, it is this near-universal feature of established scripts that made low-resolution fonts viable in the first place. Note that the orienation of strokes and the spacing between their axes is almost the only significant information that is discernible when the scritp is rendered at the fundamental scale. Most other font variations --- including small-radius curves, thicker or thinner strokes, outlines and shadows, textures, etc. --- affect only the Fourier components with shorter wavelengths, which are outside the essential band of the spectrum, or the Fourier component with frequency zero, which affects the overall lightess or darkness of the glyph. The latter attribute is generally perceived as non-essential for glyph recognition (although it is often used to convey emphasis, e.g. in when text is rendered in boldface or in a different color). On the other hand, some variations in glyph shape affect the components within the essential band, and must therefore be viewed as glyph variants rather than mere variations in font style. Some well-known examples are the two very distinct lowercase forms of the letter "a", and fo the letter "g". Another more subtle example is the distinction between the compressed forms of certain letters like "U" used in some uppercase-only proportional fonts (where the two arms may be separated by only one blank glyphel), and the more open forms used in fixed-width or upper/lower fonts (where the upper arms may be separated by three blank glyphels). In an uppercase-only, letter-only font, the fundamental bitmap can be as small as three by three glyphels. As long as the glyph's image respects the effective Nyquist limit --- that is, each glyphel is at least 2.5 image pixels wide --- the text remains legible, as the [UNLIMITED] example shows. For Western digits, the fundamental bitmap is 3 by 5 glyphels, as in the [1_800_438_2422] example. In a specific image of a text, the critical wavelength {\lambda_\min} (and hence the glyphel size {\lambda_\min/2}) in particular direction can usually be determined by looking for the closest parallel strokes that need to be distinguished for reading, such as between "I" and "N" in the sequence "IN". See the figure 004-000 below. For curved strokes, such as between "N" and "O" in "NO" in some fonts, one must consider a mean axis. However some texts have anomalous extra space between characters, as in figures 029-000 and 041-001. When the intercharacter space is reduced below 1 The font's fundamental grid must not be confused with the image's pixel grid. The fundamental grid may be orthogonal, or (as in italic scripts) its axes may slanted. (It is interesting that italic fonts were introduced by printers in the Renaissance because they used less spaec for the same readability. This change made the fundamental font grid more similar to an hexagonal grid, which seems to provide slightly higher encoding efficiency for typical Roman character shapes.) (It seems likely that this syncronization makes the glyphs easier to regognize at scales just above the fundamental one.) Glyphels are in principle square, a consequence of the fact that the spatial resolution of the eye is about the same in both main directions. However, even upright text that is imaged face-on may be verticlly stretched or compressed; which means that the vertical critical wavelength is different than the horizontal one, and the glyphels are not square on the image. If the viewing direction is not perpendicular to the text plane, then the glyphels are trapezoids, and the mapping from the glyphels of a character to the image pixels is a two-dimensional projective trasformation. If the text surface is curved, the correspondence may be evenmore complicated. However, for each individual character, usually this mapping can be well approximated by a single affine transform (a 2 by 2 linear transformation plus a translation), so that the projected glyphels are approximately parallelograms. In that case, the Nyquist criterion for legibility applies to the smallest eigenvalue of the 2 by 2 matrix, that is, to the minimum width of those parallelograms.

Glyph segmentation

Another force as also the need for scripts to be /self-synchronizing/, that is the accidental loss or obliteration of part of the text should not impede the correct segmentation into glyphs of the part that remains visible. In many printed scripts (such as the Roman, Cyrillic, Greek, Ethiopian, and Chinese), this is usually achieved by making each glyph into a connected (or at least compact) set of strokes, with a thin gap between adjacent glyphs. Some other scripts (such as Arabic or Devanagiri) use a similar principle except for the presence of a "baseline" stroke connecting glyphs into words. In handwritten or calligraphic scripts, the segmentation is more complicated. In any case, howeer, the information needed for self-syncronization too is present in the essentia band.

Implications for text detection and character recognition

It follows from the above that /character recognition by software should be easier when the image resolution is just above the effective Nyquist criterion/; that is, when each glyphel is as small as possible, but at least 1.5 pixels wide in its narrowest direction. (Slightly different values may apply for rotated or sheared text, but let's ignore that for the moment.) In particular, in a multi-scale approach, with a pyramid whose levels are related by a scale factor {\alpha}, is this criterion means that in each scale one should search only for characters whose glyphels are between 1.5 and {1.5\alpha} pixels wide in their smallest dimension. For example, suppose that {\alpha} is 2, and one wants to recognize upright digits which may be stretched vertically by at most {\sigma} or horizontally by at most {\gamma} relative to the aspect ratio of their fundamental template, which is only 3 by 5 glyphels. In each level, one should consider only candidate regions which snugly fit in an arrray of 3x5 rectangles whose smallest side is between 1.5 pixels and {1.5=3\alpha} pixels, and whose longest side is bounded by the above contraints. These considerations also imply that /character recognition should be done on continuous tone images, rather than on bilevel images/. When a continuous digital image is converted to a bilevel one, by any thresholding or segmentation method, one loses lots of valuable information (roughly four to seven bits from each pixel). More significantly, the quantization inevitably causes aliasing between Fourier components of different wavelengths, so that the information contained in the essential components may be lost or distorted. This is evident when one tries to segment small characters that are just above the legibility threshold, such as [UNLIMITED] or [ATM_INSIDE] While one may alleviate these problems by magnifying the image with smooth interpolation before oversampling, one would need at least 4 by 4 oversampling to preserve the leading 4 bits of the pixel values that are discarded by the quantization. Then the windows used by the recognition algorithms would have to be 4 times wider, leading to a factor of at least 16 in processing time. Moreover, /character recognition must use proper filtering/ and other continuous-signal processing tools. This is particularly important for digital photographs and videos that have been stored in JPEG or MPEG format, and/or processed with a "sharpening" filter. Indeed, in our experience, in those images a bit of Gaussian blurring makes small text *easier* to read.

Last edited on 2010-08-13 00:55:07 by stolfilocal