Natural Period of Text
Stroked Text
In most writing systems, the /glyphs/ (characters, symbol, and
punctuation) consist of strokes with moderately fixed relative lengths,
shapes, directions, and positions. On the other hand, the same charcter
can be rendered in a variety of fonts, sizes, weights, colors, and
textures, and can be rotated or distorted by stretching or sharing, but
surprisingly they still remain easily recognizable by human readers on
first sight, without the need for any previous "training". While
knowledge of the script or language helps to resolve ambiguity in
borderline cases, the glyph recognition itself seems to be
language-independent: readers can often tell whether two
never-before-seen glyphs (or two characters in a foreigng script) are
the same, even when they have been rendered in different ways.
This is not as surprising when one considers that major forces in the
evolution of established scripts were the need to (a) maintain
legibility under a variety of reproduction methods and viewing angles
and distances, and (b) pack as many glyphs as possible in the smallest
amount of space. As a result of these forces, /the essential information
contents of glyph --- that which allows it to be recognized as such and
distinguished by other glyphs --- is all packed in the lowest frequency
band of its Fourier spectrum/. That band is preserved bt most
distortions, including weight, font, and decoration changes. Most
importantly, as the size is reduced, it is the last band to be lost.
For each script, this /essential frequency band/ can be determined from
the minimum image scale at which the script remains readable. By the
Nyquist criterion, an horizontal or vertical frequency component (wave)
is retained only if is wavelength {\lambda} is strictly larger than
twice the pixel spacing {\delta}. Thus each script has a /fundamental
scale/, the minimum glyph width and height (in pixels) that allows the
glyphs to be recognized and read. Theory predicts, and experiments
confirm, that this is a fairly sharp threshold. Namely, if the text is
reduced slightly below its natural scale, essential information is lost,
and the text is mostly unreadable. If the imaging scale is slightly
larger than its natural scale, all the essential information is present
in the image, and so the text is perfectly readable; and the readability
does not improve significantly at even larger image resolutions.
Actually, good imaging devices always use a Gaussian-like anti-aliasing
kernel, which strongly attenuates those Fourier components that are
close to the Nyquist limit. Therefore, at the fundamental scale, the
minimum wavelength {\lambda_\min} for any essential vertical and
horizontal Fourier component of the script seems to be around {2.5 \delta}
rather than {2\delta}. (For diagonal waves the Nyquist wavelength
limit is {\sqrt{2}} rather than 2, so the minimum essential wavelength
is around {1.8 \delta} instead of {2\delta}.)
Natural scale and stroke spacing
Two parallel strokes can be recognized as such long as the distance
between their midlines is above the effective Nyquist limit; in
practice, larger than {\lambda_\min}. Therefore, the fundamental scale
is one where the distance {\lambda^\ast} between two closest strokes in
the script (whether in the same glyph or in adjacent glyphs) is equal to
the minimum visible wavelength {\lamda_\min}.
A consequence of this phenomenon is that in mature scripts and fonts the
Fourier components with this /critical wavelength/ {\lambda^\ast} define
a /natural grid/ whose cells are spaced {\lambda^\ast/2} apart, such
that all glyph strokes and dots are ideally synchronized to this grid
--- at least within the scope of a single glyph, and often over several
adjacent glyphs. See figure [BIKINI]. When this "synchronism" constraint
is violated --- that is, when the axes of parallel strokes are not
separated by an integer multiple of {\lambda_\min/2} --- both the
readability and the aesthetics seem to suffer. If the violation occurs
inside a glyph, the glyph looks ugly; if it occurs between adjacent
glyphs, their spacing seems wrong. For example, the "M" in figure [MEN]
is visibly compressed horizontally in comparison with the other letters,
while the "N" looks a bit too fat; and in figure [WOMEN], the "W" looks
okay but both the "M" and the "N" are compressed. More importantly, when
the text is reproduced at scales approaching the fundamental one scale,
the out-of-sync strokes become blobs at unexpected places, and
legibility is impaired.
Thus, at the fundamental scale, each glyph can be modeled as a
characteristic /fundamental template/, a small regular grid where each cell --- a /glyphel/ ---
is {\lambda_\min/2} wide, and is filled with a gray value. If a stroke
or dot in an actual glyph is displaced by less than {\lambda_\min/2}
from its ideal position, as implied by the fundamental bitmap, the
distortion usually results in a small change in the relative values of
the corresponding glyphels, which seems to be perceived as an unimportant defect or stylistic
variation. This true not only of low-resolution digital fonts, like
[ZOMBIES_AHEAD], but also of high-resolution or pre-computer fonts, like
[BIKINI] and [alau]. Indeed, it is this near-universal feature of
established scripts that made low-resolution fonts viable in the first place.
Note that the orienation of strokes and the spacing between their axes
is almost the only significant information that is discernible when the
scritp is rendered at the fundamental scale. Most other font variations ---
including small-radius curves, thicker or thinner strokes, outlines
and shadows, textures, etc. --- affect only the Fourier components with shorter wavelengths,
which are outside the essential band of the spectrum, or the Fourier component
with frequency zero, which affects the overall lightess or darkness of the glyph. The latter attribute
is generally perceived as non-essential for glyph recognition (although it is often used
to convey emphasis, e.g. in when text is rendered in boldface or in a different color).
On the other hand, some variations in glyph shape affect the components
within the essential band, and must therefore be viewed as glyph
variants rather than mere variations in font style. Some well-known
examples are the two very distinct lowercase forms of the letter "a",
and fo the letter "g". Another more subtle example is the distinction
between the compressed forms of certain letters like "U" used in some
uppercase-only proportional fonts (where the two arms may be separated
by only one blank glyphel), and the more open forms used in fixed-width
or upper/lower fonts (where the upper arms may be separated by three
blank glyphels). In an uppercase-only, letter-only font, the fundamental
bitmap can be as small as three by three glyphels. As long as the glyph's
image respects the effective Nyquist limit --- that is, each
glyphel is at least 2.5 image pixels wide --- the text remains legible, as
the [UNLIMITED] example shows. For Western digits, the fundamental
bitmap is 3 by 5 glyphels, as in the [1_800_438_2422] example.
In a specific image of a text, the critical wavelength {\lambda_\min}
(and hence the glyphel size {\lambda_\min/2}) in particular direction can
usually be determined by looking for the closest parallel strokes that
need to be distinguished for reading, such as between "I" and "N" in the
sequence "IN". See the figure 004-000 below. For curved strokes, such as
between "N" and "O" in "NO" in some fonts, one must consider a mean
axis.
However some texts have anomalous extra space between characters,
as in figures 029-000 and 041-001. When the intercharacter space is
reduced below 1
The font's fundamental grid must not be confused with the image's pixel
grid. The fundamental grid may be orthogonal, or (as in italic scripts)
its axes may slanted. (It is interesting that italic fonts were
introduced by printers in the Renaissance because they used less spaec
for the same readability. This change made the fundamental font grid
more similar to an hexagonal grid, which seems to provide slightly
higher encoding efficiency for typical Roman character shapes.) (It
seems likely that this syncronization makes the glyphs easier to
regognize at scales just above the fundamental one.)
Glyphels are in principle square, a consequence of the fact that the
spatial resolution of the eye is about the same in both main directions.
However, even upright text that is imaged face-on may be verticlly stretched
or compressed; which means that the vertical critical
wavelength is different than the horizontal one, and the glyphels
are not square on the image.
If the viewing direction is not perpendicular to the text plane, then
the glyphels are trapezoids, and the mapping from the glyphels of a
character to the image pixels is a two-dimensional projective
trasformation. If the text surface is curved, the correspondence may be
evenmore complicated. However, for each individual character, usually
this mapping can be well approximated by a single affine transform (a 2
by 2 linear transformation plus a translation), so that the projected
glyphels are approximately parallelograms. In that case, the Nyquist
criterion for legibility applies to the smallest eigenvalue of the 2 by
2 matrix, that is, to the minimum width of those parallelograms.
Glyph segmentation
Another force as also the need for scripts to be /self-synchronizing/,
that is the accidental loss or obliteration of part of the text should
not impede the correct segmentation into glyphs of the part that remains
visible. In many printed scripts (such as the Roman, Cyrillic, Greek,
Ethiopian, and Chinese), this is usually achieved by making each glyph
into a connected (or at least compact) set of strokes, with a thin gap
between adjacent glyphs. Some other scripts (such as Arabic or
Devanagiri) use a similar principle except for the presence of a
"baseline" stroke connecting glyphs into words. In handwritten or
calligraphic scripts, the segmentation is more complicated. In any case,
howeer, the information needed for self-syncronization too is present in
the essentia band.
Implications for text detection and character recognition
It follows from the above that /character recognition by software should
be easier when the image resolution is just above the effective Nyquist
criterion/; that is, when each glyphel is as small as possible, but at
least 1.5 pixels wide in its narrowest direction. (Slightly different
values may apply for rotated or sheared text, but let's ignore that
for the moment.)
In particular, in a multi-scale approach, with a pyramid whose levels
are related by a scale factor {\alpha}, is this criterion means that in
each scale one should search only for characters whose glyphels are
between 1.5 and {1.5\alpha} pixels wide in their smallest dimension. For
example, suppose that {\alpha} is 2, and one wants to recognize upright
digits which may be stretched vertically by at most {\sigma} or
horizontally by at most {\gamma} relative to the aspect ratio of their
fundamental template, which is only 3 by 5 glyphels. In each level, one
should consider only candidate regions which snugly fit in an arrray of
3x5 rectangles whose smallest side is between 1.5 pixels and
{1.5=3\alpha} pixels, and whose longest side is bounded by the above
contraints.
These considerations also imply that /character recognition should be
done on continuous tone images, rather than on bilevel images/. When a
continuous digital image is converted to a bilevel one, by any
thresholding or segmentation method, one loses lots of valuable
information (roughly four to seven bits from each pixel). More
significantly, the quantization inevitably causes aliasing between
Fourier components of different wavelengths, so that the information
contained in the essential components may be lost or distorted.
This is evident when one tries to segment small characters that
are just above the legibility threshold, such as [UNLIMITED] or
[ATM_INSIDE]
While one may alleviate these problems by magnifying the image with
smooth interpolation before oversampling, one would need at least 4 by 4
oversampling to preserve the leading 4 bits of the pixel values that are
discarded by the quantization. Then the windows used by the recognition
algorithms would have to be 4 times wider, leading to a factor of at
least 16 in processing time.
Moreover, /character recognition must use proper filtering/ and other
continuous-signal processing tools. This is particularly important
for digital photographs and videos that have been stored
in JPEG or MPEG format, and/or processed with a "sharpening" filter.
Indeed, in our experience, in those images a bit of Gaussian blurring
makes small text *easier* to read.
Last edited on 2010-08-13 00:55:07 by stolfilocal