# Last edited on 2013-01-02 17:16:23 by stolfilocal INTRO This notebook documents the fetching and preprocessing of the "gg100806" image dataset. See ~/projects/text-tracking/Notebook for general background. OBTAINING THE IMAGES On 2010-08-06 I obtained queried GoogleImage for "sign" and "image" and got a bunch of images. On 2012-12-29 I obtained a second set googling for "storefront". These were saved to "asis/{NNN}.${EXT}" where {NNN} is a 3-digit number and ${EXT} is varied ("jpeg", "gif", etc.) The URL of each image as saved as "asis/{NNN}.txt". CROPPING AND RESCALING The images were variously cropped and sometimes reduced in size to produce "orig/{NNN}.png" Note that the source images are probably copyrighted. CREATING THE PRELIMINARY GROUNDTRUTH MASKS Produced a preliminary 0-1 mask "ptru/${IMGNUM}.png" for each image, using GIMP. In these masks, black is the text, white is background. The boundary tries to follow the actual character boundary. The masks are not very good yet but should be good enough to create the text+surround masks. For images in 100-199, I used a program for multiscale adaptive thresholding "~/programs/java/FloatImage/DoThresholdImage.java". The result were white-on-black, had to complement them: for img in `cd pfoo && ls *.png` ; do convert pfoo/${img} \ -alpha Off \ -colorspace Gray \ -threshold '50%' \ -negate \ ptru/${img} display -title '%f' ptru/${img} orig/${img} done CHECKING FOR COMPLETENESS Checking whether all files are present for all images inums=( `count 001 199 001 "%03.0f"` ) for n in ${inums[@]} ; do if [[ ! ( ( -s asis/${n}.jpg ) || ( -s asis/${n}.jpeg ) || ( -s asis/${n}.JPG ) || ( -s asis/${n}.JPEG ) || ( -s asis/${n}.gif ) ) ]] ; then echo "( asis/${n} image missing )" else if [[ ! -s asis/${n}.txt ]] ; then echo "asis/${n}.txt missing" ; fi if [[ ! -s orig/${n}.png ]] ; then echo "orig/${n}.png missing" ; fi if [[ ! -s ptru/${n}.png ]] ; then echo "ptru/${n}.png missing" ; fi fi done The next step is to create a rough mask "rmsk/{NNN}-{KKK}.png" for each image "orig/{NNN}.png" and "true/{NNN}.png" and each homogeneous text area (with same font size, font style, and fore/back colors). The purpose of these masks is to precisely split the groundtruth image into parts belonging to different regions. Namely the intersection of the full groundtruth image and each mask must be exactly the groundtruth for the desired text region. Thus it may contain arbitrary amounts of background, but it must not contain any pixel of the groundtruth of any other text region. make-region-masks.sh OBTAINING STATISTICS FROM THE TEXTS The next step was to analyze the statistics of the homogeneous texts. For each pair {NNN,KKK}, we AND together "ptru/{NNN}.png" and "rmsk/{NNN}-{KKK}.png", then apply dilation and a bit of Gaussian blurring to generate a preliminary extraction mask "pmsk/{NNN}-{KKK}.png" for the text. That mask should cover the characters and a bit of background. The amount of dilation is proportional to the critical wavelength {lamb} which is estimated visually and specified by hand, for each case. For visual checking, we used the extraction mask also as the alpha channel over the images "orig/{NNN}.png", after some Gaussian blurring to remove the JPEG and other noise. The amount of blurring too is specified separately for each text region. The result is stored in "pext/{NNN}-{KKK}.png". We also make a copy of this text reduced to its natural scale (where the fundamental wavelength is equal to the critical one). Then we used that image as a weight mask for my "pnmxhist" tool, to obtain the distribution of colors over the text and its immediate background. All this is done by the script analyze-region-colors.sh Most regions have a bimodal distribution, consisiting of two relatively narrow peaks (foreground and background) bridged by a population of border pixels with intermediate values. However text that is too small lacks the foreground peak, and text with stroked outlines or non-uniform backgound may have more complex histograms. >>> STOPPED HERE >>> TO FIX * 121-001 : one can read also "11AM - 6AM \n LATE NIGHT", add to make-region-masks.sh * 123-003 : one can read also "Mr." before "Dooley's", add to make-region-masks.sh * Check why analyze-region-colors.sh creates a large mask instead of a tight one. * 125-036 : filtering problem? CREATING BETTER GROUNDTRUTH MASKS Will try to create a better set of groundtruth images by extracting the text alone and running normalization on it. Some ideas: * Use distance from boundary of mask to figure out whether the text is light-on-dark or dark-on-light. * Converting to grayscale first should lose little information, but in principle color may help to separate the effects of variable lighting etc. * Augment pnmwfilter to accept an opacity mask for the image and/or use PNG input with alpha channel. MOVING TO THE COMMON REPOSITORY Moving the images to my own data directory, with set id "gg100806". Renaming the sub-directories: "asis" --> (not copied) "orig" --> "orig" "true" --> "true" Also copies the URL files to "orig" projdir=~/projects/text-tracking dset="gg100806" forig=${projdir}/data/full/orig ftrue=${projdir}/data/full/true mkdir -p {${forig},${ftrue},${fmask}}/${dset} for f in ${inums[@]} ; do \ mv orig/$f.png ${forig}/${dset}/ cp -p asis/$f.txt ${forig}/${dset}/ mv true/$f.png ${ftrue}/${dset}/ done