# Last edited on 2013-01-02 17:16:23 by stolfilocal

INTRO

  This notebook documents the fetching and preprocessing of the 
  "gg100806" image dataset.  See ~/projects/text-tracking/Notebook
  for general background.

OBTAINING THE IMAGES

  On 2010-08-06 I obtained queried GoogleImage for "sign" and "image"
  and got a bunch of images.  On 2012-12-29 I obtained a second set 
  googling for "storefront".
  
  These were saved to "asis/{NNN}.${EXT}" where {NNN} is a 3-digit
  number and ${EXT} is varied ("jpeg", "gif", etc.)
  
  The URL of each image as saved as "asis/{NNN}.txt".

CROPPING AND RESCALING
  
  The images were variously cropped and sometimes reduced in size to produce
  "orig/{NNN}.png"
  
  Note that the source images are probably copyrighted.

CREATING THE PRELIMINARY GROUNDTRUTH MASKS
 
  Produced a preliminary 0-1 mask "ptru/${IMGNUM}.png" for each image,
  using GIMP. 
  
  In these masks, black is the text, white is background. The boundary
  tries to follow the actual character boundary.  The masks are
  not very good yet but should be good enough to create the
  text+surround masks.
  
  For images in 100-199, I used a program for multiscale adaptive
  thresholding "~/programs/java/FloatImage/DoThresholdImage.java".
  The result were white-on-black, had to complement them:
   
    for img in `cd pfoo && ls *.png` ; do 
      convert pfoo/${img} \
        -alpha Off \
        -colorspace Gray \
        -threshold '50%' \
        -negate \
        ptru/${img}
      display -title '%f' ptru/${img} orig/${img}
    done
        
  
CHECKING FOR COMPLETENESS

  Checking whether all files are present for all images
  
    inums=( `count 001 199 001 "%03.0f"` )
    for n in ${inums[@]} ; do 
      if [[ ! ( ( -s asis/${n}.jpg ) || ( -s asis/${n}.jpeg ) || ( -s asis/${n}.JPG ) || ( -s asis/${n}.JPEG ) || ( -s asis/${n}.gif ) ) ]] ; then
        echo "( asis/${n} image missing )"
      else
        if [[ ! -s asis/${n}.txt ]] ; then echo "asis/${n}.txt missing" ; fi
        if [[ ! -s orig/${n}.png ]] ; then echo "orig/${n}.png missing" ; fi
        if [[ ! -s ptru/${n}.png ]] ; then echo "ptru/${n}.png missing" ; fi
      fi
    done

  The next step is to create a rough mask "rmsk/{NNN}-{KKK}.png" for each image
  "orig/{NNN}.png" and "true/{NNN}.png" and each homogeneous text area
  (with same font size, font style, and fore/back colors). 
  
  The purpose of these masks is to precisely split the groundtruth image
  into parts belonging to different regions. Namely the intersection of
  the full groundtruth image and each mask must be exactly the
  groundtruth for the desired text region. Thus it may contain arbitrary
  amounts of background, but it must not contain any pixel of the
  groundtruth of any other text region.

    make-region-masks.sh
 
OBTAINING STATISTICS FROM THE TEXTS

  The next step was to analyze the statistics of the homogeneous texts.
  For each pair {NNN,KKK}, we AND together "ptru/{NNN}.png" and
  "rmsk/{NNN}-{KKK}.png", then apply dilation and a bit of Gaussian
  blurring to generate a preliminary extraction mask
  "pmsk/{NNN}-{KKK}.png" for the text. That mask should cover the
  characters and a bit of background.  The amount of dilation 
  is proportional to the critical wavelength {lamb}
  which is estimated visually and specified by hand, for each case.
  
  For visual checking, we used the extraction mask also as the alpha
  channel over the images "orig/{NNN}.png", after some Gaussian blurring
  to remove the JPEG and other noise. The amount of blurring too is
  specified separately for each text region. The result is stored in
  "pext/{NNN}-{KKK}.png".
  
  We also make a copy of this text reduced to its natural scale
  (where the fundamental wavelength is equal to the critical one).
  
  Then we used that image as a weight mask for my "pnmxhist" tool,
  to obtain the distribution of colors over the text and its immediate
  background.
  
  All this is done by the script 
  
    analyze-region-colors.sh
    
  Most regions have a bimodal distribution, consisiting of two relatively 
  narrow peaks (foreground and background) bridged by a population
  of border pixels with intermediate values.  However text that is too
  small lacks the foreground peak, and text with stroked outlines
  or non-uniform backgound may have more complex histograms.
    
>>> STOPPED HERE >>>

TO FIX

  * 121-001 : one can read also "11AM - 6AM \n LATE NIGHT", add to make-region-masks.sh
  * 123-003 : one can read also "Mr." before "Dooley's", add to make-region-masks.sh
  * Check why analyze-region-colors.sh creates a large mask instead of a tight one.
  * 125-036 : filtering problem?

CREATING BETTER GROUNDTRUTH MASKS

  Will try to create a better set of groundtruth images by extracting 
  the text alone and running normalization on it.
  
  Some ideas:
  
    * Use distance from boundary of mask to figure out whether the text
    is light-on-dark or dark-on-light.
    
    * Converting to grayscale first should lose little information,
    but in principle color may help to separate the effects of variable lighting
    etc.
    
    * Augment pnmwfilter to accept an opacity mask for the image and/or use 
    PNG input with alpha channel.

MOVING TO THE COMMON REPOSITORY

  Moving the images to my own data directory, with set id "gg100806".  
  Renaming the sub-directories:
  
      "asis" --> (not copied)
      "orig" --> "orig"
      "true" --> "true"
      
  Also copies the URL files to "orig"
    
    projdir=~/projects/text-tracking
    dset="gg100806"
    forig=${projdir}/data/full/orig
    ftrue=${projdir}/data/full/true
    mkdir -p {${forig},${ftrue},${fmask}}/${dset}
    for f in ${inums[@]} ; do \
      mv orig/$f.png ${forig}/${dset}/
      cp -p asis/$f.txt ${forig}/${dset}/
      mv true/$f.png ${ftrue}/${dset}/
    done