# Last edited on 2005-01-17 02:42:00 by stolfi

DATA SETS FOR MUTUAL INFORMATION CONTENT ANALYSIS
--------------------------------------------------

OVERVIEW
--------

  This directory contains two complete data sets for the analysis
  program ("frb_analyze"): 
  
    Data set "s2" is used for the analysis of mutual information
    contents between "true" candidates --- closely fitting 
    pairs of fragment outlines --- selected by hand from
    those used in Helena's thesis.
    
    Data set "f2" is a "control experiment" consisting of
    "false"candidates, namely fragment outline segments that were
    picked and paired at random.
  
  The program actually requires two kinds of input data:

    * the geometry of the fragment outlines (the ".flc" files in 
    "curves-001/", shared by both data sets); and

    * a representative of sample of candidates taken from the 
    target set (file "sample.can" in the directory 
    "cands-001/SET/", where SET is "s2" or "f2").

  The program writes many output files, mostly for debugging and
  documentation purposes.  The actual results of the analysis 
  are written as two tables, the information content of each Fourier 
  coefficient ("sample.ifc") and the same data condensed by frequency
  band ("sample.ibd"), in the directories "cands-001/SET/".

FRAGMENT-RELATED FILES
----------------------

  The input fragment outlines are in directory "curves-001/" ,
  specifically in the files 
  
    "curves-001/NNNN/f001.flc"

  where NNNN is the fragment number, ranging from 0000 to 0111.

  These outlines have been smoothed with a "geometric" Gaussian filter
  (as described in Helena's thesis) with characteristic width sigma
  = 1 pixel = 1/300 inch = 0.085 mm, then sampled with uniform step
  as close as possible sigma/4, namely delta = 0.25 pixel = 0.021 mm.

CANDIDATE-RELATED FILES
-----------------------

  Candidate-related files, input and output, live in the 
  directories "cands-001/SET" where SET is either "s2" or "f2". 

  The input sample candidates, to be analyzed, are described in the file
  
    "cands-001/SET/raw/sample.can"

  For each candidate listed in this file, the program will extract its
  two segments from the respective fragment outlines, and write them
  to files
    
    "cands-001/SET/raw/NNNNNN/SIDE.flc"
    "cands-001/SET/raw/NNNNNN/SIDE.fsh"

  where NNNNNN is the candidate number, and SIDE is either "a" or "b".
  These ".flc" files describe plane curves, like those that describe
  the whole fragment outlines; each data point has three coordinates X
  Y Z (with Z = 0 for our fragments). The ".fsh" files are the
  corresponding "shape functions"; each data point is a single number.
  
  The program then adjusts the alignment of each candidate and trims
  its segments to the proper number of data points (2^9+1 = 513). The
  vital parameters of these "refined" candidates are written to the
  file
  
    "cands-001/SET/ref/sample.can"
    
  The geometry of these segments and their "shape functions"
  are written to the files
  
    "cands-001/SET/ref/NNNNNN/SIDE.flc"
    "cands-001/SET/ref/NNNNNN/SIDE.fsh"
  
  A plot of the two segments, rotated and translated to their
  "about-to-fit" positions, can be found in
  
    "cands-001/SET/ref/NNNNNN/ab-flc.eps"
  
  The program also writes the Fourier transforms (actually, sine
  transforms) and the power spectra of the shape function, to files
  
    "cands-001/SET/ref/NNNNNN/SIDE.fft"
    "cands-001/SET/ref/NNNNNN/SIDE.fpw"
  
  respectively. Finally, the program computes the mean "m" and the 
  difference "d" of the two shape functions, and writes them and their
  derived data to the files

    "cands-001/SET/ref/NNNNNN/m.fsh"
    "cands-001/SET/ref/NNNNNN/m.fft"
    "cands-001/SET/ref/NNNNNN/m.fpw"

    "cands-001/SET/ref/NNNNNN/d.fsh"
    "cands-001/SET/ref/NNNNNN/d.fft"
    "cands-001/SET/ref/NNNNNN/d.fpw"

  Note that, in general, the candidate number NNNNNN is *not*
  necessarily the same in the "raw" and "refined" sets, because the
  program may discard "raw" candidates that are too short for
  realignment and trimming. In the two data sets provided here,
  however, no candidates were eliminated, so the numbers happen
  to match.
  
  The program also writes the files 

    "cands-001/SET/ref/m.fvt"
    "cands-001/SET/ref/d.fvt"
    "cands-001/SET/ref/ab.fvt"

  These files contain one line for each frequency k from 1 to 511, and
  show the average and variance of the corresponding Fourier
  coefficient --- respectively for the "m", "d", and "a"/"b" shape
  functions.
    
  Finally, the program writes

    "cands-001/SET/ref/sample.ifc"
    
  which gives the mutual information content for each frequency; and
    
    "cands-001/SET/ref/sample.ibd"
    
  which gives the same information, condensed for each frequency band.

FILE FORMATS
------------

  The format of those files should be mostly self-explanatory. Lines
  beginning with "|", when present, are merely comments. The named
  "unit" parameter defined in some files is a scale factor to be
  multiplied into the following data values.
  
  Fragment outlines and segments (".flc" extension):
    
    After the named parameter definition "samples = M" there are M
    data points equally spaced along the fragment's contour, one per
    line. Each point is a triplet X Y Z of integer coordinates,
    implicitly multiplied by the "unit" parameter to yield actual
    coordinates in mm. The Z coordinate is always zero in these
    datasets.
    
    The full outlines are closed curves, so the last sample is implicitly
    followed by the first one.  This assumption obviously does not apply
    to the ".flc" files of extracted segments.

  Shape functions (".fsh" extension):
  
    The format is similar to that of ".flc" files, except that each
    data point (each line) is a single number.

  Fourier transforms and spectra (".fft" and ".fpw" extensions):
  
    These have the same format as the "fsh" (shape function) files.
  
  Candidate sets (".can" extension)
  
    After the file header and some named parameters, there is one line
    for each candidate (segment pair) in the format
    
      ACRV ATOT AINI ALEN ADIR  BCRV BTOT BINI BLEN BDIR  R S T
      
    The first five fields describe one of the paired segments (segment
    "a"), and the next five fields describe the other segment ("b").
    The last three fields are not used by frb_analyze.
    
    For segment "a":
    
      ACRV is the index of its fragment outline in directory curves-001;
      ATOT is the total number of samples in the whole outline; 
      AINI is index of the first sample of segment "a" within that outline;
      ALEN is the number of samples (not steps) in segment "a";
      ADIR is the direction ("+" or "-") in which the segment is to be read.
      
    Thus, segment "a" consists of samples  curve[ACRV][(AINI+ADIR*k)%ATOT]
    for k varying from 0 through ALEN-1, where "%" is the mathematical MOD
    (remainder) operator.  The description of segment "b" is entirely 
    similar.