# Last edited on 2008-07-04 13:06:14 by stolfi

PROJECT

  Buiding a coding region database "from scratch".
  
OVERVIEW
  
  The Reese and Saxonov databases are not quite appropriate for what we 
  need: 
  
    * they are more concerned with intron/exon than with 
    coding/noncoding studies.
    
    * they show only primal-view sequences. We need to be able to
    detect complemented coding regions too.
    
    * they may be rather misleading on the lengths of introns.
    
    * they give several separate partial views for genes that
    have multiple splicings, instead of a single unified view.
    
  Thus we decided to create another database, with the following
  criteria: 
  
    * Each item is a maximal-length contig (preferably a whole
    chromosome or plasmid, an entire viral genome, etc.), possibly
    containing many genes.
    
    * The label file marks with [DEF] all regions present in the
    file that are or were originally coding regions.
    
    * Coding regions in the complementary strand are identified too.
    
    * Coding regions of pseudogenes are coded with [DEF] when their
    phase is precisely known.
    
    * Regions where two or more coding regions overlap
    with different phases or opposite orientations
    are marked disnctlively from normal coding regions.
    
    * Transcribed but non-coding regions may be marked, but
    that is not a high priority.
    
FETCHING

  We chose to use the genomic RefSeq files available from NCBI because
  they are good-quality DNA sequences for whole genomic biomolecules
  (chromosomes, mtDNA, plastids, etc.), with authoritative
  annotations.
  
  The file names for complete genomic molecules are
  "{HH}_{KKKKKK}.{V}" where {HH} is "NT" or "NC", {KKKKKK} is a
  six-digit number and {V} a numeric version code.
  
  To identify the definition and newest version of a file, use
  
    ../tools/ncbi_head {HH}_{KKKKKK}

  To fetch the newest version of a file, use 
    
    ../tools/ncbi_fetch {OUTDIR} {HH}_{KKKKKK}
  
  To identify or fetch a specific version, add a ".{V}" after the file name.
  
  Directory where to put the fetched files:

    mkdir -p orig/raws
    
  We group the files into datasets, each dataset being a specific version
  of the genome files of a single species.
  
  For each dataset we prepare a file called "orig/raws/{SS}-{YYYY}-{MM}-{DD}.fetch"
  which contains a list of the versioned names of the NCBI RefSeq files that comprise it.
  Here {SS} identifies the species (e.g. "at" for /Arabidopsis thaliana/)
  and "{YYYY}-{MM}-{DD}" is the date of the most recent file in the set
  (as given in its LOCUS line).

    set species = ( \
      at-2008-05-22 \
      ce-2008-06-11 \
      dm-2008-05-14 \
      dr-2008-06-10 \
    )
    
    set logfile = "orig/raws/fetch-`yyyy-mm-dd-hhmmss`.log"
    /bin/rm -f ${logfile}
    foreach sp ( ${species} )
      mkdir -p orig/raws/${sp}
      foreach f ( `cat orig/raws/${sp}.fetch | sed -e 's:[\#].*$::'` )
        echo "${sp}/${f} ..."
        ../tools/ncbi_fetch orig/raws/${sp} ${f} >> ${logfile}
        echo " " >> ${logfile}
        sleep 3
      end
      ls -l orig/raws/${sp}
    end
    ls -l orig/raws/*/*
    
      -rw-r--r-- 1 stolfi staff   304697 Jun 14 20:39 orig/raws/at-2008-05-22/NC_000932.1
      -rw-r--r-- 1 stolfi staff   634486 Jun 14 20:39 orig/raws/at-2008-05-22/NC_001284.2
      -rw-r--r-- 1 stolfi staff 69761140 Jun 15 07:42 orig/raws/at-2008-05-22/NC_003070.6
      -rw-r--r-- 1 stolfi staff 42596737 Jun 14 20:44 orig/raws/at-2008-05-22/NC_003071.4
      -rw-r--r-- 1 stolfi staff 50981491 Jun 14 20:44 orig/raws/at-2008-05-22/NC_003074.5
      -rw-r--r-- 1 stolfi staff 40550269 Jun 14 20:42 orig/raws/at-2008-05-22/NC_003075.4
      -rw-r--r-- 1 stolfi staff 59856528 Jun 14 20:46 orig/raws/at-2008-05-22/NC_003076.5
      
      -rw-r--r-- 1 stolfi staff    33461 Jun 14 20:46 orig/raws/ce-2008-06-11/NC_001328.1
      -rw-r--r-- 1 stolfi staff 29541252 Jun 14 20:46 orig/raws/ce-2008-06-11/NC_003279.5
      -rw-r--r-- 1 stolfi staff 29581575 Jun 14 20:47 orig/raws/ce-2008-06-11/NC_003280.6
      -rw-r--r-- 1 stolfi staff 27667988 Jun 14 20:48 orig/raws/ce-2008-06-11/NC_003281.7
      -rw-r--r-- 1 stolfi staff 32275175 Jun 14 20:49 orig/raws/ce-2008-06-11/NC_003282.4
      -rw-r--r-- 1 stolfi staff 38638065 Jun 14 20:49 orig/raws/ce-2008-06-11/NC_003283.7
      -rw-r--r-- 1 stolfi staff 31469892 Jun 14 20:50 orig/raws/ce-2008-06-11/NC_003284.6
      
      -rw-r--r-- 1 stolfi staff    60705 Jun 14 20:56 orig/raws/dm-2008-05-14/NC_001709.1
      -rw-r--r-- 1 stolfi staff  2097573 Jun 14 20:56 orig/raws/dm-2008-05-14/NC_004353.3
      -rw-r--r-- 1 stolfi staff 32501734 Jun 14 20:56 orig/raws/dm-2008-05-14/NC_004354.3
      -rw-r--r-- 1 stolfi staff 41489858 Jun 14 20:55 orig/raws/dm-2008-05-14/NT_033777.2
      -rw-r--r-- 1 stolfi staff 32254136 Jun 14 20:53 orig/raws/dm-2008-05-14/NT_033778.3
      -rw-r--r-- 1 stolfi staff 33803458 Jun 14 20:52 orig/raws/dm-2008-05-14/NT_033779.4
      -rw-r--r-- 1 stolfi staff 35933789 Jun 14 20:54 orig/raws/dm-2008-05-14/NT_037436.3
      
      -rw-r--r-- 1 stolfi staff    57977 Jun 14 20:50 orig/raws/dr-2008-06-10/NC_002333.2
      -rw-r--r-- 1 stolfi staff 73123740 Jun 15 07:55 orig/raws/dr-2008-06-10/NC_007112.3
      -rw-r--r-- 1 stolfi staff 70829200 Jun 15 07:57 orig/raws/dr-2008-06-10/NC_007113.3
      -rw-r--r-- 1 stolfi staff 81911587 Jun 15 08:00 orig/raws/dr-2008-06-10/NC_007114.3
      -rw-r--r-- 1 stolfi staff 55397910 Jun 15 08:02 orig/raws/dr-2008-06-10/NC_007115.3
      -rw-r--r-- 1 stolfi staff 91506853 Jun 15 08:05 orig/raws/dr-2008-06-10/NC_007116.3
      -rw-r--r-- 1 stolfi staff 76817414 Jun 15 08:08 orig/raws/dr-2008-06-10/NC_007117.3
      -rw-r--r-- 1 stolfi staff 91169946 Jun 15 08:11 orig/raws/dr-2008-06-10/NC_007118.3
      -rw-r--r-- 1 stolfi staff 73309774 Jun 15 08:14 orig/raws/dr-2008-06-10/NC_007119.3
      -rw-r--r-- 1 stolfi staff 66741793 Jun 15 08:17 orig/raws/dr-2008-06-10/NC_007120.3
      -rw-r--r-- 1 stolfi staff 55167710 Jun 15 08:18 orig/raws/dr-2008-06-10/NC_007121.3
      -rw-r--r-- 1 stolfi staff 57910696 Jun 15 08:20 orig/raws/dr-2008-06-10/NC_007122.3
      -rw-r--r-- 1 stolfi staff 61705851 Jun 15 08:22 orig/raws/dr-2008-06-10/NC_007123.3
      -rw-r--r-- 1 stolfi staff 69421045 Jun 15 08:24 orig/raws/dr-2008-06-10/NC_007124.3
      -rw-r--r-- 1 stolfi staff 73126441 Jun 15 08:26 orig/raws/dr-2008-06-10/NC_007125.3
      -rw-r--r-- 1 stolfi staff 60635691 Jun 15 08:28 orig/raws/dr-2008-06-10/NC_007126.3
      -rw-r--r-- 1 stolfi staff 69006744 Jun 15 08:30 orig/raws/dr-2008-06-10/NC_007127.3
      -rw-r--r-- 1 stolfi staff 67843837 Jun 15 15:24 orig/raws/dr-2008-06-10/NC_007128.3
      -rw-r--r-- 1 stolfi staff 63835842 Jun 15 08:33 orig/raws/dr-2008-06-10/NC_007129.3
      -rw-r--r-- 1 stolfi staff 60007581 Jun 15 08:34 orig/raws/dr-2008-06-10/NC_007130.3
      -rw-r--r-- 1 stolfi staff 73498760 Jun 15 08:37 orig/raws/dr-2008-06-10/NC_007131.3
      -rw-r--r-- 1 stolfi staff 59887856 Jun 15 15:27 orig/raws/dr-2008-06-10/NC_007132.3
      -rw-r--r-- 1 stolfi staff 51100294 Jun 15 08:38 orig/raws/dr-2008-06-10/NC_007133.3
      -rw-r--r-- 1 stolfi staff 60260763 Jun 15 08:40 orig/raws/dr-2008-06-10/NC_007134.3
      -rw-r--r-- 1 stolfi staff 52182564 Jun 15 08:42 orig/raws/dr-2008-06-10/NC_007135.3
      -rw-r--r-- 1 stolfi staff 42890688 Jun 15 15:30 orig/raws/dr-2008-06-10/NC_007136.3
      
  Counting the total base pairs in each genome:
  
    foreach sp ( ${species} )
      printf "%s " "${sp}"
      grep -m1 -e 'LOCUS' orig/raws/${sp}/N*[0-9] \
        | gawk \
            ' /LOCUS/ { n += $3; } \
              END { printf "%d bp\n", n; } \
            '
    end
    
      at-2008-05-22    119707899 bp
      ce-2008-06-11    100281244 bp
      dr-2008-06-10   1277091829 bp
      dm-2008-05-14    120401063 bp    

DISCOVERING THE FEATURE KEYS

  Listing all feature keys used in the files:
  
    foreach sp ( ${species} )
      echo " "; echo "=== ${sp} ==="
      set ofile = "orig/raws/${sp}/features.txt"
      rm -f ${ofile}
      foreach f ( `cd orig/raws/${sp} && ls N[CT]_*[0-9]` )
        echo "# ${f}" >> ${ofile}
        sed \
            -e '1,/^FEATURES/d' \
            -e '/^ORIGIN/dq' \
            orig/raws/${sp}/${f} \
          | expand \
          | gawk '//{ s = substr($0,1,20); if (s\!~/^[ ]*$/) { print s; }}' \
          | tr -d ' ' \
          | sort | uniq -c \
          >> ${ofile}
      end
      combine-counts ${ofile} \
        > orig/raws/${sp}/.tmp
    end
    set tmps = ( `echo ${species} | tr ' ' '\012' | sed -e 's:'\$':/.tmp:'` )
    echo "${tmps}"
    set tits = ( `echo ${species} | tr ' ' '\012' | sed -e 's:[-].*'\$'::'` )
    echo "${tits}"
    ( cd orig/raws && join-counts ${tmps} ) \
      | format-multi-counts \
         -v titles="${tits} KEY" \
      > .keys
    cat .keys

           at       ce       dr       dm  KEY
      -------  -------  -------  -------  ---
        20234    20234    20234    20234  CDS
          975      975      975      975  STS
            8        8        8        8  gap
        14690    14690    14690    14690  gene
        20221    20221    20221    20221  mRNA
          339      339      339      339  misc_RNA
            7        7        7        7  misc_feature
          296      296      296      296  ncRNA
            2        2        2        2  rRNA
            1        1        1        1  rep_origin
         5395     5395     5395     5395  repeat_region
            7        7        7        7  source
          314      314      314      314  tRNA

  !!! We should compress the files with {bzip2}. 

EXTRACTING THE LABELING RANGES

  We wrote a tool {extract_refSeq_ranges} that reads an annotated
  GenBank flatfile and outputs a list of all RNA-transcribed
  and protein-translated ranges listed in the "FEATURES" section.
  
  Testing:
  
    ( ../tools/extract_RefSeq_ranges \
          orig/raws/ce-2008-06-11/NC_003283.7 \
        > .ranges ) >& .bugs
      
  Extracting for real:
  
    foreach sp ( ${species} )
      echo " "; echo "=== ${sp} ==="
      foreach f ( `cd orig/raws/${sp} && ls N[CT]_*[0-9]` )
        set rfile = "${f}.ranges"
        set bfile = "${f}.bugs"
        ( cd orig/raws/${sp} && \
          ../../../../tools/extract_RefSeq_ranges \
              $f \
            | sort -k2,3n -k1,1n \
            | uniq \
            > ${rfile} \
        ) >& orig/raws/${sp}/${bfile}
      end
      dicio-wc orig/raws/${sp}/*.ranges
      dicio-wc orig/raws/${sp}/*.bugs
    end

  Let's compare these ranges with those in Saxonov's EID database.
  We unpacked the file "hEID" from Saxonov's "at2004" dataset (Arabidopsis)
  into the directory "../saxon/junk/at2004.hEID":
  
    set jj = "../saxon/junk/at2004"
    bzcat ../saxon/orig/eids/at2004/hEID.bz2 > ${jj}.hEID
    
  Recall that the ".hEID" file is divided into many items, one for
  each CDS. Each item looks like a GenBank file, whose FEATURES
  section contains a single CDS feature and whose 
  ORIGIN section is lacking, with some extra headers and separators.
  
  We seleted the Chromosome 1 items from this "${jj}.hEID" file 
  and packed them into a single file "${jj}.zEID"
  in almost-correct GenBank flatfile format, but with an
  empty ORIGIN section. Namely:
    
    cat ${jj}.hEID \
      | expand \
      | sed -e '/^[>]/d' -e '/^intron/d' -e '/_HEADER_END_/Q' \
      > ${jj}.zEID
    cat ${jj}.hEID \
      | expand \
      | sed -e '/chromosome 2/Q' \
      | sed -e '/^[ ]*CDS[ ]/,/^[ ]*[\/][\/][ ]*$/\!d' \
      | sed -e '/^[ ]*[\/][\/][ ]*$/d' -e '/BASE COUNT/d' \
      >> ${jj}.zEID
    echo "ORIGIN " >> ${jj}.zEID
    echo "//" >> ${jj}.zEID
    
  We extracted the labeling ranges from this file, and compares it
  with the CDS ranges previously extracted from the newer NCBI 
  Arabidopsis RefSeq file:
    
    ( ../tools/extract_RefSeq_ranges \
         ${jj}.zEID \
        | sort -k2,3n -k1,1n \
        | uniq \
        > ${jj}.zEID.ranges ) \
      >& ${jj}.zEID.bugs
      
    cat ${jj}.zEID.ranges \
      | sed -e '/^[>]/d' \
      | gawk '//{ print $1,$2,$3,$5; }' \
      | sort | uniq \
      > .saxon
      
    cat orig/raws/at-2008-05-22/NC_003070.6.ranges \
      | sed -e '/^[>]/d' -e '/[N]/d' \
      | gawk '//{ print $1,$2,$3,$5; }' \
      | sort | uniq \
      > .ncbir
      
    bool 1-2 .saxon .ncbir > .s-n
    bool 2-1 .saxon .ncbir > .n-s
    dicio-wc .saxon .ncbir .s-n .n-s
    
        lines   words     bytes file        
      ------- ------- --------- ------------
        35860  143440    864120 .saxon
        38216  152864    920510 .ncbir
         1390    5560     33656 .s-n
         3746   14984     90046 .n-s

  Thus there are about 12% differences at the CDS range level.
  Let's compare the CDS regions themselves:
  
    cat ${jj}.zEID \
      | egrep -e '^[ ]*CDS[ ][ ]' \
      | sort | uniq \
      > .saxon-cds
      
    cat orig/raws/at-2008-05-22/NC_003070.6 \
      | egrep -e '^[ ]*CDS[ ][ ]' \
      | sort | uniq \
      > .ncbir-cds

    bool 1-2 .saxon-cds .ncbir-cds > .s-n-cds
    bool 2-1 .saxon-cds .ncbir-cds > .n-s-cds
    dicio-wc .saxon-cds .ncbir-cds .s-n-cds .n-s-cds
    
        lines   words     bytes file        
      ------- ------- --------- ------------
         5594   11188    400858 .saxon-cds
         7553   15106    497419 .ncbir-cds
          363     726     25830 .s-n-cds
         2322    4644    122391 .n-s-cds

  There are many changes, especially new CDSs in the NCBI file. Most
  (if not all) of the new sequences appear to have been inferred by
  homology. Some (if not all) of the losses are CDSs whose span was
  corrected.
  
  Conclusion: it appears that we 

CHECKING WHETHER THERE ARE ANY NONALIGNED CDS FEATURES

  The "/codon_start" qualifier of a CDS is supposed to tell
  where the first codon of the CDS resides relative to the
  beginning of the CDS.  
  
  I canot understand the purpose of this qualifier,
  except perhaps for those cases where part of the CDS
  lies outside the "source" range.  If that is the 
  only reason, then one should always have "/codon_start=1"
  in whole-molecule files. 
  
  Let's check:
  
    foreach sp ( ${species} )
      echo " "; echo "=== ${sp} ==="
      foreach f ( `ls orig/raws/${sp}/N[CT]_*[0-9]` )
        echo "# ${f}"
        egrep \
            -n -e '[\/]codon_start *[=] *[^1 ]' \
            ${f} \
          | sed -e 's:^:    :'
      end
    end
        
  There are no such things in the at-2008-05-22 and ce-2008-06-11
  datasets. In the dr-2008-06-10 and dm-2008-05-14 datsets there are
  about 7 such things per file. Some (all?) belong to RNA regions
  which were found by automated programs (GNOMON or BestRefSeq).
  
  We should print the "/note=" qualifier of RNA and CDS regions and
  see whether we can disqualify them automatically. Perhaps label them
  in some way, e.g. [defn] instead of [DEFN]?

COLLECTING THE NCBI CODE TABLES

  For the RefSeq project, the NCBI produced a set of codon-aminoacid
  translation tables.  The correct table to use is 
  identified by the "/transl_table" attribute of the 
  "CDS" feature.  The default is "/transl_table=1" (the Stanard Code).
  
  To use this attribute, we fetched the code table page 
  from NCBI
  
    http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c#SG1
  
  and semi-automatically converted it into a set of files 
  "codes/ncbi-{NNN}.tbl" where {NNN} is the table number (3 digits).
  
  Apart from '#'-comments and blank lines, the table contains
  64 lines in the format '{CODON} {AMINO}{START}' where
  
    {AMINO} is the single-letter aminoacid code, or '*' for STOP;
    
    {START} is a protein letter for codons that may be initiators, '-' otherwise.
    
  Listing all tables side by side: 
  
    set tables = ( `ls codes/ncbi-[0-9][0-9][0-9].tbl` )
    echo ${tables}

    set tbltit = "`echo ${tables} | sed -e 's:[^ ]*[-]::g' -e 's:[.][^ ]*::g'`"
    echo ${tbltit}
    
    multicol \
        -v title="${tbltit}" \
        -v colsep=' ' \
        -v clean=1 \
        ${tables} \
      | gawk '//{ gsub(/^[ ]+/,"");for(i=NF-1;i>=2;i-=2) { if($(i)==$1){$(i)="";}} print}'
      
      AAA K-  K-  K-  K-  K-  K-  N-  K-  K-  K-  K-  N-  K-  K-  N-  K-  K-
      AAT N-  N-  N-  N-  N-  N-  N-  N-  N-  N-  N-  N-  N-  N-  N-  N-  N-
      AAC N-  N-  N-  N-  N-  N-  N-  N-  N-  N-  N-  N-  N-  N-  N-  N-  N-
      AAG K-  K-  K-  K-  K-  K-  K-  K-  K-  K-  K-  K-  K-  K-  K-  K-  K-
      ATA I-  MM  MM  IM  MM  I-  I-  I-  IM  I-  MM  I-  I-  I-  M-  I-  I-
      ATT I-  IM  I-  IM  IM  I-  I-  I-  IM  I-  I-  I-  I-  I-  I-  I-  IM
      ATC I-  IM  I-  IM  IM  I-  I-  I-  IM  I-  I-  I-  I-  I-  I-  I-  I-
      ATG MM  MM  MM  MM  MM  MM  MM  MM  MM  MM  MM  MM  MM  MM  MM  MM  MM
      ACA T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-
      ACT T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-
      ACC T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-
      ACG T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-  T-
      AGA R-  *-  R-  R-  S-  R-  S-  R-  R-  R-  G-  S-  R-  R-  S-  R-  R-
      AGT S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-
      AGC S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-
      AGG R-  *-  R-  R-  S-  R-  S-  R-  R-  R-  G-  S-  R-  R-  S-  R-  R-
      TAA *-  *-  *-  *-  *-  Q-  *-  *-  *-  *-  *-  Y-  *-  *-  *-  *-  *-
      TAT Y-  Y-  Y-  Y-  Y-  Y-  Y-  Y-  Y-  Y-  Y-  Y-  Y-  Y-  Y-  Y-  Y-
      TAC Y-  Y-  Y-  Y-  Y-  Y-  Y-  Y-  Y-  Y-  Y-  Y-  Y-  Y-  Y-  Y-  Y-
      TAG *-  *-  *-  *-  *-  Q-  *-  *-  *-  *-  *-  *-  Q-  L-  *-  L-  *-
      TTA L-  L-  L-  LM  L-  L-  L-  L-  L-  L-  L-  L-  L-  L-  L-  L-  *-
      TTT F-  F-  F-  F-  F-  F-  F-  F-  F-  F-  F-  F-  F-  F-  F-  F-  F-
      TTC F-  F-  F-  F-  F-  F-  F-  F-  F-  F-  F-  F-  F-  F-  F-  F-  F-
      TTG LM  L-  L-  LM  LM  L-  L-  L-  LM  L-  LM  L-  L-  L-  L-  L-  L-
      TCA S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  *-  S-
      TCT S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-
      TCC S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-
      TCG S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-  S-
      TGA *-  W-  W-  W-  W-  *-  W-  C-  *-  *-  W-  W-  *-  *-  W-  *-  *-
      TGT C-  C-  C-  C-  C-  C-  C-  C-  C-  C-  C-  C-  C-  C-  C-  C-  C-
      TGC C-  C-  C-  C-  C-  C-  C-  C-  C-  C-  C-  C-  C-  C-  C-  C-  C-
      TGG W-  W-  W-  W-  W-  W-  W-  W-  W-  W-  W-  W-  W-  W-  W-  W-  W-
      CAA Q-  Q-  Q-  Q-  Q-  Q-  Q-  Q-  Q-  Q-  Q-  Q-  Q-  Q-  Q-  Q-  Q-
      CAT H-  H-  H-  H-  H-  H-  H-  H-  H-  H-  H-  H-  H-  H-  H-  H-  H-
      CAC H-  H-  H-  H-  H-  H-  H-  H-  H-  H-  H-  H-  H-  H-  H-  H-  H-
      CAG Q-  Q-  Q-  Q-  Q-  Q-  Q-  Q-  Q-  Q-  Q-  Q-  Q-  Q-  Q-  Q-  Q-
      CTA L-  L-  T-  L-  L-  L-  L-  L-  L-  L-  L-  L-  L-  L-  L-  L-  L-
      CTT L-  L-  T-  L-  L-  L-  L-  L-  L-  L-  L-  L-  L-  L-  L-  L-  L-
      CTC L-  L-  T-  L-  L-  L-  L-  L-  L-  L-  L-  L-  L-  L-  L-  L-  L-
      CTG LM  L-  T-  LM  L-  L-  L-  L-  LM  SM  L-  L-  L-  L-  L-  L-  L-
      CCA P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-
      CCT P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-
      CCC P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-
      CCG P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-  P-
      CGA R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-
      CGT R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-
      CGC R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-
      CGG R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-  R-
      GAA E-  E-  E-  E-  E-  E-  E-  E-  E-  E-  E-  E-  E-  E-  E-  E-  E-
      GAT D-  D-  D-  D-  D-  D-  D-  D-  D-  D-  D-  D-  D-  D-  D-  D-  D-
      GAC D-  D-  D-  D-  D-  D-  D-  D-  D-  D-  D-  D-  D-  D-  D-  D-  D-
      GAG E-  E-  E-  E-  E-  E-  E-  E-  E-  E-  E-  E-  E-  E-  E-  E-  E-
      GTA V-  V-  V-  V-  V-  V-  V-  V-  V-  V-  V-  V-  V-  V-  V-  V-  V-
      GTT V-  V-  V-  V-  V-  V-  V-  V-  V-  V-  V-  V-  V-  V-  V-  V-  V-
      GTC V-  V-  V-  V-  V-  V-  V-  V-  V-  V-  V-  V-  V-  V-  V-  V-  V-
      GTG V-  VM  V-  VM  VM  V-  VM  V-  VM  V-  VM  V-  V-  V-  VM  V-  VM
      GCA A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-
      GCT A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-
      GCC A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-
      GCG A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-  A-
      GGA G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-
      GGT G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-
      GGC G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-
      GGG G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-  G-
      
>>TO DO>>

CONSOLIDATING THE LABELING RANGES

  !!! REVIEW !!! We wrote a tool {extract_refSeq_segment} that takes an annotated
  GenBank flatfile, a range of positions, and outputs the ".bas" and
  ".lab" files of that segment of the file.
  
    ../tools/extract_RefSeq_segment \
        -v ini=450000 -v fin=499999 \
        -v outName="test" \
      orig/raws/at-2008-05-22/NC_003071.4 >& .log
      
   !! Shoudl write {outName}-1.bas,lab in order of increasing strand 0 ??