# Last edited on 2008-06-14 11:21:45 by stolfi

Processing of Reese's human genome dataset.

OBTAINING THE DATA

  Fetching:
  
    mkdir -p orig orig/raws orig/raws-junk
    
    pushd orig
    wget -nv http://www.fruitfly.org/seq_tools/datasets/Human/multi_exon_GB.dat 
    wget -nv http://www.fruitfly.org/seq_tools/datasets/Human/multi_exon_GB.sets
    wget -nv http://www.fruitfly.org/seq_tools/datasets/scripts/unpack.pl
    popd

  Unpacking:
  
    pushd orig/raws
    unpack.pl ../multi_exon_GB.dat
    popd

  Listing the data sets:
  
  ( cd orig/raws && ls [A-Z]*[A-Z0-9] ) > raws.dir
    
EXTRACTION AND FORMAT CONVERSION 

  Converting the datasets to the bas/lab/ama format of 
  the {dnabayes} tools:
  
    mkdir -p pick/item
    
    cp -p orig/raws.dir pick/item.cands
    
    extract-all pick/item orig/raws pick/item
    ( cd pick/item && ls *.bas ) | sed -e 's:[.]bas$::' > pick/item.dir
    dicio-wc pick/item.{cands,bugs,dir}
  
REMOVING BAD DATASETS

>>>=>>> STOPPED HERE !!!

    AF036329   | non-experimental                           
    HSAPC3A    | complementar                               
    AF032455   | non-exeperimental                          
    HUMTBGA    | multi-source                               
    HUMGCB1    | multi-source                               
    AF017257   | non-experimental,multi-source,complementar 
    HSCLN3     | bad letter 'r'                             
    HSU66711   | bad letter 'b'                             
    AF010238   | bad letter 'k'                             
    AF000573   | bad letter 'n'                             
    AF006501   | bad letter 'n'                             
    D88010     | bad letter 'n'                             
    HSAPOAIA   | bad letter 'n'                             
    HSDNAMIA   | bad letter 'n'                             
    HSGEBCMA   | bad letter 'n'                             
    HSHCF1     | bad letter 'n'                             
    HSMYBPC3   | bad letter 'n'                             
    HSN10C3    | bad letter 'n'                             
    HSPPTII    | bad letter 'n'                             
    HSRSGCG    | bad letter 'n'                             
    HSSAA1A    | bad letter 'n'                             
    HSU26425   | bad letter 'n'                             
    HSU29895   | bad letter 'n'                             
    HSU29953   | bad letter 'n'                             
    HSU32323   | bad letter 'n'                             
    HSU37022   | bad letter 'n'                             
    HSU37106   | bad letter 'n'                             
    HSU46165   | bad letter 'n'                             
    HSU50871   | bad letter 'n'                             
    HSU73002   | bad letter 'n'                             
    HSU89387   | bad letter 'n'                             
    HSUBR      | bad letter 'n'                             
    HSV698D2   | bad letter 'n'                             
    HUMATP1A2  | bad letter 'n'                             
    HUMCD79B   | bad letter 'n'                             
    HUMCHYMASE | bad letter 'n'                             
    HUMG0S24B  | bad letter 'n'                             
    HUMGCAPB   | bad letter 'n'                             
    HUMHIS102  | bad letter 'n'                             
    HUMHOX13G  | bad letter 'n'                             
    HUMIL1B    | bad letter 'n'                             
    HUMMCHEMP  | bad letter 'n'                             
    HUMMHCD8A  | bad letter 'n'                             
    HUMPREELAS | bad letter 'n'                             
    HUMRIGBCHA | bad letter 'n'                             
    HUMRPS6B   | bad letter 'n'                             
    HUMTDGF1A  | bad letter 'n'                             
    HUMTKRA    | bad letter 'n'                             
    L29472     | bad letter 'n'                             
    HSEOTAX    | bad letter 's'                             
    HSUSF2     | bad letter 's'                             
    HSE48ATGN  | bad letter 'w'                             
    HSU52427   | bad letter 'w'                             
    HSU96846   | bad letter 'w'                             
    HSIGK12    | bad letter 'y'                             
    HSIL1RECA  | bad letter 'y'                             
    HUMEDN1B   | bad letter 'y'                             
    HUMAZCDI   | bad range format "(237.241)..310"          
    HSGROW2    | bad range format "(274.275)..344"          
    HUMIL8A    | bad range format "3370..(4216.4236)"       
    HUMADPRF02 | bad range format "M74492:1319..1536"       
    HSCSRP2S2  | bad range format "U95017:874..955"         
    HSALADG    | more than one mRNA entry                   
    HSCKIIBE   | more than one mRNA entry                   
    HSCYP450   | more than one mRNA entry                   
    HSGLTH1    | more than one mRNA entry                   
    HSIFNAR    | more than one mRNA entry                   
    HSINSU     | more than one mRNA entry                   
    HSPNMTB    | more than one mRNA entry                   
    HSPR264SC  | more than one mRNA entry                   
    HSSLIPG    | more than one mRNA entry                   
    HSTCRT3D   | more than one mRNA entry                   
    HSU43901   | more than one mRNA entry                   
    HSUBA52G   | more than one mRNA entry                   
    HUMHMGIY   | more than one mRNA entry                   
    HUMHSP27X  | more than one mRNA entry                   

  The "for-balance" ones were removed in order to keep
  Reese's test sets more or less balanced.

  Those bad files were moved from "orig/raws/" to "orig/raws-junk/".
  
  ( cd orig/raws && ls [A-Z]*[A-Z0-9] ) > raws.dir
  ( cd orig/raws-junk && ls [A-Z]*[A-Z0-9] ) > raws-junk.dir

COUNTING BASES IN EACH ITEM

  Counting bases in each item

    rm -f item-sizes-all.txt
    foreach f ( `cd pick/item && ls *.bas` ) 
      cat pick/item/$f \
        | tr -d -c 'acgtux' \
        | wc -c \
        | gawk '//{ printf "%10d", $1 }' \
        >> item-sizes-all.txt
      echo " ${f:r}" >> item-sizes-all.txt
    end

DATASETS

  We split the list of "raws" item names into seven files "all-00.gset" a
  "all-06.gset" with 64 item names each; these are in the "sets"
  sub-directory.  The file "sets/all.gset" is their union.
  
    split-sets -v nSets=7 -v outName=sets/all item-sizes-all.txt
    cat sets/all-0[0-6].gset | sort | uniq > sets/all.gset
    
      set sizes after 0 passes:
            850498 [00]
            512160 [01]
           1042687 [02]
            857306 [03]
            638615 [04]
            794380 [05]
            562535 [06]
      average size     751168.7  deviation   174584.3

      set sizes after 2 passes:
            751169 [00]
            751168 [01]
            751169 [02]
            751169 [03]
            751168 [04]
            751169 [05]
            751169 [06]
      average size     751168.7  deviation        0.5

    dicio-wc sets/all{,-0[0-6]}.gset

        lines   words     bytes file        
      ------- ------- --------- ------------
          448     448      3875 sets/all.gset
           64      64       558 sets/all-00.gset
           64      64       555 sets/all-01.gset
           64      64       552 sets/all-02.gset
           64      64       549 sets/all-03.gset
           64      64       552 sets/all-04.gset
           64      64       548 sets/all-05.gset
           64      64       561 sets/all-06.gset

  Checking that no items were lost:
  
    cat sets/all-0[0-6].gset \
      | gawk '//{ printf "%s.bas\n", $1;}' \
      | sort \
      > .basfiles-1
    ( cd pick/item && ls *.bas ) \
      | sort \
      > .basfiles-2
    dicio-wc .basfiles-1 .basfiles-2
    diff .basfiles-1 .basfiles-2
    
        lines   words     bytes file        
      ------- ------- --------- ------------
          448     448      5667 .basfiles-1
          448     448      5667 .basfiles-2
  
  For debugging purposes, we also picked 14 items with about 5000 bases
  each (seven A* and seven H*), and saved their size/name pairs in
  "item-sizes-few.txt".  We then split this set into seven sets of
  two items each )one of each kind), "sets/few-00.gset"
  to "sets/few-06.gset"
  
    split-sets -v nSets=7 -v outName=sets/few item-sizes-few.txt
    cat sets/few-0[0-6].gset | sort | uniq > sets/few.gset
  
      set sizes after 0 passes:
             10987 [00]
             10751 [01]
             10410 [02]
             10040 [03]
              9576 [04]
              9399 [05]
              9163 [06]
      average size      10046.6  deviation      647.4

      set sizes after 1 passes:
             10098 [00]
             10074 [01]
              9991 [02]
             10040 [03]
              9995 [04]
             10076 [05]
             10052 [06]
      average size      10046.6  deviation       38.0

    dicio-wc sets/few{,-0[0-6]}.gset

        lines   words     bytes file        
      ------- ------- --------- ------------
           14      14       119 sets/few.gset
            2       2        18 sets/few-00.gset
            2       2        18 sets/few-01.gset
            2       2        18 sets/few-02.gset
            2       2        15 sets/few-03.gset
            2       2        16 sets/few-04.gset
            2       2        16 sets/few-05.gset
            2       2        18 sets/few-06.gset

  Re-counting total bases in each set:
  
    foreach f ( sets/all-0[0-9].gset sets/few-0[0-9].gset ) 
      set basfiles = ( `cat $f | sed -e 's:$:.bas:g'` )
      ( cd pick/item && cat ${basfiles} ) \
        | tr -d -c 'acgtux' \
        | wc -c \
        | gawk '//{ printf "%10d", $1 }'
      echo " $f"
    end
  
        751169 sets/all-00.gset
        751168 sets/all-01.gset
        751169 sets/all-02.gset
        751169 sets/all-03.gset
        751168 sets/all-04.gset
        751169 sets/all-05.gset
        751169 sets/all-06.gset
        
         10098 sets/few-00.gset
         10074 sets/few-01.gset
          9991 sets/few-02.gset
         10040 sets/few-03.gset
          9995 sets/few-04.gset
         10076 sets/few-05.gset
         10052 sets/few-06.gset

PROBABILITIES

  Computing the probabilities from all datasets:
  
    mkdir prob
    foreach ws ( 1 2 3 )
      echo "** REVISE -- tabulate-probs has changed"
      echo "** tabulate-probs all ${ws} sets pick/item prob
      echo "** dicio-wc prob/all-${ws}.prb
    end

      448 biosequences processed.

        lines   words     bytes file        
      ------- ------- --------- ------------
           18      79       551 prob/all-1.prb
          258    1039      7303 prob/all-2.prb
         4098   16399    123015 prob/all-3.prb


REGION LENGTHS

  Computing the mean lengths of coding and non-coding regions:
  
    get-region-lengths all sets pick/item prob

    plot-region-lengths prob/all
    
  Analyzing the outputs of this script (both the ".lens" files
  and the warnings about short sequences) we conclude that the 
  length distributions of the "K" and "N" regions are quite different.
  Specifically: 
  
    "K" regions:
    
      * The "K" regions may have any length {m}, from 3 bases up.

      * There are 10 "K" regions (out of 2742) which have the minimum length 
        (3 bases).  They are all complete, in-phase codons ("DEF" labels).

      * The "K" regions with {m = 4} (3 of them) all include one complete
        codon; i.e., none has labels "EFDE"

      * The probability of a "K" region having length {m} looks
        like a Gaussian with mean ~120.0 and deviation ~66.0.
        The distribution is truncated at the low end
        and a bit inflated for lengths between 250 and 350.

      * However there is a conspicuous peak at {m = 54}
        (there are 58 exons with that length).

    "N" regions:
    
      * The minimum length {m} for an "N" region that is /internal/ to the
        item (between two "K" regions) appears to be 66. In 3180 "N"
        regions, there is one internal "N" region with {m = 48}, one
        with {m = 57}, one with {m = 66}, three with length {m = 67},
        and then the counts go up..
        
      * The "N" regions at the beginning or at the end of a item
        file may be much shorter, down to 6 bases.  Those are
        presumably the UTRs (untranslated regions) of geneticists.
        
      * The length distribution for "N" bases between {m = ~66} bases
        and {m = ~132} bases is very irregular; the total probability
        of this range is ~0.15. The distribution can be described as a
        Gaussian with mean ~88 and deviation ~11, plus a somewhat
        uniform distribution between ~99 and ~132. However there are
        conspicuous fat peaks at ~111, ~116, and ~130, each with
        deviation ~2.
        
      * Above ~132 bases, the probability distribution appears to
        decay exponentially with constant {tau} around -0.00150.
    

JUNK

  For Renatha's thesis we removed these data sets only:

    AB000381   | for balance 
    AB002059   | for balance 
    AB005803   | for balance 
    AF001689   | for balance 
    AF017257   | non-experimental,multi-source,complementar 
    AF032455   | non-exeperimental                          
    AF036329   | non-experimental                           
    HS2OXOC    | for balance 
    HSALDOA    | for balance 
    HSAPC3A    | complementar                               
    HSCFOS     | for-balance                                
    HSCLN3     | bad letter 'r'                             
    HUMGCB1    | multi-source                               
    HUMTBGA    | multi-source