# Last edited on 2008-06-14 11:21:45 by stolfi Processing of Reese's human genome dataset. OBTAINING THE DATA Fetching: mkdir -p orig orig/raws orig/raws-junk pushd orig wget -nv http://www.fruitfly.org/seq_tools/datasets/Human/multi_exon_GB.dat wget -nv http://www.fruitfly.org/seq_tools/datasets/Human/multi_exon_GB.sets wget -nv http://www.fruitfly.org/seq_tools/datasets/scripts/unpack.pl popd Unpacking: pushd orig/raws unpack.pl ../multi_exon_GB.dat popd Listing the data sets: ( cd orig/raws && ls [A-Z]*[A-Z0-9] ) > raws.dir EXTRACTION AND FORMAT CONVERSION Converting the datasets to the bas/lab/ama format of the {dnabayes} tools: mkdir -p pick/item cp -p orig/raws.dir pick/item.cands extract-all pick/item orig/raws pick/item ( cd pick/item && ls *.bas ) | sed -e 's:[.]bas$::' > pick/item.dir dicio-wc pick/item.{cands,bugs,dir} REMOVING BAD DATASETS >>>=>>> STOPPED HERE !!! AF036329 | non-experimental HSAPC3A | complementar AF032455 | non-exeperimental HUMTBGA | multi-source HUMGCB1 | multi-source AF017257 | non-experimental,multi-source,complementar HSCLN3 | bad letter 'r' HSU66711 | bad letter 'b' AF010238 | bad letter 'k' AF000573 | bad letter 'n' AF006501 | bad letter 'n' D88010 | bad letter 'n' HSAPOAIA | bad letter 'n' HSDNAMIA | bad letter 'n' HSGEBCMA | bad letter 'n' HSHCF1 | bad letter 'n' HSMYBPC3 | bad letter 'n' HSN10C3 | bad letter 'n' HSPPTII | bad letter 'n' HSRSGCG | bad letter 'n' HSSAA1A | bad letter 'n' HSU26425 | bad letter 'n' HSU29895 | bad letter 'n' HSU29953 | bad letter 'n' HSU32323 | bad letter 'n' HSU37022 | bad letter 'n' HSU37106 | bad letter 'n' HSU46165 | bad letter 'n' HSU50871 | bad letter 'n' HSU73002 | bad letter 'n' HSU89387 | bad letter 'n' HSUBR | bad letter 'n' HSV698D2 | bad letter 'n' HUMATP1A2 | bad letter 'n' HUMCD79B | bad letter 'n' HUMCHYMASE | bad letter 'n' HUMG0S24B | bad letter 'n' HUMGCAPB | bad letter 'n' HUMHIS102 | bad letter 'n' HUMHOX13G | bad letter 'n' HUMIL1B | bad letter 'n' HUMMCHEMP | bad letter 'n' HUMMHCD8A | bad letter 'n' HUMPREELAS | bad letter 'n' HUMRIGBCHA | bad letter 'n' HUMRPS6B | bad letter 'n' HUMTDGF1A | bad letter 'n' HUMTKRA | bad letter 'n' L29472 | bad letter 'n' HSEOTAX | bad letter 's' HSUSF2 | bad letter 's' HSE48ATGN | bad letter 'w' HSU52427 | bad letter 'w' HSU96846 | bad letter 'w' HSIGK12 | bad letter 'y' HSIL1RECA | bad letter 'y' HUMEDN1B | bad letter 'y' HUMAZCDI | bad range format "(237.241)..310" HSGROW2 | bad range format "(274.275)..344" HUMIL8A | bad range format "3370..(4216.4236)" HUMADPRF02 | bad range format "M74492:1319..1536" HSCSRP2S2 | bad range format "U95017:874..955" HSALADG | more than one mRNA entry HSCKIIBE | more than one mRNA entry HSCYP450 | more than one mRNA entry HSGLTH1 | more than one mRNA entry HSIFNAR | more than one mRNA entry HSINSU | more than one mRNA entry HSPNMTB | more than one mRNA entry HSPR264SC | more than one mRNA entry HSSLIPG | more than one mRNA entry HSTCRT3D | more than one mRNA entry HSU43901 | more than one mRNA entry HSUBA52G | more than one mRNA entry HUMHMGIY | more than one mRNA entry HUMHSP27X | more than one mRNA entry The "for-balance" ones were removed in order to keep Reese's test sets more or less balanced. Those bad files were moved from "orig/raws/" to "orig/raws-junk/". ( cd orig/raws && ls [A-Z]*[A-Z0-9] ) > raws.dir ( cd orig/raws-junk && ls [A-Z]*[A-Z0-9] ) > raws-junk.dir COUNTING BASES IN EACH ITEM Counting bases in each item rm -f item-sizes-all.txt foreach f ( `cd pick/item && ls *.bas` ) cat pick/item/$f \ | tr -d -c 'acgtux' \ | wc -c \ | gawk '//{ printf "%10d", $1 }' \ >> item-sizes-all.txt echo " ${f:r}" >> item-sizes-all.txt end DATASETS We split the list of "raws" item names into seven files "all-00.gset" a "all-06.gset" with 64 item names each; these are in the "sets" sub-directory. The file "sets/all.gset" is their union. split-sets -v nSets=7 -v outName=sets/all item-sizes-all.txt cat sets/all-0[0-6].gset | sort | uniq > sets/all.gset set sizes after 0 passes: 850498 [00] 512160 [01] 1042687 [02] 857306 [03] 638615 [04] 794380 [05] 562535 [06] average size 751168.7 deviation 174584.3 set sizes after 2 passes: 751169 [00] 751168 [01] 751169 [02] 751169 [03] 751168 [04] 751169 [05] 751169 [06] average size 751168.7 deviation 0.5 dicio-wc sets/all{,-0[0-6]}.gset lines words bytes file ------- ------- --------- ------------ 448 448 3875 sets/all.gset 64 64 558 sets/all-00.gset 64 64 555 sets/all-01.gset 64 64 552 sets/all-02.gset 64 64 549 sets/all-03.gset 64 64 552 sets/all-04.gset 64 64 548 sets/all-05.gset 64 64 561 sets/all-06.gset Checking that no items were lost: cat sets/all-0[0-6].gset \ | gawk '//{ printf "%s.bas\n", $1;}' \ | sort \ > .basfiles-1 ( cd pick/item && ls *.bas ) \ | sort \ > .basfiles-2 dicio-wc .basfiles-1 .basfiles-2 diff .basfiles-1 .basfiles-2 lines words bytes file ------- ------- --------- ------------ 448 448 5667 .basfiles-1 448 448 5667 .basfiles-2 For debugging purposes, we also picked 14 items with about 5000 bases each (seven A* and seven H*), and saved their size/name pairs in "item-sizes-few.txt". We then split this set into seven sets of two items each )one of each kind), "sets/few-00.gset" to "sets/few-06.gset" split-sets -v nSets=7 -v outName=sets/few item-sizes-few.txt cat sets/few-0[0-6].gset | sort | uniq > sets/few.gset set sizes after 0 passes: 10987 [00] 10751 [01] 10410 [02] 10040 [03] 9576 [04] 9399 [05] 9163 [06] average size 10046.6 deviation 647.4 set sizes after 1 passes: 10098 [00] 10074 [01] 9991 [02] 10040 [03] 9995 [04] 10076 [05] 10052 [06] average size 10046.6 deviation 38.0 dicio-wc sets/few{,-0[0-6]}.gset lines words bytes file ------- ------- --------- ------------ 14 14 119 sets/few.gset 2 2 18 sets/few-00.gset 2 2 18 sets/few-01.gset 2 2 18 sets/few-02.gset 2 2 15 sets/few-03.gset 2 2 16 sets/few-04.gset 2 2 16 sets/few-05.gset 2 2 18 sets/few-06.gset Re-counting total bases in each set: foreach f ( sets/all-0[0-9].gset sets/few-0[0-9].gset ) set basfiles = ( `cat $f | sed -e 's:$:.bas:g'` ) ( cd pick/item && cat ${basfiles} ) \ | tr -d -c 'acgtux' \ | wc -c \ | gawk '//{ printf "%10d", $1 }' echo " $f" end 751169 sets/all-00.gset 751168 sets/all-01.gset 751169 sets/all-02.gset 751169 sets/all-03.gset 751168 sets/all-04.gset 751169 sets/all-05.gset 751169 sets/all-06.gset 10098 sets/few-00.gset 10074 sets/few-01.gset 9991 sets/few-02.gset 10040 sets/few-03.gset 9995 sets/few-04.gset 10076 sets/few-05.gset 10052 sets/few-06.gset PROBABILITIES Computing the probabilities from all datasets: mkdir prob foreach ws ( 1 2 3 ) echo "** REVISE -- tabulate-probs has changed" echo "** tabulate-probs all ${ws} sets pick/item prob echo "** dicio-wc prob/all-${ws}.prb end 448 biosequences processed. lines words bytes file ------- ------- --------- ------------ 18 79 551 prob/all-1.prb 258 1039 7303 prob/all-2.prb 4098 16399 123015 prob/all-3.prb REGION LENGTHS Computing the mean lengths of coding and non-coding regions: get-region-lengths all sets pick/item prob plot-region-lengths prob/all Analyzing the outputs of this script (both the ".lens" files and the warnings about short sequences) we conclude that the length distributions of the "K" and "N" regions are quite different. Specifically: "K" regions: * The "K" regions may have any length {m}, from 3 bases up. * There are 10 "K" regions (out of 2742) which have the minimum length (3 bases). They are all complete, in-phase codons ("DEF" labels). * The "K" regions with {m = 4} (3 of them) all include one complete codon; i.e., none has labels "EFDE" * The probability of a "K" region having length {m} looks like a Gaussian with mean ~120.0 and deviation ~66.0. The distribution is truncated at the low end and a bit inflated for lengths between 250 and 350. * However there is a conspicuous peak at {m = 54} (there are 58 exons with that length). "N" regions: * The minimum length {m} for an "N" region that is /internal/ to the item (between two "K" regions) appears to be 66. In 3180 "N" regions, there is one internal "N" region with {m = 48}, one with {m = 57}, one with {m = 66}, three with length {m = 67}, and then the counts go up.. * The "N" regions at the beginning or at the end of a item file may be much shorter, down to 6 bases. Those are presumably the UTRs (untranslated regions) of geneticists. * The length distribution for "N" bases between {m = ~66} bases and {m = ~132} bases is very irregular; the total probability of this range is ~0.15. The distribution can be described as a Gaussian with mean ~88 and deviation ~11, plus a somewhat uniform distribution between ~99 and ~132. However there are conspicuous fat peaks at ~111, ~116, and ~130, each with deviation ~2. * Above ~132 bases, the probability distribution appears to decay exponentially with constant {tau} around -0.00150. JUNK For Renatha's thesis we removed these data sets only: AB000381 | for balance AB002059 | for balance AB005803 | for balance AF001689 | for balance AF017257 | non-experimental,multi-source,complementar AF032455 | non-exeperimental AF036329 | non-experimental HS2OXOC | for balance HSALDOA | for balance HSAPC3A | complementar HSCFOS | for-balance HSCLN3 | bad letter 'r' HUMGCB1 | multi-source HUMTBGA | multi-source