# Last edited on 2008-07-04 13:06:14 by stolfi PROJECT Buiding a coding region database "from scratch". OVERVIEW The Reese and Saxonov databases are not quite appropriate for what we need: * they are more concerned with intron/exon than with coding/noncoding studies. * they show only primal-view sequences. We need to be able to detect complemented coding regions too. * they may be rather misleading on the lengths of introns. * they give several separate partial views for genes that have multiple splicings, instead of a single unified view. Thus we decided to create another database, with the following criteria: * Each item is a maximal-length contig (preferably a whole chromosome or plasmid, an entire viral genome, etc.), possibly containing many genes. * The label file marks with [DEF] all regions present in the file that are or were originally coding regions. * Coding regions in the complementary strand are identified too. * Coding regions of pseudogenes are coded with [DEF] when their phase is precisely known. * Regions where two or more coding regions overlap with different phases or opposite orientations are marked disnctlively from normal coding regions. * Transcribed but non-coding regions may be marked, but that is not a high priority. FETCHING We chose to use the genomic RefSeq files available from NCBI because they are good-quality DNA sequences for whole genomic biomolecules (chromosomes, mtDNA, plastids, etc.), with authoritative annotations. The file names for complete genomic molecules are "{HH}_{KKKKKK}.{V}" where {HH} is "NT" or "NC", {KKKKKK} is a six-digit number and {V} a numeric version code. To identify the definition and newest version of a file, use ../tools/ncbi_head {HH}_{KKKKKK} To fetch the newest version of a file, use ../tools/ncbi_fetch {OUTDIR} {HH}_{KKKKKK} To identify or fetch a specific version, add a ".{V}" after the file name. Directory where to put the fetched files: mkdir -p orig/raws We group the files into datasets, each dataset being a specific version of the genome files of a single species. For each dataset we prepare a file called "orig/raws/{SS}-{YYYY}-{MM}-{DD}.fetch" which contains a list of the versioned names of the NCBI RefSeq files that comprise it. Here {SS} identifies the species (e.g. "at" for /Arabidopsis thaliana/) and "{YYYY}-{MM}-{DD}" is the date of the most recent file in the set (as given in its LOCUS line). set species = ( \ at-2008-05-22 \ ce-2008-06-11 \ dm-2008-05-14 \ dr-2008-06-10 \ ) set logfile = "orig/raws/fetch-`yyyy-mm-dd-hhmmss`.log" /bin/rm -f ${logfile} foreach sp ( ${species} ) mkdir -p orig/raws/${sp} foreach f ( `cat orig/raws/${sp}.fetch | sed -e 's:[\#].*$::'` ) echo "${sp}/${f} ..." ../tools/ncbi_fetch orig/raws/${sp} ${f} >> ${logfile} echo " " >> ${logfile} sleep 3 end ls -l orig/raws/${sp} end ls -l orig/raws/*/* -rw-r--r-- 1 stolfi staff 304697 Jun 14 20:39 orig/raws/at-2008-05-22/NC_000932.1 -rw-r--r-- 1 stolfi staff 634486 Jun 14 20:39 orig/raws/at-2008-05-22/NC_001284.2 -rw-r--r-- 1 stolfi staff 69761140 Jun 15 07:42 orig/raws/at-2008-05-22/NC_003070.6 -rw-r--r-- 1 stolfi staff 42596737 Jun 14 20:44 orig/raws/at-2008-05-22/NC_003071.4 -rw-r--r-- 1 stolfi staff 50981491 Jun 14 20:44 orig/raws/at-2008-05-22/NC_003074.5 -rw-r--r-- 1 stolfi staff 40550269 Jun 14 20:42 orig/raws/at-2008-05-22/NC_003075.4 -rw-r--r-- 1 stolfi staff 59856528 Jun 14 20:46 orig/raws/at-2008-05-22/NC_003076.5 -rw-r--r-- 1 stolfi staff 33461 Jun 14 20:46 orig/raws/ce-2008-06-11/NC_001328.1 -rw-r--r-- 1 stolfi staff 29541252 Jun 14 20:46 orig/raws/ce-2008-06-11/NC_003279.5 -rw-r--r-- 1 stolfi staff 29581575 Jun 14 20:47 orig/raws/ce-2008-06-11/NC_003280.6 -rw-r--r-- 1 stolfi staff 27667988 Jun 14 20:48 orig/raws/ce-2008-06-11/NC_003281.7 -rw-r--r-- 1 stolfi staff 32275175 Jun 14 20:49 orig/raws/ce-2008-06-11/NC_003282.4 -rw-r--r-- 1 stolfi staff 38638065 Jun 14 20:49 orig/raws/ce-2008-06-11/NC_003283.7 -rw-r--r-- 1 stolfi staff 31469892 Jun 14 20:50 orig/raws/ce-2008-06-11/NC_003284.6 -rw-r--r-- 1 stolfi staff 60705 Jun 14 20:56 orig/raws/dm-2008-05-14/NC_001709.1 -rw-r--r-- 1 stolfi staff 2097573 Jun 14 20:56 orig/raws/dm-2008-05-14/NC_004353.3 -rw-r--r-- 1 stolfi staff 32501734 Jun 14 20:56 orig/raws/dm-2008-05-14/NC_004354.3 -rw-r--r-- 1 stolfi staff 41489858 Jun 14 20:55 orig/raws/dm-2008-05-14/NT_033777.2 -rw-r--r-- 1 stolfi staff 32254136 Jun 14 20:53 orig/raws/dm-2008-05-14/NT_033778.3 -rw-r--r-- 1 stolfi staff 33803458 Jun 14 20:52 orig/raws/dm-2008-05-14/NT_033779.4 -rw-r--r-- 1 stolfi staff 35933789 Jun 14 20:54 orig/raws/dm-2008-05-14/NT_037436.3 -rw-r--r-- 1 stolfi staff 57977 Jun 14 20:50 orig/raws/dr-2008-06-10/NC_002333.2 -rw-r--r-- 1 stolfi staff 73123740 Jun 15 07:55 orig/raws/dr-2008-06-10/NC_007112.3 -rw-r--r-- 1 stolfi staff 70829200 Jun 15 07:57 orig/raws/dr-2008-06-10/NC_007113.3 -rw-r--r-- 1 stolfi staff 81911587 Jun 15 08:00 orig/raws/dr-2008-06-10/NC_007114.3 -rw-r--r-- 1 stolfi staff 55397910 Jun 15 08:02 orig/raws/dr-2008-06-10/NC_007115.3 -rw-r--r-- 1 stolfi staff 91506853 Jun 15 08:05 orig/raws/dr-2008-06-10/NC_007116.3 -rw-r--r-- 1 stolfi staff 76817414 Jun 15 08:08 orig/raws/dr-2008-06-10/NC_007117.3 -rw-r--r-- 1 stolfi staff 91169946 Jun 15 08:11 orig/raws/dr-2008-06-10/NC_007118.3 -rw-r--r-- 1 stolfi staff 73309774 Jun 15 08:14 orig/raws/dr-2008-06-10/NC_007119.3 -rw-r--r-- 1 stolfi staff 66741793 Jun 15 08:17 orig/raws/dr-2008-06-10/NC_007120.3 -rw-r--r-- 1 stolfi staff 55167710 Jun 15 08:18 orig/raws/dr-2008-06-10/NC_007121.3 -rw-r--r-- 1 stolfi staff 57910696 Jun 15 08:20 orig/raws/dr-2008-06-10/NC_007122.3 -rw-r--r-- 1 stolfi staff 61705851 Jun 15 08:22 orig/raws/dr-2008-06-10/NC_007123.3 -rw-r--r-- 1 stolfi staff 69421045 Jun 15 08:24 orig/raws/dr-2008-06-10/NC_007124.3 -rw-r--r-- 1 stolfi staff 73126441 Jun 15 08:26 orig/raws/dr-2008-06-10/NC_007125.3 -rw-r--r-- 1 stolfi staff 60635691 Jun 15 08:28 orig/raws/dr-2008-06-10/NC_007126.3 -rw-r--r-- 1 stolfi staff 69006744 Jun 15 08:30 orig/raws/dr-2008-06-10/NC_007127.3 -rw-r--r-- 1 stolfi staff 67843837 Jun 15 15:24 orig/raws/dr-2008-06-10/NC_007128.3 -rw-r--r-- 1 stolfi staff 63835842 Jun 15 08:33 orig/raws/dr-2008-06-10/NC_007129.3 -rw-r--r-- 1 stolfi staff 60007581 Jun 15 08:34 orig/raws/dr-2008-06-10/NC_007130.3 -rw-r--r-- 1 stolfi staff 73498760 Jun 15 08:37 orig/raws/dr-2008-06-10/NC_007131.3 -rw-r--r-- 1 stolfi staff 59887856 Jun 15 15:27 orig/raws/dr-2008-06-10/NC_007132.3 -rw-r--r-- 1 stolfi staff 51100294 Jun 15 08:38 orig/raws/dr-2008-06-10/NC_007133.3 -rw-r--r-- 1 stolfi staff 60260763 Jun 15 08:40 orig/raws/dr-2008-06-10/NC_007134.3 -rw-r--r-- 1 stolfi staff 52182564 Jun 15 08:42 orig/raws/dr-2008-06-10/NC_007135.3 -rw-r--r-- 1 stolfi staff 42890688 Jun 15 15:30 orig/raws/dr-2008-06-10/NC_007136.3 Counting the total base pairs in each genome: foreach sp ( ${species} ) printf "%s " "${sp}" grep -m1 -e 'LOCUS' orig/raws/${sp}/N*[0-9] \ | gawk \ ' /LOCUS/ { n += $3; } \ END { printf "%d bp\n", n; } \ ' end at-2008-05-22 119707899 bp ce-2008-06-11 100281244 bp dr-2008-06-10 1277091829 bp dm-2008-05-14 120401063 bp DISCOVERING THE FEATURE KEYS Listing all feature keys used in the files: foreach sp ( ${species} ) echo " "; echo "=== ${sp} ===" set ofile = "orig/raws/${sp}/features.txt" rm -f ${ofile} foreach f ( `cd orig/raws/${sp} && ls N[CT]_*[0-9]` ) echo "# ${f}" >> ${ofile} sed \ -e '1,/^FEATURES/d' \ -e '/^ORIGIN/dq' \ orig/raws/${sp}/${f} \ | expand \ | gawk '//{ s = substr($0,1,20); if (s\!~/^[ ]*$/) { print s; }}' \ | tr -d ' ' \ | sort | uniq -c \ >> ${ofile} end combine-counts ${ofile} \ > orig/raws/${sp}/.tmp end set tmps = ( `echo ${species} | tr ' ' '\012' | sed -e 's:'\$':/.tmp:'` ) echo "${tmps}" set tits = ( `echo ${species} | tr ' ' '\012' | sed -e 's:[-].*'\$'::'` ) echo "${tits}" ( cd orig/raws && join-counts ${tmps} ) \ | format-multi-counts \ -v titles="${tits} KEY" \ > .keys cat .keys at ce dr dm KEY ------- ------- ------- ------- --- 20234 20234 20234 20234 CDS 975 975 975 975 STS 8 8 8 8 gap 14690 14690 14690 14690 gene 20221 20221 20221 20221 mRNA 339 339 339 339 misc_RNA 7 7 7 7 misc_feature 296 296 296 296 ncRNA 2 2 2 2 rRNA 1 1 1 1 rep_origin 5395 5395 5395 5395 repeat_region 7 7 7 7 source 314 314 314 314 tRNA !!! We should compress the files with {bzip2}. EXTRACTING THE LABELING RANGES We wrote a tool {extract_refSeq_ranges} that reads an annotated GenBank flatfile and outputs a list of all RNA-transcribed and protein-translated ranges listed in the "FEATURES" section. Testing: ( ../tools/extract_RefSeq_ranges \ orig/raws/ce-2008-06-11/NC_003283.7 \ > .ranges ) >& .bugs Extracting for real: foreach sp ( ${species} ) echo " "; echo "=== ${sp} ===" foreach f ( `cd orig/raws/${sp} && ls N[CT]_*[0-9]` ) set rfile = "${f}.ranges" set bfile = "${f}.bugs" ( cd orig/raws/${sp} && \ ../../../../tools/extract_RefSeq_ranges \ $f \ | sort -k2,3n -k1,1n \ | uniq \ > ${rfile} \ ) >& orig/raws/${sp}/${bfile} end dicio-wc orig/raws/${sp}/*.ranges dicio-wc orig/raws/${sp}/*.bugs end Let's compare these ranges with those in Saxonov's EID database. We unpacked the file "hEID" from Saxonov's "at2004" dataset (Arabidopsis) into the directory "../saxon/junk/at2004.hEID": set jj = "../saxon/junk/at2004" bzcat ../saxon/orig/eids/at2004/hEID.bz2 > ${jj}.hEID Recall that the ".hEID" file is divided into many items, one for each CDS. Each item looks like a GenBank file, whose FEATURES section contains a single CDS feature and whose ORIGIN section is lacking, with some extra headers and separators. We seleted the Chromosome 1 items from this "${jj}.hEID" file and packed them into a single file "${jj}.zEID" in almost-correct GenBank flatfile format, but with an empty ORIGIN section. Namely: cat ${jj}.hEID \ | expand \ | sed -e '/^[>]/d' -e '/^intron/d' -e '/_HEADER_END_/Q' \ > ${jj}.zEID cat ${jj}.hEID \ | expand \ | sed -e '/chromosome 2/Q' \ | sed -e '/^[ ]*CDS[ ]/,/^[ ]*[\/][\/][ ]*$/\!d' \ | sed -e '/^[ ]*[\/][\/][ ]*$/d' -e '/BASE COUNT/d' \ >> ${jj}.zEID echo "ORIGIN " >> ${jj}.zEID echo "//" >> ${jj}.zEID We extracted the labeling ranges from this file, and compares it with the CDS ranges previously extracted from the newer NCBI Arabidopsis RefSeq file: ( ../tools/extract_RefSeq_ranges \ ${jj}.zEID \ | sort -k2,3n -k1,1n \ | uniq \ > ${jj}.zEID.ranges ) \ >& ${jj}.zEID.bugs cat ${jj}.zEID.ranges \ | sed -e '/^[>]/d' \ | gawk '//{ print $1,$2,$3,$5; }' \ | sort | uniq \ > .saxon cat orig/raws/at-2008-05-22/NC_003070.6.ranges \ | sed -e '/^[>]/d' -e '/[N]/d' \ | gawk '//{ print $1,$2,$3,$5; }' \ | sort | uniq \ > .ncbir bool 1-2 .saxon .ncbir > .s-n bool 2-1 .saxon .ncbir > .n-s dicio-wc .saxon .ncbir .s-n .n-s lines words bytes file ------- ------- --------- ------------ 35860 143440 864120 .saxon 38216 152864 920510 .ncbir 1390 5560 33656 .s-n 3746 14984 90046 .n-s Thus there are about 12% differences at the CDS range level. Let's compare the CDS regions themselves: cat ${jj}.zEID \ | egrep -e '^[ ]*CDS[ ][ ]' \ | sort | uniq \ > .saxon-cds cat orig/raws/at-2008-05-22/NC_003070.6 \ | egrep -e '^[ ]*CDS[ ][ ]' \ | sort | uniq \ > .ncbir-cds bool 1-2 .saxon-cds .ncbir-cds > .s-n-cds bool 2-1 .saxon-cds .ncbir-cds > .n-s-cds dicio-wc .saxon-cds .ncbir-cds .s-n-cds .n-s-cds lines words bytes file ------- ------- --------- ------------ 5594 11188 400858 .saxon-cds 7553 15106 497419 .ncbir-cds 363 726 25830 .s-n-cds 2322 4644 122391 .n-s-cds There are many changes, especially new CDSs in the NCBI file. Most (if not all) of the new sequences appear to have been inferred by homology. Some (if not all) of the losses are CDSs whose span was corrected. Conclusion: it appears that we CHECKING WHETHER THERE ARE ANY NONALIGNED CDS FEATURES The "/codon_start" qualifier of a CDS is supposed to tell where the first codon of the CDS resides relative to the beginning of the CDS. I canot understand the purpose of this qualifier, except perhaps for those cases where part of the CDS lies outside the "source" range. If that is the only reason, then one should always have "/codon_start=1" in whole-molecule files. Let's check: foreach sp ( ${species} ) echo " "; echo "=== ${sp} ===" foreach f ( `ls orig/raws/${sp}/N[CT]_*[0-9]` ) echo "# ${f}" egrep \ -n -e '[\/]codon_start *[=] *[^1 ]' \ ${f} \ | sed -e 's:^: :' end end There are no such things in the at-2008-05-22 and ce-2008-06-11 datasets. In the dr-2008-06-10 and dm-2008-05-14 datsets there are about 7 such things per file. Some (all?) belong to RNA regions which were found by automated programs (GNOMON or BestRefSeq). We should print the "/note=" qualifier of RNA and CDS regions and see whether we can disqualify them automatically. Perhaps label them in some way, e.g. [defn] instead of [DEFN]? COLLECTING THE NCBI CODE TABLES For the RefSeq project, the NCBI produced a set of codon-aminoacid translation tables. The correct table to use is identified by the "/transl_table" attribute of the "CDS" feature. The default is "/transl_table=1" (the Stanard Code). To use this attribute, we fetched the code table page from NCBI http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c#SG1 and semi-automatically converted it into a set of files "codes/ncbi-{NNN}.tbl" where {NNN} is the table number (3 digits). Apart from '#'-comments and blank lines, the table contains 64 lines in the format '{CODON} {AMINO}{START}' where {AMINO} is the single-letter aminoacid code, or '*' for STOP; {START} is a protein letter for codons that may be initiators, '-' otherwise. Listing all tables side by side: set tables = ( `ls codes/ncbi-[0-9][0-9][0-9].tbl` ) echo ${tables} set tbltit = "`echo ${tables} | sed -e 's:[^ ]*[-]::g' -e 's:[.][^ ]*::g'`" echo ${tbltit} multicol \ -v title="${tbltit}" \ -v colsep=' ' \ -v clean=1 \ ${tables} \ | gawk '//{ gsub(/^[ ]+/,"");for(i=NF-1;i>=2;i-=2) { if($(i)==$1){$(i)="";}} print}' AAA K- K- K- K- K- K- N- K- K- K- K- N- K- K- N- K- K- AAT N- N- N- N- N- N- N- N- N- N- N- N- N- N- N- N- N- AAC N- N- N- N- N- N- N- N- N- N- N- N- N- N- N- N- N- AAG K- K- K- K- K- K- K- K- K- K- K- K- K- K- K- K- K- ATA I- MM MM IM MM I- I- I- IM I- MM I- I- I- M- I- I- ATT I- IM I- IM IM I- I- I- IM I- I- I- I- I- I- I- IM ATC I- IM I- IM IM I- I- I- IM I- I- I- I- I- I- I- I- ATG MM MM MM MM MM MM MM MM MM MM MM MM MM MM MM MM MM ACA T- T- T- T- T- T- T- T- T- T- T- T- T- T- T- T- T- ACT T- T- T- T- T- T- T- T- T- T- T- T- T- T- T- T- T- ACC T- T- T- T- T- T- T- T- T- T- T- T- T- T- T- T- T- ACG T- T- T- T- T- T- T- T- T- T- T- T- T- T- T- T- T- AGA R- *- R- R- S- R- S- R- R- R- G- S- R- R- S- R- R- AGT S- S- S- S- S- S- S- S- S- S- S- S- S- S- S- S- S- AGC S- S- S- S- S- S- S- S- S- S- S- S- S- S- S- S- S- AGG R- *- R- R- S- R- S- R- R- R- G- S- R- R- S- R- R- TAA *- *- *- *- *- Q- *- *- *- *- *- Y- *- *- *- *- *- TAT Y- Y- Y- Y- Y- Y- Y- Y- Y- Y- Y- Y- Y- Y- Y- Y- Y- TAC Y- Y- Y- Y- Y- Y- Y- Y- Y- Y- Y- Y- Y- Y- Y- Y- Y- TAG *- *- *- *- *- Q- *- *- *- *- *- *- Q- L- *- L- *- TTA L- L- L- LM L- L- L- L- L- L- L- L- L- L- L- L- *- TTT F- F- F- F- F- F- F- F- F- F- F- F- F- F- F- F- F- TTC F- F- F- F- F- F- F- F- F- F- F- F- F- F- F- F- F- TTG LM L- L- LM LM L- L- L- LM L- LM L- L- L- L- L- L- TCA S- S- S- S- S- S- S- S- S- S- S- S- S- S- S- *- S- TCT S- S- S- S- S- S- S- S- S- S- S- S- S- S- S- S- S- TCC S- S- S- S- S- S- S- S- S- S- S- S- S- S- S- S- S- TCG S- S- S- S- S- S- S- S- S- S- S- S- S- S- S- S- S- TGA *- W- W- W- W- *- W- C- *- *- W- W- *- *- W- *- *- TGT C- C- C- C- C- C- C- C- C- C- C- C- C- C- C- C- C- TGC C- C- C- C- C- C- C- C- C- C- C- C- C- C- C- C- C- TGG W- W- W- W- W- W- W- W- W- W- W- W- W- W- W- W- W- CAA Q- Q- Q- Q- Q- Q- Q- Q- Q- Q- Q- Q- Q- Q- Q- Q- Q- CAT H- H- H- H- H- H- H- H- H- H- H- H- H- H- H- H- H- CAC H- H- H- H- H- H- H- H- H- H- H- H- H- H- H- H- H- CAG Q- Q- Q- Q- Q- Q- Q- Q- Q- Q- Q- Q- Q- Q- Q- Q- Q- CTA L- L- T- L- L- L- L- L- L- L- L- L- L- L- L- L- L- CTT L- L- T- L- L- L- L- L- L- L- L- L- L- L- L- L- L- CTC L- L- T- L- L- L- L- L- L- L- L- L- L- L- L- L- L- CTG LM L- T- LM L- L- L- L- LM SM L- L- L- L- L- L- L- CCA P- P- P- P- P- P- P- P- P- P- P- P- P- P- P- P- P- CCT P- P- P- P- P- P- P- P- P- P- P- P- P- P- P- P- P- CCC P- P- P- P- P- P- P- P- P- P- P- P- P- P- P- P- P- CCG P- P- P- P- P- P- P- P- P- P- P- P- P- P- P- P- P- CGA R- R- R- R- R- R- R- R- R- R- R- R- R- R- R- R- R- CGT R- R- R- R- R- R- R- R- R- R- R- R- R- R- R- R- R- CGC R- R- R- R- R- R- R- R- R- R- R- R- R- R- R- R- R- CGG R- R- R- R- R- R- R- R- R- R- R- R- R- R- R- R- R- GAA E- E- E- E- E- E- E- E- E- E- E- E- E- E- E- E- E- GAT D- D- D- D- D- D- D- D- D- D- D- D- D- D- D- D- D- GAC D- D- D- D- D- D- D- D- D- D- D- D- D- D- D- D- D- GAG E- E- E- E- E- E- E- E- E- E- E- E- E- E- E- E- E- GTA V- V- V- V- V- V- V- V- V- V- V- V- V- V- V- V- V- GTT V- V- V- V- V- V- V- V- V- V- V- V- V- V- V- V- V- GTC V- V- V- V- V- V- V- V- V- V- V- V- V- V- V- V- V- GTG V- VM V- VM VM V- VM V- VM V- VM V- V- V- VM V- VM GCA A- A- A- A- A- A- A- A- A- A- A- A- A- A- A- A- A- GCT A- A- A- A- A- A- A- A- A- A- A- A- A- A- A- A- A- GCC A- A- A- A- A- A- A- A- A- A- A- A- A- A- A- A- A- GCG A- A- A- A- A- A- A- A- A- A- A- A- A- A- A- A- A- GGA G- G- G- G- G- G- G- G- G- G- G- G- G- G- G- G- G- GGT G- G- G- G- G- G- G- G- G- G- G- G- G- G- G- G- G- GGC G- G- G- G- G- G- G- G- G- G- G- G- G- G- G- G- G- GGG G- G- G- G- G- G- G- G- G- G- G- G- G- G- G- G- G- >>TO DO>> CONSOLIDATING THE LABELING RANGES !!! REVIEW !!! We wrote a tool {extract_refSeq_segment} that takes an annotated GenBank flatfile, a range of positions, and outputs the ".bas" and ".lab" files of that segment of the file. ../tools/extract_RefSeq_segment \ -v ini=450000 -v fin=499999 \ -v outName="test" \ orig/raws/at-2008-05-22/NC_003071.4 >& .log !! Shoudl write {outName}-1.bas,lab in order of increasing strand 0 ??