Of particular interest arer the labels with intermediate frequency (between 1 and 99 occurrences). I separated those into a file .labels-rare.dic. 8cc8o 8ccoe 8o8orm 8oHccoe 8oHoeo 8oHor 8oeoro 8oero 8orcco8o Hoccorom Hoecc8 Horomo Poro cPccor ccPoeo cccHoe cccocHco cccor8o cccoro8 cccoroe ccoHoro ccoccro ccoer ccoerom o8r oHcc8occcHoe oHcc8oe oHcccc8o oHcccco oHccccor oHccco8o oHccco8or oHcccor oHcco8oer oHcco8or oHccocc8o oHccocco oHccooro oHccoror oHccr oHco8occcHco oHco8oer oHcoeo oHcoeor oHcoeroe oHcooe8o oHcooeo oHcoor oHcoroe oHcororor oHo8oe oHo8or oHo8oro oHoHo oHoHoe oHocHco oHocco oHoccor oHoe8oe oHoe8or oHoecc8 oHoecc8o oHoeccoe8o oHoeo8 oHoeo8o oHoeoe oHoeoeo oHoeok oHoeom oHoeor oHoeoroe oHoeoror oHoer oHoero oHom8om oHomoHom oHorcccHo oHorcco8 oHorcco8o oHorccok oHorccor oHorco oHoroe8o oHorok oHorom oHoror oHororo oHroe oPccco8o oPcccor oPcco8oo oPccor oPccorom oPoeror oPorom occc8oe8om occo8o8o occoro oecPco oecccccco oeorom oeoror ooPcco oorcccor oorom oroe8 ororcco8om ororoeo qHoe qoHcro rccoor ro8or ro8ororo roPoe roe8ok roeP roeccror roeoe roeoer roer roero rorccor roroeo roromr rororo rr I thought of removing those labels that were superstrings of others in the same set, e.g. removing "oHoecc8o" since there is already "oHoecc8". Here is a (rather convoluted) recipe to do that: cp -p .labels-rare.dic .labs @ i = 0 while ( $i < 4 ) @ i = $i + 1 cat .labs \ | enum-proper-substrings \ | sort | uniq \ > .subs.$i bool 1.2 .labs .subs.$i \ > .subs-oc.$i if ( -z .subs-oc.$i ) break cat .labs \ | fgrep -v -f .subs-oc.$i \ > .prop.$i cat .subs-oc.$i >> .prop.$i cat .prop.$i \ | sort | uniq \ > .labs end Here is what it would remove: bool 1-2 .labels-rare.dic .labs Hoccorom ccoerom oHcc8occcHoe oHccccor oHccco8or oHcoeor oHo8oro oHoHoe oHoccor oHoecc8 oHoecc8o oHoeo8o oHoeoeo oHoeoroe oHoeoror oHoero oHorcco8o oHoroe8o oHororo oPccorom oPorom ororoeo ro8ororo roeoer roero But it seems that those labels are intersting on their own, and in some cases much larger than the relevant substrings. So I decided to leave them in. We should also look at the "rarest" labels (with less than 25 occurrences): 8o8orm 8oHccoe 8oHoeo 8oero 8orcco8o Hoccorom cPccor ccPoeo cccor8o cccoro8 ccoccro ccoerom o8r oHcc8occcHoe oHcccc8o oHccccor oHccco8or oHcco8oer oHccocc8o oHccooro oHccoror oHco8occcHco oHco8oer oHcoeor oHcoeroe oHcooe8o oHcooeo oHcoor oHcoroe oHcororor oHo8oe oHo8or oHo8oro oHoHoe oHocHco oHoccor oHoeccoe8o oHoeo8 oHoeo8o oHoeoeo oHoeok oHoeom oHoeoroe oHoeoror oHoero oHom8om oHorcccHo oHorcco8 oHorcco8o oHorccok oHorccor oHorco oHoroe8o oHorok oHroe oPccco8o oPcccor oPcco8oo oPccorom oPoeror oPorom occc8oe8om occo8o8o oecPco oecccccco oorcccor ororcco8om qHoe qoHcro rccoor ro8or ro8ororo roPoe roe8ok roeP roeccror roeoer rorccor roromr Now let's find all occurrences of these rare labels in the parags text: cat .parags-j-ecc.evt \ | enum-word-locations .labels-rare.dic \ | sort -b +2 -3n \ > .label-rare-occurrences.idx cat .parags-j-ecc.evt \ | enum-word-locations .labels-rarest.dic \ | sort -b +2 -3n \ > .label-rarest-occurrences.idx Let's tabulate their frequencies per page: foreach f ( '' -rare -rarest ) cat .label${f}-occurrences.idx \ | gawk '/./ { print $1 }' \ | sed -e 's/\..*>/>/g' \ | sort | uniq -c | expand \ | compute-freqs \ | sort +0 -1nr \ > .label${f}-refs-by-panel.frq end --- .label-rare-refs-by-panel.frq ------------------------ 93 0.030 <f58r> 78 0.025 <f113v> 63 0.020 <f104v> 60 0.019 <f115r> 60 0.019 <f86v6> 57 0.018 <f86v5> 55 0.017 <f106v> 53 0.017 <f113r> 51 0.016 <f116r> 50 0.016 <f108r> 49 0.016 <f104r> 49 0.016 <f105v> 48 0.015 <f107r> 44 0.014 <f108v> 42 0.013 <f106r> 41 0.013 <f107v> 40 0.013 <f112r> 39 0.012 <f103v> 39 0.012 <f85r1> 37 0.012 <f115v> 36 0.011 <f111v> 36 0.011 <f80v> 35 0.011 <f114r> 35 0.011 <f76r> 35 0.011 <f99v> 34 0.011 <f103r> 34 0.011 <f111r> 34 0.011 <f80r> 33 0.010 <f79r> 30 0.010 <f105r> 28 0.009 <f78v> 27 0.009 <f66r> 26 0.008 <f76v> 25 0.008 <f112v> 25 0.008 <f79v> 24 0.008 <f114v> 24 0.008 <f39v> 24 0.008 <f40r> 24 0.008 <f84r> 23 0.007 <f101r1> 23 0.007 <f78r> 22 0.007 <f39r> 22 0.007 <f75v> 21 0.007 <f102v2> 21 0.007 <f77r> 20 0.006 <f50r> 20 0.006 <f70r2> 20 0.006 <f75r> 19 0.006 <f93r> 18 0.006 <f82r> 18 0.006 <f94v> 17 0.005 <f46v> 17 0.005 <f48v> 17 0.005 <f55r> 16 0.005 <f66v> 16 0.005 <f94r> 15 0.005 <f100v> 15 0.005 <f43r> 15 0.005 <f47v> 15 0.005 <f77v> 15 0.005 <f82v> 15 0.005 <f89r2> 15 0.005 <f9v> 14 0.004 <f102r2> 14 0.004 <f14v> 14 0.004 <f1v> 14 0.004 <f23v> 14 0.004 <f33r> 14 0.004 <f55v> 14 0.004 <f81v> 14 0.004 <f83r> 13 0.004 <f34v> 13 0.004 <f81r> 13 0.004 <f87v> 13 0.004 <f88v> 13 0.004 <f8r> 12 0.004 <f100r> 12 0.004 <f101v1> 12 0.004 <f16v> 12 0.004 <f1r> 12 0.004 <f22r> 12 0.004 <f23r> 12 0.004 <f2r> 12 0.004 <f3r> 12 0.004 <f45r> 12 0.004 <f54r> 12 0.004 <f56r> 12 0.004 <f83v> 12 0.004 <f84v> 12 0.004 <f86v3> 12 0.004 <f8v> 12 0.004 <f90v1> 12 0.004 <f95v1> 12 0.004 <f95v2> 12 0.004 <f99r> 11 0.003 <f31v> 11 0.003 <f36r> 11 0.003 <f40v> 11 0.003 <f44r> 11 0.003 <f49r> 11 0.003 <f87r> 11 0.003 <f88r> 11 0.003 <f9r> 10 0.003 <f19v> 10 0.003 <f46r> 10 0.003 <f48r> 10 0.003 <f52r> 10 0.003 <f54v> 10 0.003 <f86v4> 10 0.003 <f93v> 10 0.003 <f96r> 9 0.003 <f13r> 9 0.003 <f15v> 9 0.003 <f16r> 9 0.003 <f17v> 9 0.003 <f24r> 9 0.003 <f27r> 9 0.003 <f29r> 9 0.003 <f34r> 9 0.003 <f51v> 9 0.003 <f95r2> 8 0.003 <f101v2> 8 0.003 <f102v1> 8 0.003 <f21v> 8 0.003 <f22v> 8 0.003 <f24v> 8 0.003 <f35r> 8 0.003 <f35v> 8 0.003 <f37r> 8 0.003 <f42v> 8 0.003 <f43v> 8 0.003 <f56v> 8 0.003 <f57r> 7 0.002 <f19r> 7 0.002 <f30v> 7 0.002 <f41v> 7 0.002 <f42r> 7 0.002 <f45v> 7 0.002 <f6v> 7 0.002 <f7v> 7 0.002 <f89v1> 7 0.002 <f89v2> 6 0.002 <f10r> 6 0.002 <f11v> 6 0.002 <f13v> 6 0.002 <f15r> 6 0.002 <f17r> 6 0.002 <f20r> 6 0.002 <f20v> 6 0.002 <f26v> 6 0.002 <f32v> 6 0.002 <f33v> 6 0.002 <f3v> 6 0.002 <f52v> 6 0.002 <f67r2> 6 0.002 <f6r> 6 0.002 <f90r1> 6 0.002 <f90r2> 6 0.002 <f90v2> 6 0.002 <f95r1> 6 0.002 <f96v> 5 0.002 <f102r1> 5 0.002 <f11r> 5 0.002 <f18v> 5 0.002 <f21r> 5 0.002 <f25r> 5 0.002 <f25v> 5 0.002 <f26r> 5 0.002 <f2v> 5 0.002 <f30r> 5 0.002 <f31r> 5 0.002 <f32r> 5 0.002 <f36v> 5 0.002 <f53r> 5 0.002 <f53v> 5 0.002 <f5r> 5 0.002 <f5v> 5 0.002 <f69r> 5 0.002 <f7r> 4 0.001 <f14r> 4 0.001 <f27v> 4 0.001 <f28r> 4 0.001 <f28v> 4 0.001 <f29v> 4 0.001 <f37v> 4 0.001 <f38r> 4 0.001 <f44v> 4 0.001 <f47r> 4 0.001 <f4r> 4 0.001 <f50v> 4 0.001 <f51r> 4 0.001 <f68v2> 4 0.001 <f89r1> 3 0.001 <f10v> 3 0.001 <f68r2> 3 0.001 <f68v3> 2 0.001 <f38v> 1 0.000 <f18r> 1 0.000 <f41r> 1 0.000 <f4v> 1 0.000 <f65v> 1 0.000 <f67r1> ---------------------------------------------------------- --- .label-rarest-refs-by-panel.frq ------------------------ 28 0.041 <f58r> 18 0.026 <f113v> 15 0.022 <f115r> 13 0.019 <f113r> 13 0.019 <f66r> 12 0.017 <f86v5> 12 0.017 <f99v> 11 0.016 <f105v> 10 0.014 <f106v> 10 0.014 <f116r> 10 0.014 <f78r> 9 0.013 <f108r> 9 0.013 <f85r1> 9 0.013 <f86v6> 8 0.012 <f102v2> 8 0.012 <f103r> 8 0.012 <f105r> 8 0.012 <f106r> 8 0.012 <f107r> 8 0.012 <f46v> 8 0.012 <f76r> 7 0.010 <f104v> 7 0.010 <f111r> 7 0.010 <f112r> 7 0.010 <f84r> 7 0.010 <f9v> 6 0.009 <f101r1> 6 0.009 <f103v> 6 0.009 <f107v> 6 0.009 <f108v> 6 0.009 <f114r> 6 0.009 <f78v> 6 0.009 <f80r> 6 0.009 <f93r> 5 0.007 <f102r2> 5 0.007 <f104r> 5 0.007 <f111v> 5 0.007 <f114v> 5 0.007 <f14v> 5 0.007 <f1v> 5 0.007 <f23r> 5 0.007 <f24v> 5 0.007 <f3r> 5 0.007 <f45r> 5 0.007 <f52r> 5 0.007 <f87r> 5 0.007 <f8v> 5 0.007 <f95v2> 4 0.006 <f100r> 4 0.006 <f100v> 4 0.006 <f112v> 4 0.006 <f13v> 4 0.006 <f22r> 4 0.006 <f32v> 4 0.006 <f33r> 4 0.006 <f34v> 4 0.006 <f39r> 4 0.006 <f51v> 4 0.006 <f54v> 4 0.006 <f55v> 4 0.006 <f6v> 4 0.006 <f75v> 4 0.006 <f77r> 4 0.006 <f77v> 4 0.006 <f80v> 4 0.006 <f81v> 4 0.006 <f86v3> 4 0.006 <f87v> 4 0.006 <f88v> 4 0.006 <f89r2> 4 0.006 <f90r1> 4 0.006 <f95v1> 4 0.006 <f99r> 3 0.004 <f101v2> 3 0.004 <f102v1> 3 0.004 <f11v> 3 0.004 <f16v> 3 0.004 <f17v> 3 0.004 <f19r> 3 0.004 <f23v> 3 0.004 <f2r> 3 0.004 <f30v> 3 0.004 <f34r> 3 0.004 <f35r> 3 0.004 <f37r> 3 0.004 <f39v> 3 0.004 <f42r> 3 0.004 <f47v> 3 0.004 <f50r> 3 0.004 <f51r> 3 0.004 <f57r> 3 0.004 <f66v> 3 0.004 <f69r> 3 0.004 <f79r> 3 0.004 <f79v> 3 0.004 <f84v> 3 0.004 <f86v4> 3 0.004 <f89v1> 3 0.004 <f89v2> 3 0.004 <f90r2> 2 0.003 <f101v1> 2 0.003 <f10v> 2 0.003 <f13r> 2 0.003 <f14r> 2 0.003 <f16r> 2 0.003 <f17r> 2 0.003 <f18v> 2 0.003 <f1r> 2 0.003 <f20r> 2 0.003 <f22v> 2 0.003 <f26v> 2 0.003 <f28r> 2 0.003 <f30r> 2 0.003 <f31r> 2 0.003 <f31v> 2 0.003 <f35v> 2 0.003 <f36r> 2 0.003 <f3v> 2 0.003 <f40r> 2 0.003 <f42v> 2 0.003 <f43r> 2 0.003 <f45v> 2 0.003 <f47r> 2 0.003 <f49r> 2 0.003 <f4r> 2 0.003 <f52v> 2 0.003 <f53r> 2 0.003 <f54r> 2 0.003 <f55r> 2 0.003 <f56r> 2 0.003 <f5v> 2 0.003 <f70r2> 2 0.003 <f75r> 2 0.003 <f7v> 2 0.003 <f81r> 2 0.003 <f83v> 2 0.003 <f89r1> 2 0.003 <f8r> 2 0.003 <f90v1> 2 0.003 <f90v2> 2 0.003 <f93v> 2 0.003 <f94r> 2 0.003 <f95r2> 2 0.003 <f96r> 2 0.003 <f96v> 2 0.003 <f9r> 1 0.001 <f115v> 1 0.001 <f11r> 1 0.001 <f15r> 1 0.001 <f15v> 1 0.001 <f19v> 1 0.001 <f21r> 1 0.001 <f21v> 1 0.001 <f25r> 1 0.001 <f26r> 1 0.001 <f27r> 1 0.001 <f27v> 1 0.001 <f28v> 1 0.001 <f29v> 1 0.001 <f2v> 1 0.001 <f33v> 1 0.001 <f37v> 1 0.001 <f43v> 1 0.001 <f44r> 1 0.001 <f46r> 1 0.001 <f48r> 1 0.001 <f48v> 1 0.001 <f53v> 1 0.001 <f67r2> 1 0.001 <f68v2> 1 0.001 <f76v> 1 0.001 <f82r> 1 0.001 <f82v> 1 0.001 <f83r> 1 0.001 <f88r> 1 0.001 <f94v> ------------------------------------------------------------ Let's compute the density of label references per page (number of references divided by number of text characters in page): foreach f ( '' -rare -rarest ) cat .label${f}-refs-by-panel.frq \ | sort +2 -3 \ > .foo join \ -a 1 -e 00 \ -j1 3 -j2 1 \ -o0,1.1,1.2,2.2 \ .foo .panels.nchars \ > .bar${f} cat .bar${f} \ | gawk '/./ {printf "%-8s %5d %5.3f %5d\n", $1, $2, $3, int(1000*$2/$4 + 0.5)}' \ | sort -b +3 -4nr \ > .label${f}-refs-by-panel.rfrq end --- .label-refs-by-panel.rfrq ------------------------ <f40r> 146 0.005 348 <f58r> 546 0.020 294 <f99v> 178 0.006 293 <f95v2> 69 0.003 273 <f113v> 636 0.023 269 <f94r> 99 0.004 268 <f86v5> 436 0.016 267 <f55r> 136 0.005 261 <f95r2> 97 0.004 261 <f39v> 152 0.006 256 <f86v6> 563 0.020 252 <f70r2> 106 0.004 249 <f23v> 95 0.003 242 <f23r> 106 0.004 240 <f86v3> 131 0.005 234 <f33r> 76 0.003 229 <f55v> 100 0.004 224 <f69r> 37 0.001 224 <f94v> 105 0.004 224 <f107v> 491 0.018 223 <f86v4> 45 0.002 223 <f107r> 511 0.019 219 <f50r> 104 0.004 218 <f100v> 87 0.003 215 <f105v> 395 0.014 215 <f36r> 58 0.002 215 <f9v> 79 0.003 215 <f106r> 482 0.017 213 <f19v> 75 0.003 213 <f106v> 475 0.017 211 ... ... ..... ... <f7r> 30 0.001 89 <f32v> 31 0.001 87 <f4v> 35 0.001 87 <f5r> 22 0.001 85 <f89r1> 53 0.002 84 <f68v2> 19 0.001 82 <f30r> 38 0.001 80 <f26r> 33 0.001 76 <f25v> 19 0.001 73 <f41r> 30 0.001 54 <f65v> 11 0.000 49 <f27v> 12 0.000 39 ------------------------------------------------------ --- .label-rare-refs-by-panel.rfrq ------------------------ <f99v> 35 0.011 58 <f40r> 24 0.008 57 <f58r> 93 0.030 50 <f86v4> 10 0.003 50 <f70r2> 20 0.006 47 <f95v2> 12 0.004 47 <f14v> 14 0.004 44 <f94r> 16 0.005 43 <f33r> 14 0.004 42 <f50r> 20 0.006 42 <f36r> 11 0.003 41 <f9v> 15 0.005 41 <f39v> 24 0.008 40 <f47v> 15 0.005 38 <f94v> 18 0.006 38 <f100v> 15 0.005 37 <f23v> 14 0.004 36 <f67r2> 6 0.002 36 <f1v> 14 0.004 35 <f86v5> 57 0.018 35 <f16v> 12 0.004 34 <f113v> 78 0.025 33 <f55r> 17 0.005 33 <f46v> 17 0.005 32 <f15v> 9 0.003 31 <f45r> 12 0.004 31 <f52r> 10 0.003 31 <f55v> 14 0.004 31 <f21v> 8 0.003 30 <f39r> 22 0.007 30 ... ... ..... ... <f38v> 2 0.001 8 <f50v> 4 0.001 8 <f51r> 4 0.001 8 <f83r> 14 0.004 8 <f89v2> 7 0.002 8 <f84v> 12 0.004 7 <f67r1> 1 0.000 6 <f89r1> 4 0.001 6 <f65v> 1 0.000 4 <f18r> 1 0.000 2 <f41r> 1 0.000 2 <f4v> 1 0.000 2 ----------------------------------------------------------- --- .label-rarest-refs-by-panel.rfrq ------------------------ <f95v2> 5 0.007 20 <f99v> 12 0.017 20 <f9v> 7 0.010 19 <f69r> 3 0.004 18 <f14v> 5 0.007 16 <f46v> 8 0.012 15 <f52r> 5 0.007 15 <f58r> 28 0.041 15 <f86v4> 3 0.004 15 <f90r2> 3 0.004 15 <f11v> 3 0.004 14 <f13v> 4 0.006 14 <f1v> 5 0.007 13 <f45r> 5 0.007 13 <f24v> 5 0.007 12 <f33r> 4 0.006 12 <f102v2> 8 0.012 11 <f23r> 5 0.007 11 <f32v> 4 0.006 11 <f90r1> 4 0.006 11 <f100v> 4 0.006 10 <f51v> 4 0.006 10 <f87r> 5 0.007 10 <f19r> 3 0.004 9 <f30v> 3 0.004 9 <f3r> 5 0.007 9 <f54v> 4 0.006 9 <f55v> 4 0.006 9 <f5v> 2 0.003 9 <f87v> 4 0.006 9 ... ... ..... ... <f80v> 4 0.006 2 <f81r> 2 0.003 2 <f84v> 3 0.004 2 <f88r> 1 0.001 2 <f94v> 1 0.001 2 <f43v> 1 0.001 1 <f46r> 1 0.001 1 <f75r> 2 0.003 1 <f82r> 1 0.001 1 <f82v> 1 0.001 1 <f83r> 1 0.001 1 <f83v> 2 0.003 1 <f115v> 1 0.001 0 <f76v> 1 0.001 0 -------------------------------------------------------------