# Last edited on 2009-11-28 13:23:40 by stolfi OBTAINING THE RAW DATA FILES Obtained one raw data file "data/wp-size-emo-2009-06-raw.txt" from the Wikipedia page on [[Wikipedia:Modelling Wikipedia's growth]]. It has one entry per month, with the size of wikipedia at the end of each month: 31/01/2004, 200981 29/02/2004, 217064 31/03/2004, 239255 30/04/2004, 258781 Obtained another raw data file "data/wp-size-irr-2009-11-raw.txt" from the Wikipedia page [[Wikipedia:Size of Wikipedia]], with size of wikipedia at irregular dates. This version has a slightly different format, and comments besides some entries. I had to sort it myself in INCREASING date: 2002-11-07, 90003, mpacIII 2002-11-09, 90266, mpacIII 2002-11-18, 90905, mpacIII, article counter is back, after being switched off 2002-11-22, 91580, mpacIII, some recent performance problems REFORMATTING THE RAW DATA FILE The files were reformatted to rslt/wp-size-{FF}-rar.txt where {FF} is either "emo-2009-06" or "irr-2009-11", in the common format "{TIME} {YEAR} {MONTH} {DAY} {SZ} {SU}" where {YEAR} is the 4-digit year, {MONTH} is 01 to 12, {DAY} is day of month 01 to 31, {TIME} is elapsed days since Jan 1, 2001, {SZ} is the article count at {TIME}, and {SU} is a status indicator, presently 1 (OK) or 0 (suspicious). The plots are in files rslt/wp-size-{FF}-rar-e{EARLY}.eps where {EARLY} is 0 for whole plot, 1 for the early period only. COMPUTING THE GROWTH RATES The reformatted raw data files were converted to rslt/wp-size-{FF}-int.txt in the format "{TIME} {YEAR} {MONTH} {DAY} {SZ} {SU} {DZ} {DU}" where {TIME} is spaced 30 days apart starting with 30, {SZ} is size interpolated at {TIME}, {DZ} is the increment in {SZ} during the last 30 days, {SU} is as as before, and {DU} is the status of {DZ}: 1 (OK), 0 (suspicious) or 9 (ignore). The plots are rslt/wp-size-{FF}-int-e{EARLY}.eps MODELING The files with interpolated sizes and growth rates were piped to a data prediction script, yielding rslt/wp-size-${ff}-prd-p{PKS}-s{SEA}.txt where {PKS} is 0 (model without peaks) or 1 (model with peaks), and {SEA} is 0 (no seasonal factor) or 1 (with seasonal factor). The plots are rslt/wp-size-{FF}-prd-p{PKS}-s{SEA}-e{EARLY}.eps