# Last edited on 2025-07-08 12:07:28 by stolfi engl/lac - Language sample: The Lewis and Clark expedition diaries PRINTED SOURCES The first printed edition of the Journals (1814) was heavily abridged (to ~25% of the original length) and edited by the publisher Nicholas Biddle. The first publication of the full original manuscripts by Lewis, Clark, and two other expedition members, came out only in 1904 ("Original Journals of the Lewis and Clark Expedition 1804-1806" in seven volumes, Edited by Reuben G. Thwaites.) it is available in PDF scanned images at "https://www.americanjourneys.org/aj-100/" The PDF files of the first 5 volumes of Thwaites' edition are saved in orig-pdf aj-100{a,b,c,d,e}.pdf. DIGITAL SOURCE The source for this file was an electronic version prepared by Bob Webster and David Widger, fetched on 2025-07-05 from the Project Gutemberg Project site "https://www.gutenberg.org/files/8419/8419-h/8419-h.htm". This HTML file omitted sections "courses and distances" and "celestial observations" of every entry of the diaries, which are preserved in the PDF files of Thwaite's edition. Which is OK given the purpose of this sample file. INITIAL CLEANUP Started the conversion of the Project Gutemberg HTML file to the standard "main.src" format as described in "../../src-format.txt". The changes are described in the file "intro.src". However, halfway through that process I noticed that the Project Gutemberg HTML file contains at least two versions of the Diaries, including the full 1904 edition, and presumably the 1814 edition, and with the entries interleaved in chrono order. Thus it has two or more entries for each day. SPLITTING AND FILTERING It was decided to keep the original version as published in 1904. For that purpose, the half-finished "main.src" file was split into one file for each entry, in "files-per-day"/{NNNN}-{YYYY}-{MM}-{DD}" where {NNNN} is a sequential number from "0001", indicating the entry's position in the HTML file, and "{YYYY}-{MM}-{DD}" is the entry's date in ISO format. There were 1620 such entries. See "files.txt" in the "files-per-day" folder. There were 863 distinct dates among them. Note that the 1904 edition will occasionally have two or more entries for the same day, e. g. written by different members (Clark, Lewis, Ordway, Whitehouse, ...) Started to visually compare the multiple entries for each date with the PDF files of the 1904 edition, and moving those entries that did not match the latter to the sub-folder "bad" in that same folder. Got only as far as July 1, 1804: ~100 entries processed out of 1620. TO DO ??? Finish separating the "bad" entries. ??? Join the "good" ones again into a single "main.src", including "intro.src"; ??? Remove the duplication between the date in section {dt} and its repetition at the start of the day's entry. ??? Check and cleanup discrepancies between "main.src" and the PDF files. Pay special attention to: ??? Missing or spurious punctuation. ??? Insert a parag break after the date at the start of the entry's text. ??? Check all syllabic spelling of Indan names for missing "~". ??? Use "##" and "###" to indicate that the last 2-3 chars in an abbreviation should be raised. ??? Fix occurrences of "&.", "&c", "&c.", "&c#", etc. ??? Replace wrong "¬" instead of parag breaks or punctuation.