Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/instantiator/anne-frank-diary-parser
Scripts to clean up the Internet Archive OCR of Anne Frank's "Diary of a Young Girl". Generates a structured dataset of entries by date.
https://github.com/instantiator/anne-frank-diary-parser
Last synced: about 16 hours ago
JSON representation
Scripts to clean up the Internet Archive OCR of Anne Frank's "Diary of a Young Girl". Generates a structured dataset of entries by date.
- Host: GitHub
- URL: https://github.com/instantiator/anne-frank-diary-parser
- Owner: instantiator
- License: mit
- Created: 2023-11-17T22:59:42.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2023-11-17T23:24:40.000Z (about 1 year ago)
- Last Synced: 2023-11-18T00:53:07.202Z (about 1 year ago)
- Language: C#
- Size: 154 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Diary of Anne Frank
A set of scripts to clean up the Internet Archive OCR of Anne Frank's "Diary of a Young Girl", and generate a structured dataset of formatted entries by date.
## Source material
The source material for these scripts comes from the Internet Archive:
* [Anne Frank The Diary Of A Young Girl](https://archive.org/details/AnneFrankTheDiaryOfAYoungGirl_201606)
If you're intending to use or publish anything derived from this work, please keep an eye on the latest information regarding [copyright](https://en.wikipedia.org/wiki/The_Diary_of_a_Young_Girl#Copyright_and_ownership_of_the_originals) of the book.
The scripts in this repository do not contain material from the book, and are available under a very permissive MIT license.
## Prerequisites
These scripts assumes access to a `bash` shell, and that you have installed:
* [DotNet runtime 8.0 or above](https://dotnet.microsoft.com/en-us/download)
## Usage
Download the source material from the Internet Archive, and then parse it:
```bash
./download.sh
./parse.sh
```* The source material will be downloaded to: `input/diary.txt`
* Structured output will be written to: `output/diary-entries.json`## How it works
The diary is an OCR text file, uploaded in 2016. The OCR is good, but not perfect. Parsing helps to tidy up the text, and generate structured entries from it.
### OCR issues
The OCR formatting is imperfect:
* There are spelling mistakes (eg. the number `29` was OCR'd in at least 1 place as `2g`)
* Lines are split (at around `92` characters long)
* There are multiple blank lines between paragraphs
* Sometimes there are multiple blank lines between lines that shouldn't be splitI believe some of the spurious line breaks are caused by OCR handling lines that run over page breaks.
### Spelling
In at least a couple of cases, the spelling errors resulted in faulty dates that could not be parsed. These are corrected with a short list of special case corrections in the `SpellingFixer` class.
### Dates
Entries need to be split by date. This is done in the `SourceSplitter` class, which reads a text file, and generates `IEnumerable`.
Diary entry dates in the source text are found alone on a line.
* They are roughly in the format: `DAYOFWEEK, MONTH DAY, YEAR`
* The day of the week of mostly present
* The month is mostly uppercase
* The only years are `1942`, `1943`, `1944`Examples from the source:
```text
SUNDAY, JUNE 14, 1942
June 12, 1942
SATURDAY, JUNE 20,1942
SUNDAY, JUNE 21, 1942
MONDAY, NOVEMBER 9,1942
```Entry dates are recognised as a line that ends with `1942`, `1943`, or `1944`. The day, month and year is then parsed from these lines by splitting on `,` and then splitting on space.
### Reformatting
The line breaks are cleaned up with a set of rules that tries to identify lines that shouldn't have been split. If there are 1 or more blank lines between a pair of lines, it will attempt to rejoin them as if a broken line, if the previous line did not end a sentence. (ie. with `.`, `!`, `?`, or `Kitty,`.) Otherwise it recognises the blanks as an indicator of a new paragraph.
_I'm sure these rules are imperfect, and no doubt I'll return to them at some point._