{"id":16857642,"url":"https://github.com/brawer/cadaref-zurich","last_synced_at":"2025-03-18T12:14:54.490Z","repository":{"id":247971092,"uuid":"826343633","full_name":"brawer/cadaref-zurich","owner":"brawer","description":"georeferencing scanned cadastral maps of the City of Zürich","archived":false,"fork":false,"pushed_at":"2024-11-04T13:50:00.000Z","size":4363,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-24T18:12:08.977Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/brawer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"docs/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-09T14:23:02.000Z","updated_at":"2024-11-04T13:50:04.000Z","dependencies_parsed_at":"2024-09-13T20:36:42.026Z","dependency_job_id":"98a90f86-bb41-4380-b158-acf3ecc98d63","html_url":"https://github.com/brawer/cadaref-zurich","commit_stats":null,"previous_names":["brawer/cadaref-zurich"],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brawer%2Fcadaref-zurich","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brawer%2Fcadaref-zurich/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brawer%2Fcadaref-zurich/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brawer%2Fcadaref-zurich/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/brawer","download_url":"https://codeload.github.com/brawer/cadaref-zurich/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244217948,"owners_count":20417677,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-13T14:08:53.088Z","updated_at":"2025-03-18T12:14:54.468Z","avatar_url":"https://github.com/brawer.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Cadaref Zürich\n\n## Usage\n\nTo run the georeferencing pipeline on a workstation, install a container\nruntime such as [Docker](https://www.docker.com/products/docker-desktop/)\nor [Podman](https://podman.io/docs/installation). Then, execute the following\ncommands in a shell.\n\nAs `path/to/scans`, pass the file path to a directory on your\nworkstation. This directory (or any of its sub-directories) should\ncontain scanned cadastral plans in PDF format.\n\nThe pipeline will write its intermediate files and the final output\nto `workdir`. If the process is interrupted, for example because\nthe workstation is turned off, you can restart the pipeline with\nthe same command; it will just continue to work.\n\nFor security reasons, we highly recommend to disable network access\nby passing `--network none` to the container runtime.\n\n```sh\nmkdir workdir\ndocker run \\\n    --network none   \\\n    --mount type=bind,src=path/to/scans,dst=/home/cadaref/scans,readonly   \\\n    --mount type=bind,src=./workdir,dst=/home/cadaref/workdir  \\\n    ghcr.io/brawer/cadaref-zurich:v0.2.0\n```\n\n## Pipeline\n\nThe pipeline works in stages. Each stage creates a sub-directory in\n`workdir` that contains data (typically an image, text, or a CSV file)\nfor every cadastral mutation. The file names are the same short\nmutation identifiers that also appear in the present-day cadastral\ndatabase, for example `21989` or `HG3099`.  Sometimes, the scanning\nprocess has split a mutation file into multiple PDFs, possibly when\nthe historical documents happened to get archived in separate physical\nfolders. In this case, the pipeline assembles the various parts together,\nso we always have all data for a mutation in a single file.\n\nThe pipeline consists of the following stages:\n\n1. **Finding work:** The pipeline starts by listing the contents\nof the input directory, looking for PDF files that match the\nnaming scheme used by the cadastral plan archive of the City of Zürich.\nFor each mutation, the pipeline checks if there’s a log file from\nprevious run. If no log file can be found, the mutation is put on\na work queue for processing.\n\n2. **Text extraction:** In `workdir/text`, the pipeline stores the\nplaintext for every mutation as found by means of Optical Character\nRecognition (OCR). To produce its archival PDF/A files, the document\nscanning center of the City of Zürich uses\n[Kodak Capture Pro](https://support.alarisworld.com/en-us/capture-pro-software).\nWhile developing this pipeline for georeferencing historical cadastral plans,\nwe evaluated various alternative OCR systems:\n[Tesseract](https://tesseract-ocr.github.io/tessdoc/),\n[Jaided EasyOCR](https://www.jaided.ai/easyocr_enterprise/),\n[Microsoft Document Intelligence](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/read),\n[Apple Vision API](https://developer.apple.com/documentation/vision),\n[Google Document AI](https://cloud.google.com/document-ai),\nand [Amazon Textract](https://aws.amazon.com/textract/).\nHowever, the OCR engine of Kodak Capture Pro\ngave the best quality for the input dataset.\nTherefore, the current version of the pipeline simply extracts\nthe embedded plaintext from the PDF/A input. For PDF parsing\nand layout analysis, is uses the [Poppler](https://poppler.freedesktop.org/)\nlibrary.\n\n3. **Rendering:** The pipeline converts every page in the input PDF\nto a tiled, 24-bit color, single-page TIFF image with gzip compression.\nFor PDF rendering, we use [Poppler](https://poppler.freedesktop.org/)\nand [Cairo](https://www.cairographics.org/).\n\n4. **Page splitting:** In `workdir/rendered`, the pipeline stores a\ntiled, gzip-compressed, 24-bit color, multi-page TIFF image file\nfor every mutation dossier. In the output of the previous step,\nthe pipeline detects glued-together pages and splits them in two halves\nalong the middle fold. Possibly to save time, the human scanning operators\noccasionally happened to merge two separate DIN A4 pages into a single A3\npage with landscape orientation. However, sometimes a historical cadastral\nplan really was in landscape DIN A3 format, so we cannot blindly split\nall A3 pages. Likewise, the scan contractors did not bother to separate\nthe left and right halves when scanning bound tomes of the early 20th century;\nand again, we cannot blindly split everything because sometimes, a single\nhistorical map really does span two pages. Initially, we detected this\nsituation algorithmically, by means of image analysis.\nUltimately, however, we settled on looking for certain keywords\nin the OCRed text. Looking for certain keywords is much simpler and turned\nout to be more reliable.\n\n5. **Thresholding:** In `workdir/thresholded`, the pipeline stores\na thresholded (binarized) version of the rendered image as a tiled,\nmulti-page, black-and-white TIFF image in [Telefax CCITT Group 4 compression](https://en.wikipedia.org/wiki/Group_4_compression). The pipeline chooses a suitable\nthreshold for each page by means of the classic [Ōtsu method](https://en.wikipedia.org/wiki/Otsu%27s_method). However, the Zürich mutation plan archive\ncontains a handful of very dark scans where the Ōtsu method did not\nperform well. The pipeline detects this, and applies a custom workaround\nto handle it. Also at this stage, the pipeline runs some basic image\npre-processing algorithms to clean up scanning artifacts. For example,\na morphological operation is used to remove small dust speckles.\n\n6. **Detecting screenshots:** Some mutation dossiers of the late 1990s\nand early 2000s contain printed-out screenshots of a Microsoft Windows\ndatabase. At the time, this Windows tool was used to manage the\ncadastral register, and Windows screenshots were regularly printed out\nand archived.  Because these screenshot print-outs look like maps\n(they have long thin lines like a cadastral plan), our pipeline needs\nto detect them. We experimented with computer vision, but by far the\neasiest and most reliable way to detect screenshots was to look at the\nOCRed text. The pipeline does not generate any special files for detected\nscreenshots, but it notes a list of screenshot pages in the logs.\n\n7. **Detecting map scale:** The pipeline tries to find the map scale,\nsuch as `1:500`, which is often (but not always) printed on the historical\nmap. If no scale designation can be found on the page, the pipeline falls\nback to the other pages in the same mutation dossier because sometimes\nthe scale was given on the page next to the actual map. If this still\ndoes not lead to any map scales, the pipeline supplies a fallback list\nwith map scales that commonly appear in the Zürich dataset.\n\n8. **Measuring distance limit:** For every scanned page that hasn’t\nbeen classified as a screenshot, the pipeline measures the maximum\ndistance between any two points assuming it’s a map.  The inputs to\nthis computation are the map scale, the width and height of the\nrendered image in raster pixels, and the resolution of the rendered\nimage in dots per inch (dpi). For example, if the detected map scale\nis 1:1000, and the image is 2480×3508 pixels at 300 dpi, the scanned\npage is 21.0×29.7 centimeters (DIN A4). At scale 1:1000, this\ncorresponds to 210×297 meters on the ground. Thus, the distance\nbetween any two points depicted on this map can’t be more than\n√(210² + 297²) = 363.7 meters.  We’ll need this value in the next step.\n\n9. **Estimating mutation bounds:** In `workdir/bounds`, the pipeline\nstores a GeoJSON file with the approximate bounds of the mutation.\nThe bounds are approximated by looking up the parcel numbers, found by\nmeans of Optical Character Recognition, in the survey data of December\n2007.  This will capture any parcels whose numbers are mentioned in\nthe text documentation for the mutation, and any parcels whose numbers\nwere printed on the map (provided OCR managed to read the text).\nAlso, today’s land survey database stores for every parcel by what\nmutation it got created. In case our historic mutation has created\nparcels that that still happen to exist today, we incorporate their\nbounds into our estimation. If the estimated bounds are smaller than\nthe distance limit (the maximal distance covered by the map) from the\nprevious step, we grow the bounding box accordingly. — If no bounds\ncan be found, the pipeline stops processing the mutation with status\n`BoundsNotFound`.\n\n10. **Symbol recognition:** In `workdir/symbols`, the pipeline stores\na CSV file that tells which symbols have been recognized on the historical\nmap images by means of computer vision. The CSV file contains the\nfollowing columns: `page` for the document page, `x` and `y` for\nthe pixel coordinates on that page (which can be fractional because\nsymbol recognition works on an enhanced-resolution image), and\n`symbol` with the detected symbol type. — If there’s not a single page\nin the dossier with at least four cartographic symbols, the pipeline\nstops processing this mutation with status `NotEnoughSymbols`.\n\n11. **Survey data extraction:** In `workdir/points`, the pipeline\nstores a CSV file with the geographic points (survey markers, fixed\npoints) that are likely to have been drawn on the historical cadastral\nmap.  The CSV file contains the following columns: `id`, `x`, `y` and\n`symbol`.  The latter is the cartographic symbol type likely to be\nused on the map, inferred from known properties of the feature\n(eg. whether or not a marker has been secured with a metal\nbolt). Essentially, this is an excerpt of the cadastral survey data,\nlimited to the geographical area found earlier in the **Bounds\nestimation** stage.  To the extent possible, the pipeline further\nrestricts this set of points to those that actually existed at the\ntime the map was drawn. For example, a survery marker that existed\nbetween 1969 and 1992 would included when georeferencing a historical\nmap from 1984, but not when georeferencing a map from 1930 or 1999. We\nallow for some slack (up to a year) in date comparisions, in case the\nrecorded dates were not fully accurate. The set of points is taken\nfrom two sources: The land survey database as of 2007, and a list of\n[deleted points](src/deleted_points.csv) that we recovered (and\nmanually checked) from scanned and OCRed point deletion logs that\nhappened to get archived by the City of Zürich.\n\n12. **Georeferencing:** In `workdir/georeferenced`, the pipeline stores\ngeo-referenced imagery in Cloud-Optimized GeoTIFF format. The georeferencing\nis done by calling the [Cadaref tool](https://github.com/brawer/cadaref)\nwith the rendered image, map scale, symbols and points that were found\nby the previous steps. If an image could not be georeferenced, the pipeline\nstores it in TIFF format in `workdir/not_georeferenced`.\n\n\nIn `workdir/logs`, the pipeline stores a log file for every\nmutation.\n\nIn `workdir/tmp`, the pipeline stores temporary files. We do not use `/tmp`\nbecause some of our temporary files can be very large, and we do not\nwant to exhaust physical memory in case `/tmp` happens to be implemented\nby a [tmpfs file system](https://en.wikipedia.org/wiki/Tmpfs) on the\nworker machine.\n\nTo maximize throughput, the pipeline will concurrently process several\nmutation files on a multi-processor machine.\n\n\n## Contributing\n\nIf you’d like to work on this pipeline, please have a look at\nthe [developer guidelines](docs/CONTRIBUTING.md). Your contributions\nwould be very welcome.\n\n\n## License\n\nCopyright 2024 by Sascha Brawer, released under the [MIT license](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrawer%2Fcadaref-zurich","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbrawer%2Fcadaref-zurich","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrawer%2Fcadaref-zurich/lists"}