{"id":21007043,"url":"https://github.com/jjfiv/poetry-identification","last_synced_at":"2025-08-26T03:09:02.482Z","repository":{"id":142094515,"uuid":"216228698","full_name":"jjfiv/poetry-identification","owner":"jjfiv","description":"Poetry Identification Code from my dissertation runs on zip files containing DJVUXML from the Internet Archive.","archived":false,"fork":false,"pushed_at":"2020-01-07T20:30:23.000Z","size":7099,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-13T15:15:59.877Z","etag":null,"topics":["digital-humanities","djvuxml","internet-archive","machine-learning","poetry","random-forests"],"latest_commit_sha":null,"homepage":"https://ciir.cs.umass.edu/downloads/poetry","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jjfiv.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-10-19T15:34:39.000Z","updated_at":"2022-12-13T22:06:10.000Z","dependencies_parsed_at":null,"dependency_job_id":"96f8ed4b-3d3d-414d-b3e7-75bba971b039","html_url":"https://github.com/jjfiv/poetry-identification","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jjfiv/poetry-identification","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jjfiv%2Fpoetry-identification","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jjfiv%2Fpoetry-identification/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jjfiv%2Fpoetry-identification/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jjfiv%2Fpoetry-identification/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jjfiv","download_url":"https://codeload.github.com/jjfiv/poetry-identification/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jjfiv%2Fpoetry-identification/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272164962,"owners_count":24884626,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-26T02:00:07.904Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["digital-humanities","djvuxml","internet-archive","machine-learning","poetry","random-forests"],"created_at":"2024-11-19T08:54:35.475Z","updated_at":"2025-08-26T03:09:02.447Z","avatar_url":"https://github.com/jjfiv.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Poetry-Identification\nPoetry Identification Code from my dissertation runs on zip files containing DJVUXML from the Internet Archive.\n\n## Where did this model come from?\n\nFor details about where this model came from, or what it does, refer to [my dissertation](https://scholarworks.umass.edu/dissertations_2/1573/) for now.\n\n```bibtex\n@phdthesis{foley2019thesis,\n  author = {John Foley},\n  title = {{Poetry: Identification, Entity Recognition, and Retrieval}},\n  year = {2019},\n  school = {University of Massachusetts},\n}\n```\n\n## Can I get some data for this?\n\nData from my dissertation is available at [CIIR/downloads/poetry](http://ciir.cs.umass.edu/downloads/poetry). The training data used to build the model is there, as well as the output of this model on the 50,000 books from the INEX 2007 challenge (basically a random sample of Internet Archive books).\n\n## How do I run the code?\n\nYou'll need a bunch of DJVU-XML books available in a zip file. I have so many of these -- email me and we can work something out :)\n\n### Prepare\n1. Get [Rust](https://rustup.rs/).\n2. ``gunzip ../models/forest-05-2019.json.gz`` # Extract the model; it's too big for github otherwise -- only need to do this once.\n\nBuild and run the code:\n```bash\ncd classification\ncargo build --release\n./target/release/classification --model ../models/forest-05-2019.json --books input_books.zip \u003e input_books.poetry.jsonl\n```\n\nThe ``classification`` binary once built is very portable because Rust does static linking -- you can build it once and copy it to a cluster of Linux machines fairly easily.\n\n## About this Code\n\nThis code is written in Rust. There are two packages: ``djvuxml-rs`` which is a pretty generic way to interact with internet-archive scanned book files, and ``classification`` which runs through using a JSONified Random Forest model and makes predictions at the page level. The files on [CIIR/downloads/poetry](http://ciir.cs.umass.edu/downloads/poetry) -- Poetry50K collection were generated from de-duplicating the output of this code.\n\n## Help? Where's the code for XXX?\n\nI'm slowly cleaning up and open-sourcing all the code. If you're looking for a piece that's not made it public yet, please don't hesitate to contact me! File an issue here or check out my [personal website](https://jjfoley.me) to find my latest academic email.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjjfiv%2Fpoetry-identification","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjjfiv%2Fpoetry-identification","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjjfiv%2Fpoetry-identification/lists"}