{"id":13827589,"url":"https://github.com/YaleDHLab/intertext","last_synced_at":"2025-07-09T04:32:53.537Z","repository":{"id":28655666,"uuid":"115829599","full_name":"YaleDHLab/intertext","owner":"YaleDHLab","description":"Detect and visualize text reuse","archived":true,"fork":false,"pushed_at":"2024-09-04T19:50:31.000Z","size":3259,"stargazers_count":115,"open_issues_count":6,"forks_count":10,"subscribers_count":10,"default_branch":"master","last_synced_at":"2024-09-06T04:41:36.330Z","etag":null,"topics":["data-visualization","minhash","text-mining","web-app"],"latest_commit_sha":null,"homepage":"https://duhaime.s3.amazonaws.com/yale-dh-lab/intertext/demo/index.html","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/YaleDHLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-12-30T22:54:04.000Z","updated_at":"2024-09-04T19:50:47.000Z","dependencies_parsed_at":"2022-08-29T10:31:08.613Z","dependency_job_id":"d669978f-3f5e-419e-ad42-309fa9dd178e","html_url":"https://github.com/YaleDHLab/intertext","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YaleDHLab%2Fintertext","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YaleDHLab%2Fintertext/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YaleDHLab%2Fintertext/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YaleDHLab%2Fintertext/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/YaleDHLab","download_url":"https://codeload.github.com/YaleDHLab/intertext/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225486362,"owners_count":17481883,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-visualization","minhash","text-mining","web-app"],"created_at":"2024-08-04T09:02:02.550Z","updated_at":"2024-11-20T07:30:49.060Z","avatar_url":"https://github.com/YaleDHLab.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Note: This repository has been archived\nThis project was developed under a previous phase of the Yale Digital Humanities Lab. Now a part of Yale Library’s Computational Methods and Data department, the Lab no longer includes this project in its scope of work. As such, it will receive no further updates.\n\n\n# Intertext\n\n\u003e Detect and visualize text reuse within collections of plain text or XML documents.\n\nIntertext uses machine learning and interactive visualizations to identify and display intertextual patterns in text collections. The text processing is based on minhashing vectorized strings and the web viewer is based on interactive React components. [[Demo](https://duhaime.s3.amazonaws.com/yale-dh-lab/intertext/output/index.html)]\n\n![App preview](/docs/preview.png?raw=true)\n\n# Installation\n\nTo install Intertext, run the steps below:\n\n```bash\n# optional: install Anaconda and set up conda virtual environment\nconda create --name intertext python=3.7\nconda activate intertext\n\n# install the package\npip uninstall intertext -y\npip install https://github.com/yaledhlab/intertext/archive/master.zip\n```\n\n# Usage\n\n```bash\n# search for intertextuality in some documents\npython intertext/intertext.py --infiles \"sample_data/texts/*.txt\" --metadata \"sample_data/metadata.json\"  --verbose --update_client\n\n# serve output\npython -m http.server 8000\n```\n\nThen open a web browser to `http://localhost:8000/output` and you'll see any intertextualities the engine discovered!\n\n## CUDA Acceleration\n\nTo enable Cuda acceleration, we recommend using the following steps when installing the module:\n\n```bash\n# set up conda virtual environment\nconda create --name intertext python=3.7\nconda activate intertext\n\n# set up cuda and cupy\nconda install cudatoolkit\nconda install -c conda-forge cupy\n\n# install the package\npip uninstall intertext -y\npip install https://github.com/yaledhlab/intertext/archive/master.zip\n```\n\n## Providing Metadata\n\nTo indicate the author and title of matching texts, one should pass the flag to a metadata file to the `intertext` command, e.g.\n\n```bash\nintertext --infiles \"sample_data/texts/*.txt\" --metadata \"sample_data/metadata.json\"\n```\n\nMetadata files should be JSON files with the following format:\n\n```bash\n{\n  \"a.xml\": {\n    \"author\": \"Author A\",\n    \"title\": \"Title A\",\n    \"year\": 1751,\n    \"url\": \"https://google.com?text=a.xml\"\n  },\n  \"b.xml\": {\n    \"author\": \"Author B\",\n    \"title\": \"Title B\",\n    \"year\": 1753,\n    \"url\": \"https://google.com?text=b.xml\"\n  }\n}\n```\n\n## Deeplinking\n\nIf your text documents can be read on another website, you can add a `url` attribute to each of your files within your metadata JSON file (see example above).\n\nIf your documents are XML files and you would like to deeplink to specific pages within a reading environment, you can use the `--xml_page_tag` flag to designate the tag within which page breaks are identified. Additionally, you should include `$PAGE_ID` in the `url` attribute for the given file within your metadata file, e.g.\n\n```bash\n{\n  \"a.xml\": {\n    \"author\": \"Author A\",\n    \"title\": \"Title A\",\n    \"year\": 1751,\n    \"url\": \"https://google.com?text=a.xml\u0026page=$PAGE_ID\"\n  },\n  \"b.xml\": {\n    \"author\": \"Author B\",\n    \"title\": \"Title B\",\n    \"year\": 1753,\n    \"url\": \"https://google.com?text=b.xml\u0026page=$PAGE_ID\"\n  }\n}\n```\n\nIf your page ids are specified within an attribute in the `--xml_page_tag` tag, you can specify the relevant attribute using the `--xml_page_attr` flag.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FYaleDHLab%2Fintertext","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FYaleDHLab%2Fintertext","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FYaleDHLab%2Fintertext/lists"}