https://github.com/YaleDHLab/intertext
Detect and visualize text reuse
https://github.com/YaleDHLab/intertext
data-visualization minhash text-mining web-app
Last synced: 5 months ago
JSON representation
Detect and visualize text reuse
- Host: GitHub
- URL: https://github.com/YaleDHLab/intertext
- Owner: YaleDHLab
- Archived: true
- Created: 2017-12-30T22:54:04.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2024-09-04T19:50:31.000Z (about 1 year ago)
- Last Synced: 2024-09-06T04:41:36.330Z (about 1 year ago)
- Topics: data-visualization, minhash, text-mining, web-app
- Language: Python
- Homepage: https://duhaime.s3.amazonaws.com/yale-dh-lab/intertext/demo/index.html
- Size: 3.11 MB
- Stars: 115
- Watchers: 10
- Forks: 10
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Note: This repository has been archived
This project was developed under a previous phase of the Yale Digital Humanities Lab. Now a part of Yale Library’s Computational Methods and Data department, the Lab no longer includes this project in its scope of work. As such, it will receive no further updates.
# Intertext
> Detect and visualize text reuse within collections of plain text or XML documents.
Intertext uses machine learning and interactive visualizations to identify and display intertextual patterns in text collections. The text processing is based on minhashing vectorized strings and the web viewer is based on interactive React components. [[Demo](https://duhaime.s3.amazonaws.com/yale-dh-lab/intertext/output/index.html)]

# Installation
To install Intertext, run the steps below:
```bash
# optional: install Anaconda and set up conda virtual environment
conda create --name intertext python=3.7
conda activate intertext
# install the package
pip uninstall intertext -y
pip install https://github.com/yaledhlab/intertext/archive/master.zip
```
# Usage
```bash
# search for intertextuality in some documents
python intertext/intertext.py --infiles "sample_data/texts/*.txt" --metadata "sample_data/metadata.json" --verbose --update_client
# serve output
python -m http.server 8000
```
Then open a web browser to `http://localhost:8000/output` and you'll see any intertextualities the engine discovered!
## CUDA Acceleration
To enable Cuda acceleration, we recommend using the following steps when installing the module:
```bash
# set up conda virtual environment
conda create --name intertext python=3.7
conda activate intertext
# set up cuda and cupy
conda install cudatoolkit
conda install -c conda-forge cupy
# install the package
pip uninstall intertext -y
pip install https://github.com/yaledhlab/intertext/archive/master.zip
```
## Providing Metadata
To indicate the author and title of matching texts, one should pass the flag to a metadata file to the `intertext` command, e.g.
```bash
intertext --infiles "sample_data/texts/*.txt" --metadata "sample_data/metadata.json"
```
Metadata files should be JSON files with the following format:
```bash
{
"a.xml": {
"author": "Author A",
"title": "Title A",
"year": 1751,
"url": "https://google.com?text=a.xml"
},
"b.xml": {
"author": "Author B",
"title": "Title B",
"year": 1753,
"url": "https://google.com?text=b.xml"
}
}
```
## Deeplinking
If your text documents can be read on another website, you can add a `url` attribute to each of your files within your metadata JSON file (see example above).
If your documents are XML files and you would like to deeplink to specific pages within a reading environment, you can use the `--xml_page_tag` flag to designate the tag within which page breaks are identified. Additionally, you should include `$PAGE_ID` in the `url` attribute for the given file within your metadata file, e.g.
```bash
{
"a.xml": {
"author": "Author A",
"title": "Title A",
"year": 1751,
"url": "https://google.com?text=a.xml&page=$PAGE_ID"
},
"b.xml": {
"author": "Author B",
"title": "Title B",
"year": 1753,
"url": "https://google.com?text=b.xml&page=$PAGE_ID"
}
}
```
If your page ids are specified within an attribute in the `--xml_page_tag` tag, you can specify the relevant attribute using the `--xml_page_attr` flag.