https://github.com/internetarchive/iari
Import workflows for the Wikipedia Citations Database
https://github.com/internetarchive/iari
Last synced: 5 months ago
JSON representation
Import workflows for the Wikipedia Citations Database
- Host: GitHub
- URL: https://github.com/internetarchive/iari
- Owner: internetarchive
- License: gpl-3.0
- Created: 2021-12-03T18:21:13.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2025-02-25T22:37:24.000Z (11 months ago)
- Last Synced: 2025-05-03T08:06:35.314Z (8 months ago)
- Language: Python
- Size: 6.75 MB
- Stars: 12
- Watchers: 15
- Forks: 9
- Open Issues: 56
-
Metadata Files:
- Readme: README.development.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# Internet Archive Reference Inventory [IARI](https://www.wikidata.org/wiki/Q117023013)
# Authors
IARI has been developed by [Dennis Priskorn](https://www.wikidata.org/wiki/Q111016131) as part of the
[Turn All References Blue project](https://www.wikidata.org/wiki/Q115136754) which is led by
Mark Graham, head of The
[Wayback Machine](https://www.wikidata.org/wiki/Q648266) department of the
[Internet Archive](https://www.wikidata.org/wiki/Q461).
# Endpoints
IARI features a number of endpoints that help patrons
get structured data about references in a Wikipedia article:
[___describe endpoints here briefly with links to further descriptions below___]
new endpoints:
/article
# Setup & Deployment
Docker
* IARI utilizes Docker
* src/Dockerfile
* src/docker-compose
* uses {PORT} environment variable to determine port to serve app at
Git
* Clone
* pull latest
* `$ git clone https://github.com/internetarchive/iari.git && cd iari`
Environment variables
* .env file
* location is at top src directory
* defines PORT
* defines TESTDEADLINKKEY
Python Requirements
* python pip
* python gunicorn
* python poetry
JSON directories
* directories to hold cache files must be in /json directory
setup the directories for the json cache files
* `$ ./setup_json_directories.sh` can be run to do this
Version control
* `pyproject.toml` holds the current version
*
AWS deployment
* Currently deployed instances are on the AWS machines
* (_TODO insert more information here; get from owenl?_)
* We have:
* iari-prod `/home/mojomonger/iari_prod`
* iari-stage `/home/mojomonger/iari_stage`
* iari-test `/home/mojomonger/iari_test`
* defined. Each can run a separate version of the repo.
## Docker
### Prerequisites
Ensure the following directories are created at the root of the iari install (they are not in the git repo)
```
json/articles/
json/references/
json/dois/
json/urls/
json/urls/archives
json/xhtmls/
json/pdfs/
```
An ".env" file at same level as docker-compose.yml file must contain the contents:
```
PORT=
# some suggested values:
# 5042: for iari-test
# 5088: for iari-stage
# 5000: for iari-prod
TESTDEADLINK_KEY=
```
### Running
docker compose up -d # -d for detached mode, in which the docker
can run and the prcoess that invoked it can be exited and destroyed.
Use `docker container ls` to see running docker version
Use `docker compose down` to stop docker instance. Make sure you are
in the appropriate directory when running this command so the desired
docker instance is taken down.
Note: When on the AWS machine, you must use `docker-compose` rather
than `docker compose' on account of the AWS machine not having the latest
version of the docker command line tools.
# Development
IARI is a flask app. The main entry point is in `src/__init__.py`
`*` This may change to a different API framework
## Definitions of endpoints
`src/__init__.py` defines the toplevel endpoints.
# Caveats
* upgrade docker version on AWS server
* fix poetry install (versions, etc.)
##
##
##
## Setup
_talk about having venv setup_
* `$ python3 -m venv venv`
* for Mac: `$ source venv/bin/activate`
_especially useful when compiling for integrity checking tools_
pip and poetry are used to set things up in python land.
`$ pip install poetry gunicorn && poetry install`
[OLD ENPOINTS]
* an _article_ endpoint which analyzes a given article and returns basic statistics about it
* a _references_ endpoint which gives back all ids of references found
* a _reference_ endpoint which gives back all details about a reference including templates and wikitext
* a _check-url_ endpoint which looks up the URL and gives back
standardized information about its status
* a _check-doi_ endpoint which looks up the DOI and gives back
standardized information about it from [FatCat](https://fatcat.wiki/), OpenAlex and Wikidata
including abstract, retracted status, and more.
* a _pdf_ endpoint which extracts links both from annotations and free text from PDFs.
* a _xhtml_ endpoint which extracts links both from any XHTML-page.
## Reference types detected by the ArticleAnalyzer
We support detecting the following types. A reference cannot have multiple types.
We distinguish between two main types of references:
1) **named reference** - type of reference with only a name and no content e.g. '' (Unsupported
beyond counting because we have not decided if they contain any value)
2) **content reference** - reference with any type of content
1) **general reference** - subtype of content reference which is outside a and usually found in
sections called "Further reading" or "Bibliography"
(unsupported but we want to support it, see
https://github.com/internetarchive/wcdimportbot/labels/general%20reference)

2) **footnote reference** - subtype of content reference which is inside a (supported partially, see below)
Example of a URL-template reference:
`Mueller Report, p12 {{url|http://example.com}} {{bare url inline}}`
This is a content reference -> a footnote reference -> a mixed reference with a URL template.
Example of an plain text reference:
`Muller Report, p12`
This is a footnote reference -> content reference -> Short citation reference aka naked named footnote.
# Endpoints
## Checking endpoints
### Check URL
This endpoint requires an authkey for the testdeadlink API from Internet Archive to work.
It looks for the environment variable TESTDEADLINK_KEY.
the check-url endpoint accepts the following parameters:
* url (string, mandatory)
* refresh (boolean, optional)
* testing (boolean, optional)
* timeout (int, optional)
* debug (boolean, optional)
On error it returns 400.
It returns json similar to
```
{
"first_level_domain": "github.com",
"fld_is_ip": false,
"url": "https://github.com/internetarchive/iari/issues",
"scheme": "https",
"netloc": "github.com",
"tld": "com",
"malformed_url": false,
"malformed_url_details": null,
"archived_url": "",
"wayback_machine_timestamp": "",
"is_valid": true,
"request_error": false,
"request_error_details": "",
"dns_record_found": true,
"dns_no_answer": false,
"dns_error": false,
"status_code": 200,
"testdeadlink_status_code": 200,
"timeout": 2,
"dns_error_details": "",
"response_headers": {
"Server": "GitHub.com",
"Date": "Mon, 12 Jun 2023 12:00:28 GMT",
"Content-Type": "text/html; charset=utf-8",
"Vary": "X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, Accept-Encoding, Accept, X-Requested-With",
"ETag": "W/\"03d14b06291209737bb700da5adf5968\"",
"Cache-Control": "max-age=0, private, must-revalidate",
"Strict-Transport-Security": "max-age=31536000; includeSubdomains; preload",
"X-Frame-Options": "deny",
"X-Content-Type-Options": "nosniff",
"X-XSS-Protection": "0",
"Referrer-Policy": "no-referrer-when-downgrade",
"Content-Security-Policy": "default-src 'none'; base-uri 'self'; block-all-mixed-content; child-src github.com/assets-cdn/worker/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com objects-origin.githubusercontent.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com cdn.optimizely.com logx.optimizely.com/v1/events *.actions.githubusercontent.com productionresultssa0.blob.core.windows.net/ productionresultssa1.blob.core.windows.net/ productionresultssa2.blob.core.windows.net/ productionresultssa3.blob.core.windows.net/ productionresultssa4.blob.core.windows.net/ wss://*.actions.githubusercontent.com github-production-repository-image-32fea6.s3.amazonaws.com github-production-release-asset-2e65be.s3.amazonaws.com insights.github.com wss://alive.github.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com objects-origin.githubusercontent.com; frame-ancestors 'none'; frame-src viewscreen.githubusercontent.com notebooks.githubusercontent.com; img-src 'self' data: github.githubassets.com media.githubusercontent.com camo.githubusercontent.com identicons.github.com avatars.githubusercontent.com github-cloud.s3.amazonaws.com objects.githubusercontent.com objects-origin.githubusercontent.com secured-user-images.githubusercontent.com/ user-images.githubusercontent.com/ private-user-images.githubusercontent.com opengraph.githubassets.com github-production-user-asset-6210df.s3.amazonaws.com customer-stories-feed.github.com spotlights-feed.github.com *.githubusercontent.com; manifest-src 'self'; media-src github.com user-images.githubusercontent.com/ secured-user-images.githubusercontent.com/ private-user-images.githubusercontent.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; worker-src github.com/assets-cdn/worker/ gist.github.com/assets-cdn/worker/",
"Content-Encoding": "gzip",
"Set-Cookie": "_gh_sess=Kt%2FbTAsPUlbQMLzfnhA6DaZ%2FrXJio0ITQdf7cs83MUTpiSwa0ALcco5FkiN3Gne6MBf%2Fzf001SNsFAAt1IXJVU9kFP5EOV1P1UZsps9JI2dyn9lAG0Zt%2FKocfikDwnYU03uuJAmDpI82mOt6L0ULDzxEm1hc0vCMRy72T8saF%2Ba8qoSFPJR296%2FjZVoqrg9cHzAnYr5uo0cj9dQL6pGfoDreWMMV41kG7S%2FbwvU5%2FTWkJnmYTK8XoioOtrVjnvQ%2Fw%2FNuMVh4dkEEeprnmfe8%2Bw%3D%3D--CpOcS3YATJ%2FQZh%2Fk--zuEd5QQlr0A8Pif7t41ywQ%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax, _octo=GH1.1.1345489649.1686571228; Path=/; Domain=github.com; Expires=Wed, 12 Jun 2024 12:00:28 GMT; Secure; SameSite=Lax, logged_in=no; Path=/; Domain=github.com; Expires=Wed, 12 Jun 2024 12:00:28 GMT; HttpOnly; Secure; SameSite=Lax",
"Accept-Ranges": "bytes",
"Transfer-Encoding": "chunked",
"X-GitHub-Request-Id": "4C01:6326:116FFBB:1199F4E:648708DB"
},
"detected_language": "en",
"detected_language_error": false,
"detected_language_error_details": "",
"timestamp": 1686564033,
"isodate": "2023-06-12T12:00:33.941494",
"id": "7e8230ce"
}
```
When the debug parameter is true the text from the resource is returned.
This text is the basis for the language detection.
#### Known limitations
You are very welcome to suggest improvements by opening an issue or sending a pull request. :)
#### Status codes
The status_code attribute is not as reliable as the testdeadlink_status_code.
Sometimes we get back a 403 because an intermediary like Cloudflare detected that we are not a person behind a browser
doing the request. We don't have any ways to detect these soft200s.
Also sometimes a server returns status code 200 but the content is an error page or any other content than what the
person asking for the information wants. These are called soft404s and we currently do not make any effort to detect
them.
#### Language detection
We need at least 200 characters to be able to reliably detect the language.
## Statistics
### article
the statistics/article endpoint accepts the following parameters:
* url (mandatory)
* refresh (bool, optional)
* testing (bool, optional)
* revision (int, optional, defaults to the most recent)
* dehydrate (bool, optional, defaults to True)
* regex (string, mandatory)
On error it returns 400. On timeout it returns 504 or 502
(this is a bug and should be reported).
Regex is a string of section headers separated by "|" to analyze for general references.
For enwiki a string like
"bibliography|further%20reading|works%20cited|sources|external%20links"
catches most references outside -tags.
Dehydrate means it returns a limited reference object.
Currently the keys "templates" and "url_objects" are
being removed to make sure we don't return a huge blob of json by default.
If you set dehydrate to False the whole reference will be returned.
For an article like https://en.wikipedia.org/wiki/United_States with >600 references
1,3 MB of JSON is returned.
It will return json [like this](https://gist.github.com/dpriskorn/aac368490f46016f8fb3bd3cf378e98b) (by default with dehydrated references)
and [like this](https://gist.github.com/dpriskorn/b0e4bf7b9b098c5d6f59664450d70cc2) with dehydrate set to false.
#### Known limitations
* the general references parsing relies on 2 things:
* a manually supplied list of sections to search using the 'regex' to the article and all endpoints. The list is
case insensitive and should be delimited by the '|' character.
* that every line with a general reference begins with a star character (*)
* URLs in comments are not currently supported. Comments begin with .
Any line that doesn't begin with a star is ignored when parsing lines in the sections specified by the regex.
### references
the statistics/references endpoint accepts the following parameters:
* wari_id (mandatory, str) (this is obtained from the article endpoint)
* all (optional, boolean) (default: false)
* offset (optional, int) (default: 0)
* chunk_size (optional, int) (default: 10)
On error it returns 400. If data is not found it returns 404.
It will return json similar to:
```
{
"total": 31,
"references": [
{
"id": "cfa8b438",
"template_names": [],
"wikitext": "",
"type": "footnote",
"footnote_subtype": "named",
"flds": [],
"urls": [],
"templates": [],
"titles": [],
"section": "History",
"served_from_cache": true
},
]
}
```
#### Known limitations
None
### reference
the statistics/reference/id endpoint accepts the following parameters:
* id (mandatory) (this is unique for each reference and is obtained from the article or references endpoint)
On error it returns 400. If data is not found it returns 404.
It will return json similar to:
```
{
"id": "cfa8b438",
"template_names": [],
"wikitext": "",
"type": "footnote",
"footnote_subtype": "named",
"flds": [],
"urls": [],
"templates": [],
"titles": [],
"section": "History",
"served_from_cache": true
}
```
#### Known limitations
None
### PDF
the statistics/pdf endpoint accepts the following parameters:
* url (mandatory)
* refresh (bool, optional, default false)
* testing (bool, optional, default false)
* timeout (int, optional, default 2 (seconds))
* debug (bool, optional) Note: you have to set refresh=true to be sure to get debug output.
* html (bool, optional, default false)
* xml (bool, optional, default false)
* json_ (bool, optional, default false)
* blocks (bool, optional, default false)
On error it returns 404 or 415. The first is when we could not find/fetch the url
and the second is when it is not a valid PDF.
If not given debug=true it will return json similar this
```
{
"words_mean": 660,
"words_max": 836,
"words_min": 474,
"annotation_links": [
{
"url": "http://arxiv.org/abs/2210.02667v1",
"page": 0
},...
],
"links_from_original_text": [
{
"url": "https://www.un.org/en/about-us/universal-declaration-of-human-rights",
"page": 1
},
],
"links_from_text_without_linebreaks": [
{
"url": "https://www.un.org/en/about-us/universal-declaration-of-human-rights3https://www.ohchr.org/en/issues/pages/whatarehumanrights.aspx4Legal",
"page": 1
},
],
"links_from_text_without_spaces": [
{
"url": "https://www.un.org/en/about-us/universal-declaration-of-human-rights",
"page": 1
},
],
"url": "https://arxiv.org/pdf/2210.02667.pdf",
"timeout": 2,
"pages_total": 17,
"detected_language": "en",
"detected_language_error": false,
"detected_language_error_details": "",
"characters": 74196,
"timestamp": 1686055925,
"isodate": "2023-06-06T14:52:05.228754",
"id": "639dc1df"
}
```
Using the debug and refresh parameter, all the text before and after cleaning is exposed as well as the link-annotations
before url extraction. Using the 4 additional parameters you can get the corresponding output from PyMuPDF for further debugging.
Example debug expressions:
To show block debug output:
```
https://archive.org/services/context/iari/v2/statistics/pdf?url=https://ia601600.us.archive.org/31/items/Book_URLs/DeSantis.pdf&refresh=true&debug=true&blocks=true
```
To show html debug output:
```
https://archive.org/services/context/iari/v2/statistics/pdf?url=https://ia601600.us.archive.org/31/items/Book_URLs/DeSantis.pdf&refresh=true&debug=true&html=true
```
*Note: Setting debug=true parameter without refresh=true will often not yield any debug output since we don't have it
stored in the cache.*
*Warning: The debug outputs can generate very large output up to hundreds of MB
so use with care or from the command line to avoid crashing your browser*
This output permits the data consumer to count number of links per page, which links or domains appear most, number of
characters in the pdf and using the debug outputs you can investigate why the API could not find the links correctly.
#### Known limitations
The extraction of URLs from unstructured text in any random PDF is a
difficult thing to do reliably. This is because
PDF is not a structured data format with clear boundaries between different type of information.
We currently use PyMuPDF to extract all the text as it appears on the page.
We clean the text in various ways to get the links out via a regex.
This approach results in some links containing information that was part of the
text after the link ended but we have yet to find a way to reliably determine
the end boundary of links.
We are currently investigating if using the html output of PyMuPdf or
using advanced Machine Learning could improve the output.
You are very welcome to suggest improvements by opening an issue or sending a pull request. :)
### XHTML
the statistics/pdf endpoint accepts the following parameters:
* url (mandatory)
* refresh (optional)
* testing (optional)
On error it returns 400.
It will return json similar to:
```
{
"links": [
{
"context": "This is a link to the XHTML test suite table of contents.",
"parent": "
This is a link to the XHTML test suite table of contents. The link contain inline markup (<acronym>), has a title and an accesskey \u2018t\u2019.
",
"title": "The main page of this test suite",
"href": "http://www.hut.fi/u/hsivonen/test/xhtml-suite/"
},
{
"context": "This is a relative link.",
"parent": "This is a relative link. If the link points to http://www.hut.fi/u/hsivonen/test/base-test/base-target, the user agent supports <base/>. if it points to http://www.hut.fi/u/hsivonen/test/xhtml-suite/base-target, the user agent has failed the test.
",
"title": "",
"href": "base-target"
}
],
"links_total": 2,
"timestamp": 1682497512,
"isodate": "2023-04-26T10:25:12.798840",
"id": "fc5aa88d",
"refreshed_now": false
}
```
#### Known limitations
None
# Deployment
IARI utilizes Docker, using Dockerfile and docker-compse.yml files to define the build and deploy process
Clone the git repo:
`$ git clone https://github.com/internetarchive/iari.git && cd iari`
It is recommended to check out the latest release before proceeding.
## Requirements
* python pip
* python gunicorn
* python poetry
## Setup
pip and poetry are used to set things up in python land.
`$ pip install poetry gunicorn && poetry install`
### Virtual environment
Setup:
`$ python3 -m venv venv`
Enter in (in a Linux/Mac terminal):
`$ source venv/bin/activate`
### JSON directories
Lastly setup the directories for the json cache files
`$ ./setup_json_directories.sh`
## Run
Run these commands in different shells or in GNU screen.
Before running them please make sure you have activated the virtual environment first, see above.
### GNU screen
Start GNU screen (if you want to have a persisting session)
`$ screen -D -RR`
### Debug mode
Run it with
`$ export TESTDEADLINK_KEY="your key here" && ./run-debug-api.sh`
Test it in another Screen window or local terminal with
`$ curl -i "localhost:5000/v2/statistics/article?regex=external%20links&url=https://en.wikipedia.org/wiki/Test"`
### Production mode
Run it with
`$ export TESTDEADLINK_KEY="your key here" && ./run-api.sh`
Test it in another Screen window or local terminal with
`$ curl -i "localhost:8000/v2/statistics/article?regex=external%20links&url=https://en.wikipedia.org/wiki/Test"`
# PyCharm specific recommendations
## Venv activation
Make sure this setting is checked.
https://www.google.com/search?client=firefox-b-d&q=make+pycharm+automatically+enter+the+venv
Note for mac users: This might not work, so you have to manually enter the venv in all terminals.
# Deployed instances
See [KNOWN_DEPLOYED_INSTANCES.md](KNOWN_DEPLOYED_INSTANCES.md)
# Diagrams
## IARI
### Components

# History of this repo
* version 1.0.0 a proof of concept import tool based on WikidataIntegrator (by James Hare)
* version 2.0.0+ a scalable ETL-framework with an API and capability of reading EventStreams (by Dennis Priskorn)
* version 3.0.0+ WARI, a host of API endpoints that returns statistics
about a Wikipedia article and its references. (by Dennis Priskorn)
* version 4.0.0+ IARI, a host of API endpoints that returns statistics
about both Wikipedia articles and its references and can extract links from any PDF or XHTML page. (by Dennis Priskor
# License
This project is licensed under GPLv3+. Copyright Dennis Priskorn 2022
The diagram PNG files are CC0.
# Further reading and installation/setup
See the [development notes](DEVELOPMENT_NOTES.md)