Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/iliaschalkidis/LegalCrawler
LegalCrawler: A tool for automated scraping of English legal corpora
https://github.com/iliaschalkidis/LegalCrawler
Last synced: about 2 months ago
JSON representation
LegalCrawler: A tool for automated scraping of English legal corpora
- Host: GitHub
- URL: https://github.com/iliaschalkidis/LegalCrawler
- Owner: iliaschalkidis
- Created: 2020-11-26T08:34:00.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2022-08-18T09:40:20.000Z (over 2 years ago)
- Last Synced: 2024-08-04T04:01:42.164Z (5 months ago)
- Language: Python
- Size: 539 KB
- Stars: 41
- Watchers: 3
- Forks: 7
- Open Issues: 2
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
- awesome-legal-data - Scripts to crawl English legal corpora
README
## Legal Crawler :octopus:
A collection of scripts to crawl English legal corpora :closed_book: from open public domains.
* The current version supports the following domains:
| Corpus | Domain | Corpus alias |
| ------------------- | ------------------------------------ | ------------------- |
| :eu: EU legislation | https://eur-lex.europa.eu/ | `eu` |
| :uk: UK legislation | https://legislation.gov.uk/ | `uk` |
| :canada: Canadian legislation | http://laws.justice.gc.ca/eng/ | `ca` |
| :jp: Japanese legislation | http://www.japaneselawtranslation.go.jp/law/ | `jp` |
| :finland: Finish legislation | https://www.finlex.fi/en | `fi` |
| :us: US case law* | https://case.law/bulk/download/ | `us` |\* In order to use the script for US case law, you need to first apply for a researcher account at https://case.law.
* For US public filings, e.g., contracts, please use the library OpenEDGAR (https://github.com/LexPredict/openedgar) by LexPredict.
* Documents are saved in raw text format, amend the code if you wish to better handle metadata, document structure, etc.## :bangbang: Disclaimer :bangbang:
* If you aim to use the code, please carefully read the individual license agreements with respect to re-use, re-publication, terms of use, etc. :memo:
* The text cleansing from the original PDF/HTML files is minimal. Consider amending the scripts and/or writing your own post-processing data cleansing process that better fit for each corpus. :construction:
* These scripts aim to give researchers a kick start for scraping legal corpora from public domains. They should not considered a stand-alone qualified solution. :construction:## Project Requirements:
### Python packages
* json-lines
* tqdm
* beautifulsoup4### Linux packages (command line tools)
The following linux packages are used to process PDF documents:
* pdftocairo
* pdftotext
* mutool
* gs## Quick start:
### Install python requirements:
```
pip install -r requirements.txtsudo apt-get install libcairo2-dev
sudo apt-get install libpango1.0-dev
sudo apt-get install -y xpdf
sudo apt-get install mupdf mupdf-tools
```### Download Canadian legislation
```
python download_legal_corpora.py --corpus ca
```### Download EU legislation
```
python download_legal_corpora.py --corpus eu```
### Download all (EU, UK, CA, FI, JP, US)
```
python download_legal_corpora.py --corpus all```
## Citation
In case you use this repo or any derivative in your work, please cite using the following:
```
@Misc{chalkidis-legalcrawler,
author = {Ilias Chalkidis},
title = {{Legal Crawler}: A collection of scripts to crawl English legal corpora from open public domains.},
howpublished = {\url{https://github.com/iliaschalkidis/LegalCrawler/}},
year = {2020--2022}
}
```