Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/iliaschalkidis/LegalCrawler

LegalCrawler: A tool for automated scraping of English legal corpora
https://github.com/iliaschalkidis/LegalCrawler

Last synced: about 2 months ago
JSON representation

LegalCrawler: A tool for automated scraping of English legal corpora

Host: GitHub
URL: https://github.com/iliaschalkidis/LegalCrawler
Owner: iliaschalkidis
Created: 2020-11-26T08:34:00.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2022-08-18T09:40:20.000Z (over 2 years ago)
Last Synced: 2024-08-04T04:01:42.164Z (5 months ago)
Language: Python
Size: 539 KB
Stars: 41
Watchers: 3
Forks: 7
Open Issues: 2
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

awesome-legal-data - Scripts to crawl English legal corpora

README

        ## Legal Crawler :octopus:

A collection of scripts to crawl English legal corpora :closed_book: from open public domains.

* The current version supports the following domains:

| Corpus          | Domain                          | Corpus alias        |

| ------------------- | ------------------------------------  | ------------------- |

| :eu: EU legislation      | https://eur-lex.europa.eu/            | `eu`                |

| :uk: UK legislation      | https://legislation.gov.uk/           | `uk` |

| :canada: Canadian legislation  | http://laws.justice.gc.ca/eng/      | `ca` |

| :jp: Japanese legislation  | http://www.japaneselawtranslation.go.jp/law/     | `jp` |

| :finland: Finish legislation    | https://www.finlex.fi/en    | `fi` |

| :us: US case law* | https://case.law/bulk/download/ | `us` |

\* In order to use the script for US case law, you need to first apply for a researcher account at https://case.law.

* For US public filings, e.g., contracts, please use the library OpenEDGAR (https://github.com/LexPredict/openedgar) by LexPredict.

* Documents are saved in raw text format, amend the code if you wish to better handle metadata, document structure, etc.

## :bangbang: Disclaimer :bangbang:

* If you aim to use the code, please carefully read the individual license agreements with respect to re-use, re-publication, terms of use, etc. :memo:

* The text cleansing from the original PDF/HTML files is minimal. Consider amending the scripts and/or writing your own post-processing data cleansing process that better fit for each corpus. :construction:

* These scripts aim to give researchers a kick start for scraping legal corpora from public domains. They should not considered a stand-alone qualified solution. :construction:

## Project Requirements:

### Python packages

* json-lines

* tqdm

* beautifulsoup4

### Linux packages (command line tools)

The following linux packages are used to process PDF documents:

* pdftocairo

* pdftotext

* mutool

* gs

## Quick start:

### Install python requirements:

```

pip install -r requirements.txt

sudo apt-get install libcairo2-dev

sudo apt-get install libpango1.0-dev

sudo apt-get install -y xpdf

sudo apt-get install mupdf mupdf-tools

```

### Download Canadian legislation

```

python download_legal_corpora.py --corpus ca

```

### Download EU legislation

```

python download_legal_corpora.py --corpus eu

```

### Download all (EU, UK, CA, FI, JP, US)

```

python download_legal_corpora.py --corpus all

```

## Citation

In case you use this repo or any derivative in your work, please cite using the following:

```

@Misc{chalkidis-legalcrawler,

author =   {Ilias Chalkidis},

title =    {{Legal Crawler}: A collection of scripts to crawl English legal corpora from open public domains.},

howpublished = {\url{https://github.com/iliaschalkidis/LegalCrawler/}},

year = {2020--2022}

}

```