Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/isaacus-dev/open-australian-legal-corpus-creator
The code used to create and update the Open Australian Legal Corpus, the first and only multijurisdictional open corpus of Australian legislative and judicial documents.
https://github.com/isaacus-dev/open-australian-legal-corpus-creator
australia corpus dataset datasets law legal open-data scraping web-scraping
Last synced: 5 days ago
JSON representation
The code used to create and update the Open Australian Legal Corpus, the first and only multijurisdictional open corpus of Australian legislative and judicial documents.
- Host: GitHub
- URL: https://github.com/isaacus-dev/open-australian-legal-corpus-creator
- Owner: isaacus-dev
- License: mit
- Created: 2023-06-26T08:15:49.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-08-08T11:34:35.000Z (6 months ago)
- Last Synced: 2025-01-21T09:35:08.750Z (13 days ago)
- Topics: australia, corpus, dataset, datasets, law, legal, open-data, scraping, web-scraping
- Language: Python
- Homepage: https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus
- Size: 277 MB
- Stars: 76
- Watchers: 7
- Forks: 11
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project
README
# Open Australian Legal Corpus Creator
The [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) is the first and only multijurisdictional open corpus of Australian legislative and judicial documents. This repository contains the code used to create and update the Corpus.
To learn more about the Corpus and how it was built, please see Umar Butler's article, [*How I built the largest open database of Australian law*](https://umarbutler.com/how-i-built-the-largest-open-database-of-australian-law/). If you're looking to download the Corpus, you may do so on [Hugging Face](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus).
## Requirements
The Open Australian Legal Corpus Creator requires Python 3.10 or higher, along with [`tesserocr`](https://github.com/sirfz/tesserocr).Before running the Creator, it is **essential** that you are authorised to scrape and use the sources' data.
## Installation
Install [`tesserocr`](https://github.com/sirfz/tesserocr) by following the instructions [here](https://github.com/sirfz/tesserocr?tab=readme-ov-file#installation) and then run the following command:
```bash
pip install git+https://github.com/umarbutler/open-australian-legal-corpus-creator
```## Usage
To create or update the Corpus, simply call `mkoalc` from the command line. By default, this will output the Corpus to a file named `corpus.jsonl` in the current working directory. Checkpoints and other Corpus data will be stored in your user data directory.The Creator's default behaviour may be modified by passing the following optional arguments to `mkoalc`:
* `-s`/`--sources`: The names of the sources to be scraped, delimited by commas. Possible sources are `federal_court_of_australia`, `federal_register_of_legislation`, `high_court_of_australia`, `nsw_legislation`, `nsw_caselaw`, `queensland_legislation`, `south_australian_legislation`, `western_australian_legislation` and `tasmanian_legislation`. Defaults to all supported sources.
* `-o`/`--output`: The path to the Corpus. Defaults to a file named `corpus.jsonl` in the current working directory.
* `-d`/`--data_dir`: The path to the directory in which Corpus data should be stored. Defaults to the user's data directory as determined by [`platformdirs.user_data_dir`](https://github.com/platformdirs/platformdirs#the-problem) (on Windows, this will be `C:/Users//AppData/Local/Umar Butler/Open Australian Legal Corpus`).
* `-n`/`--num_threads`: The number of threads to use for OCRing PDFs with `tesseract`. Defaults to the number of logical CPUs on the system minus one, or one if there is only one logical CPU.
* `m`/`--max-concurrent-ocr`: The maximum number of PDFs that may be OCR'd concurrently. Defaults to 1.As an example, if you wanted to output the Corpus to `~/corpus/oalc.jsonl`, save Corpus data to `~/app_data/oalc/` and scrape only the Federal Court of Australia and Federal Register of Legislation, you would run:
```bash
mkoalc -s federal_court_of_australia,federal_register_of_legislation -o ~/corpus/oalc.jsonl -d ~/app_data/oalc/
```For even greater control over the Creator's behaviour, you may also access it from the `oalc_creator` Python package:
```python
from asyncio import run as async_run # or, if on Linux, `from uvloop import run as async_run`.
from oalc_creator import Creator# Create a Creator instance.
creator = Creator(
sources=['federal_court_of_australia', 'federal_register_of_legislation'],
corpus_path='~/corpus/oalc.jsonl',
data_dir='~/app_data/oalc/',
)# Create or update the Corpus.
async_run(creator.create()) # `await creator.create()` if you are already in an event loop (eg, in a Jupyter notebook).
```By creating your own subclasses of `oalc_creator.Scraper` and then passing them to `oalc_creator.Creator` as the `sources` argument, you can add support for custom sources. Examples of scrapers are available in [`src/oalc_creator/scrapers`](src/oalc_creator/scrapers). You are encouraged to contribute scrapers for new sources via pull requests.
## Licence
The Creator is licensed under the [MIT License](LICENCE).