https://github.com/bcdh/cadet
Updated Cadet to work with spaCy v3
https://github.com/bcdh/cadet
Last synced: 11 months ago
JSON representation
Updated Cadet to work with spaCy v3
- Host: GitHub
- URL: https://github.com/bcdh/cadet
- Owner: BCDH
- License: mit
- Created: 2020-12-03T15:04:02.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2024-05-14T10:22:33.000Z (about 2 years ago)
- Last Synced: 2024-05-21T07:27:24.015Z (about 2 years ago)
- Language: JavaScript
- Homepage: https://cadet-nightly.herokuapp.com
- Size: 14.7 MB
- Stars: 2
- Watchers: 5
- Forks: 2
- Open Issues: 13
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Cadet: Asset Management for spaCy Language Models

## What is Cadet?
Cadet is a web app for creating custom language objects for spaCy.
- **Goal**: To provide an easy-to-use tool that enables non-technical users to start leveraging the power of natural language processing (NLP) in their research projects.
- **Context**: CLS Infra + DARIAH-Princeton Workshop Series "NLP 4 New Languages" (funded by the NEH)
## New Languages for spaCy?
- before you can train your model on annotated data, you need some data to begin with
- spaCy language object contains multiple linguistic assets, not just an annotated corpus
- spaCy offers models for many languages, but starting from scratch is not easy

## Why Cadet?
- **Accessibility:** Makes the collection and processing of langauge assets accessible to humanists without a background in programming or data science.
- **Customization:** Allows users to tailor language data to their specific needs and research domains.
- **Efficiency:** Streamlines the process of creating amd processing language assets for new spaCy language models
## Two flavors of Cadet
- **Stand-alone web app**: User-friendly GUI with an intuitive design that simplifies model creation and customization.
- **Jupyter Notebook**: More flexible than the stand-alone web app but requires a knowledge of Python
## How does it work?
- it takes the user through seven individual steps

### 1. Create a New Language Object
Building from spaCy's defaults, this will create a new language object for your language
### 2. Provide example sentences

### 3. Tokenization Check

### 4. Lookup Tables

### 5. Load texts for annotation

### 6. Frequent Tokens
#### Overview

#### Bulk Editing

#### 7. Generate CONLL-U Files for Export to Inception

### 8. Export model for training

## Install and run with Docker
1. Make sure you have docker installed on your machine (including the `docker` command).
2. After cloning this repository, navigate to the root of the repository
For example:
```
git clone git@github.com:BCDH/cadet.git
cd cadet
```
3. Build the Docker image
```
docker build -t cadet .
```
4. Run the Docker Container
```
docker run -p 8000:8000 cadet
```
## Repo template

## How to use this template
1. [Click on the green button "Use this template"](https://docs.github.com/en/free-pro-team@latest/github/creating-cloning-and-archiving-repositories/creating-a-repository-from-a-template)

2. Create a new repository for your app. The name is entirely up to you.
3. When you application is working and ready to deploy, type the following in your browser:
`https://heroku.com/deploy?template=https://github.com///tree/master`
Please note that you will be prompted to create a Hiroku user account if you do not have one.
## Acknowledgements

This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No 101004984: [CLS INFRA](https://clsinfra.io) as well as the National Endownment for the Humanities via [New Languages for NLP](https://newnlp.princeton.edu)