https://github.com/bcdh/cadet

Updated Cadet to work with spaCy v3
https://github.com/bcdh/cadet

Last synced: about 1 year ago
JSON representation

Updated Cadet to work with spaCy v3

Host: GitHub
URL: https://github.com/bcdh/cadet
Owner: BCDH
License: mit
Created: 2020-12-03T15:04:02.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2024-05-14T10:22:33.000Z (about 2 years ago)
Last Synced: 2024-05-21T07:27:24.015Z (about 2 years ago)
Language: JavaScript
Homepage: https://cadet-nightly.herokuapp.com
Size: 14.7 MB
Stars: 2
Watchers: 5
Forks: 2
Open Issues: 13
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Cadet: Asset Management for spaCy Language Models

![](https://i.imgur.com/yhTiX7G.jpeg)

## What is Cadet?

Cadet is a web app for creating custom language objects for spaCy.

- **Goal**: To provide an easy-to-use tool that enables non-technical users to start leveraging the power of natural language processing (NLP) in their research projects.
- **Context**: CLS Infra + DARIAH-Princeton Workshop Series "NLP 4 New Languages" (funded by the NEH)

## New Languages for spaCy?

- before you can train your model on annotated data, you need some data to begin with
- spaCy language object contains multiple linguistic assets, not just an annotated corpus
- spaCy offers models for many languages, but starting from scratch is not easy

![Скриншот 2019-11-20 19.48.27](https://i.imgur.com/7e7B8Pc.png)

## Why Cadet?

- **Accessibility:** Makes the collection and processing of langauge assets accessible to humanists without a background in programming or data science.
- **Customization:** Allows users to tailor language data to their specific needs and research domains.
- **Efficiency:** Streamlines the process of creating amd processing language assets for new spaCy language models

## Two flavors of Cadet

- **Stand-alone web app**: User-friendly GUI with an intuitive design that simplifies model creation and customization.
- **Jupyter Notebook**: More flexible than the stand-alone web app but requires a knowledge of Python

## How does it work?

- it takes the user through seven individual steps

![](https://i.imgur.com/QeBW6GO.png)

### 1. Create a New Language Object

Building from spaCy's defaults, this will create a new language object for your language

### 2. Provide example sentences

![](https://i.imgur.com/ak948Ha.png)

### 3. Tokenization Check

![](https://i.imgur.com/GRmRT1X.png)

### 4. Lookup Tables

![](https://i.imgur.com/qu7X9k6.png)

### 5. Load texts for annotation

![](https://i.imgur.com/SgeCIfI.png)

### 6. Frequent Tokens

#### Overview

![](https://i.imgur.com/im3FwqF.png)

#### Bulk Editing

![](https://i.imgur.com/nUBSOwS.png)

#### 7. Generate CONLL-U Files for Export to Inception

![](https://i.imgur.com/dEye0Io.jpeg)

### 8. Export model for training

![](https://i.imgur.com/kQxhPtZ.png)

## Install and run with Docker

1. Make sure you have docker installed on your machine (including the `docker` command).
2. After cloning this repository, navigate to the root of the repository
For example:
```
git clone git@github.com:BCDH/cadet.git
cd cadet
```
3. Build the Docker image
```
docker build -t cadet .
```

4. Run the Docker Container
```
docker run -p 8000:8000 cadet
```

## Repo template

![](https://i.imgur.com/ttcnsAr.png)

## How to use this template

1. [Click on the green button "Use this template"](https://docs.github.com/en/free-pro-team@latest/github/creating-cloning-and-archiving-repositories/creating-a-repository-from-a-template)
![](https://i.imgur.com/Rh2y7ZK.png)

2. Create a new repository for your app. The name is entirely up to you.

3. When you application is working and ready to deploy, type the following in your browser:

`https://heroku.com/deploy?template=https://github.com///tree/master`

Please note that you will be prompted to create a Hiroku user account if you do not have one.

## Acknowledgements

This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No 101004984: [CLS INFRA](https://clsinfra.io) as well as the National Endownment for the Humanities via [New Languages for NLP](https://newnlp.princeton.edu)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bcdh/cadet

Awesome Lists containing this project

README