https://github.com/ayushk4/character-probing-pytorch

NAACL 2022, What do tokens know about their characters and how do they know it?
https://github.com/ayushk4/character-probing-pytorch

Last synced: 3 months ago
JSON representation

NAACL 2022, What do tokens know about their characters and how do they know it?

Host: GitHub
URL: https://github.com/ayushk4/character-probing-pytorch
Owner: Ayushk4
Created: 2022-05-03T21:38:19.000Z (about 3 years ago)
Default Branch: master
Last Pushed: 2022-05-03T21:55:42.000Z (about 3 years ago)
Last Synced: 2025-02-21T10:39:06.357Z (5 months ago)
Language: Python
Size: 1.8 MB
Stars: 4
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Code

Codebase accompanying the submission `What do tokens know about their characters and how do they know it?`.

## Instructions:

We divide our codebase with the experiments:

### Section 3 and Appendix B

Follow the instructions in `experiment1/README.md` to replicate all our character probing experiments on English language.

Follow the instructions in `multilingual/README.md` to replicate all our character probing experiments on non-English language.

Follow the instructions in `expt1_substring/README.md` to replicate all our substring experiment.

### Section 4 and Appendix C

Follow the instructions in `sec_4.1_train_custom_models/README.md` to train our proposed syntax baselines for character information. You may also directly use our already-trained syntax model linked in that README.

Follow the instructions in `sec_4.1_using_spacy/README.md` to probe our SpaCy-syntax baseline for character information.

Follow the instructions in `sec_4.1_using_spacy/README.md` to probe our subword-syntax baselines for character information.

### Section 5 and Appendix D

Follow the instructions in `quantify_tokenization/README.md` to replicate our experiments to quantify the variability in subword tokenizers. Our code is also compatible with other sub-word tokenizers.

You may use `custom_embeds/README.md` to train custom word embeddings with controllable variability and prepare the corpus for it and you may then probe for character information following `probe_custom_word2vec/README.md`.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ayushk4/character-probing-pytorch

Awesome Lists containing this project

README