Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/wellecks/overlap
Tool for n-gram overlap analysis between test and training sequences
https://github.com/wellecks/overlap
Last synced: 2 months ago
JSON representation
Tool for n-gram overlap analysis between test and training sequences
- Host: GitHub
- URL: https://github.com/wellecks/overlap
- Owner: wellecks
- License: mit
- Created: 2023-10-10T00:57:49.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2023-10-17T03:44:19.000Z (about 1 year ago)
- Last Synced: 2024-04-28T04:28:25.339Z (8 months ago)
- Language: Jupyter Notebook
- Size: 9.59 MB
- Stars: 8
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Overlap
Checks overlap between inputs or outputs from a test set (e.g. MATH), and a corpus (e.g. open-web-math).
Example:
```bash
python check_overlap.py --test-dataset MATH \
--test-key input \
--dataset open-web-math/open-web-math \
--ngram-n 30
```
This command checks whether 30-grams from `MATH` `input` sequences appear in `open-web-math`.See `notebooks/analysis.ipynb` for an example usage of the output.
### Model generations (Llemma lm-evaluation-harness)
```bash
python check_overlap.py --test-dataset /path/to/output.json \
--test-key input \
--dataset open-web-math/open-web-math \
--ngram-n 30
```
Where `output.json` is produced by the [Llemma `lm-evaluation-harness`](https://github.com/wellecks/lm-evaluation-harness). \
The JSON file must have a sequence stored at a `unprocessed_answers` key in the `metadata`. \
The `minerva_math_xyz` tasks yield JSON that adheres to this format.See `notebooks/analysis.ipynb` for an example usage of the output.
### Authors:
- Sean Welleck, Keiran Paster### Llemma
This tool was developed as part of the Llemma project.
Llemma's analysis is saved in the `llemma` branch.
### Citation:
Please cite the following:
```
@misc{azerbayev2023llemma,
title={Llemma: An Open Language Model For Mathematics},
author={Zhangir Azerbayev and Hailey Schoelkopf and Keiran Paster and Marco Dos Santos and Stephen McAleer and Albert Q. Jiang and Jia Deng and Stella Biderman and Sean Welleck},
year={2023},
eprint={2310.10631},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```