https://github.com/stanfordnlp/sindhi-tokenization

Sindhi tokenization data from ISRA
https://github.com/stanfordnlp/sindhi-tokenization

Last synced: about 1 month ago
JSON representation

Sindhi tokenization data from ISRA

Host: GitHub
URL: https://github.com/stanfordnlp/sindhi-tokenization
Owner: stanfordnlp
License: apache-2.0
Created: 2023-09-08T06:54:50.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2023-09-08T07:03:51.000Z (almost 2 years ago)
Last Synced: 2024-12-30T20:16:04.864Z (6 months ago)
Size: 355 KB
Stars: 0
Watchers: 10
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# sindhi-tokenization
Sindhi tokenization data from ISRA

A collection of text files, with token and sentence boundaries marked
in the tkns_ and stns_ files respectively.

A tool in [Stanza](https://github.com/stanfordnlp/stanza),
`convert_text_files.py`, processes this data into a CoNLL-style
suitable for training a tokenizer.
(The other annotations are left blank.)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/stanfordnlp/sindhi-tokenization

Awesome Lists containing this project

README