Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Projects in Awesome Lists by shjwudp
A curated list of projects in awesome lists by shjwudp .
https://github.com/shjwudp/c4-dataset-script
Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.
commoncrawl dataset massivetext nlp python spark
Last synced: 02 Dec 2024
https://github.com/shjwudp/megabyte
A PyTorch implementation of MEGABYTE. This multi-scale transformer architecture has the excellent features of tokenization-free and sub-quadratic attention. The paper link: https://arxiv.org/abs/2305.07185
deep-learning language-model sub-quadratic-attention tokenization-free
Last synced: 02 Dec 2024
https://github.com/shjwudp/blueprint-trainer
Scaffolding for sequence model training research.
Last synced: 02 Dec 2024