Projects in Awesome Lists by bigscience-workshop
A curated list of projects in awesome lists by bigscience-workshop .
https://github.com/bigscience-workshop/petals
🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
bloom chatbot deep-learning distributed-systems falcon gpt guanaco language-models large-language-models llama machine-learning mixtral neural-networks nlp pipeline-parallelism pretrained-models pytorch tensor-parallelism transformer volunteer-computing
Last synced: 13 May 2025
https://github.com/bigscience-workshop/promptsource
Toolkit for creating, sharing and using natural language prompts.
machine-learning natural-language-processing nlp
Last synced: 14 May 2025
https://github.com/bigscience-workshop/megatron-deepspeed
Ongoing research training transformer language models at scale, including: BERT & GPT-2
Last synced: 15 May 2025
https://github.com/bigscience-workshop/Megatron-DeepSpeed
Ongoing research training transformer language models at scale, including: BERT & GPT-2
Last synced: 27 Mar 2025
https://github.com/bigscience-workshop/bigscience
Central place for the engineering/scaling WG: documentation, SLURM scripts and logs, compute environment and data.
machine-learning models nlp training
Last synced: 16 May 2025
https://github.com/bigscience-workshop/xmtf
Crosslingual Generalization through Multitask Finetuning
bloom bloomz instruction-tuning language-models large-language-models mt0 multilingual-nlp multitask-learning t5 zero-shot-learning
Last synced: 09 Apr 2025
https://github.com/bigscience-workshop/biomedical
Tools for curating biomedical training data for large-scale language modeling
Last synced: 15 May 2025
https://github.com/bigscience-workshop/t-zero
Reproduce results and replicate training fo T0 (Multitask Prompted Training Enables Zero-Shot Task Generalization)
Last synced: 28 Oct 2025
https://github.com/bigscience-workshop/data-preparation
Code used for sourcing and cleaning the BigScience ROOTS corpus
dataset large-language-models multilingual
Last synced: 09 Apr 2025
https://github.com/bigscience-workshop/lm-evaluation-harness
A framework for few-shot evaluation of autoregressive language models.
Last synced: 14 Jan 2026
https://github.com/bigscience-workshop/data_tooling
Tools for managing datasets for governance and training.
Last synced: 27 Jan 2026
https://github.com/bigscience-workshop/lam
Libraries, Archives and Museums (LAM)
Last synced: 26 Feb 2025
https://github.com/bigscience-workshop/multilingual-modeling
BLOOM+1: Adapting BLOOM model to support a new unseen language
Last synced: 26 Apr 2025
https://github.com/bigscience-workshop/evaluation
Code and Data for Evaluation WG
Last synced: 26 Apr 2025
https://github.com/bigscience-workshop/data_sourcing
This directory gathers the tools developed by the Data Sourcing Working Group
Last synced: 24 Oct 2025
https://github.com/bigscience-workshop/metadata
Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
Last synced: 26 Apr 2025
https://github.com/bigscience-workshop/carbon-footprint
A repository for `codecarbon` logs.
Last synced: 26 Apr 2025
https://github.com/bigscience-workshop/bloom-dechonk
A repo for running model shrinking experiments
Last synced: 26 Apr 2025
https://github.com/bigscience-workshop/catalogue_data
Scripts to prepare catalogue data
Last synced: 26 Apr 2025
https://github.com/bigscience-workshop/historical_texts
BigScience working group on language models for historical texts
Last synced: 26 Apr 2025
https://github.com/bigscience-workshop/pii_processing
PII Processing code to detect and remediate PII in BigScience datasets. Reference implementation for the PII Hackathon
Last synced: 26 Apr 2025
https://github.com/bigscience-workshop/bibliography
A list of BigScience publications
Last synced: 26 Feb 2025
https://github.com/bigscience-workshop/evaluation-robustness-consistency
Tools for evaluating model robustness and consistency
Last synced: 16 Jul 2025
https://github.com/bigscience-workshop/scaling-laws-tokenization
scaling-laws-tokenization
Last synced: 26 Feb 2025
https://github.com/bigscience-workshop/datasets_stats
Generate statistics over datasets used in the context of BS
Last synced: 23 Feb 2026
https://github.com/bigscience-workshop/shadesofbias
Evaluation for Shades of Bias in Text
Last synced: 26 Feb 2025