{"id":27969025,"url":"https://github.com/prismadic/hygiene","last_synced_at":"2025-10-13T01:03:16.339Z","repository":{"id":212182019,"uuid":"730897742","full_name":"Prismadic/hygiene","owner":"Prismadic","description":"A payload compression toolkit that makes it easy to create ideal data structures for LLMs; from training data to chain payloads.","archived":false,"fork":false,"pushed_at":"2024-03-14T15:27:08.000Z","size":6686,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-10-13T01:03:06.330Z","etag":null,"topics":["compression-methods","data-preprocessing","data-structures","llm-chain","llm-finetuning","llm-inference"],"latest_commit_sha":null,"homepage":"https://prismadic.github.io/hygiene/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Prismadic.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-12-12T23:13:01.000Z","updated_at":"2024-02-27T19:51:51.000Z","dependencies_parsed_at":"2023-12-13T00:20:37.098Z","dependency_job_id":"b9842a71-ba5f-4266-8a1d-75d24d656ecb","html_url":"https://github.com/Prismadic/hygiene","commit_stats":null,"previous_names":["prismadic/hygiene"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/Prismadic/hygiene","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Prismadic%2Fhygiene","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Prismadic%2Fhygiene/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Prismadic%2Fhygiene/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Prismadic%2Fhygiene/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Prismadic","download_url":"https://codeload.github.com/Prismadic/hygiene/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Prismadic%2Fhygiene/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279013885,"owners_count":26085325,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-12T02:00:06.719Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["compression-methods","data-preprocessing","data-structures","llm-chain","llm-finetuning","llm-inference"],"created_at":"2025-05-07T21:08:11.069Z","updated_at":"2025-10-13T01:03:16.324Z","avatar_url":"https://github.com/Prismadic.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\n   \u003cimg height=\"250\" src=\"./hygiene.png\"\u003e\n   \u003cbr\u003e\n   \u003ch3 align=\"center\"\u003ehygiene\u003c/h3\u003e\n   \u003cp align=\"center\"\u003eA payload compression toolkit that makes it easy to create ideal data structures for LLMs.\u003c/p\u003e\n   \u003cp align=\"center\"\u003e\u003ci\u003e~ from training data to chain payloads ~\u003c/i\u003e\u003c/p\u003e\n\u003c/p\u003e\n\n## 🤔 Why?\n\n0. Compress (or freeze/reformat) payloads during inference and vector embedding.\n\n1. Get data to look _the way language models expect it to look during prompting_ **no matter the origin or shape of that data** \u003csmall\u003ewhile also being as small as possible\u003c/small\u003e (which starts w/ fine-tunining engineer's goal)\n\n2. Provide utilities and connectors to reduce code in language model workflows.\n\n3. Prompt-generated datasets\u003csup\u003e[2] [3]\u003c/sup\u003e in particular are unique but come with similar mundane routines as others.\n\n\n## 💾 Installation\n\n``` bash\npip install llm-hygiene\n```\nor \n``` bash\npython3 setup.py install\n```\n\n\n\n## 🤷 Usage\n\n``` python\nPython 3.11.2 (main, Mar 24 2023, 00:16:47) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin\nType \"help\", \"copyright\", \"credits\" or \"license\" for more information.\n\u003e\u003e\u003e import hygiene\n\u003e\u003e\u003e from hygiene import Singleton\n\u003e\u003e\u003e # Example JSON string\n\u003e\u003e\u003e singletons = [\n        {\"name\": \"John\", \"age\": 30, \"city\": \"New York\"},\n        '{\"name\": \"John\", \"age\": 30, \"city\": \"New York\"}',\n        list({\"name\": \"John\", \"age\": 30, \"city\": \"New York\"}),\n        [{\"name\": \"John\", \"age\": 30, \"city\": \"New York\"}]\n    ]\n\u003e\u003e\u003e milvus_payload_examples = [\n        {\"count\": 10, \"sizes\": [35, 36, 38]},\n        {\"price\": 11.99,\"ratings\": [9.1, 9.2, 9.4]},\n        {\"is_delivered\": True,\"responses\": [False, False, True, False]},\n        {\"name\": \"Alice\",\"friends\": [\"bob\",\"eva\",\"jack\"]},\n        {\"location\": {\"lon\": 52.5200,\"lat\": 13.4050},\n            \"cities\": [\n                {\"lon\": 51.5072,\"lat\": 0.1276},\n                {\"lon\": 40.7128,\"lat\": 74.0060}\n            ]\n        }\n    ]\n\u003e\u003e\u003e def calculate_ratio(string, json_obj):\n        string_size = len(string.encode('utf-8'))\n        json_size = len(json.dumps(json_obj).encode('utf-8'))\n        ratio = string_size / json_size\n        print(f'JSON-\u003eYAML bytes ratio: {ratio}')\n\u003e\u003e\u003e boxing = Singleton.boxing()\n\u003e\u003e\u003e for each in singletons:\n         package = boxing.Payload(data=each, fmt=\"yml\")\n         payload = package.deliver()\n         print(payload)\n         calculate_ratio(payload, each)\nage: 30\ncity: New York\nname: John\n\nJSON-\u003eYAML bytes ratio: 0.723404255319149\nage: 30\ncity: New York\nname: John\n\nJSON-\u003eYAML bytes ratio: 0.576271186440678\n- name\n- age\n- city\n\nJSON-\u003eYAML bytes ratio: 0.8695652173913043\n- age: 30\n  city: New York\n  name: John\n\nJSON-\u003eYAML bytes ratio: 0.8163265306122449\n\u003e\u003e\u003e for each in milvus_payload_examples:\n         package = boxing.Payload(data=each, fmt=\"yml\")\n         payload = package.deliver()\n         print(payload)\n         calculate_ratio(payload, each)\ncount: 10\nsizes:\n- 35\n- 36\n- 38\n\nJSON-\u003eYAML bytes ratio: 0.8888888888888888\nprice: 11.99\nratings:\n- 9.1\n- 9.2\n- 9.4\n\nJSON-\u003eYAML bytes ratio: 0.9090909090909091\nis_delivered: true\nresponses:\n- false\n- false\n- true\n- false\n\nJSON-\u003eYAML bytes ratio: 0.953125\nfriends:\n- bob\n- eva\n- jack\nname: Alice\n\nJSON-\u003eYAML bytes ratio: 0.7692307692307693\ncities:\n- lat: 0.1276\n  lon: 51.5072\n- lat: 74.006\n  lon: 40.7128\nlocation:\n  lat: 13.405\n  lon: 52.52\n\nJSON-\u003eYAML bytes ratio: 0.8512396694214877\n```\n\n## 🥅 Goals\n\n- Provide an extremely robust, complete, dataset for finetuning a **small language model** on payload structures\u003csup\u003e[2]\u003c/sup\u003e\n- Create a fine-tuning dataset for Seq2Seq inference based on collation of the previous dataset\u003csup\u003e[2]\u003c/sup\u003e\n- Use datasets to make models for embedding vectors and training LLMs on pristine \"Instruct\"-type chains-of-thought\u003csup\u003e[3]\u003c/sup\u003e\n- Provide all of the preprocessing tools to do this within this very package\n\n### ⚡️ Advantages\n\n- suits structured to non-structured data but **also careless** data 👉 natural language workflows\n- atomized, low-level conversions for items belonging to massive datasets (memory-safe if used correctly)\n- tiny footprint in your project with _few_ dependencies\n- super-easy\n- fast\n\n## ⌨️ Working on\n\n- [ ] integrating with Milvus\n- [ ] integrating with embeddings\u003csup\u003e[1]\u003c/sup\u003e\n- [x] finishing this readme\n- [x] pip package\n\n\u003chr\u003e\n\n### ✍️ Citations\n\n[1] **\"MTEB: Massive Text Embedding Benchmark\"**\n\n_Niklas Muennighoff_\n\nhttps://github.com/huggingface/blog/blob/main/mteb.md\n\n[2] **\"Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data\"**\n\n_Xu, Canwen and Guo, Daya and Duan, Nan and McAuley, Julian_\n\nhttps://arxiv.org/abs/2304.01196\n\n[3] **\"Training language models to follow instructions with human feedback\"**\n\n_Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe_\n\nhttps://arxiv.org/abs/2203.02155\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprismadic%2Fhygiene","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprismadic%2Fhygiene","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprismadic%2Fhygiene/lists"}