{"id":22780363,"url":"https://github.com/glavin001/data2aitextbook","last_synced_at":"2025-10-05T21:45:50.844Z","repository":{"id":189315688,"uuid":"680376531","full_name":"Glavin001/Data2AITextbook","owner":"Glavin001","description":"🚀 Automatically convert unstructured data into a high-quality 'textbook' format, optimized for fine-tuning Large Language Models (LLMs)","archived":false,"fork":false,"pushed_at":"2023-10-15T00:49:39.000Z","size":4035,"stargazers_count":26,"open_issues_count":11,"forks_count":3,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-15T14:53:26.678Z","etag":null,"topics":["ai","llm","question-generation"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Glavin001.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-19T03:46:07.000Z","updated_at":"2025-01-30T09:46:45.000Z","dependencies_parsed_at":"2023-10-15T21:12:44.374Z","dependency_job_id":null,"html_url":"https://github.com/Glavin001/Data2AITextbook","commit_stats":null,"previous_names":["glavin001/data-maker-llm","glavin001/data2aitextbook"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Glavin001/Data2AITextbook","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Glavin001%2FData2AITextbook","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Glavin001%2FData2AITextbook/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Glavin001%2FData2AITextbook/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Glavin001%2FData2AITextbook/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Glavin001","download_url":"https://codeload.github.com/Glavin001/Data2AITextbook/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Glavin001%2FData2AITextbook/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278526239,"owners_count":26001325,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-05T02:00:06.059Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","llm","question-generation"],"created_at":"2024-12-11T20:12:57.142Z","updated_at":"2025-10-05T21:45:50.826Z","avatar_url":"https://github.com/Glavin001.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data2AITextbook\n\n## 🎯 Goal\n\n\u003e Automatically convert unstructured data into a high-quality 'textbook' format, optimized for fine-tuning Large Language Models (LLMs) 🚀\n\n## About\n\nInspired by [Textbooks Are All You Need](https://arxiv.org/pdf/2306.11644.pdf), which produced [`phi-1`](https://huggingface.co/microsoft/phi-1) LLM trained with \"textbook quality\" data.\n\n### Principles\n\n- **Flexible**: Any form of unstructured data (e.g. speeches, blogs, code, existing texbooks, etc)\n- **Grounded**: Trusts your data over model's pre-existing knowledge and doesn't make up new data unless explicitly asked\n- **Efficient**: Highest density of learning-per-training-token, leverage best practices \u0026 undestands of language model training\n- **Enhanced**: Increase capabilities of trained models versus simply training on the raw input text\n\n## Phases\n\nThis project is broken into 2 phases:\n- **Ingestion \u0026 Generation**: Extract learning objectives; generate high-quality 1️⃣ lessons and 2️⃣ exercises per knowledge type and cognitive process; knowledge augmentation (e.g. paraphrase, inverse, etc).\n- **Training**: Curriculum learning to optimally train a new language model. Training should get progressively harder and leverage knowledge learned earlier in the curriculum.\n\n### 1. Ingestion \u0026 Generation\n\n- **Input**: Unstructured text data\n- **Output**: Training \u0026 test dataset, with rich meta-data (e.g. dependencies/relationships, knowledge type, cognitive process, etc)\n\n🤝 Combining the best practices from teaching humans and training language models.\n\nInspiration for educating:\n- Humans:\n    - Bloom's Taxonomy for learning objectives\n        - Learn more about Bloom's Taxonomy in https://www.celt.iastate.edu/instructional-strategies/effective-teaching-practices/revised-blooms-taxonomy/ and https://cft.vanderbilt.edu/guides-sub-pages/blooms-taxonomy/\n    - Curricium Learning\n    - Encoding Specificity Principle\n- Language models:\n    - Prompting/Inference\n        - [Chain of Thought](https://arxiv.org/abs/2201.11903)\n        - [Graph of Thoughts](https://arxiv.org/abs/2308.09687)\n        - [Self Consistency](https://arxiv.org/abs/2203.11171)\n    - Training datasets\n        - [Less Is More for Alignment (LIMA)](https://browse.arxiv.org/pdf/2305.11206.pdf)\n        - TinyStories\n        - Textbooks are all you need\n        - Physics of Language Models Part 3.1 and 3.2\n    - and many, *many* more.\n\n### Learning Objectives\n\nBloom's Taxonomy breaks down learning objectives based on knowledge types and cognitive processes.\nThe combination of each can be treated different to maximize the output generated.\n\n![Bloom's Taxonomy Table](./.github/images/blooms-taxonomy-table.png)\n\n#### Exercises\n\nFor each knowledge type and cognitive process to be taught there are multiple lessons and exercises to consider generating.\nEach of these exercises should have an optimzed prompting pipeline to generate which leverages powerful techniques, such as Chain-of-Thought or Graph of Thoughts and Self-Consistency, etc.\n\nThe following is work-in-progress:\n\n|                 | **Remember**       | **Understand**        | **Apply**           | **Analyze**          | **Evaluate**      | **Create**                  |\n|-----------------|--------------------|-----------------------|---------------------|----------------------|-------------------|------------------------------|\n| **Factual**      | Multiple Choice\u003cbr\u003eTrue/False\u003cbr\u003eFlashcards\u003cbr\u003eLabeling\u003cbr\u003eListing   | Summary Writing\u003cbr\u003eParaphrasing\u003cbr\u003eExplanation  | Matching\u003cbr\u003eIdentification  | Categorization   | Ranking  | Listing (newly synthesized)  |\n| **Conceptual**  | Flashcards (concepts)\u003cbr\u003eMatching  | Explanation\u003cbr\u003eInterpretation  | Problem-Solving\u003cbr\u003eCase Studies  | Compare and Contrast\u003cbr\u003eCategorization  | Critical Review\u003cbr\u003eAssessment  | Designing\u003cbr\u003ePlanning  |\n| **Procedural**  | Labeling (steps)\u003cbr\u003eMultiple Choice (next step)  | Summary Writing (process)\u003cbr\u003eExplanation (how-to)  | Demonstration\u003cbr\u003eSimulation  | Flowcharting\u003cbr\u003eError Analysis  | Assessment (procedures)\u003cbr\u003eRecommendation  | Programming\u003cbr\u003eDesigning (new process)  |\n| **Metacognitive** | Listing (strategies)  | Paraphrasing (strategies)  | Role-Playing (strategies)  | Investigation\u003cbr\u003eDebate  | Self-Assessment\u003cbr\u003eJudgment  | Planning\u003cbr\u003eStoryboarding  |\n\n\n### 2. Training Curriculum\n\n- **Input**: Training \u0026 test dataset\n- **Output**: Trained language model, using curriculum learning by grouping \u0026 ordering training dataset based on meta-data\n\nAll of the dataset will be consumed, however, the next chapter of data will only be unlocked once the model in training ✅ passes a sufficient % of exercises from test dataset.\n\n```mermaid\ngraph LR\n\n    subgraph Chapter1[\"1. Foundational Knowledge\"]\n        subgraph Knowledge_Types1[\"Knowledge Types\"]\n            Factual1[\"Factual\"]\n        end\n        subgraph CognitiveProcesses1[\"Cognitive Processes\"]\n            Remember1[\"Remember\"]\n            Understand1[\"Understand\"]\n        end\n    end\n\n    subgraph Chapter2[\"2. Basic Concepts and Procedures\"]\n        subgraph Knowledge_Types2[\"Knowledge Types\"]\n            Conceptual2[\"Conceptual\"]\n            Procedural2[\"Procedural\"]\n        end\n        subgraph CognitiveProcesses2[\"Cognitive Processes\"]\n            Understand2[\"Understand\"]\n            Apply2[\"Apply\"]\n        end\n    end\n\n    subgraph Chapter3[\"3. Interconnecting Concepts\"]\n        subgraph Knowledge_Types3[\"Knowledge Types\"]\n            Conceptual3[\"Conceptual\"]\n            Procedural3[\"Procedural\"]\n        end\n        subgraph CognitiveProcesses3[\"Cognitive Processes\"]\n            Analyze3[\"Analyze\"]\n        end\n    end\n\n    subgraph Chapter4[\"4. Practical Application\"]\n        subgraph Knowledge_Types4[\"Knowledge Types\"]\n            Procedural4[\"Procedural\"]\n            Metacognitive4[\"Metacognitive\"]\n        end\n        subgraph CognitiveProcesses4[\"Cognitive Processes\"]\n            Apply4[\"Apply\"]\n            Evaluate4[\"Evaluate\"]\n        end\n    end\n\n    subgraph Chapter5[\"5. Critical Examination\"]\n        subgraph Knowledge_Types5[\"Knowledge Types\"]\n            Conceptual5[\"Conceptual\"]\n            Metacognitive5[\"Metacognitive\"]\n        end\n        subgraph CognitiveProcesses5[\"Cognitive Processes\"]\n            Analyze5[\"Analyze\"]\n            Evaluate5[\"Evaluate\"]\n        end\n    end\n\n    subgraph Chapter6[\"6. Building New Knowledge\"]\n        subgraph Knowledge_Types6[\"Knowledge Types\"]\n            Metacognitive6[\"Metacognitive\"]\n            Conceptual6[\"Conceptual\"]\n        end\n        subgraph CognitiveProcesses6[\"Cognitive Processes\"]\n            Create6[\"Create\"]\n        end\n    end\n\n    subgraph Chapter7[\"7. Integration and Reflection\"]\n        subgraph Knowledge_Types7[\"Knowledge Types\"]\n            Factual7[\"Factual\"]\n            Conceptual7[\"Conceptual\"]\n            Procedural7[\"Procedural\"]\n            Metacognitive7[\"Metacognitive\"]\n        end\n        subgraph CognitiveProcesses7[\"Cognitive Processes\"]\n            Remember7[\"Remember\"]\n            Understand7[\"Understand\"]\n            Apply7[\"Apply\"]\n            Analyze7[\"Analyze\"]\n            Evaluate7[\"Evaluate\"]\n            Create7[\"Create\"]\n        end\n    end\n\n    Model[\"Trained Language Model\"]\n\n    %% Edges\n    Chapter1--\u003e|\"✅ Pass Tests\"|Chapter2\n    Chapter2--\u003e|\"✅ Pass Tests\"|Chapter3\n    Chapter3--\u003e|\"✅ Pass Tests\"|Chapter4\n    Chapter4--\u003e|\"✅ Pass Tests\"|Chapter5\n    Chapter5--\u003e|\"✅ Pass Tests\"|Chapter6\n    Chapter6--\u003e|\"✅ Pass Tests\"|Chapter7\n    Chapter7--\u003e|\"✅ Pass Tests\"|Model\n```\n## Datasets \u0026 Models\n\nCheck out my Huggingface profile for a list of datasets \u0026 models I've created: https://huggingface.co/Glavin001\n\nMany will be created for the [Expertise by AI](https://github.com/Glavin001/Expertise-by-AI) project, where you can learn how to train custom models with your own data.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fglavin001%2Fdata2aitextbook","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fglavin001%2Fdata2aitextbook","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fglavin001%2Fdata2aitextbook/lists"}