{"id":14130598,"url":"https://github.com/cognitivetech/ollama-ebook-summary","last_synced_at":"2025-04-11T23:15:52.267Z","repository":{"id":211676886,"uuid":"729716514","full_name":"cognitivetech/ollama-ebook-summary","owner":"cognitivetech","description":"LLM for Long Text Summary (Comprehensive Bulleted Notes)","archived":false,"fork":false,"pushed_at":"2025-01-19T16:29:10.000Z","size":2655,"stargazers_count":534,"open_issues_count":3,"forks_count":38,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-04-11T23:15:45.409Z","etag":null,"topics":["generative-ai","gpt","llm","localai","localgpt","ollama","ollama-app","privategpt","privategpt4linux","summarization"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cognitivetech.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-12-10T05:42:52.000Z","updated_at":"2025-04-11T08:05:29.000Z","dependencies_parsed_at":"2023-12-21T04:23:57.489Z","dependency_job_id":"92f5a6a9-f7a0-44d1-ab95-dea569245436","html_url":"https://github.com/cognitivetech/ollama-ebook-summary","commit_stats":null,"previous_names":["cognitivetech/llm-book-summarization","cognitivetech/llm-long-text-summary","cognitivetech/llm-long-text-summarization","cognitivetech/ollama-ebook-summary"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cognitivetech%2Follama-ebook-summary","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cognitivetech%2Follama-ebook-summary/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cognitivetech%2Follama-ebook-summary/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cognitivetech%2Follama-ebook-summary/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cognitivetech","download_url":"https://codeload.github.com/cognitivetech/ollama-ebook-summary/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248492884,"owners_count":21113163,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["generative-ai","gpt","llm","localai","localgpt","ollama","ollama-app","privategpt","privategpt4linux","summarization"],"created_at":"2024-08-15T21:01:00.377Z","updated_at":"2025-04-11T23:15:52.227Z","avatar_url":"https://github.com/cognitivetech.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Bulleted Notes Book Summaries\n\n_Built With: Python 3.11.9_\n\n## Introduction\nThis project creates bulleted notes summaries of books and other long texts, particularly epub and pdf which have ToC metadata available.\n\nWhen the ebooks contain approrpiate metadata, we are able to easily automate the extraction of chapters from most books, and split them into ~2000 token chunks, with fallbacks in case we are unable to access a document outline.\n\n### Why 2000 tokens?\n[*Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models*](https://huggingface.co/papers/2402.14848) (2024-02-19; Mosh Levy, Alon Jacoby, Yoav Goldberg) suggests that reasoning capacity drops off pretty sharply from 250 to 1000 tokens, starting to flatten out between 2000-3000 tokens.\n\n![](https://i.imgur.com/nyDkAzP.png)\n\nThis corresponds my own experience while summarizing many long documents using local llm.\n\nYou can check the [depreciated walkthroughs and rankings](notes/depreciated/) for more background on how I got here.\n\n### Comparison with RAG\n\nSimilar to Retrieval Augmented Generation (RAG), we split the document into many parts, so they fit into the context. The difference is that RAG systems try to determine what is the best chunk to ask their question to. Instead, we ask the same questions to *every part of the document*.\n\nIts very important towards unlocking the full capabilities of LLM without relying on a multitude of 3rd party apps.\n\n## Contents\n- [Setup](#setup)\n  - [Python Environment](#python-environment)\n  - [Install Dependencies](#install-dependencies)\n  - [Download Models](#download-models)\n  - [Update Config File](#update-config-file-_configyaml)\n- [Usage](#usage)\n  - [Convert E-book to chunked CSV or TXT](#convert-e-book-to-chunked-csv-or-txt)\n  - [Generate Summary](#generate-summary)\n- [Semi-Manual with Prototypes](#semi-manual-with-prototypes)\n- [Models](#models)\n  - [Ollama](#ollama)\n  - [HuggingFace](#huggingface)\n- [Check your Document Outline](#check-your-ebook-for-document-outline)\n  - [Firefox](#firefox)\n  - [Brave](#brave)\n- [Disclaimer](#disclaimer)\n- [Inspiration](#inspiration)\n- [Resources](#resources)\n\n## Setup\n### Python Environment\n\nBefore starting, ensure you have Python 3.11.9 installed. If not, you can use conda or pyenv to manage Python versions:\n\n**Using conda:**\n1. Install Anaconda from: https://www.anaconda.com/download/success\n2. Create a new environment: `conda create -n book_summary python=3.11.9`\n3. Activate the environment: `conda activate book_summary`\n\n**Using pyenv:**\n1. Install pyenv: https://github.com/pyenv/pyenv#installation\n2. Install Python 3.11.9: `pyenv install 3.11.9`\n3. Set local version: `pyenv local 3.11.9`\n\n### Install Dependencies\n```\npip install -r requirements.txt\n```\n- [Install Ollama](https://github.com/ollama/ollama?tab=readme-ov-file#ollama)\n\n### Download Models\n\n#### 1. **Download a copy of Mistral Instruct v0.2 Bulleted Notes Fine-Tune**\n\n`ollama pull cognitivetech/obook_summary:q6_k`\n\n#### 2. **Download up a title model**\n\n##### a) *Download a preconfigured model*\n\n`ollama pull cognitivetech/obook_title:q4_k_m`\n\nFor your convenience Mistral 7b 0.3 is packaged with the necessary message history for title creation. \n\n***or***\n\n##### b) *Append this* [message history](Modelfile) *to the Modelfile of your choice*\n\n#### 3. **Download a general-purpose model**\n`ollama pull gemma2`\n\n### Update Config File `_config.yaml`\n\nEnsure the defaults are set accordingly! \n\n\u003e This is an area subject to change which may differ from the documentation. **Make sure you have the models on your system as noted in `summary`, `general`, and `title` in the current [_config.yaml](./_config.yaml).** I have to clean up this aspect of the code, but I'm still working on that.\n\n```yaml\ndefaults:\n  prompt: bnotes\n  summary: cognitivetech/obook_summary:q6_k # default model for summaries\n  general: gemma2                           # default model for basic summary\n  title: cognitivetech/obook_title:q4_k_m   # default model for title generation\nprompts:\n  bnotes: # Default Prompt\n    prompt: Write comprehensive bulleted notes summarizing the provided text, with\n      headings and terms in bold.\n  research: # Also for use with summary model\n    prompt: Does this text make any arguments? If so list them here.\n  clean:  # The following prompts should be used with a general purpose model.\n    prompt: Repeat back this text exactly, remove only garbage characters that do\n      not contribute to the flow of text. Output only the main text content, condensed\n      onto a single line. If you encounter any chapter boundaries or subheadings,\n      start a new line beginning with its title.\n  concise:\n    prompt: Repeat the provided passage, with Concision.\n  md:\n    prompt: 'Print these notes in proper markdown format, with headings marked as\n      bold with double asterisks and terms in bold also, and bullet points as `-`.\n      Print the notes exactly, word-for-word, do not elaborate, do not add headings\n      with #'\n  sum: # basic\n    prompt: Comprehensive bulleted notes with headings and terms in bold.\n  teacher:\n    prompt: 'Write a list of questions that can be answered by 3rd graders who are\n      reading the provided text. Topics we like to focus on include: Main idea, supporting\n      details, Point of view, Theme, Sequence, Elements of fiction (setting, characters,\n      BME)'\n  quotes:\n    prompt: 'write a few dozen quotes inspired by the provided text'\ntitle_generation:\n  prompt: Write a title with fewer than 11 words to concisely describe this selection.\n```\n\n## Usage \n### Convert E-book to chunked CSV or TXT\n\n#### 1. Use automated script to split your `pdf` or `epub`.\n```bash\npython3 book2text.py ebook-name.epub # or ebook-name.pdf (Epub is preferred)\n```\n\n**This step produces two outputs**:\n- `out/ebook-name.csv` (split by chapter or section)\n- `out/ebook-name_processed.csv` (chunked)\n\n***or***\n\n#### 2. Remove or escape all newlines within each chunk, so they may be placed line by line [in a text file](notes/depreciated/summarize.txt), with each line surrounded by double quotes.\n\u003ca href=\"notes/depreciated/summarize.txt\"\u003e\u003cimg width=\"1163\" alt=\"image\" src=\"https://github.com/user-attachments/assets/6621d209-35ab-40a5-ab7c-3f8324909e43\"\u003e\u003c/a\u003e\n\n\\*_Note to be cautious of properly escaping or replacing double quotes from within each chunk._\n\n### Generate Summary\n\n`$``python3 sum.py --help`\n\n```bash\nUsage: python sum.py [OPTIONS] input_file\n\nOptions:\n-c, --csv        Process a CSV file. Expected columns: Title, Text\n-t, --txt        Process a text file. Each line should be a separate text chunk.\n-m, --model      Model name to use for generation (default from config)\n-p, --prompt     Alias of the prompt to use from config (default from config)\n-v, --verbose    Print markdown output additionally to terminal\n--help           Show this help message and exit.\n\nFor CSV input:\n- Ensure your CSV has 'Title' and 'Text' columns.\n\nFor Text input:\n- Each line should be a chunk of text surrounded by double quote.\n\nThe output CSV will include:\n- Title: Final title chosen or generated\n- Gen: Boolean indicating if the title was generated\n- Text: Original input text\n- model_name: Generated output\n- Time: Processing time in seconds\n- Len: Length of the output\n```    \n\nIf you have your defaults set, then all you need is to specify which type of input, manual `text`, or automated `csv`. \n```\npython3 sum.py -c ebook-name_processed.csv\n```\n\n## Semi-Manual with Prototypes\n\nIn this example, I've used a prototype [split_pdf.py](tools-prototype/split_pdf.py) to split the pdf not only by chapter but subsections (producing `ebook-name_extracted.csv`), then manually process that output (using [vscode](https://code.visualstudio.com/)) to place each chunk [on a single line](notes/depreciated/summarize.txt) surrounded by double quotes.\n\nEventually that will be automated but provides challenges, which you will notice, that have prevented me from finishing that tool.\n\n**Split**:\n```\ntools-prototype/split_pdf.py ebook-name.pdf # produces ebook-name_extracted.csv\n```\n\n**Process**:\n```\npython3 sum.py -t ebook-name_extracted.csv\n```\n\n**This step generates two outputs**:\n- `ebook-name_extracted_processed_sum.md` (rendered markdown)\n- `ebook-name_extracted_processed_sum.csv` (csv with: input text, flattened md output, generation time, output length)\n\n## Models\nDownload from one of two sources:\n\n### Ollama\nYou can get any of them them right from ollama, template in all.\nexample: `ollama pull obook_summary:q5_k_m`\n\n- [obook_summary](https://ollama.com/cognitivetech/obook_summary) - On Ollama.com\n  - `latest` • 7.7GB • Q_8\n  - `q3_k_m` • 3.5GB\n  - `q4_k_m` • 4.4GB\n  - `q5_k_m` • 5.1GB\n  - `q6_k` • 5.9GB (preferred)\n- [obook_title](https://ollama.com/cognitivetech/obook_title) - On Ollama.com\n  - `latest` • 7.7GB • Q_8\n  - `q3_k_m` • 3.5GB\n  - `q4_k_m` • 4.4GB\n  - `q5_k_m` • 5.1GB\n  - `q6_k`   • 5.9GB (preferred)\n\n### HuggingFace\nThere is also complete weights, lora and ggguf on huggingface\n- [Mistral Instruct Bulleted Notes](https://huggingface.co/collections/cognitivetech/mistral-instruct-bulleted-notes-v02-66b6e2c16196e24d674b1940) - Collection on HuggingFace\n  - [cognitivetech/Mistral-7B-Inst-0.2-Bulleted-Notes](https://huggingface.co/cognitivetech/Mistral-7B-Inst-0.2-Bulleted-Notes)\n  - [cognitivetech/Mistral-7b-Inst-0.2-Bulleted-Notes_GGUF](https://huggingface.co/cognitivetech/cognitivetech/Mistral-7b-Inst-0.2-Bulleted-Notes_GGUF)\n  - [cognitivetech/Mistral-7B-Inst-0.2_Bulleted-Notes_LoRA](https://huggingface.co/cognitivetech/cognitivetech/Mistral-7B-Inst-0.2_Bulleted-Notes_LoRA)\n\n## Check your eBook for Document Outline\n\nHere you can see how to check whethere your eBook as the proper formatting, or not. **With ePub it should fail gracefully**.\n\n\\* In some rare occasion, even with clickable toc the script will not find that.\n\n### Firefox\n![image](https://github.com/user-attachments/assets/fc618e8c-d3e7-4bbd-aa16-1830fdc75b12)\n\n### Brave \n![image](https://github.com/user-attachments/assets/c4491208-f66b-45cf-9095-f2f919d0fa49)\n\n## Disclaimer\n\nYou are responsible for verifying that the summary tool creates an accurate summary. There are a variety of issues which can interfere with a quality summary, and if you aren't paying attention may slip your notice.\n\n**1. References:**\n\nPersonally, I don't trust references from my fine-tuned model without verifying them manually. Maybe this is solved in newer models, but during my testing phase I noticed some bad references with 7b models I was using. I never tested this out to see the quality of the app on references, my personal preference is to remove any long references sections before summarizing, and deal with those separate. I don't think this is a permenant blocker, just an area that I haven't fully dealt with or understood, yet.\n\n**2. Other:**\n\nThere are a few other things to watch out for. \n\nOne of the reasons I keep the length of the input and output on CSV is that makes it easy to check when a summary is longer than the input, thats a red flag.\n\nwhen the structure of the summary greatly deviates from the others, this can indicate issues with the summary. Some of these can be realated to special characters, or if the input is too long and the AI just doesn't grasp it all.\n\n\n## Inspiration\n\nThe inspiration for this app was my intention to manually summarize a dozen books so I could tie together psychological theory and practice which they discuss and make a cohesive argument based on that information.\n\nI've already read the books a few times, but now I need easy access to the information within so that I can relate it to others in a cohesive fashion.\n\nOriginally, after working at it this project manually, for a week, I was only a few chapters into my first book, I could see this was going to take a loong time.\n\nOver the next 6 months I began learning how to use LLM, discovering were the best for my task, with fine-tuning to deliver production quality consistency in the results.\n\nNow with this tool, I'm able to review a lot more material more quickly. This is a content curation tool that empowers me to not only learn things but more readily share that knowledge, without having to spend ages that it takes to create quality content.\n\nMoreover, it can be used to create custom datasets based on whatever source materials you throw at it.\n\n## Resources\n* [Summarizing Books](https://openai.com/research/summarizing-books) OpenAI\n\n### Leaderboards\n* [Small LLM Leaderboard](https://huggingface.co/spaces/w601sxs/SLM-Leaderboard) HuggingFace\n* [HuggingFace - Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)\n* [Chatbox Arena Leaderboard](https://chat.lmsys.org/)\n* [Hallucination leaderboard](https://github.com/vectara/hallucination-leaderboard/) Vectara\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcognitivetech%2Follama-ebook-summary","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcognitivetech%2Follama-ebook-summary","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcognitivetech%2Follama-ebook-summary/lists"}