{"id":24556100,"url":"https://github.com/deepfates/bookwyrm","last_synced_at":"2025-07-11T05:34:01.442Z","repository":{"id":242943758,"uuid":"810663597","full_name":"deepfates/bookwyrm","owner":"deepfates","description":"ingest, index, and encode information into one long file 🐉","archived":false,"fork":false,"pushed_at":"2024-06-11T20:29:03.000Z","size":52,"stargazers_count":4,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-23T04:38:26.637Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/deepfates.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-05T06:14:31.000Z","updated_at":"2024-08-28T03:43:07.000Z","dependencies_parsed_at":"2024-06-10T21:25:55.296Z","dependency_job_id":null,"html_url":"https://github.com/deepfates/bookwyrm","commit_stats":null,"previous_names":["deepfates/bookwyrm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepfates%2Fbookwyrm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepfates%2Fbookwyrm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepfates%2Fbookwyrm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepfates%2Fbookwyrm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/deepfates","download_url":"https://codeload.github.com/deepfates/bookwyrm/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243910714,"owners_count":20367538,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-23T04:38:31.620Z","updated_at":"2025-03-16T17:46:07.437Z","avatar_url":"https://github.com/deepfates.png","language":"Python","readme":"# 🐉 bookwyrm \n\nThis is an ingestion pipeline for Github repos, website, documents, and more.\n\nIt takes a list of URLs and outputs a bookwyrm: a long string of docs and embeddings, easily indexed. \n\nIf you've ever wanted to:\n\n- Chat with PDF\n- Chat with repo\n- Chat with video\n- Chat with notebook\n- Chat with website\n- Chat with files\n\nWell, this doesn't let you do that. It just processes them and spits out chunks of text with embeddings.\n\nWhen you want to actually chat with it, that's what [concat](https://github.com/deepfates/concat) is for.\n\n\u003c!-- Describe the different types of data we can scrape --\u003e\n\n## Using with Cog\nUse Cog to run predictions:\n```sh\ncog predict -i urls='[\"https://github.com/replicate/cog\"]'\n```\n\n## Using with Python\n\n### Setting Up the Environment\n1. **Create a new virtual environment**:\n   ```sh\n   python3 -m venv .venv\n   ```\n\n2. **Activate the virtual environment**:\n   - For `bash` or `zsh`:\n     ```sh\n     source .venv/bin/activate\n     ```\n   - For `fish`:\n     ```sh\n     source .venv/bin/activate.fish\n     ```\n\n\n3. **Install the required dependencies**:\n   ```sh\n   pip install -r requirements.txt\n   ```\n\n### Running the Pipeline\n\n1. **Run the main script**:\n   ```sh\n   python -m bookwyrm.bookwyrm\n   ```\n\nThis will process the test URLs and save the output to `wyrm.json`.\n\n### Use as a library\n\n```python\nimport asyncio\n\nfrom bookwyrm.bookwyrm import process_documents\n\nurls = [\"https://llm.datasette.io/en/stable/\"]\noutput = asyncio.run(process_documents(urls))\nwith open(\"wyrm.json\", \"w\") as f:\n    f.write(output.to_json())\n```\n\nRun the test script:\n```sh\npython test_script.py\n```\n\nThis should process the URLs and save the output to `wyrm.json`, confirming that your environment is correctly set up.\n\n---\n\n\nModified versions of the following third-party software components are included in this project:\n```\nProject: n-levels-of-rag\nSource: https://github.com/jxnl/n-levels-of-rag\nLicense: MIT\nCopyright: 2024 Jason Liu\nFiles: README.md\n```\n\n```\nProject: 1filellm\nSource: https://github.com/jimmc414/1filellm\nLicense: MIT\nCopyright: 2024 Jim McMillan\nFiles: onefilellm.py\n```\n\n## About\nThe Bookwyrm model is structured as a Python class that contains three main components:\n\n1. **Documents**: A list of DocumentRecord objects, each representing a document with its index, URI, and metadata.\n2. **Chunks**: A list of TextChunk objects, each representing a chunk of text from a document, along with its document index, local index, and global index.\n3. **Embeddings**: A NumPy array containing the embeddings for each text chunk.\n\nThis structure is designed to efficiently store and manage large collections of documents, their textual content, and their corresponding embeddings. By chunking the documents into smaller text segments and storing their embeddings, the Bookwyrm model enables efficient similarity searches and retrieval of relevant information from the corpus.\n\nThe separation of documents, chunks, and embeddings allows for flexible processing and manipulation of the data, while the use of NumPy arrays for embeddings facilitates efficient vector operations and similarity calculations.\n\nThe Bookwyrm model can be serialized to and deserialized from JSON format using the `to_json` and `from_json` methods defined in the `Bookwyrm` class.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepfates%2Fbookwyrm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeepfates%2Fbookwyrm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepfates%2Fbookwyrm/lists"}