{"id":29698717,"url":"https://github.com/moeki0/baran","last_synced_at":"2025-07-23T10:38:25.813Z","repository":{"id":170187328,"uuid":"646313748","full_name":"moeki0/baran","owner":"moeki0","description":"Text Splitter for Large Language Model (LLM) datasets.","archived":false,"fork":false,"pushed_at":"2025-05-31T11:31:00.000Z","size":56,"stargazers_count":2,"open_issues_count":3,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-19T23:31:09.392Z","etag":null,"topics":["ai","gem","llm","markdown","ruby"],"latest_commit_sha":null,"homepage":"https://rubygems.org/gems/baran","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/moeki0.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-28T01:33:12.000Z","updated_at":"2025-06-15T01:25:59.000Z","dependencies_parsed_at":null,"dependency_job_id":"41c499cb-3487-4387-a30e-8ac5ae146456","html_url":"https://github.com/moeki0/baran","commit_stats":null,"previous_names":["moekidev/baran","kawakamimoeki/baran","hackluckcat/baran","kawakamidev/baran","moekiorg/baran"],"tags_count":15,"template":false,"template_full_name":null,"purl":"pkg:github/moeki0/baran","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moeki0%2Fbaran","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moeki0%2Fbaran/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moeki0%2Fbaran/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moeki0%2Fbaran/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/moeki0","download_url":"https://codeload.github.com/moeki0/baran/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moeki0%2Fbaran/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266664241,"owners_count":23964930,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-23T02:00:09.312Z","response_time":66,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","gem","llm","markdown","ruby"],"created_at":"2025-07-23T10:38:25.178Z","updated_at":"2025-07-23T10:38:25.807Z","avatar_url":"https://github.com/moeki0.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Baran\n\n![v](https://badgen.net/rubygems/v/baran)\n![dt](https://badgen.net/rubygems/dt/baran)\n![license](https://badgen.net/github/license/kawakamimoeki/baran)\n\nText Splitter for Large Language Model datasets.\n\nTo avoid token constraints and improve the accuracy of vector search in the Large Language Model, it is necessary to divide the document. This gem supports splitting the text in the specified manner.\n\n## Features\n\nBaran provides efficient text splitting capabilities with the following key features:\n\n- **Chunk Size Control**: Split text into specified sizes\n- **Overlap Management**: Maintain continuity between chunks with configurable overlap\n- **Context Preservation**: Respect semantic boundaries in text\n- **Metadata Support**: Attach metadata to each chunk\n- **Multiple Splitting Strategies**: Character-based, recursive, sentence-based, and Markdown-aware splitting\n\n## Installation\n\n### Using Bundler\n\nAdd this line to your application's Gemfile:\n\n```ruby\ngem 'baran'\n```\n\nAnd then execute:\n\n    $ bundle install\n\n### Direct Installation\n\n    $ gem install baran\n\n## Quick Start\n\n```ruby\nrequire 'baran'\n\n# Basic text splitting\nsplitter = Baran::CharacterTextSplitter.new(chunk_size: 500, chunk_overlap: 50)\nchunks = splitter.chunks(\"Your long text here...\")\n\n# Access chunk data\nchunks.each do |chunk|\n  puts \"Text: #{chunk[:text]}\"\n  puts \"Position: #{chunk[:cursor]}\"\nend\n```\n\n## Usage\n\n### Default Parameters\n\n- `chunk_size`: 1024 (characters)\n- `chunk_overlap`: 64 (characters)\n\n### Character Text Splitter\n\nSplitting by the specified character.\n\n```ruby\nsplitter = Baran::CharacterTextSplitter.new(\n    chunk_size: 1024,\n    chunk_overlap: 64,\n    separator: \"\\n\\n\"\n)\nchunks = splitter.chunks(text, metadata: { source: \"document.txt\" })\n# =\u003e [{ cursor: 0, text: \"...\", metadata: { source: \"document.txt\" } }, ...]\n```\n\n### Recursive Character Text Splitter\n\nSplitting by the specified characters recursively, using the first separator found in the text.\n\n```ruby\nsplitter = Baran::RecursiveCharacterTextSplitter.new(\n    chunk_size: 1024,\n    chunk_overlap: 64,\n    separators: [\"\\n\\n\", \"\\n\", \" \", \"\"]\n)\nchunks = splitter.chunks(text, metadata: { type: \"article\" })\n# =\u003e [{ cursor: 0, text: \"...\", metadata: { type: \"article\" } }, ...]\n```\n\n### Sentence Text Splitter\n\nSplitting text by sentence boundaries (periods, exclamation marks, question marks).\n\n```ruby\nsplitter = Baran::SentenceTextSplitter.new(\n    chunk_size: 2000,\n    chunk_overlap: 200\n)\nchunks = splitter.chunks(text)\n# =\u003e [{ cursor: 0, text: \"Complete sentence.\", metadata: nil }, ...]\n```\n\n### Markdown Text Splitter\n\nSplitting by Markdown structure with awareness of headers, code blocks, and other elements.\n\n```ruby\nsplitter = Baran::MarkdownSplitter.new(\n    chunk_size: 1500,\n    chunk_overlap: 150\n)\nchunks = splitter.chunks(markdown_text, metadata: { format: \"markdown\" })\n# =\u003e [{ cursor: 0, text: \"# Header\\n\\nContent...\", metadata: { format: \"markdown\" } }, ...]\n```\n\nSplit with the following priority:\n\n```ruby\n[\n    \"\\n# \",         # h1\n    \"\\n## \",        # h2\n    \"\\n### \",       # h3\n    \"\\n#### \",      # h4\n    \"\\n##### \",     # h5\n    \"\\n###### \",    # h6\n    \"```\\n\\n\",      # code block\n    \"\\n\\n***\\n\\n\",  # horizontal rule\n    \"\\n\\n---\\n\\n\",  # horizontal rule\n    \"\\n\\n___\\n\\n\",  # horizontal rule\n    \"\\n\\n\",         # paragraph break\n    \"\\n\",           # line break\n    \" \",            # space\n    \"\"              # character\n]\n```\n\n## Advanced Usage\n\n### Working with Metadata\n\n```ruby\nsplitter = Baran::RecursiveCharacterTextSplitter.new\ndocument_text = File.read('document.txt')\n\nchunks = splitter.chunks(\n  document_text,\n  metadata: {\n    source: 'document.txt',\n    created_at: Time.now,\n    author: 'Author Name'\n  }\n)\n\nchunks.each do |chunk|\n  puts \"Text: #{chunk[:text]}\"\n  puts \"Position: #{chunk[:cursor]}\"\n  puts \"Source: #{chunk[:metadata][:source]}\"\nend\n```\n\n### Processing Large Documents\n\n```ruby\nclass DocumentProcessor\n  def initialize\n    @splitter = Baran::RecursiveCharacterTextSplitter.new(\n      chunk_size: 1000,\n      chunk_overlap: 100\n    )\n  end\n\n  def process_file(file_path)\n    content = File.read(file_path)\n    \n    chunks = @splitter.chunks(\n      content,\n      metadata: {\n        file_path: file_path,\n        file_size: File.size(file_path),\n        processed_at: Time.now\n      }\n    )\n\n    chunks.each_with_index do |chunk, index|\n      save_to_vector_store(chunk, index)\n    end\n  end\n\n  private\n\n  def save_to_vector_store(chunk, index)\n    # Your vector storage logic here\n    puts \"Saved chunk #{index}: #{chunk[:text].length} chars\"\n  end\nend\n```\n\n### Comparing Splitting Strategies\n\n```ruby\ntext = File.read('sample.md')\n\n# Character-based splitting\nchar_splitter = Baran::CharacterTextSplitter.new(chunk_size: 500)\nchar_chunks = char_splitter.chunks(text)\n\n# Recursive splitting\nrecursive_splitter = Baran::RecursiveCharacterTextSplitter.new(chunk_size: 500)\nrecursive_chunks = recursive_splitter.chunks(text)\n\n# Markdown-aware splitting\nmd_splitter = Baran::MarkdownSplitter.new(chunk_size: 500)\nmd_chunks = md_splitter.chunks(text)\n\nputs \"Character-based: #{char_chunks.length} chunks\"\nputs \"Recursive: #{recursive_chunks.length} chunks\"\nputs \"Markdown-aware: #{md_chunks.length} chunks\"\n```\n\n## API Reference\n\n### TextSplitter (Base Class)\n\nBase class for all text splitters.\n\n#### Methods\n\n##### `initialize(chunk_size: 1024, chunk_overlap: 64)`\n\n- `chunk_size` (Integer): Maximum characters per chunk\n- `chunk_overlap` (Integer): Characters to overlap between chunks\n\n##### `chunks(text, metadata: nil)`\n\nReturns an array of chunk hashes with `:text`, `:cursor`, and optional `:metadata` keys.\n\n### CharacterTextSplitter\n\nSplits text using a specified separator.\n\n#### Additional Parameters\n\n- `separator` (String): Character(s) to split on (default: \"\\n\\n\")\n\n### RecursiveCharacterTextSplitter\n\nRecursively splits text using multiple separators in priority order.\n\n#### Additional Parameters\n\n- `separators` (Array): Array of separators in priority order (default: [\"\\n\\n\", \"\\n\", \" \"])\n\n### SentenceTextSplitter\n\nSplits text at sentence boundaries using regex pattern matching.\n\nDetects sentences ending with `.`, `!`, or `?` followed by whitespace or end of string.\n\n### MarkdownSplitter\n\nSplits Markdown text while preserving document structure.\n\nInherits from `RecursiveCharacterTextSplitter` with Markdown-specific separators.\n\n## Best Practices\n\n### Choosing Chunk Size\n\n```ruby\n# For GPT-3.5 (4K context window)\nsmall_splitter = Baran::RecursiveCharacterTextSplitter.new(chunk_size: 500)\n\n# For GPT-4 (8K context window)\nmedium_splitter = Baran::RecursiveCharacterTextSplitter.new(chunk_size: 1000)\n\n# For Claude-2 (100K context window)\nlarge_splitter = Baran::RecursiveCharacterTextSplitter.new(chunk_size: 4000)\n```\n\n### Setting Overlap\n\n```ruby\n# General documents: 5-10% of chunk size\ngeneral_splitter = Baran::CharacterTextSplitter.new(\n  chunk_size: 1000,\n  chunk_overlap: 100  # 10%\n)\n\n# Technical documents: Higher overlap for better context\ntechnical_splitter = Baran::RecursiveCharacterTextSplitter.new(\n  chunk_size: 800,\n  chunk_overlap: 150  # ~19%\n)\n```\n\n### Choosing the Right Splitter\n\n- **CharacterTextSplitter**: Simple documents with consistent structure\n- **RecursiveCharacterTextSplitter**: General-purpose text splitting\n- **SentenceTextSplitter**: When sentence integrity is important\n- **MarkdownSplitter**: For Markdown documents and documentation\n\n## Error Handling\n\n```ruby\nbegin\n  # This will raise an error\n  invalid_splitter = Baran::TextSplitter.new(\n    chunk_size: 100,\n    chunk_overlap: 100  # overlap \u003e= chunk_size\n  )\nrescue RuntimeError =\u003e e\n  puts \"Error: #{e.message}\"\n  # =\u003e \"Cannot have chunk_overlap \u003e= chunk_size\"\nend\n```\n\n## Performance Considerations\n\nFor large files, consider streaming processing:\n\n```ruby\ndef process_large_file(file_path)\n  splitter = Baran::RecursiveCharacterTextSplitter.new\n  \n  File.foreach(file_path, \"\\n\\n\") do |paragraph|\n    chunks = splitter.chunks(paragraph)\n    chunks.each { |chunk| yield chunk }\n  end\nend\n\nprocess_large_file('huge_document.txt') do |chunk|\n  # Process each chunk individually\n  save_to_database(chunk)\nend\n```\n\n## Version Information\n\n- **Current Version**: 0.2.1\n- **Ruby Requirement**: \u003e= 2.6.0\n- **License**: MIT\n\n## Development\n\nAfter checking out the repo, run `bin/setup` to install dependencies. Then, run `bundle exec rake` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.\n\nTo run tests:\n\n```bash\nbundle exec rake\n```\n\n## Contributing\n\nBug reports and pull requests are welcome on GitHub at https://github.com/kawakamimoeki/baran. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/kawakamimoeki/baran/blob/main/CODE_OF_CONDUCT.md).\n\n## License\n\nThe gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).\n\n## Code of Conduct\n\nEveryone interacting in the Baran project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/kawakamimoeki/baran/blob/main/CODE_OF_CONDUCT.md).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoeki0%2Fbaran","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmoeki0%2Fbaran","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoeki0%2Fbaran/lists"}