{"id":15925824,"url":"https://github.com/clearedge-ai/clearedge","last_synced_at":"2025-03-24T14:32:34.857Z","repository":{"id":230121628,"uuid":"772312972","full_name":"Clearedge-AI/clearedge","owner":"Clearedge-AI","description":"Build a RAG preprocessing pipeline ","archived":false,"fork":false,"pushed_at":"2024-04-07T02:43:02.000Z","size":25927,"stargazers_count":11,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-19T03:56:13.358Z","etag":null,"topics":["document-parser","haystack","langchain","llamaindex","llm","ocr","pdf","pdf-ocr-extraction","pdf-to-json","pdf-to-text","rag-pipeline","retrieval-augmented-generation","table-detection","table-recognition"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Clearedge-AI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-15T00:12:50.000Z","updated_at":"2024-12-30T22:00:25.000Z","dependencies_parsed_at":"2024-10-23T01:02:47.013Z","dependency_job_id":null,"html_url":"https://github.com/Clearedge-AI/clearedge","commit_stats":{"total_commits":49,"total_committers":2,"mean_commits":24.5,"dds":"0.061224489795918324","last_synced_commit":"364c79b51876aefafb2d8f0f1e2c3afd9ef6fb87"},"previous_names":["clearedge-ai/clearedge"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Clearedge-AI%2Fclearedge","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Clearedge-AI%2Fclearedge/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Clearedge-AI%2Fclearedge/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Clearedge-AI%2Fclearedge/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Clearedge-AI","download_url":"https://codeload.github.com/Clearedge-AI/clearedge/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245289747,"owners_count":20591125,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document-parser","haystack","langchain","llamaindex","llm","ocr","pdf","pdf-ocr-extraction","pdf-to-json","pdf-to-text","rag-pipeline","retrieval-augmented-generation","table-detection","table-recognition"],"created_at":"2024-10-06T22:04:45.619Z","updated_at":"2025-03-24T14:32:33.501Z","avatar_url":"https://github.com/Clearedge-AI.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Clearedge-AI/clearedge/blob/main/notebooks/quicktour.ipynb)\n## Overview\n\nClearedge is a Python package designed to simplify the process of extracting raw text and metadata from documents. You can use it to retrieve not only the text but also valuable metadata including titles, subheadings, page numbers, file names, bounding box (bbox) coordinates, and chunk types. Whether you're working on document analysis, data extraction projects, or building a RAG app with LLM, Clearedge provides a straightforward and efficient solution.\n\n## Features\n\n- Text Extraction: Extract raw text from documents (currently supports pdf only. other file types coming soon).\n- Metadata Retrieval: Obtain metadata such as subheadings, page numbers, file names, bounding boxes and more.\n- Bounding Box Coordinates: Access bbox coordinates for text chunks, enabling spatial analysis of text placement within documents.\n- Chunk Type Identification: Identify types of text chunks (e.g., table, text and more) for advanced content analysis.\n- Support for Multiple Formats (coming soon): Compatible with popular document formats, ensuring broad usability.\n\n## Installation\n\n### Prerequisites\n\nTo install clearedge, you will need Python 3.8 or later.\n\nSince we use Tesseract, you will need extra dependencies.\n\nFor MacOS users, you need to run:\n```shell\nbrew install tesseract\n```\n\nFor ubuntu users, you need to run:\n```shell\nsudo apt install tesseract-ocr\n```\n\n### Latest release\n\nYou can then install the latest release of the package using [pypi](https://pypi.org/project/clearedge/) as follows:\n\n```bash\npip install clearedge\n```\n\n## Quick Start\n\nHere's a simple example to get you started with clearedge:\n\n```python\nfrom clearedge.reader.pdf import process_pdf\n\n# Call the extractor with the path to your document\nchunks = process_pdf('/path/to/your/document.pdf', use_ocr=True) # do not add use_ocr for faster processing. output is less accurate without ocr.\n\n# Extract text and metadata\nfor chunk in chunks:\n    text, metadata = chunk.text, chunk.metadata\n    print(text) # Accessing extracted text\n    print(metadata.to_dict()) # Accessing metadata\n```\n\n## Documentation\n\nFor more detailed information on all the features and functionalities of Clearedge, please refer to the official documentation (coming soon).\n\n## Contributing\n\nContributions to Clearedge are welcome! If you have suggestions for improvements or bug fixes, please feel free to:\nOpen an issue to discuss what you would like to change.\nSubmit pull requests for us to review.\n\n## Citation\n\nIf you wish to cite this project, feel free to use this [BibTeX](http://www.bibtex.org/) reference:\n\n```bibtex\n@misc{clearedge2024,\n    title={clearedge: RAG preprocessor},\n    author={Clearedge AI},\n    year={2024},\n    publisher = {GitHub},\n    howpublished = {\\url{https://github.com/Clearedge-AI/clearedge}}\n}\n```\n\n## License\n\nClearedge is released under the Apache 2.0 License. See the [`LICENSE`](https://github.com/Clearedge-AI/clearedge?tab=Apache-2.0-1-ov-file#readme) file for more details.\n\n## Acknowledgments\n\nThis project was inspired by the need for a simple, yet comprehensive tool for document analysis and metadata extraction. We thank all contributors and users for their support and feedback. Clearedge aims to be a valuable tool for developers, researchers, and anyone involved in processing and analyzing document content. We hope it simplifies your projects and helps you achieve your goals more efficiently.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclearedge-ai%2Fclearedge","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fclearedge-ai%2Fclearedge","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclearedge-ai%2Fclearedge/lists"}