{"id":30687623,"url":"https://github.com/codinglabsong/anyfile-agent","last_synced_at":"2026-05-07T10:33:19.361Z","repository":{"id":307344061,"uuid":"1026834273","full_name":"codinglabsong/anyfile-agent","owner":"codinglabsong","description":"Ask any questions about your own documents via natural language! Supports many file extensions.","archived":false,"fork":false,"pushed_at":"2025-08-16T02:02:09.000Z","size":421,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-02T00:07:47.161Z","etag":null,"topics":["agent","duckdb","embeddings","gradio","natural-language","ocr","persistent-memory","persistent-storage","prompt-engineering","rag","rag-chatbot","sql","sqlqueries"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/codinglabsong.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-26T17:56:58.000Z","updated_at":"2025-08-16T02:02:12.000Z","dependencies_parsed_at":"2025-07-31T09:04:55.753Z","dependency_job_id":null,"html_url":"https://github.com/codinglabsong/anyfile-agent","commit_stats":null,"previous_names":["codinglabsong/anyfile-agent"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/codinglabsong/anyfile-agent","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codinglabsong%2Fanyfile-agent","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codinglabsong%2Fanyfile-agent/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codinglabsong%2Fanyfile-agent/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codinglabsong%2Fanyfile-agent/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/codinglabsong","download_url":"https://codeload.github.com/codinglabsong/anyfile-agent/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codinglabsong%2Fanyfile-agent/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32733581,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-07T02:14:30.463Z","status":"ssl_error","status_checked_at":"2026-05-07T02:14:29.405Z","response_time":62,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","duckdb","embeddings","gradio","natural-language","ocr","persistent-memory","persistent-storage","prompt-engineering","rag","rag-chatbot","sql","sqlqueries"],"created_at":"2025-09-02T00:03:37.195Z","updated_at":"2026-05-07T10:33:19.344Z","avatar_url":"https://github.com/codinglabsong.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Anyfile-Agent\nAnyfile-Agent lets you query your own documents using natural language. It indexes a folder of files, converts CSV and Excel sheets into a DuckDB database, and performs semantic search via vector retrieval. Built with LangChain/LangGraph, this interactive LLM agent combines RAG-based retrieval and SQL querying so you can “chat” with your data.\n\n## Features\n- **Multi-format ingestion** – Images are processed through OCR so their text is indexed. PDFs, Word docs, PowerPoint, Markdown, HTML, and plain text are split into searchable chunks. \n- **Data summarization** – CSV and Excel files are loaded into DuckDB tables. Summary cards for each table are added to the vector index.\n- **Embeddings \u0026 retrieval** – Documents are embedded with `GoogleGenerativeAIEmbeddings` and stored in a FAISS vector database for fast top-k semantic search.\n- **SQL integration** – The agent can issue DuckDB queries over your uploaded spreadsheets. Only `SELECT` and `PRAGMA` statements are allowed for safety.\n- **Prompt engineering** – System prompts and tool descriptions were iteratively tuned to guide the RAG‑based agent through schema inspection, query planning, and result synthesis.\n- **Persistent conversations** – The agent saves its conversation history with you to SQLite with a `thread_id` so that you can resume or switch between chats.\n- **Gradio App** – Run a user-friendly interface for interactive sessions.\n- **Developer Tools \u0026 CI** – Linting with ruff and black, unit tests with pytest, end‐to‐end smoke tests in GitHub Actions.\n\n## Installation\n1. Install system packages needed for OCR (first time only):\n   ```bash\n   sudo apt update\n   sudo apt install -y tesseract-ocr libtesseract-dev\n   ```\n2. Install the Python package and dependencies:\n   ```bash\n   pip install -e .\n   pip install -r requirements-dev.txt  # optional dev tools\n   ```\n\n## Usage\n### CLI\n1. Place the documents you want to search under `data/` directory.\n2. Run the agent. The first run may take a while as it loads and indexes the files:\n   ```bash\n   bash scripts/run_agent.sh --thread_id 12345 --ask \"What kinds of files have I provided?\" --load_data\n   ```\n   Additional options are available via:\n   ```bash\n   bash scripts/run_agent.sh --help\n   ```\n3. For later sessions, omit `--load_data` to reuse the existing FAISS index and DuckDB database. \n4. If you have added more documents under `data/`, please load them again using `--load_data`.\n\n### Gradio App\nRun the App Locally:\n```bash\npython app.py\n```\n- Visit the printed URL (e.g., `http://127.0.0.1:7860`) to interact with the agent.\n\n## Supported File Types\n- Text documents: PDF, DOCX, PPTX, Markdown, HTML, TXT\n- Images: PNG, JPG, JPEG, TIFF (text processed via OCR)\n- Spreadsheets: CSV, XLSX\n   * For best results with XLSX, use a simple tabular layout—one header row, uniform columns, and no merged cells or custom formatting. You can have multiple sheets.\n\n## Example Results\n### User Friendly Gradio Interface\n![Gradio Interface Example](assets/gradio2.png)\n\n### Multi-Step Reasoning with Tool Use\n\u003cdiv style=\"max-height:400px; overflow-y:auto; border:1px solid #ccc; padding:8px;\"\u003e\n  \u003cpre\u003e\u003ccode class=\"language-bash\"\u003e\n================================ Human Message =================================\n\nCan you generate some mathematically useful information across these excel files?\n================================== Ai Message ==================================\nTool Calls:\n  sql_db_schema (aea9b712-105d-4253-9fdb-0ef4eebc79b3)\n Call ID: aea9b712-105d-4253-9fdb-0ef4eebc79b3\n  Args:\n    table_names: file_example_xlsx_50__sheet1\n================================= Tool Message =================================\nName: sql_db_schema\n\n\nCREATE TABLE file_example_xlsx_50__sheet1 (\n        \"0\" BIGINT, \n        \"First Name\" VARCHAR, \n        \"Last Name\" VARCHAR, \n        \"Gender\" VARCHAR, \n        \"Country\" VARCHAR, \n        \"Age\" BIGINT, \n        \"Date\" VARCHAR, \n        \"Id\" BIGINT\n)\n\n/*\n3 rows from file_example_xlsx_50__sheet1 table:\n0       First Name      Last Name       Gender  Country Age     Date    Id\n1       Dulce   Abril   Female  United States   32      15/10/2017      1562\n2       Mara    Hashimoto       Female  Great Britain   25      16/08/2016      1582\n3       Philip  Gent    Male    France  36      21/05/2015      2587\n*/\n================================== Ai Message ==================================\nTool Calls:\n  sql_db_schema (78cbce3d-fb28-42d6-80bd-eb77d277f7f9)\n Call ID: 78cbce3d-fb28-42d6-80bd-eb77d277f7f9\n  Args:\n    table_names: customers_100\n================================= Tool Message =================================\nName: sql_db_schema\n\n\nCREATE TABLE customers_100 (\n        \"Index\" BIGINT, \n        \"Customer Id\" VARCHAR, \n        \"First Name\" VARCHAR, \n        \"Last Name\" VARCHAR, \n        \"Company\" VARCHAR, \n        \"City\" VARCHAR, \n        \"Country\" VARCHAR, \n        \"Phone 1\" VARCHAR, \n        \"Phone 2\" VARCHAR, \n        \"Email\" VARCHAR, \n        \"Subscription Date\" DATE, \n        \"Website\" VARCHAR\n)\n\n/*\n3 rows from customers_100 table:\nIndex   Customer Id     First Name      Last Name       Company City    Country Phone 1 Phone 2 Email   Subscription Date       Website\n1       DD37Cf93aecA6Dc Sheryl  Baxter  Rasmussen Group East Leonard    Chile   229.077.5154    397.884.0519x718        zunigavanessa@smith.info        2020-08-24      http://www.stephenson.com/\n2       1Ef7b82A4CAAD10 Preston Lozano  Vega-Gentry     East Jimmychester       Djibouti        5153435776      686-620-1820x944        vmata@colon.com 2021-04-23      http://www.hobbs.com/\n3       6F94879bDAfE5a6 Roy     Berry   Murillo-Perry   Isabelborough   Antigua and Barbuda     +1-539-402-0259 (496)978-3969x58947     beckycarr@hogan.com     2020-03-25      http://www.lawrence.com/\n*/\n================================== Ai Message ==================================\nTool Calls:\n  sql_db_query_checker (9c7c199a-e8d5-45f8-acef-996eb9039681)\n Call ID: 9c7c199a-e8d5-45f8-acef-996eb9039681\n  Args:\n    query: SELECT AVG(Age) AS AverageAge, COUNT(CASE WHEN Gender = 'Male' THEN 1 END) AS MaleCount, COUNT(CASE WHEN Gender = 'Female' THEN 1 END) AS FemaleCount FROM file_example_xlsx_50__sheet1;\n================================= Tool Message =================================\nName: sql_db_query_checker\n\nSELECT AVG(Age) AS AverageAge, COUNT(CASE WHEN Gender = 'Male' THEN 1 END) AS MaleCount, COUNT(CASE WHEN Gender = 'Female' THEN 1 END) AS FemaleCount FROM file_example_xlsx_50__sheet1;\n================================== Ai Message ==================================\nTool Calls:\n  sql_db_query (e1413327-1faf-41f1-bb90-8ea283bd52d8)\n Call ID: e1413327-1faf-41f1-bb90-8ea283bd52d8\n  Args:\n    query: SELECT AVG(Age) AS AverageAge, COUNT(CASE WHEN Gender = 'Male' THEN 1 END) AS MaleCount, COUNT(CASE WHEN Gender = 'Female' THEN 1 END) AS FemaleCount FROM file_example_xlsx_50__sheet1;\n================================= Tool Message =================================\nName: sql_db_query\n\n[(33.26, 12, 38)]\n================================== Ai Message ==================================\nTool Calls:\n  sql_db_query_checker (d6a472b0-24e7-4103-bf5e-4b7453a6c70a)\n Call ID: d6a472b0-24e7-4103-bf5e-4b7453a6c70a\n  Args:\n    query: SELECT Country, COUNT(*) AS CustomerCount FROM customers_100 GROUP BY Country ORDER BY CustomerCount DESC LIMIT 5;\n================================= Tool Message =================================\nName: sql_db_query_checker\n\nSELECT Country, COUNT(*) AS CustomerCount FROM customers_100 GROUP BY Country ORDER BY CustomerCount DESC LIMIT 5;\n================================== Ai Message ==================================\nTool Calls:\n  sql_db_query (5a4d951c-4fce-4cc8-98ad-764d91ab6b78)\n Call ID: 5a4d951c-4fce-4cc8-98ad-764d91ab6b78\n  Args:\n    query: SELECT Country, COUNT(*) AS CustomerCount FROM customers_100 GROUP BY Country ORDER BY CustomerCount DESC LIMIT 5;\n================================= Tool Message =================================\nName: sql_db_query\n\n[('Solomon Islands', 4), ('Netherlands', 2), ('Sri Lanka', 2), ('United Arab Emirates', 2), ('Belarus', 2)]\n================================== Ai Message ==================================\n\nFrom the `file_example_xlsx_50__sheet1` table:\n*   The average age is 33.26 years.\n*   There are 12 males and 38 females.\n\nFrom the `customers_100` table, the top 5 countries by customer count are:\n*   Solomon Islands: 4 customers\n*   Netherlands: 2 customers\n*   Sri Lanka: 2 customers\n*   United Arab Emirates: 2 customers\n*   Belarus: 2 customers\n  \u003c/code\u003e\u003c/pre\u003e\n\u003c/div\u003e\n\n### Persistent Memory and Database\n\u003cdiv style=\"max-height:400px; overflow-y:auto; border:1px solid #ccc; padding:8px;\"\u003e\n  \u003cpre\u003e\u003ccode class=\"language-bash\"\u003e\n================================ Human Message =================================\n\nCould you please add the customer counts for these 5 countries?\n================================== Ai Message ==================================\nTool Calls:\n  sql_db_query_checker (7a0f341a-b860-4398-997e-3352f77c081e)\n Call ID: 7a0f341a-b860-4398-997e-3352f77c081e\n  Args:\n    query: SELECT SUM(CustomerCount) FROM (SELECT Country, COUNT(*) AS CustomerCount FROM customers_100 WHERE Country IN ('Solomon Islands', 'Netherlands', 'Sri Lanka', 'United Arab Emirates', 'Belarus') GROUP BY Country)\n================================= Tool Message =================================\nName: sql_db_query_checker\n\nSELECT SUM(CustomerCount) FROM (SELECT Country, COUNT(*) AS CustomerCount FROM customers_100 WHERE Country IN ('Solomon Islands', 'Netherlands', 'Sri Lanka', 'United Arab Emirates', 'Belarus') GROUP BY Country)\n================================== Ai Message ==================================\nTool Calls:\n  sql_db_query (6fe70492-8dc5-42c7-92f2-3de22caf15ab)\n Call ID: 6fe70492-8dc5-42c7-92f2-3de22caf15ab\n  Args:\n    query: SELECT SUM(CustomerCount) FROM (SELECT Country, COUNT(*) AS CustomerCount FROM customers_100 WHERE Country IN ('Solomon Islands', 'Netherlands', 'Sri Lanka', 'United Arab Emirates', 'Belarus') GROUP BY Country)\n================================= Tool Message =================================\nName: sql_db_query\n\n[(12,)]\n================================== Ai Message ==================================\n\nThe total customer count for Solomon Islands, Netherlands, Sri Lanka, United Arab Emirates, and Belarus is 12.\n  \u003c/code\u003e\u003c/pre\u003e\n\u003c/div\u003e\n\n## Code Hygiene Tools \u0026 Testing\nRun formatting checks and unit tests with:\n```bash\npre-commit run --all-files\npytest\n```\n\n## Repository Structure\n- `src/any_chatbot/` – core modules for indexing, tools, and agent\n- `data/` – directory to add your files for CLI interface.\n- `scripts/` – helper script to launch the agent\n- `notebooks/` – example notebooks for experiments\n- `tests/` – unit tests for the indexing and tool utilities\n- `scripts/` – helper script to launch the agent\n\n## Requirements\n- Python 3.10+\n- A Google Gemini API key (`GOOGLE_API_KEY` environment variable)\n\n## Contributing\nContributions are welcome! Feel free to open issues or pull requests.\n\n## Acknowledgements\n- [LangChain Docs](https://python.langchain.com/docs/introduction/)\n- [LVMH Annual Report](https://www.lvmh.com/en/investors)\n\n## License\nThis project is licensed under the [MIT License](LICENSE).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodinglabsong%2Fanyfile-agent","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcodinglabsong%2Fanyfile-agent","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodinglabsong%2Fanyfile-agent/lists"}