{"id":19883066,"url":"https://github.com/taivop/agentreader","last_synced_at":"2025-05-02T14:33:01.616Z","repository":{"id":170739495,"uuid":"644842392","full_name":"taivop/agentreader","owner":"taivop","description":"Simple web browsing for your Langchain agent. ","archived":false,"fork":false,"pushed_at":"2023-05-29T21:56:04.000Z","size":24,"stargazers_count":32,"open_issues_count":2,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-07T02:42:00.425Z","etag":null,"topics":["agent","autogpt","gpt","langchain","llm","tool"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/taivop.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-24T11:22:18.000Z","updated_at":"2024-12-09T01:21:01.000Z","dependencies_parsed_at":"2024-02-06T05:32:47.225Z","dependency_job_id":null,"html_url":"https://github.com/taivop/agentreader","commit_stats":null,"previous_names":["taivop/agentreader"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taivop%2Fagentreader","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taivop%2Fagentreader/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taivop%2Fagentreader/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taivop%2Fagentreader/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/taivop","download_url":"https://codeload.github.com/taivop/agentreader/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252053936,"owners_count":21687196,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","autogpt","gpt","langchain","llm","tool"],"created_at":"2024-11-12T17:19:18.137Z","updated_at":"2025-05-02T14:33:01.601Z","avatar_url":"https://github.com/taivop.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# agentreader - a minimal drop-in browser tool for LLM agents\n\nAgentreader is a simple drop-in Python module that gives your LLM agents an ability to read the Internet.\n\nFeatures:\n\n* Returns plain text instead of raw HTML.\n* No API key needed. Just copy-paste the `reader.py` into your project.\n* Implements Langchain's `BaseTool` interface -- drop-in the tool into any existing agent.\n* Extracts page title, authors, and other metadata.\n* Supports paging through results to respect your context window limits.\n* Supports both single-input tools (only URL) and [structured tools](https://python.langchain.com/en/latest/modules/agents/agents/examples/structured_chat.html) that give your agent more control over the output.\n\n## Usage\n\nCopy-paste the `reader.py` file into your project.\n\nInstall dependencies:\n\n```bash\npip install langchain trafilatura newspaper3k\n```\n\nImport and initialize the tool:\n\n```python\nfrom reader import SimpleReaderTool\nreader_tool = SimpleReaderTool()\n```\n\nThat's it! You can now add `reader_tool` into the list of tools available to your agent.\n\nFor full examples of usage, and details about the multi-input (\"structured\") tools, see the [Usage guide](Usage_guide.ipynb).\n\n## Background\n\nWhile trying to make robust autonomous agents in the style of [AutoGPT](https://github.com/Significant-Gravitas/Auto-GPT), I ran into two problems, many of which are common to all LLM agents.\n\n* The raw HTML output from `requests.get` wastes token count and inserts a lot of confusing tokens. The relevant part of the website is usually the body text.\n* Using Playwright introduces too many possible actions which confuse the agent -- it starts to use tools unnecessarily. Also, it's finicky to integrate.\n\nAgentreader is a solution to those. Really it is just a very thin wrapper around libraries that extract text from a website.\n\n## But does it work?\n\nIn the [Usage guide](Usage_guide.ipynb) notebook one of the examples works well and the other not very well. However, I think that is mostly because the default Langchain agents are very un-optimized. I originally used it in a heavily-modified two-stage agent (AutoGPT-style) where I got it to work very well in combination with a SerpAPI based search tool. If it doesn't for you, tinker with the prompts, remove every unnecessary tool and part of the prompt, and you may see better results.\n\n## Tests\n\nOne-off: `pip install pytest`.\n\n```bash\npytest test_reader.py\n```\n\n# Contributing\n\nI've structured this repo in the expectation that you will do heavy customization to how it works -- prompt engineering, adding support for specific websites, or even replacing the underlying text-extraction libraries. That is because key prompts are baked into the code: `ToolInput` introduces strings that describe input arguments, and output string templates describe the format of the output.\n\nIt would of course be possible to generalize these, but with this repo I favor simplicity over generality.\n\nThat said: if you discover something valuable -- either an additional feature, or a way to make this library even simpler -- then open a ticket/PR and let's discuss!\n\n\n# TODOs\n- [ ] reduce number of dependencies (`trafilature` has a dependency conflict with e.g. `openai/evals`).\n- [ ] extract metadata more consistently - currently doesn't work when falling back to `newspaper3k`\n- [ ] better support for e.g. Twitter and other popular non-article content\n- [ ] README.md: objective comparison against playwright, requests.get - on features, output token counts for a few websites, etc\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftaivop%2Fagentreader","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftaivop%2Fagentreader","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftaivop%2Fagentreader/lists"}