{"id":14965673,"url":"https://github.com/sabber-slt/netextract","last_synced_at":"2025-10-25T12:30:43.596Z","repository":{"id":252580196,"uuid":"840846741","full_name":"sabber-slt/NetExtract","owner":"sabber-slt","description":"NetExtract: Efficiently extract core content from any webpage and convert it to clean, LLM-optimized Markdown with a simple API.","archived":false,"fork":false,"pushed_at":"2024-08-18T14:01:37.000Z","size":1008,"stargazers_count":26,"open_issues_count":1,"forks_count":3,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-10-30T01:19:30.493Z","etag":null,"topics":["api","crawling","gemma2","llm","markdown","puppeteer"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sabber-slt.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-10T21:42:46.000Z","updated_at":"2024-10-20T12:24:47.000Z","dependencies_parsed_at":"2024-08-10T22:44:41.018Z","dependency_job_id":"360f4b0e-3710-41c9-a15d-99c34dcac726","html_url":"https://github.com/sabber-slt/NetExtract","commit_stats":{"total_commits":12,"total_committers":1,"mean_commits":12.0,"dds":0.0,"last_synced_commit":"aa5410db87af68530fc00da79aa317866966807f"},"previous_names":["sabber-slt/netextract"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sabber-slt%2FNetExtract","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sabber-slt%2FNetExtract/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sabber-slt%2FNetExtract/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sabber-slt%2FNetExtract/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sabber-slt","download_url":"https://codeload.github.com/sabber-slt/NetExtract/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238137875,"owners_count":19422718,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["api","crawling","gemma2","llm","markdown","puppeteer"],"created_at":"2024-09-24T13:35:04.617Z","updated_at":"2025-10-25T12:30:43.138Z","avatar_url":"https://github.com/sabber-slt.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003ch1 align=\"center\"\u003e\u003cstrong\u003eNetExtract\u003c/strong\u003e\u003c/h1\u003e\n  \u003cp\u003eNetExtract is crafted to extract core content from webpages and convert it into clean, LLM-friendly text. Leveraging the power of Express.js, TypeScript, and Puppeteer, it offers a streamlined API for efficient content extraction and transformation, making it an invaluable tool for enhancing LLM and RAG systems with up-to-date web information and API web scraping.\u003c/p\u003e\n\u003c/div\u003e\n\n![preview](./assets/x.png)\n\n## Features\n\n1. Core Content Extraction: Seamlessly extracts essential content from any URL.\n2. Markdown Conversion: Converts webpage content into clean, well-formatted Markdown.\n3. Social Media Scraping: Efficiently scrapes and formats X (Twitter) posts.\n4. Simple API Integration: Easily integrates with existing systems.\n5. LLM-Powered Conversion: Utilizes open-source large language models to enhance the extraction and conversion process, ensuring high-quality output.\n\n## 📖 Usage\n\nTo use NetExtract, prepend the API endpoint to your desired URL:\n\n```bash\nhttp://{your_address}/api?url={url}\n```\n\n## 🗂️ Getting started with Docker\n\n```bash\ngit clone https://github.com/sabber-slt/NetExtract\ncd NetExtract\n```\n\nThen run the application with Docker:\n\n```bash\ndocker compose up -d\n```\n\n## ⚡️ Acknowledgments\n\n- Inspired by jina.ai\n- Built with Node.js, Express.js, TypeScript, and Puppeteer\n\n## 🧩 Structure\n\n```\n.\n├── cookie\n│   └── twitter.json            # Twitter cookie for X (Twitter) post scraping\n├── docs                        # Documentation files\n├── search                      # Searxng engine\n├── src                         # Source code\n│   ├── interfaces              # TypeScript interfaces\n│   ├── lib                     # Utility libraries\n│   ├── routes                  # Express route handlers\n│   ├── services                # Core service layer for business logic\n│   ├── utils                   # Helper functions and utilities\n│   └── app.ts                  # Main application entry point\n├── .env                        # Environment variables\n├── .gitignore                  # Git ignored files\n├── .prettierignore             # Prettier ignored files\n├── .prettierrc.js              # Prettier configuration\n├── app.log                     # Log file\n├── Dockerfile                  # Dockerfile\n├── docker-compose.yaml         # Docker Compose configuration\n├── package.json                # Node.js project metadata\n├── README.md                   # Project README\n├── tsconfig.json               # TypeScript configuration\n└── yarn.lock                   # Yarn lockfile for dependency management\n\n```\n\n## 🤝 Contributing\n\nI welcome and appreciate contributions! If you'd like to contribute, please feel free to submit issues, fork the repository, and send pull requests.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsabber-slt%2Fnetextract","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsabber-slt%2Fnetextract","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsabber-slt%2Fnetextract/lists"}