{"id":15997258,"url":"https://github.com/YektaDev/Krawler","last_synced_at":"2025-10-21T05:30:22.168Z","repository":{"id":207971978,"uuid":"717656395","full_name":"YektaDev/Krawler","owner":"YektaDev","description":"A configurable HTML Crawler written in Kotlin (JVM), powered by Coroutines, Kotlin Serialization (JSON), Ktor Client, Exposed, and SQLite.","archived":false,"fork":false,"pushed_at":"2023-11-18T18:15:18.000Z","size":3773,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-10-08T08:05:02.680Z","etag":null,"topics":["crawl","crawler","crawlers","crawling"],"latest_commit_sha":null,"homepage":"","language":"Kotlin","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/YektaDev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-11-12T06:09:13.000Z","updated_at":"2024-09-23T19:09:52.000Z","dependencies_parsed_at":"2023-11-18T20:25:30.435Z","dependency_job_id":"19efa6ef-6dac-484d-863b-3c77c340b061","html_url":"https://github.com/YektaDev/Krawler","commit_stats":null,"previous_names":["yektadev/krawler"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YektaDev%2FKrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YektaDev%2FKrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YektaDev%2FKrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YektaDev%2FKrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/YektaDev","download_url":"https://codeload.github.com/YektaDev/Krawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":237436564,"owners_count":19309933,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawl","crawler","crawlers","crawling"],"created_at":"2024-10-08T08:01:34.766Z","updated_at":"2025-10-21T05:30:21.261Z","avatar_url":"https://github.com/YektaDev.png","language":"Kotlin","funding_links":[],"categories":["Kotlin"],"sub_categories":[],"readme":"# Krawler: Asynchronous Kotlin Crawler 🚀\n\n## Overview\n\nKrawler is a fully configurable and asynchronous HTML Crawler written in Kotlin (JVM). Powered by **Coroutines**,\n**Kotlin Serialization (JSON)**, **Ktor Client**, **Exposed**, **SQLite**, and **SQLite JDBC**, Krawler provides a way\nto easily scrape HTML webpages.\n\n## Features\n\n- **Asynchronous Processing**: Utilizing Kotlin's coroutines, Krawler is designed for high-performance, concurrent web\n  crawling.\n\n- **Configurability**: Krawler is highly customizable through the `krawler_config.json` file, placed at the project\n  path.\n\n- **Extensive Logging**: Verbose logs can be enabled via the configuration file.\n\n- **Persisting Errors**: Errors during the crawling process are stored in the `CrawlErrors` table (with the necessary\n  metadata) and printed to the standard output.\n\n## Database Schema\n\nKrawler uses the following tables to persist data:\n\n```\nCrawlActivities : IntIdTable() {\n  varchar(\"sessionId\", 100)\n  long(\"atEpochSeconds\")\n  varchar(\"type\", 50)\n}\n\nCrawlErrors : IntIdTable() {\n  varchar(\"sessionId\", 100)\n  long(\"atEpochSeconds\")\n  text(\"url\")\n  text(\"error\")\n}\n\nCrawlingStates : IntIdTable() {\n  varchar(\"sessionId\", 100)\n  text(\"url\")\n  integer(\"depth\")\n  long(\"priority\")\n}\n\nWebpages : IntIdTable() {\n  varchar(\"sessionId\", 100)\n  long(\"atEpochSeconds\")\n  text(\"url\")\n  text(\"html\")\n}\n```\n\n## Configuration\n\nKrawler is highly customizable through the `krawler_config.json` file, placed at the project path. Below is a sample\nconfiguration containing all settings:\n\n  ```json\n  {\n  \"seeds\": [\n    \"https://en.wikipedia.org/wiki/NASA\"\n  ],\n  \"filter\": {\n    \"#\": \"dev.yekta.krawler.model.CrawlingFilter.Whitelist\",\n    \"allowPatterns\": [\n      \"https://en\\\\.wikipedia\\\\.org/wiki/.*\"\n    ]\n  },\n  \"depth\": 8,\n  \"maxPages\": 100,\n  \"maxPageSizeKb\": null,\n  \"concurrentConnections\": 16,\n  \"verbose\": true,\n  \"shouldFollowRedirects\": true,\n  \"userAgent\": \"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)\",\n  \"connectTimeoutMs\": 6000,\n  \"readTimeoutMs\": 6000,\n  \"retriesOnServerError\": 0,\n  \"customHeaders\": null\n}\n  ```\n\n+ **`seeds`**: Starting URLs for crawling.\n+ **`filter`**: Crawling filter configuration, either **Whitelist** or **Blacklist**.\n+ **`depth`**: Maximum depth of crawling.\n+ **`maxPages`**: Maximum number of pages to crawl.\n+ **`maxPageSizeKb`**: Maximum page size in kilobytes.\n+ **`concurrentConnections`**: Number of concurrent connections for crawling.\n+ **`verbose`**: Enable verbose logging.\n+ **`shouldFollowRedirects`**: Specify if redirects should be followed.\n+ **`userAgent`**: User agent string for HTTP requests.\n+ **`connectTimeoutMs`**: Connection timeout in milliseconds.\n+ **`readTimeoutMs`**: Read timeout in milliseconds.\n+ **`retriesOnServerError`**: Number of retries on server errors (`5xx`).\n+ **`customHeaders`**: Additional custom headers for HTTP requests.\n\n## Good Next Steps\n\nThings that would benefit Krawler the most:\n\n+ Implementing Pause/Resume\n    + _Hint:_ The `UrlPool` is the only state that isn't currently being persisted but needs to be, in order to be able\n      to restore paused sessions.\n+ Config: `respectRobotsTxt: Boolean`\n+ Config: `consecutiveErrorsToPause: Int?`\n\n## Disclaimer\n\nKrawler was conceived and brought to life over a weekend, starting as a pet project. It was initially planned to be made\nas a component of the coursework for the Web \u0026 Search Engines course at Yazd University, then growing exponentially due\nto a sudden desire to make a \"good thing\" out of it! It's important to note that, no explicit guarantees are extended\nregarding its correctness of functionality, support, or any other aspect. With that in mind, happy Krawling!\n\n## License\n\nPlease **refer to [LICENSE](./LICENSE)** to view the project's license.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FYektaDev%2FKrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FYektaDev%2FKrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FYektaDev%2FKrawler/lists"}