{"id":13716960,"url":"https://github.com/adileo/MicroFrontier","last_synced_at":"2025-05-07T06:31:47.670Z","repository":{"id":57296620,"uuid":"423119298","full_name":"adileo/MicroFrontier","owner":"adileo","description":"A lightweight crawler frontier implementation in TypeScript using Redis.","archived":false,"fork":false,"pushed_at":"2021-11-04T21:40:00.000Z","size":269,"stargazers_count":8,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-22T15:46:59.977Z","etag":null,"topics":["crawler","frontier","microservice","redis","robots-txt","spider"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/adileo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-10-31T10:44:11.000Z","updated_at":"2024-11-18T14:54:08.000Z","dependencies_parsed_at":"2022-09-02T07:40:16.041Z","dependency_job_id":null,"html_url":"https://github.com/adileo/MicroFrontier","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adileo%2FMicroFrontier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adileo%2FMicroFrontier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adileo%2FMicroFrontier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adileo%2FMicroFrontier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/adileo","download_url":"https://codeload.github.com/adileo/MicroFrontier/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252826886,"owners_count":21810198,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","frontier","microservice","redis","robots-txt","spider"],"created_at":"2024-08-03T00:01:16.043Z","updated_at":"2025-05-07T06:31:47.284Z","avatar_url":"https://github.com/adileo.png","language":"TypeScript","funding_links":[],"categories":["Built with Micro"],"sub_categories":["Utilities"],"readme":"# MicroFrontier \u0026middot; [![npm](https://img.shields.io/npm/dm/microfrontier.svg?style=flat-square)](https://npm-stat.com/charts.html?package=microfrontier) [![npm version](https://img.shields.io/npm/v/microfrontier.svg?style=flat-square)](https://www.npmjs.com/package/microfrontier) ![Docker Pulls](https://img.shields.io/docker/pulls/adileo/microfrontier?style=flat-square) ![Docker Image Size (tag)](https://img.shields.io/docker/image-size/adileo/microfrontier/latest?style=flat-square) [![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg?style=flat-square)](https://www.gnu.org/licenses/gpl-3.0)\n\n\n\nA web crawler frontier implementation in TypeScript backed by Redis.\nMicroFrontier is a scalable and distributed frontier implemented through Redis Queues.\n\n- [x] Fast Ingestion \u0026 High throughput\n- [x] Easy to use HTTP Microservice or Javascript Client\n- [x] Multiple configurable priority queues\n- [x] Customizable stochastic function for priority queue picking\n- [x] Politeness Policy: Per-Hostname crawl rate limit or default fallback delay\n- [x] Multi-processing \u0026 concurrency support\n- [ ] Prioritization Strategy: Breadth-first Crawl, Depth-first crawl, PageRank etc... - TODO\n- [ ] URL Re-visit policy - TODO\n- [ ] URL canonicalization and Bloom filtering - TODO\n- [ ] URL Selection Policy - TODO\n\n\u003cbr\u003e\n\nMicroFrontier is inspired by the Mercator Frontier\u003csup\u003e[1](#footnote1)\u003c/sup\u003e\n\n![Queue](./docs/images/queue.png)\n\n## Why you need MicroFrontier?\n\nThe frontier essentially answer a simple question: \"What URL should i crawl next?\".\nThis seems a simple problem until you realize that you have to consider a lot of factors:\n\n- That multiple crawlers should be able to work concurrently without overlapping\n- You have to be polite with websites (DDoSing a website isn't fun)\n- You have to visit a web page just once, or once in a while\n- Some pages are more important than others to be crawled early on while others are just spider traps\n\nSince I couldn't find a lightweight multipurpose frontier implementation, I made MicroFrontier, hoping that could help field researchers.\n\n\n## Usage\n\nMicroFrontier can be used both as a Javascript library SDK, from the command line or with a Docker instance working as a microservice.\n\n### Command line usage\nInstall microfrontier with:\n```\nnpm i -g microfrontier\n```\nRun microfrontier\n```bash\nmicrofrontier --host localhost --port 8090 --redis:host localhost --redis:port 6379\n#see configuration section below for additional parameters\n```\n\n\n### As a javascript library\n\n```bash\nnpm i microfrontier\n\n# or\n\nyarn add microfrontier\n```\nSee below the examples for using the Javascript Client.\n\n### Docker\n```\ndocker pull adileo/microfrontier\n```\nYou can configure the docker instance with the environment variables described below.\n\n## Configuration\n\n| ENV VAR  | CLI PARAMS | Description |\n| ------------- | --- |------------- |\n| host  | --host | Host name to start the microservice http server. \u003cbr\u003eDefault value: `127.0.0.1`  \n| port  | --port| Port to start the microservice http server.\u003cbr\u003e Default value: `8090`   |\n| redis_host | --redis:host | Redis server host.\u003cbr\u003e Default value: `127.0.0.1`   |\n| redis_port | --redis:port | Redis server port.\u003cbr\u003e Default value: `6379`   |\n| redis_* | --redis:* | Parameters are interpreted by `nconf` and passed to `ioredis` as the client config.  \n| config_frontierName | --config:frontierName | Prefix used for Redis keys.  |\n| config_* | --config:* | Parameters are interpreted by `nconf`, you can find an example of default values below.  |\n\n```typescript\n{\n    frontierName: 'frontier', // Example ENV: config_frontierName=frontier\n    priorities: { // Example ENV: config_priorities={\"high\":{\"probability\":0.6},...}\n        'high':     {probability: 0.6},\n        'normal':   {probability: 0.3},\n        'low':      {probability: 0.1},\n    },\n    defaultCrawlDelay: 1000 // Example ENV: config_defaultCrawlDelay=1000\n}\n```\n\n# How to\n## Adding an URL to the frontier\nVia HTTP\n```bash\ncurl --location --request POST 'http://127.0.0.1:8090/frontier' \\\n--header 'Content-Type: application/json' \\\n--data-raw '{\n    \"url\": \"http://www.example.com\",\n    \"priority\": \"normal\",\n    \"meta\": {\n        \"foo\": \"bar\"\n    }\n}'\n```\nVia SDK\n```javascript\nimport { URLFrontier } from \"microfrontier\"\n\nconst frontier = new URLFrontier(config)\n\nfrontier.add(\"http://www.example.com\", \"normal\", {\"foo\": \"bar\"}).then(() =\u003e {\n    console.log('URL added')\n})\n```\n\n## Getting an URL from the frontier\n```bash\ncurl --location --request GET 'http://127.0.0.1:8090/frontier'\n```\n```javascript\nimport { URLFrontier } from \"microfrontier\"\n\nconst frontier = new URLFrontier(config)\n\nfrontier.get().then((item) =\u003e {\n    // {url: \"http://www.example.com\", meta: {\"foo\":\"bar\"}}\n})\n```\n\n## Per Hostname Rate-Limit\nImplemented, documentation WIP\n\n## Scaling the frontend queue workers\nImplemented, documentation WIP\n\n## Getting the number of enqueued urls (for an hostname)\nImplemented, documentation WIP\n\n\u003cbr\u003e\n\n# Citations\n\n\u003ca id=\"footnote1\"\u003e[1]\u003c/a\u003e: [High-Performance Web Crawling](http://www.cs.cornell.edu/courses/cs685/2002fa/mercator.pdf) - Marc Najork, Allan Heydon\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadileo%2FMicroFrontier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadileo%2FMicroFrontier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadileo%2FMicroFrontier/lists"}