{"id":14970823,"url":"https://github.com/nvk681/gumo","last_synced_at":"2026-03-16T22:03:07.113Z","repository":{"id":57259658,"uuid":"355261035","full_name":"nvk681/Gumo","owner":"nvk681","description":"A crawler that extracts data from a dynamic webpage. Written in node js. ","archived":false,"fork":false,"pushed_at":"2022-07-27T23:47:44.000Z","size":943,"stargazers_count":21,"open_issues_count":2,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-10-11T04:41:56.279Z","etag":null,"topics":["crawler","elasticsearch","neo4j","nodejs"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nvk681.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-04-06T16:39:19.000Z","updated_at":"2023-11-06T15:56:15.000Z","dependencies_parsed_at":"2022-08-24T21:52:16.186Z","dependency_job_id":null,"html_url":"https://github.com/nvk681/Gumo","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nvk681%2FGumo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nvk681%2FGumo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nvk681%2FGumo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nvk681%2FGumo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nvk681","download_url":"https://codeload.github.com/nvk681/Gumo/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":219862885,"owners_count":16555951,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","elasticsearch","neo4j","nodejs"],"created_at":"2024-09-24T13:44:12.015Z","updated_at":"2026-03-16T22:03:07.107Z","avatar_url":"https://github.com/nvk681.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🕸️Gumo\n\n*\"Gumo\" (蜘蛛) is Japanese for \"spider\".*\n\n[![npm version](https://badge.fury.io/js/gumo.svg)](//npmjs.com/package/gumo) [![CI](https://github.com/nvk681/Gumo/actions/workflows/ci.yml/badge.svg)](https://github.com/nvk681/Gumo/actions/workflows/ci.yml) [![MIT license](https://img.shields.io/badge/License-MIT-blue.svg)](https://gumo.mit-license.org/)\n\n## Overview 👓\n\nA web-crawler (get it?) and scraper that extracts data from a family of nested dynamic webpages with added enhancements to assist in knowledge mining applications. Written in NodeJS.\n\n## Table of Contents 📖\n\n- [🕸️Gumo](#️gumo)\n  - [Overview 👓](#overview-)\n  - [Table of Contents 📖](#table-of-contents-)\n  - [Features 🌟](#features-)\n  - [Requirements 📋](#requirements-)\n- [Installation 🏗️](#installation-️)\n  - [Usage 👨‍💻](#usage-)\n  - [Development 🛠️](#development-)\n  - [Configuration ⚙️](#configuration-️)\n  - [ElasticSearch ⚡](#elasticsearch-)\n  - [GraphDB ☋](#graphdb-)\n    - [Nodes](#nodes)\n    - [Relationships](#relationships)\n  - [Changelog](#changelog)\n  - [TODO ☑️](#todo-️)\n\n## Features 🌟\n\n- Crawl hyperlinks present on the pages of any domain and its subdomains.\n- Scrape meta-tags and body text from every page.\n- Store entire sitemap in a GraphDB (currently supports Neo4J).\n- Store page content in ElasticSearch for easy full-text lookup.\n\n## Requirements 📋\n\n- **Node.js** ≥ 24.0.0 (LTS). Pinned in `package.json` (`engines`) and `.nvmrc` for [nvm](https://github.com/nvm-sh/nvm) users.\n- **Neo4j** 4.0+ when using the graph (constraint syntax requires it).\n\n## Installation 🏗️\n\n[![NPM](https://nodei.co/npm/gumo.png?mini=true)](https://nodei.co/npm/gumo/)\n\n1. Use Node 24+ (e.g. `nvm use` if you have [nvm](https://github.com/nvm-sh/nvm) and the repo’s `.nvmrc`).\n2. Install dependencies (uses `package-lock.json` for reproducible installs):\n\n   ```bash\n   npm install\n   ```\n   Or in CI: `npm ci`.\n\n## Usage 👨‍💻\n\nFrom code:\n\n```js\n// 1: import the module\nconst gumo = require('gumo')\n\n// 2: instantiate the crawler\nlet cron = new gumo()\n\n// 3: call the configure method and pass the configuration options\ncron.configure({\n    'neo4j': { // replace with your details or remove if not required\n        'url' : 'neo4j://localhost',\n        'user' : 'neo4j',\n        'password' : 'gumo123'\n    },\n    'elastic': { // replace with your details or remove if not required\n        'url' : 'http://localhost:9200',\n        'index' : 'myIndex'\n    },\n    'crawler': {\n        'url': 'https://www.example.com',\n    }\n});\n\n// 4: start crawling\ncron.insert()\n```\n\n**Note:** The config params passed to `cron.configure` above are the default values. See [Configuration](#configuration-️) for all options.\n\nWhen using Gumo as a dependency (e.g. `require('gumo')` with no `config.json` in your project), in-package defaults are used so the module loads; pass your Elasticsearch, Neo4j, and crawler settings via `configure()` before calling `insert()`.\n\n### Development 🛠️\n\n| Script   | Description                          |\n| -------- | ------------------------------------ |\n| `npm run dev` | Run the crawler (`node index.js`).   |\n| `npm run lint` | Run [ESLint](https://eslint.org/) on the project (see `eslint.config.js`). |\n| `npm test`    | Run tests (placeholder until tests are added). |\n\nCI runs on [GitHub Actions](https://github.com/nvk681/Gumo/actions) (Node 24, lint + test) on push/PR to `main`/`master`.\n\n## Configuration ⚙️\n\nThe behavior of the crawler can be customized by passing a custom configuration object to the `config()` method. The following are the attributes which can be configured:\n\n| Attribute ( * - Mandatory )                    | Type          | Accepted Values  | Description                                                                                | Default Value           | Default Behavior                                                         |\n| :---------------------------- | :------------ | :--------------- | :----------------------------------------------------------------------------------------- | :---------------------- | :----------------------------------------------------------------------- |\n| * crawler.url                   | string        |                  | Base URL to start scanning from                                                            | \"\" (empty string)       | Module is disabled                                                       |                                                                          |\n| crawler.Cookie                | string        |                  | Cookie string to be sent with each request (useful for pages that require auth)            | \"\" (empty string)       | Cookies will not be attached to the requests                             |\n| crawler.saveOutputAsHtml      | string        | \"Yes\"/\"No\"       | Whether or not to store scraped content as HTML files in the output/html/ directory        | \"No\"                    | Saving output as HTML files is disabled                                  |\n| crawler.saveOutputAsJson      | string        | \"Yes\"/\"No\"       | Whether or not to store scraped content as JSON files in the output/json/ directory        | \"No\"                    | Saving output as JSON files is disabled                                  |\n| crawler.maxRequestsPerSecond  | int           | range: 1 to 5000 | The maximum number of requests to be sent to the target in one second                      | 5000                    |                                                                          |\n| crawler.maxConcurrentRequests | int           | range: 1 to 5000 | The maximum number of concurrent connections to be created with the host at any given time | 5000                    |                                                                          |\n| crawler.whiteList             | Array(string) |                  | If populated, only these URLs will be traversed                                            | [] (empty array)        | All URLs with the same hostname as the \"url\" attribute will be traversed |\n| crawler.blackList             | Array(string) |                  | If populated, these URLs will ignored                                                      | [] (empty array)        |                                                                          |\n| crawler.depth                 | int           | range: 1 to 999  | Depth up to which nested hyperlinks will be followed                                       | 3                       |                                                                          |\n| * elastic.url                   | string        |                  | URI of the ElasticSearch instance to connect to                                            | \"http://localhost:9200\" |                                                                          |\n| * elastic.index                 | string        |                  | The name of the ElasticSearch index to store results in                                    | \"myIndex\"               |                                                                          |\n| * neo4j.url                     | string        |                  | The URI of a running Neo4J instance (uses the Bolt driver to connect)                      | \"neo4j://localhost\"     |                                                                          |\n| * neo4j.user                    | string        |                  | Neo4J server username                                                                      | \"neo4j\"                 |                                                                          |\n| * neo4j.password                | string        |                  | Neo4J server password                                                                      | \"gumo123\"               |                                                                          |\n\n## ElasticSearch ⚡\n\nPage content is stored with the URL and a hash. The index is set via the `elastic.index` config (or `config.json`). If the index does not exist, it is created. Gumo uses the official `@elastic/elasticsearch` client; each page is indexed with **id** = hash and **document** = the page object (no separate type field).\n\n## GraphDB ☋\n\nThe sitemap of all the traversed pages is stored in a convenient graph. The following structure of nodes and relationships is followed:\n\n### Nodes\n\n- **Label**: Page\n- **Properties**:\n\n| Property Name | Type   | Description                                                                                                 |\n| :------------ | :----- | :---------------------------------------------------------------------------------------------------------- |\n| pid           | String | UID generated by the crawler which can be used to uniquely identify a page across ElasticSearch and GraphDB |\n| link          | String | URL of the current page                                                                                     |\n| parent        | String | URL of the page from which the current page was accessed (typically only used while creating relationships) |\n| title         | String | Page title as it appears in the page header                                                                 |\n\n### Relationships\n\n| Name       | Direction                | Condition         |\n| :--------- | :----------------------- | :---------------- |\n| links_to   | (a)-[r1:links_to]-\u003e(b)   | b.link = a.parent |\n| links_from | (b)-[r2:links_from]-\u003e(a) | b.link = a.parent |\n\n## Changelog\n\nSee [CHANGELOG.md](CHANGELOG.md) for version history and upgrading notes (e.g. Node 24, Elasticsearch client, Neo4j driver in v2.0.0).\n\n## TODO ☑️\n\n- [ ] Make it executable from CLI\n- [x] Enable to send config parameters while invoking the gumo\n- [x] CI (GitHub Actions, Node 24, lint + test)\n- [ ] Write more tests\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvk681%2Fgumo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnvk681%2Fgumo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvk681%2Fgumo/lists"}