https://github.com/nvk681/gumo
A crawler that extracts data from a dynamic webpage. Written in node js.
https://github.com/nvk681/gumo
crawler elasticsearch neo4j nodejs
Last synced: 3 months ago
JSON representation
A crawler that extracts data from a dynamic webpage. Written in node js.
- Host: GitHub
- URL: https://github.com/nvk681/gumo
- Owner: nvk681
- License: mit
- Created: 2021-04-06T16:39:19.000Z (about 5 years ago)
- Default Branch: main
- Last Pushed: 2022-07-27T23:47:44.000Z (almost 4 years ago)
- Last Synced: 2024-10-11T04:41:56.279Z (over 1 year ago)
- Topics: crawler, elasticsearch, neo4j, nodejs
- Language: JavaScript
- Homepage:
- Size: 921 KB
- Stars: 21
- Watchers: 2
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# πΈοΈGumo
*"Gumo" (θθ) is Japanese for "spider".*
[](//npmjs.com/package/gumo) [](https://github.com/nvk681/Gumo/actions/workflows/ci.yml) [](https://gumo.mit-license.org/)
## Overview π
A web-crawler (get it?) and scraper that extracts data from a family of nested dynamic webpages with added enhancements to assist in knowledge mining applications. Written in NodeJS.
## Table of Contents π
- [πΈοΈGumo](#οΈgumo)
- [Overview π](#overview-)
- [Table of Contents π](#table-of-contents-)
- [Features π](#features-)
- [Requirements π](#requirements-)
- [Installation ποΈ](#installation-οΈ)
- [Usage π¨βπ»](#usage-)
- [Development π οΈ](#development-)
- [Configuration βοΈ](#configuration-οΈ)
- [ElasticSearch β‘](#elasticsearch-)
- [GraphDB β](#graphdb-)
- [Nodes](#nodes)
- [Relationships](#relationships)
- [Changelog](#changelog)
- [TODO βοΈ](#todo-οΈ)
## Features π
- Crawl hyperlinks present on the pages of any domain and its subdomains.
- Scrape meta-tags and body text from every page.
- Store entire sitemap in a GraphDB (currently supports Neo4J).
- Store page content in ElasticSearch for easy full-text lookup.
## Requirements π
- **Node.js** β₯ 24.0.0 (LTS). Pinned in `package.json` (`engines`) and `.nvmrc` for [nvm](https://github.com/nvm-sh/nvm) users.
- **Neo4j** 4.0+ when using the graph (constraint syntax requires it).
## Installation ποΈ
[](https://nodei.co/npm/gumo/)
1. Use Node 24+ (e.g. `nvm use` if you have [nvm](https://github.com/nvm-sh/nvm) and the repoβs `.nvmrc`).
2. Install dependencies (uses `package-lock.json` for reproducible installs):
```bash
npm install
```
Or in CI: `npm ci`.
## Usage π¨βπ»
From code:
```js
// 1: import the module
const gumo = require('gumo')
// 2: instantiate the crawler
let cron = new gumo()
// 3: call the configure method and pass the configuration options
cron.configure({
'neo4j': { // replace with your details or remove if not required
'url' : 'neo4j://localhost',
'user' : 'neo4j',
'password' : 'gumo123'
},
'elastic': { // replace with your details or remove if not required
'url' : 'http://localhost:9200',
'index' : 'myIndex'
},
'crawler': {
'url': 'https://www.example.com',
}
});
// 4: start crawling
cron.insert()
```
**Note:** The config params passed to `cron.configure` above are the default values. See [Configuration](#configuration-οΈ) for all options.
When using Gumo as a dependency (e.g. `require('gumo')` with no `config.json` in your project), in-package defaults are used so the module loads; pass your Elasticsearch, Neo4j, and crawler settings via `configure()` before calling `insert()`.
### Development π οΈ
| Script | Description |
| -------- | ------------------------------------ |
| `npm run dev` | Run the crawler (`node index.js`). |
| `npm run lint` | Run [ESLint](https://eslint.org/) on the project (see `eslint.config.js`). |
| `npm test` | Run tests (placeholder until tests are added). |
CI runs on [GitHub Actions](https://github.com/nvk681/Gumo/actions) (Node 24, lint + test) on push/PR to `main`/`master`.
## Configuration βοΈ
The behavior of the crawler can be customized by passing a custom configuration object to the `config()` method. The following are the attributes which can be configured:
| Attribute ( * - Mandatory ) | Type | Accepted Values | Description | Default Value | Default Behavior |
| :---------------------------- | :------------ | :--------------- | :----------------------------------------------------------------------------------------- | :---------------------- | :----------------------------------------------------------------------- |
| * crawler.url | string | | Base URL to start scanning from | "" (empty string) | Module is disabled | |
| crawler.Cookie | string | | Cookie string to be sent with each request (useful for pages that require auth) | "" (empty string) | Cookies will not be attached to the requests |
| crawler.saveOutputAsHtml | string | "Yes"/"No" | Whether or not to store scraped content as HTML files in the output/html/ directory | "No" | Saving output as HTML files is disabled |
| crawler.saveOutputAsJson | string | "Yes"/"No" | Whether or not to store scraped content as JSON files in the output/json/ directory | "No" | Saving output as JSON files is disabled |
| crawler.maxRequestsPerSecond | int | range: 1 to 5000 | The maximum number of requests to be sent to the target in one second | 5000 | |
| crawler.maxConcurrentRequests | int | range: 1 to 5000 | The maximum number of concurrent connections to be created with the host at any given time | 5000 | |
| crawler.whiteList | Array(string) | | If populated, only these URLs will be traversed | [] (empty array) | All URLs with the same hostname as the "url" attribute will be traversed |
| crawler.blackList | Array(string) | | If populated, these URLs will ignored | [] (empty array) | |
| crawler.depth | int | range: 1 to 999 | Depth up to which nested hyperlinks will be followed | 3 | |
| * elastic.url | string | | URI of the ElasticSearch instance to connect to | "http://localhost:9200" | |
| * elastic.index | string | | The name of the ElasticSearch index to store results in | "myIndex" | |
| * neo4j.url | string | | The URI of a running Neo4J instance (uses the Bolt driver to connect) | "neo4j://localhost" | |
| * neo4j.user | string | | Neo4J server username | "neo4j" | |
| * neo4j.password | string | | Neo4J server password | "gumo123" | |
## ElasticSearch β‘
Page content is stored with the URL and a hash. The index is set via the `elastic.index` config (or `config.json`). If the index does not exist, it is created. Gumo uses the official `@elastic/elasticsearch` client; each page is indexed with **id** = hash and **document** = the page object (no separate type field).
## GraphDB β
The sitemap of all the traversed pages is stored in a convenient graph. The following structure of nodes and relationships is followed:
### Nodes
- **Label**: Page
- **Properties**:
| Property Name | Type | Description |
| :------------ | :----- | :---------------------------------------------------------------------------------------------------------- |
| pid | String | UID generated by the crawler which can be used to uniquely identify a page across ElasticSearch and GraphDB |
| link | String | URL of the current page |
| parent | String | URL of the page from which the current page was accessed (typically only used while creating relationships) |
| title | String | Page title as it appears in the page header |
### Relationships
| Name | Direction | Condition |
| :--------- | :----------------------- | :---------------- |
| links_to | (a)-[r1:links_to]->(b) | b.link = a.parent |
| links_from | (b)-[r2:links_from]->(a) | b.link = a.parent |
## Changelog
See [CHANGELOG.md](CHANGELOG.md) for version history and upgrading notes (e.g. Node 24, Elasticsearch client, Neo4j driver in v2.0.0).
## TODO βοΈ
- [ ] Make it executable from CLI
- [x] Enable to send config parameters while invoking the gumo
- [x] CI (GitHub Actions, Node 24, lint + test)
- [ ] Write more tests