Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/josepedrodias/naivebot
attempt to mimic googlebot behaviour in nodejs with nightmarejs
https://github.com/josepedrodias/naivebot
crawler googlebot nightmarejs nodejs robots
Last synced: 9 days ago
JSON representation
attempt to mimic googlebot behaviour in nodejs with nightmarejs
- Host: GitHub
- URL: https://github.com/josepedrodias/naivebot
- Owner: JosePedroDias
- Created: 2017-07-21T10:14:26.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-07-22T08:15:17.000Z (over 7 years ago)
- Last Synced: 2024-11-20T13:38:18.172Z (2 months ago)
- Topics: crawler, googlebot, nightmarejs, nodejs, robots
- Language: JavaScript
- Size: 3.91 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# naivebot
Simulates googlebot visiting pages, kinda.
This is very experimental, naive and a possibly plain wrong approach.
I'm not publishing this as a npm module because it's much easier to edit the hooks in the index.js
itself then to create override capabilities for those.## config
Edit `config.json` file.
```js
{
"domain": "pixels.camp", // domain to scrap
"userAgent": "", // user agent to set (TODO)
"resolution": [800, 600], // screen resolution to use
"pages": ["https://pixels.camp/"] // initial pages (kinda like sitemap.xml)
}
```## current crawling behaviour
Pages and their screenshots are persisted to `pages` directory.
Bootstrapped `toVisit` array of pages comes from `config.json`.
While that array has elements, scrap continues.
Each scrap consists of several promises being fulfilled:* waitPageReady - resolves once page is deemed ready. currenly waits 5 secs.
* atPageStart - something to do once page is ready. ex: dismiss modal.
* indexFollowCriteria - returns object with booleans for `index` and `follow`, work like the robots counterpart, i.e., index saves the scrapped page, follow adds found links to `toVisit`.Notice that most of these receive and return an object with:
* nightmare - the nightmare instance
* o - scrapped data from page
* state - scrapping state.Indexed pages are stored to `.json` and screenshot to `.png`,
where `` is a file-system friendly version of the page path.This is the object persisted for every page marked for storage:
```js
{
url : location.href
title : document.title
text : document.body.innerText
html : document.documentElement.outerHTML
mRobots : // meta robots
mTitle : // meta title
mKeywords : // meta keywords
mDescription : // meta description
h1 : // first h1's inner text
afterH1 : // inner text of element after first h1
links : // array of a hrefs
}
```Notice this h1 and afterH1, which are attempts to elect alternate titles and descriptions.
## TODO
* investigate how googlebot determines page loaded or alternate clever approach
* check if links scrapped are as naive as ours (``s on page body)
* improve path processing - support #! paths
* (less relevant) map robots.txt and sitemap.xml to config.json## references
https://developers.google.com/search/reference/robots_meta_tag#valid-indexing--serving-directives
https://github.com/segmentio/nightmare/blob/master/Readme.md
https://segment.com/blog/ui-testing-with-nightmare/
https://varvy.com/googlebot.html
https://support.google.com/webmasters/answer/96569?hl=en