https://github.com/webrecorder/autobrowser
https://github.com/webrecorder/autobrowser
Last synced: 9 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/webrecorder/autobrowser
- Owner: webrecorder
- License: apache-2.0
- Created: 2018-08-22T20:23:22.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-10-27T22:46:47.000Z (over 4 years ago)
- Last Synced: 2025-04-14T14:27:23.706Z (16 days ago)
- Language: Python
- Size: 396 KB
- Stars: 10
- Watchers: 9
- Forks: 3
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# autobrowser
[](https://github.com/ambv/black)Webrecorder's high-fidelity browser-based crawler
## Configuration
Configuration of autobrowser is done primarily through the environment variables listed below.
Their values are read by the application in [autobrowser/automation/details.py](https://github.com/webrecorder/autobrowser/blob/master/autobrowser/automation/details.py)
**Note**: boolean values (flag/switches) have the following format
- true: `1`, `true`, `yes`, `y`, `ok`, `on`
- false: `0`, `false`, `no`, `n`, `nok`, `off`, `env var does not exist`#### General
REDIS_URL
- The URL to be used when connecting to redis (string)
- Defaults to `redis://localhost`CHROME_OPTS
- A string of json used by `LocalDriver` to launch a browser (string)CDP_PORT
- The port to be used when communicating with a browser via the CDP (number)
- Defaults to `9222`AUTO_ID **Required when crawling**
- The id of the entire automation (string)REQ_ID **Required**
- The id of the request used to start this part of the entire automation (string)#### Shepherd
SHEPARD_HOST
- The URL that shepard is listening on (string)
- Defaults to `http://shepherd:9020`
BROWSER_ID
- The id of the browser to be used when requesting one from shepherd (string)
- Defaults to `chrome:67`BROWSER_HOST
- The host name of the browser running in a container (string)
REQ_BROWSER_PATH
- The path to the shepherd endpoint for requesting browsers (string)
- Defaults to `/request_browser/`INIT_BROWSER_PATH
- The path to the shepherd endpoint for initializing new browsers (string)
- Defaults to `/init_browser?reqid=`
GET_BROWSER_INFO_PATH
- The path to the shepherd endpoint for requesting information about a browsers (string)
- Defaults to `/info/`
#### CrawlingCRAWL_NO_NETCACHE
- Should the browsers network cache be disable (bool)
- Defaults to `true`NAV_TO
- How long should the navigation timeout be (time value in seconds)
- Defaults to `30`WAIT_FOR_Q
- How long should the crawler tab wait for the frontier q to become populated (time value in seconds)
- Defaults to `-1` (forever)
WAIT_FOR_Q_POLL_RATE
- How long is the check interval (time value in seconds)
- Defaults to `5`BEHAVIOR_RUN_TIME
- How long should the behaviors be allowed to run for (time value in seconds)
- Defaults to `60`NUM_TABS
- How many tabs should the be created per browser connected to (number)
- Defaults to `1`TAB_TYPE
- Which tab type should be used (BehaviorTab or CrawlerTab)
- Defaults to `BehaviorTab`#### Behaviors
BEHAVIOR_API_URL
- The base URL to be used for interaction with the behaviors api (string)
- Defaults to `http://localhost:3030`FETCH_BEHAVIOR_ENDPOINT
- The URL of the behaviors api endpoint for retrieving just the behaviors JavaScript (string)
- Defaults to `{BEHAVIOR_API_URL}/behavior?url=`FETCH_BEHAVIOR_INFO_ENDPOINT
- The URL of the behaviors api endpoint for retrieving just the behaviors info (string)
- Defaults to `{BEHAVIOR_API_URL}/info?url=`SCREENSHOT_API_URL
- The url to be used to send screenshots of the page after a behavior has run (string)
- **Note** acts as a flag indicating screenshots are to be takenSCREENSHOT_TARGET_URI **Required if SCREENSHOT_API_URL is provided**
- The url for the resource record for the screenshots (string)SCREENSHOT_FORMAT
- The type of screenshot to be taken `png` or `jpg` (string)
- Defaults to `png`
SCREENSHOT_DIMENSIONS
- The dimensions of the screen shot to be taken (number).
- Format: width height, space or comma separated
- Defaults to the natural width height of the page's content#### Javascript Expressions
BEHAVIOR_ACTION_EXPRESSION
- The expression used to initiate the next action of a behavior (string)
- Defaults to: `window.$WRIteratorHandler$()`
BEHAVIOR_PAUSED_EXPRESSION
- The expression used to determine if the running behavior is in the paused state (string)
- Defaults to: `window.$WBBehaviorPaused === true`PAUSE_BEHAVIOR_EXPRESSION
- The expression used to pause a running behavior
- Defaults to: `window.$WBBehaviorPaused = true`UNPAUSE_BEHAVIOR_EXPRESSION
- The expression used to un-pause a running behavior
- Defaults to: `window.$WBBehaviorPaused = false`PAGE_URL_EXPRESSION
- The expression used to determine the URL of the page (string)
- Defaults to: `window.location.href`OUTLINKS_EXPRESSION
- The expression used to retrieve the outlinks collected by the running behavior (string)
- Defaults to: `window.$wbOutlinks$`
CLEAR_OUTLINKS_EXPRESSION
- The expression used to clear the outlinks collected by the running behavior (string)
- Defaults to: `window.$wbOutlinkSet$.clear()`
NO_OUT_LINKS_EXPRESS
- The expression used to indicate to the behavior that it is not to collect outlinks (string)
- Defaults to: `window.$WBNOOUTLINKS = true`