https://github.com/minyk/nifi-headlessbrowser-processor
Headless browser processor for Apache Nifi
https://github.com/minyk/nifi-headlessbrowser-processor
browser nifi processor
Last synced: about 2 months ago
JSON representation
Headless browser processor for Apache Nifi
- Host: GitHub
- URL: https://github.com/minyk/nifi-headlessbrowser-processor
- Owner: minyk
- Created: 2016-05-31T10:50:14.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2020-10-12T19:46:44.000Z (over 5 years ago)
- Last Synced: 2024-03-18T23:54:36.850Z (almost 2 years ago)
- Topics: browser, nifi, processor
- Language: Java
- Size: 26.4 KB
- Stars: 6
- Watchers: 2
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-nifi - minyk/nifi-headlessbrowser-processor - Returns the page source in its current state to FlowFile, including any DOM updates that occurred after page load (Processors and Bundles / Mailing List Best Of)
- awesome-nifi - minyk/nifi-headlessbrowser-processor - Returns the page source in its current state to FlowFile, including any DOM updates that occurred after page load (Processors and Bundles / Mailing List Best Of)
README
Nifi Headless Browser Processor
================================
**Currently, `URL Provided` configuration is only tested.**
* Returns the page source in its current state to FlowFile, including any DOM updates that occurred after page load.
* Use [JBrowserDriver](https://github.com/MachinePublishers/jBrowserDriver).
# Prerequisite
* JRE with Java FX
* OpenJDK 8 does not contain `Java FX`
* Use Oracle JDK or Zulu JDK FX
* `fontconfig` package on OS.
* `yum install fontconfig` or `apt install fontconfig`
# Configurations
Most configuration is used to make JBrowserDriver.
* configurations
* Host: Hostname for the browser. hostname or ip address.
* Url Provided: if true, the processor read target from `Page URL` configuration. if false, the input flowfile must contain URL.
* Page URL: URL for processing. Used only `Url Provided` is `true`.
* Timezone: Timezone for browser. Select from dropdown list.
* Port Range: port range for JBrowserDriverServer. This range should be multiple of three.
* ~~Javascript: Script after page loading. Currently, EL is not supported.~~
* Remove for now due to timing issue.
* Relationship
* success: success relationship of this processor. Flowfile contains page source of input URL.
* failure: failure relationship of this processor.
# TODOs
- [ ] Test for `Url Provided: false` configuration.
- [ ] Add some attribute to result flowfile.
- [x] Source URL
- [ ] Page Title
- [ ] Etc.
- [ ] Add capabilities to execute javascript after page loading.