https://github.com/jonesrussell/page-prowler
Experimental mashup of golang commands to scrape the web
https://github.com/jonesrussell/page-prowler
scraping-api scraping-websites
Last synced: 5 months ago
JSON representation
Experimental mashup of golang commands to scrape the web
- Host: GitHub
- URL: https://github.com/jonesrussell/page-prowler
- Owner: jonesrussell
- License: mit
- Created: 2021-09-03T01:04:58.000Z (almost 5 years ago)
- Default Branch: main
- Last Pushed: 2024-11-23T12:57:05.000Z (over 1 year ago)
- Last Synced: 2024-11-23T13:35:09.828Z (over 1 year ago)
- Topics: scraping-api, scraping-websites
- Language: HTML
- Homepage:
- Size: 11.1 MB
- Stars: 1
- Watchers: 4
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: news/service.go
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project
README
# Page Prowler
Page Prowler is a tool designed to find and extract links from websites based on specified terms. It allows direct interaction through the command-line interface or the initiation of the Echo web server, which exposes an API. This API utilizes the Asynq library to manage queued crawl jobs.
## Usage
```page-prowler [command]```
## Commands
- **api**: Starts the API server.
- **matchlinks**: Crawls specific websites and extracts matchlinks that match the provided terms. Can be run from the command line or via a POST request to `/v1/matchlinks` on the API server.
- **clearlinks**: Clears the Redis set for a given siteid.
- **getlinks**: Gets the list of links for a given siteid.
- **worker**: Starts the Asynq worker.
- **help**: Displays help about any command.
## Building
To install Page Prowler, clone the repository and build the binary using the following commands:
```
git clone https://github.com/jonesrussell/page-prowler.git
cd page-prowler
go build
```
Alternatively, you can use the provided Makefile to build the project:
```make all```
This command will run fmt, lint, test, and build targets defined in the Makefile.
### Command Line
To search for matchlinks from the command line, use the following command:
```bash
./page-prowler matchlinks --url="https://www.example.com" --searchterms="keyword1,keyword2" --siteid=siteID --maxdepth=1 --debug
```
Replace `"https://www.example.com"` with the URL you want to crawl, `"keyword1,keyword2"` with the search terms you want to look for, `siteID` with your site ID, and `1` with the maximum depth of the crawl.
### API
To start the API server, use the following command:
```bash
./page-prowler api
```
Then, you can send a POST request to start a crawl:
```bash
curl -X POST -H "Content-Type: application/json" -d '{
"URL": "https://www.example.com",
"SearchTerms": "keyword1,keyword2",
"CrawlSiteID": "siteID",
"MaxDepth": 3,
"Debug": true
}' http://localhost:3000/matchlinks
```
Again, replace `"https://www.example.com"` with the URL you want to crawl, `"keyword1,keyword2"` with the search terms you want to look for, `siteID` with your site ID, and `3` with the maximum depth of the crawl.
## Configuration
Page Prowler uses a `.env` file for configuration. You can specify the Redis host and password in this file. For example:
```bash
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_AUTH=yourpassword
```
## Contributing
Contributions are welcome! Please feel free to submit a pull request.
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.