https://github.com/codefeedr/maven-scraper
📈 Python script to incrementally push Maven releases to Kafka.
https://github.com/codefeedr/maven-scraper
Last synced: over 1 year ago
JSON representation
📈 Python script to incrementally push Maven releases to Kafka.
- Host: GitHub
- URL: https://github.com/codefeedr/maven-scraper
- Owner: codefeedr
- Created: 2019-06-28T09:46:38.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2019-08-08T14:23:29.000Z (almost 7 years ago)
- Last Synced: 2025-01-22T01:41:29.572Z (over 1 year ago)
- Language: Python
- Homepage:
- Size: 13.7 KB
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Incremental Maven releases to Kafka
**NOTE: This scraper is not working anymore since maven-repository.com
is down.**
This Python script scrapes [maven-repository.com](maven-repository) and
forwards it to Kafka. The scraper script requires: `start_date`,
`kafka_topic`, `bootstrap_servers` and `sleeptime`. It will scrape all
releases until `start_date` and push this to the `kafka_topic` running
on `bootstrap_servers`. It will keep repeating after `sleep_time`
seconds with `start_time == date_of_latest_release`. I.e. it scrapes
incremental updates on Maven releases.
## Prerequisites
Install all dependencies:
```bash
python3 -m venv venv
. ./venv/bin/activate
pip install requests BeautifulSoup4 kafka-python
```
## How To Run
```bash
usage: Scrape Maven releases to Kafka. [-h]
start_date topic bootstrap_servers
sleep_time
```
For example:
```sh
python scraper.py '2019-06-24 14:05:50' cf_mvn_releases localhost:29092 60
```
This will scrape up to `2019-06-24 14:05:50` (+ incremental updates)
pushes it to `cf_maven_releases` located at `localhost:29092`.
Incremental updates are checked every `60` seconds.
**Note**: `start_date` must be in `%Y-%m-%d %H:%M:%S` format. Multiple
bootstrap servers should be `,` separated. Sleep time is in _seconds_.
## Sample data
Data will be send in the following format:
```json
{
"groupId": "com.g2forge.alexandria",
"artifactId": "alexandria",
"version": "0.0.9",
"date": "2019-06-24 14:42:49"
}
```
## Run in Docker
```sh
docker build -t mvn-scraper .
docker run mvn-scraper '2019-06-24 14:05:50' cf_mvn_releases localhost:29092 60
```
or [alternatively](https://hub.docker.com/r/wzorgdrager/mvn-scraper)
```sh
docker run wzorgdrager/mvn-scraper '2019-06-24 14:05:50' cf_mvn_releases localhost:29092 60
```