https://github.com/matteo-grella/tweetdeck-scraper
A tool that continuously scrapes TweetDeck tweets
https://github.com/matteo-grella/tweetdeck-scraper
scraper tweetdeck tweets-extraction tweetstream
Last synced: 8 months ago
JSON representation
A tool that continuously scrapes TweetDeck tweets
- Host: GitHub
- URL: https://github.com/matteo-grella/tweetdeck-scraper
- Owner: matteo-grella
- License: apache-2.0
- Created: 2020-02-13T09:56:03.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-02-13T10:02:21.000Z (over 6 years ago)
- Last Synced: 2025-04-06T20:53:25.370Z (about 1 year ago)
- Topics: scraper, tweetdeck, tweets-extraction, tweetstream
- Language: Python
- Size: 10.7 KB
- Stars: 18
- Watchers: 2
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# TweetDeck Scraper Docker Container
A docker container wrapping a tool to continuously scrape [Tweetdeck](https://tweetdeck.twitter.com/) tweets, store them in ElasticSearch and add scraped ids to a RabbitMQ queue.
The extracted information are:
- Id
- Publish date
- Download date
- Author
- Language
- Text
- Full body
- Image url
## Configuration
Sensible settings are configured in a `.env` file, these are exported as env variables when running the container so can be changed even after the container is built:
Twitter account params:
$ cd tweetdeck-scraper
$ echo "TWITTER_USERNAME=username" > .env
$ echo "TWITTER_PASWORD=password" >> .env
Elasticsearch settings:
$ echo "ES_HOST=localhost" >> .env
$ echo "ES_PORT=9200" >> .env
$ echo "ES_CURRENT=index_name" >> .env
$ echo "ES_USERNAME=username" >> .env
$ echo "ES_SECRET=password" >> .env
RabbitMQ settings:
$ echo "RMQ_HOST=localhost" >> .env
$ echo "RMQ_PORT=5762" >> .env
$ echo "RMQ_QUEUE=tweetdeck" >> .env
$ echo "RMQ_USERNAME=username" >> .env
$ echo "RMQ_PASSWORD=password" >> .env
Additional settings can be found inside `tweetdeck_scraper/settings.py` and should be modified before building the container.
**DEBUG**
When True additional infos are logged.
**LOG_PATH**
The path of the log file.
**SCRAPE_INTERVAL**
Seconds between scraping actions.
**COLUMNS**
Which columns to scrape, value can be 'ALL' or a list of xpaths.
## Usage
Build the container
$ ./build-docker.sh
Run the container
$ ./run-docker.sh
Enjoy :)
-----
Legal
=====
It is your responsibility to ensure that your use of tweetdeck-scraper does not violate applicable laws.
Licensing
=====
Tweetdeck Scraper is licensed under the Apache License, Version 2.0. See
[LICENSE](https://github.com/matteo-grella/tweetdeck-scraper/blob/master/LICENSE) for the full
license text.