https://github.com/matteo-grella/tweetdeck-scraper

A tool that continuously scrapes TweetDeck tweets
https://github.com/matteo-grella/tweetdeck-scraper

scraper tweetdeck tweets-extraction tweetstream

Last synced: 8 months ago
JSON representation

A tool that continuously scrapes TweetDeck tweets

Host: GitHub
URL: https://github.com/matteo-grella/tweetdeck-scraper
Owner: matteo-grella
License: apache-2.0
Created: 2020-02-13T09:56:03.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2020-02-13T10:02:21.000Z (over 6 years ago)
Last Synced: 2025-04-06T20:53:25.370Z (about 1 year ago)
Topics: scraper, tweetdeck, tweets-extraction, tweetstream
Language: Python
Size: 10.7 KB
Stars: 18
Watchers: 2
Forks: 5
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # TweetDeck Scraper Docker Container

A docker container wrapping a tool to continuously scrape [Tweetdeck](https://tweetdeck.twitter.com/) tweets, store them in ElasticSearch and add scraped ids to a RabbitMQ queue.

The extracted information are:

- Id

- Publish date

- Download date

- Author

- Language

- Text

- Full body

- Image url

## Configuration

Sensible settings are configured in a `.env` file, these are exported as env variables when running the container so can be changed even after the container is built:

Twitter account params:

    $ cd tweetdeck-scraper

    $ echo "TWITTER_USERNAME=username" > .env

    $ echo "TWITTER_PASWORD=password" >> .env

Elasticsearch settings:

    $ echo "ES_HOST=localhost" >> .env

    $ echo "ES_PORT=9200" >> .env

    $ echo "ES_CURRENT=index_name" >> .env

    $ echo "ES_USERNAME=username" >> .env

    $ echo "ES_SECRET=password" >> .env

RabbitMQ settings:

    $ echo "RMQ_HOST=localhost" >> .env

    $ echo "RMQ_PORT=5762" >> .env

    $ echo "RMQ_QUEUE=tweetdeck" >> .env

    $ echo "RMQ_USERNAME=username" >> .env

    $ echo "RMQ_PASSWORD=password" >> .env

Additional settings can be found inside `tweetdeck_scraper/settings.py` and should be modified before building the container.

**DEBUG**

When True additional infos are logged.

**LOG_PATH**

The path of the log file.

**SCRAPE_INTERVAL**

Seconds between scraping actions.

**COLUMNS**

Which columns to scrape, value can be 'ALL' or a list of xpaths.

## Usage

Build the container

    $ ./build-docker.sh

Run the container

    $ ./run-docker.sh

Enjoy :)

-----

Legal

=====

It is your responsibility to ensure that your use of tweetdeck-scraper does not violate applicable laws.

Licensing

=====

Tweetdeck Scraper is licensed under the Apache License, Version 2.0. See

[LICENSE](https://github.com/matteo-grella/tweetdeck-scraper/blob/master/LICENSE) for the full

license text.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/matteo-grella/tweetdeck-scraper

Awesome Lists containing this project

README