https://github.com/harshstats/etl-project-for-twitter-data-using-airflow
https://github.com/harshstats/etl-project-for-twitter-data-using-airflow
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/harshstats/etl-project-for-twitter-data-using-airflow
- Owner: HarshStats
- License: apache-2.0
- Created: 2022-10-24T09:30:41.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-03-04T13:41:55.000Z (over 2 years ago)
- Last Synced: 2025-01-24T20:38:04.257Z (over 1 year ago)
- Language: Python
- Size: 21.5 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Twitter Data Extraction and Pipeline Project
## Overview
This project demonstrates how to build a data pipeline that extracts data from Twitter, processes it using Python, and deploys the workflow on Apache Airflow hosted on an AWS EC2 instance. The final data is stored in Amazon S3. This README outlines the project steps, technologies used, and guidelines for replication.
## Prerequisites
- **Laptop with Internet Access**: Basic computing and internet capabilities are required.
- **Python Installation**: Python version 3.5 or above must be installed on your local system.
- **AWS Account**: An AWS account is needed to access EC2 and S3 services. AWS offers a 12-month free trial that can be utilized.
- **Basic Python Knowledge**: Familiarity with Python is necessary as the project focuses on leveraging Airflow and AWS rather than teaching Python basics.
## Project Structure
The project is divided into three sections:
1. **Setting Up the Prerequisites**: Outlines the initial requirements and setups.
2. **Understanding the Concepts**: Introduces Apache Airflow and its components like DAGs and operators.
3. **Execution**: Covers the step-by-step implementation of the project, including code snippets and detailed instructions.
## Tools and Technologies Used
- **Twitter API**: For data extraction.
- **Python**: Used for data processing and interactions with the Twitter API and AWS services.
- **Apache Airflow**: Manages the workflow of the data pipeline.
- **AWS EC2**: Hosts Apache Airflow.
- **Amazon S3**: Stores the processed data.
## Steps to Replicate the Project
1. **Extract Data from Twitter**: Obtain Twitter API credentials and use the Tweepy library in Python for data extraction.
2. **Transform Data**: Utilize Pandas in Python for data transformation.
3. **Deploy on Apache Airflow**: Set up an AWS EC2 instance for Airflow hosting and configure the data pipeline workflow.
4. **Store Data in Amazon S3**: Set up an S3 bucket and configure the Airflow job to store the output data.
## Challenges and Solutions
- Various execution errors were tackled through research and community support.
- Persistence and discipline are highlighted as essential for tackling learning curves associated with new technologies.
## Additional Resources
- Installation guides for Python, AWS account setup, and Airflow documentation are provided to aid in project setup.
## Contribution
Contributions are welcome! Feel free to fork this project, submit pull requests, or suggest improvements through issues.
## Acknowledgements
This project was inspired by the practical application of integrating various technologies to implement a real-world data pipeline.