https://github.com/nullqwertyuiop/tweet-crawler
Python tool using Playwright to intercept Twitter responses and parse tweets.
https://github.com/nullqwertyuiop/tweet-crawler
playwright playwright-python twitter
Last synced: 2 days ago
JSON representation
Python tool using Playwright to intercept Twitter responses and parse tweets.
- Host: GitHub
- URL: https://github.com/nullqwertyuiop/tweet-crawler
- Owner: nullqwertyuiop
- License: mit
- Created: 2024-10-11T10:53:17.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-10-19T14:29:07.000Z (8 months ago)
- Last Synced: 2024-10-21T04:23:53.380Z (8 months ago)
- Topics: playwright, playwright-python, twitter
- Language: Python
- Homepage:
- Size: 38.1 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Tweet Crawler




[](https://codecov.io/gh/nullqwertyuiop/tweet-crawler)Tweet Crawler is a Python-based web scraping tool that leverages Playwright to intercept responses from Twitter and parse them into manipulable dataclasses. This project allows users to extract comprehensive tweet data either in guest mode or authenticated mode (via cookies).
## Features
- **Guest Mode**:
- Fetch basic tweet details from a given status link without authentication.
- Extract data such as tweet content, user details, media, and reaction statistics.- **Authenticated Mode** (requires cookies):
- Access additional tweet details including reply threads.
- Provides a more extensive dataset by using user-specific cookie information.## Installation
### Install as a VCS Dependency
Tweet Crawler can be installed as a VCS dependency in your project.
Here is how you can add it to your project using [PDM](https://pdm-project.org/):
1. **Install Dependencies**
Ensure you have Python (version 3.10 or higher) installed and [PDM](https://pdm-project.org/). Then, run:
```bash
pdm add "git+https://github.com/nullqwertyuiop/tweet-crawler.git@main"
```2. **Set Up Playwright**
Initialize Playwright by running:
```bash
pdm run playwright install
```### Clone directly from GitHub
1. **Clone the Repository**
```bash
git clone https://github.com/nullqwertyuiop/tweet-crawler.git
cd tweet-crawler
```2. **Install Dependencies**
Ensure you have Python (version 3.10 or higher) installed and [PDM](https://pdm-project.org/). Then, run:
```bash
pdm install
```3. **Set Up Playwright**
Initialize Playwright by running:
```bash
pdm run playwright install
```## Usage
### Spinning Up an Async Playwright Instance
Tweet Crawler needs an instance of async playwright to interact with the browser.
Here's an example of how to create one:
```python
from playwright.async_api import async_playwrighturl: str = ... # URL of the tweet to crawl
async with async_playwright() as p:
browser = await p.chromium.launch()
context = await browser.new_context()
page = await browser.new_page()
crawler = TwitterStatusCrawler(page, url)
```### Running in Guest Mode
To crawl tweets as a guest (without replies), simply run:
```python
await crawler.run()
```### Running with Cookies
For fetching replies and extended information, you need to provide your Twitter cookies.
Here shows an example of how to add cookies to the crawler from environment variables:
> [!CAUTION]
> Never hardcode your cookies directly in the code. Doing so can expose your sensitive information.
> Use environment variables or a secure method to store them.```python
context: BrowserContextawait context.add_cookies(
[
{
"name": "auth_token",
"value": os.environ["AUTH_TOKEN"],
"domain": ".x.com",
"path": "/",
"expires": float(os.environ["AUTH_TOKEN_EXPIRES"]),
"httpOnly": True,
"sameSite": "None",
"secure": True,
},
{
"name": "ct0",
"value": os.environ["CT0"],
"domain": ".x.com",
"path": "/",
"expires": float(os.environ["CT0_EXPIRES"]),
"httpOnly": False,
"sameSite": "Lax",
"secure": True,
},
]
)
```Then, you can run the crawler as usual:
```python
await crawler.run()
```## Data Output
The data is parsed into Python dataclasses for easy handling and manipulation. The following information can be extracted:
- **Tweet Content**: The text of the tweet.
- **User Information**: Username and profile details of the tweet author.
- **Media**: Links to any media (images, videos, etc.) included in the tweet.
- **Statistics**: Number of likes, retweets, and other reaction metrics.
- **Replies**: (Authenticated mode only) Full threads of replies to the tweet.## Contributing
Contributions are welcome! Feel free to open issues or submit pull requests with improvements. For major changes, please open an issue first to discuss what you would like to change.
1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/YourFeature`)
3. Commit your Changes (`git commit -m 'Add some feature'`)
4. Push to the Branch (`git push origin feature/YourFeature`)
5. Open a Pull Request## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Disclaimer
This tool is intended for educational and research purposes only. Please ensure you comply with Twitter's terms of service and any applicable laws before using this tool to scrape data from their platform. Use responsibly.