Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ruggsea/bluesky-firehose-py
A Python library/CLI for collecting and archiving posts from the Bluesky social network using the Jetstream API.
https://github.com/ruggsea/bluesky-firehose-py
archiving atproto blsky bluesky firehose jetstream scraping
Last synced: 3 days ago
JSON representation
A Python library/CLI for collecting and archiving posts from the Bluesky social network using the Jetstream API.
- Host: GitHub
- URL: https://github.com/ruggsea/bluesky-firehose-py
- Owner: ruggsea
- License: mit
- Created: 2024-12-05T01:29:55.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2025-01-22T00:50:59.000Z (13 days ago)
- Last Synced: 2025-01-22T01:24:53.621Z (13 days ago)
- Topics: archiving, atproto, blsky, bluesky, firehose, jetstream, scraping
- Language: Python
- Homepage:
- Size: 17.6 KB
- Stars: 7
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Bluesky Firehose Archiver
A Python library for collecting and archiving posts from the Bluesky social network using the [Jetstream API](https://github.com/bluesky-social/jetstream). This tool connects to Bluesky's firehose and saves posts in an organized file structure.
## Features
- Connects to Bluesky's Jetstream websocket API
- Archives posts in JSONL format, organized by date and hour
- Optional real-time post text streaming to stdout
- Automatic reconnection on connection loss
- Efficient batch processing and disk operations
- Debug mode for detailed logging
- Optional handle resolution (disabled by default)## Installation
1. Clone the repository:
```bash
git clone https://github.com/yourusername/bluesky-firehose-archiver.git
```2. Navigate to the project directory:
```bash
cd bluesky-firehose-archiver
```3. Create a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```4. Install dependencies:
```bash
pip install -r requirements.txt
```5. For development/testing:
```bash
pip install -r requirements-dev.txt
```## Usage
### As a Command Line Tool
Basic usage:
```bash
python src/main.py
```Available command line options:
```bash
python src/main.py [options]Options:
--username Bluesky username (optional)
--password Bluesky password (optional)
--debug Enable debug output
--stream Stream post text to stdout in real-time
--measure-rate Track and display posts per minute rate
--get-handles Resolve handles while archiving (not recommended)
--cursor Unix microseconds timestamp to start playback from
--archive-all Archive all records in their original format (not just posts)
```Note: Authentication (username/password) is currently implemented but not required for basic operation. Future versions will use authentication to fetch additional user and post metadata. In addition, handle resolution is disabled by default, because it slows down the archiving process considerably because of the rate limiting. Getting the handles through the DID after collection is recommended.
### As a Library
You can use the archiver in two ways:
1. Archive posts to disk:
```python
from archiver import BlueskyArchiver
import asyncioasync def main():
archiver = BlueskyArchiver(debug=True, stream=True)
await archiver.archive_posts() # This will save posts to diskasyncio.run(main())
```2. Stream posts in your code:
```python
from archiver import BlueskyArchiver
import asyncioasync def main():
archiver = BlueskyArchiver()
async for post in archiver.stream_posts():
# Process each post as it arrives
print(f"New post from @{post['handle']}: {post['record']['text']}")
# Example: Filter posts containing specific text
if "python" in post['record']['text'].lower():
# Do something with Python-related posts
process_python_post(post)asyncio.run(main())
```3. **Run Archiving and Streaming Concurrently**:
```python
from archiver import BlueskyArchiver
import asyncioasync def main():
archiver = BlueskyArchiver(debug=True, stream=True, measure_rate=True)
async for post in archiver.run_stream():
# Process each post as it arrives
print(f"New post from @{post['handle']}: {post['record']['text']}")# Example: Additional processing
# process_post(post)asyncio.run(main())
```### Example Use Cases:
- Real-time content analysis
- Custom filtering and processing
- Integration with other services
- Building real-time dashboards
- Research and data collection## Data Storage
Posts are automatically saved in JSONL (JSON Lines) format, organized by date and hour:
```
data/
└── YYYY-MM/
└── DD/
└── posts_YYYYMMDD_HH.jsonl
```Each JSONL file contains one post per line in JSON format with the following structure:
```json
{
"handle": "user.bsky.social",
"timestamp": "2024-03-15T01:23:45.678Z",
"record": {
"text": "Post content",
"createdAt": "2024-03-15T01:23:45.678Z",
...
},
"rkey": "unique-record-key",
...
}
```## Project Structure
```
├── src/
│ ├── main.py # Entry point and CLI interface
│ └── archiver.py # Core archiving logic
├── data/ # Archived posts storage
├── requirements.txt # Project dependencies
└── README.md # This file
```## License
MIT License
### Playback Feature
The archiver supports playback from a specific point in time using the Jetstream cursor functionality. To use this feature:
```bash
# Start archiving from a specific timestamp (Unix microseconds)
python src/main.py --cursor 1725911162329308
```Notes about playback:
- The cursor should be a Unix timestamp in microseconds
- Playback will start from the specified time and continue to real-time
- You can find the timestamp in the saved posts' `time_us` field### Complete Record Archiving
By default, the archiver only saves post records. To archive all record types (posts, likes, follows, etc.) in their original format:
```bash
python src/main.py --archive-all
```This will:
- Save all records without filtering by collection
- Preserve the original JSON structure from the firehose
- Store files in the `data_everything` directory
- Include all record types (posts, likes, follows, profiles, etc.)The records are saved in JSONL format with the original structure:
```json
{
"did": "did:plc:abcd...",
"time_us": 1234567890,
"kind": "commit",
"commit": {
"rev": "...",
"operation": "create",
"collection": "app.bsky.feed.post",
"rkey": "...",
"record": { ... }
}
}
```