{"id":31378683,"url":"https://github.com/william1nguyen/dblab","last_synced_at":"2025-09-28T06:55:30.451Z","repository":{"id":315371805,"uuid":"1058588373","full_name":"william1nguyen/dblab","owner":"william1nguyen","description":"Lab for database sharding concepts and large-scale data techniques","archived":false,"fork":false,"pushed_at":"2025-09-18T15:48:40.000Z","size":910,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-09-28T06:55:29.419Z","etag":null,"topics":["data-sharding","database","large-scale"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/william1nguyen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-17T09:35:29.000Z","updated_at":"2025-09-18T15:51:27.000Z","dependencies_parsed_at":"2025-09-18T08:40:39.661Z","dependency_job_id":"72c2a983-527a-4261-b661-d2e5216e4cfa","html_url":"https://github.com/william1nguyen/dblab","commit_stats":null,"previous_names":["william1nguyen/data-sharding","william1nguyen/dblab"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/william1nguyen/dblab","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/william1nguyen%2Fdblab","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/william1nguyen%2Fdblab/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/william1nguyen%2Fdblab/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/william1nguyen%2Fdblab/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/william1nguyen","download_url":"https://codeload.github.com/william1nguyen/dblab/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/william1nguyen%2Fdblab/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":277335327,"owners_count":25800966,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-28T02:00:08.834Z","response_time":79,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-sharding","database","large-scale"],"created_at":"2025-09-28T06:55:27.833Z","updated_at":"2025-09-28T06:55:30.434Z","avatar_url":"https://github.com/william1nguyen.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DBLab\n\n[![Python Version](https://img.shields.io/badge/python-3.8+-blue.svg)](https://python.org)\n[![PostgreSQL](https://img.shields.io/badge/postgresql-12+-336791.svg)](https://postgresql.org)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)\n[![Status](https://img.shields.io/badge/Project-Learning-orange)](#)\n\nA simple learning project for experimenting with database sharding concepts and large-scale data techniques.\n\n---\n\n## Quick Start (TL;DR)\n\nCopy-paste these 3 lines to start experimenting:\n\n```bash\ngit clone https://github.com/yourusername/data-sharding.git\ncd data-sharding \u0026\u0026 make start-dev-env\nmake dev\n```\n\nWatch the progress bars as data gets generated and distributed across shards 🚀\n\n## Demo\n\n### Progress Tracking\n\n\u003cimg src=\"docs/demo/progress-tracking.png\" alt=\"Data Generation Progress\" width=\"70%\"/\u003e\n\n### Performance Benchmarks\n\n\u003cimg src=\"docs/demo/benchmark-result.png\" alt=\"Benchmark Results\" width=\"70%\"/\u003e\n\n### Database Connection Test\n\n\u003cimg src=\"docs/demo/connection-test.png\" alt=\"Database Connection Test\" width=\"70%\"/\u003e\n\n### Complete Workflow\n\n\u003cimg src=\"docs/demo/full-process.png\" alt=\"Complete Workflow Demo\" width=\"70%\"/\u003e\n\n## Table of Contents\n\n1. [Purpose](#purpose)\n2. [Features](#features)\n3. [Prerequisites](#prerequisites)\n4. [Installation](#installation)\n5. [Configuration](#configuration)\n6. [Usage](#usage)\n7. [Project Structure](#project-structure)\n8. [Development Commands](#development-commands)\n9. [Notes](#notes)\n10. [TODO](#todo)\n\n---\n\n## Purpose\n\nThis project helps me understand database sharding fundamentals by:\n\n- **Database Sharding Simulation** - Simple hash-based data distribution across PostgreSQL instances\n- **Synthetic Data Generation** - Create realistic test datasets to experiment with\n- **Performance Comparison** - Basic benchmarking between single vs sharded database queries\n- **Large Dataset Handling** - Learn techniques for processing millions of records efficiently\n\nIt's a personal laboratory for understanding how database scaling works in practice.\n\n---\n\n## Features\n\n- Generate user data using Faker library\n- Distribute data across configurable number of PostgreSQL shards\n- Real-time progress tracking with multiple progress bars\n- Simple modulo-based hash sharding implementation\n- Basic query performance benchmarking suite\n- Docker-based development environment setup\n- Batch processing for memory-efficient data handling\n\n---\n\n## Prerequisites\n\nEnsure your development environment is ready:\n\n- Python 3.8+ with pip or uv package manager\n- Docker and Docker Compose for database containers\n- Basic understanding of PostgreSQL and database concepts\n- Terminal/command line familiarity\n\n---\n\n## Installation\n\n### Clone and Setup\n\n```bash\ngit clone https://github.com/yourusername/data-sharding.git\ncd data-sharding\n```\n\n### Start Development Environment\n\n```bash\n# Start PostgreSQL containers\nmake start-dev-env\n\n# Run the application\nmake dev\n```\n\n---\n\n## Configuration\n\n### Environment Variables\n\nSet up your database connections in `.env` file:\n\n```bash\n# Main database (coordinator)\nMAINDB_URL=postgresql://user:password@localhost:5432/maindb\n\n# Shard databases (comma-separated)\nSHARD_URLS=postgresql://user:password@localhost:5433/shard0,postgresql://user:password@localhost:5434/shard1\n\n# Data generation settings\nMAX_GEN_USERS=10000\n```\n\nSee [.env.example](.env.example) for complete configuration template.\n\n### Database Schema\n\nDatabase tables are automatically created using SQL scripts in `/sql`:\n\n- `sql/init-main.sql` - Main database schema\n- `sql/init-shard.sql` - Shard database schema\n\n---\n\n## Usage\n\n### Basic Workflow\n\nThe main script runs everything automatically:\n\n```bash\nmake dev\n```\n\nThis performs:\n\n1. **Connection Testing** - Verifies all database connections\n2. **Data Generation** - Creates synthetic users in main database\n3. **Data Migration** - Distributes users to shards using hash-based routing\n4. **Performance Benchmarking** - Runs query tests on both architectures\n\n### Customizing Data Volume\n\nControl the number of users via environment variable:\n\n```bash\n# Generate 50,000 users\nMAX_GEN_USERS=50000 make dev\n\n# Generate 1 million users\nMAX_GEN_USERS=1000000 make dev\n```\n\n### Individual Operations\n\nWhile the main script runs everything, you can also:\n\n```bash\n# Start only the databases\nmake start-dev-env\n\n# Stop databases when done\nmake stop-dev-env\n\n# Check Docker containers\ndocker-compose -f docker-compose.yml -p data-sharding ps\n```\n\n## Project Structure\n\n```\ndata-sharding/\n├── src/\n│   ├── database.py     # Core sharding logic and operations\n│   ├── models.py       # User data model with dataclass\n│   ├── env.py         # Environment configuration loader\n│   ├── main.py        # Application entry point\n│   └── utils/\n│       └── progress.py # Progress bar utilities\n├── sql/\n│   ├── init-main.sql   # Main database schema\n│   └── init-shard.sql  # Shard database schema\n├── docker-compose.yml  # PostgreSQL containers\n├── Makefile           # Development commands\n├── .env.example       # Configuration template\n└── README.md         # This file\n```\n\n---\n\n## Development Commands\n\n### Database Management\n\n```bash\n# Start PostgreSQL databases\nmake start-dev-env\n\n# Stop databases\nmake stop-dev-env\n```\n\n### Application\n\n```bash\n# Run the main application\nmake dev\n\n# Run with custom user count\nMAX_GEN_USERS=100000 make dev\n```\n\n### Docker Operations\n\n```bash\n# View running containers\ndocker-compose -f docker-compose.yml -p data-sharding ps\n\n# View logs\ndocker-compose -f docker-compose.yml -p data-sharding logs\n\n# Connect to database directly\ndocker exec -it data-sharding_maindb_1 psql -U postgres -d maindb\n```\n\n---\n\n## Notes\n\n- **Memory Usage** - Large datasets (1M+ users) require sufficient system memory\n- **Docker Resources** - Ensure Docker has adequate CPU and memory allocation\n- **Progress Tracking** - Each shard gets its own progress bar during migration\n- **Hash Distribution** - Uses simple `user_id % shard_count` for data routing\n- **Connection Pooling** - Currently uses basic connection management\n- **Data Persistence** - Database data persists between container restarts\n\n---\n\n## TODO\n\n- [ ] Add data rollback functionality for testing different sharding scenarios\n- [ ] Implement range-based partitioning as alternative to hash-based sharding\n- [ ] Create simple connection pooling to handle more concurrent operations\n- [ ] Add basic monitoring dashboard for query performance tracking\n- [ ] Support for different data types (orders, products, transactions, etc.)\n- [ ] Implement data migration tools for resharding experiments\n- [ ] Add read replica simulation for read/write separation testing\n- [ ] Basic caching layer integration (Redis) for performance comparison\n- [ ] Support for testing with different shard counts dynamically\n- [ ] Simple backup and restore functionality for experiment snapshots\n- [ ] Query optimization techniques and indexing strategy experiments\n- [ ] Load testing scenarios with concurrent users simulation\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwilliam1nguyen%2Fdblab","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwilliam1nguyen%2Fdblab","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwilliam1nguyen%2Fdblab/lists"}