{"id":28762347,"url":"https://github.com/difu/discostar","last_synced_at":"2025-06-17T08:08:02.873Z","repository":{"id":298141368,"uuid":"915587135","full_name":"difu/discostar","owner":"difu","description":"Discogs Statistics Researcher","archived":false,"fork":false,"pushed_at":"2025-06-16T18:10:59.000Z","size":117,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-16T19:27:49.031Z","etag":null,"topics":["discogs","discogs-api","discogs-client"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/difu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-01-12T09:04:35.000Z","updated_at":"2025-06-16T18:11:03.000Z","dependencies_parsed_at":"2025-06-09T16:43:43.341Z","dependency_job_id":"a8cec775-56e3-4117-8acd-d9e12509658c","html_url":"https://github.com/difu/discostar","commit_stats":null,"previous_names":["difu/discostar"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/difu/discostar","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/difu%2Fdiscostar","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/difu%2Fdiscostar/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/difu%2Fdiscostar/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/difu%2Fdiscostar/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/difu","download_url":"https://codeload.github.com/difu/discostar/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/difu%2Fdiscostar/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260318682,"owners_count":22991121,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["discogs","discogs-api","discogs-client"],"created_at":"2025-06-17T08:08:02.227Z","updated_at":"2025-06-17T08:08:02.861Z","avatar_url":"https://github.com/difu.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🎵 DiscoStar\n\nA powerful Python CLI tool for analyzing your personal record collection using Discogs data. DiscoStar combines XML data dumps with real-time API calls to provide deep insights into your music collection.\n\n## ✨ Features\n\n- **Hybrid Data Approach**: Combines Discogs XML dumps for reference data with API calls for personal collection\n- **Collection Sync**: Sync your personal collection from Discogs API with real-time progress tracking\n- **High-Performance Ingestion**: Memory-efficient XML parsing with batch processing (10,000+ records/second)\n- **Rate-Limited API Client**: Respects Discogs API limits with configurable SSL handling\n- **Real-time Progress Tracking**: Visual progress indicators and detailed status reporting\n- **Robust Error Handling**: Comprehensive error recovery with sub-1% error rates\n- **Local Database**: SQLite for development, with Azure PostgreSQL support for production\n- **CLI Interface**: Clean command-line interface for all operations\n- **Analytics Engine**: Comprehensive collection analysis with multiple output formats\n- **Web Interface**: Future Flask-based web dashboard (coming soon)\n- **Cloud Ready**: Terraform infrastructure for Azure deployment\n\n## 📊 Analytics Features\n\nDiscoStar provides comprehensive analytics for your music collection with multiple output formats:\n\n### Available Analyses\n- **Collection Summary**: Overview statistics (total releases, artists, labels, year range)\n- **Decade Analysis**: Distribution by decade (prevents duplicate counting of same albums)\n- **Top Artists**: Most collected artists in your collection\n- **Top Labels**: Most collected record labels\n- **Longest Tracks**: Find the longest tracks in your collection\n- **Multiple Copies**: Identify albums where you own multiple variants/pressings\n- **Genre Analysis**: Breakdown by genre and subgenre\n- **Format Analysis**: Distribution by format (vinyl, CD, digital, etc.)\n- **Year Analysis**: Most collected years\n- **Artist Collaborations**: Find releases where two artists collaborated\n\n### Output Formats\n- **Human-readable**: Formatted tables for terminal display\n- **CSV**: For spreadsheet analysis and external visualization tools\n- **JSON**: For programmatic use and integration with other tools\n\n### Usage Examples\n\n```bash\n# Basic collection summary\ndiscostar analytics\n\n# Decade analysis with CSV output for visualization\ndiscostar analytics --type decades --format csv --output decades.csv\n\n# Top 10 artists in JSON format\ndiscostar analytics --type top-artists --limit 10 --format json\n\n# Find collaborations between Miles Davis and John Coltrane\ndiscostar analytics --type collaborations --artist1 \"Miles Davis\" --artist2 \"John Coltrane\"\n\n# Run all analyses and save comprehensive report\ndiscostar analytics --type all --output collection_report.txt\n\n# Export genre data for external analysis\ndiscostar analytics --type genres --format csv --limit 30 --output genres.csv\n```\n\n### Advanced Features\n- **Smart duplicate handling**: Decade analysis uses earliest release year for each master to prevent duplicate counting\n- **Flexible limits**: Customize result limits for top-N analyses\n- **File output**: Save results directly to files for further processing\n- **Real-time validation**: Checks for collection data before running analyses\n\n\n## 🚀 Quick Start\n\n### Prerequisites\n\n- Python 3.9 or higher\n- Discogs account and API token\n\n### Installation\n\n1. Clone the repository:\n```bash\ngit clone https://github.com/difu/discostar.git\ncd discostar\n```\n\n2. Create a virtual environment:\n```bash\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n```\n\n3. Install dependencies:\n```bash\npip install -r requirements-dev.txt\n```\n\n4. Set up configuration:\n```bash\ncp .env.example .env\n# Edit .env with your Discogs API token and username\n```\n\n5. Initialize the database:\n```bash\ndiscostar init\n```\n\n### Basic Usage\n\n```bash\n# Download Discogs XML dumps\ndiscostar download-dumps\n\n# Import XML data into database\ndiscostar ingest-data\n\n# Sync your personal collection from Discogs API\ndiscostar sync-collection\n\n# Check ingestion and sync status\ndiscostar status\n\n# Analyze your collection\ndiscostar analytics\n```\n\n## ⚡ Performance Metrics\n\nDiscoStar is optimized for processing large Discogs datasets efficiently:\n\n### XML Ingestion Performance\n- **Processing Speed**: ~10,000 records/second\n- **Memory Efficiency**: Uses iterative XML parsing for files \u003e1GB\n- **Error Rate**: \u003c0.001% (sub-1% error tolerance)\n- **Batch Processing**: Configurable batch sizes (default: 1,000 records)\n- **Progress Tracking**: Real-time updates every 10,000 records\n\n### Database Performance\n- **Batch Commits**: Every 10,000 records to optimize transaction overhead\n- **Memory Usage**: Minimal memory footprint with streaming processing\n- **Storage**: SQLite for local development, PostgreSQL for production scale\n\n### API Performance\n- **Collection Sync**: 603 collection items synced in ~8 seconds\n- **Rate Limiting**: 60 requests/minute with 1-second minimum between requests\n- **Error Recovery**: Automatic retry logic for transient API failures\n- **Progress Tracking**: Real-time statistics during sync operations\n\n### Benchmark Results\nTested with Discogs June 2025 XML dumps on a Macbook Pro M4:\n- **Artists**: 1,060,000+ records processed in ~2 minutes\n- **Collection Sync**: 603 personal collection items in ~8 seconds\n- **Releases**: Estimated 8+ million records (full dataset)\n- **Labels**: Estimated 1.5+ million records\n- **Masters**: Estimated 2+ million records\n\n## 🔗 Database Schema \u0026 Relationships\n\nDiscoStar uses a normalized database schema with both JSON fields and relational join tables for optimal flexibility:\n\n### Data Storage Approach\n- **JSON Fields**: Store raw Discogs data in JSON format for completeness\n- **Join Tables**: Normalized relationships for efficient queries and analytics\n- **Hybrid Benefits**: Maintains data integrity while enabling complex SQL queries\n\n### Join Tables\nDiscoStar automatically populates join tables during release ingestion:\n\n| Table | Purpose | Example Query |\n|-------|---------|---------------|\n| **`release_artists`** | Artist-release relationships with roles | Find all releases by producer |\n| **`release_labels`** | Label-release relationships with catalog numbers | Group releases by label |\n| **`tracks`** | Individual track listings with positions | Search for specific songs |\n\n### Relationship Processing\n```bash\n# Automatic: Join tables populated during release ingestion\ndiscostar ingest-data --type releases\n\n# Manual: Process existing releases to populate join tables  \ndiscostar process-relationships\n\n# Check results\ndiscostar status  # Shows join table counts\n```\n\n### Query Examples\nWith join tables populated, you can run complex analytics:\n\n```sql\n-- Find all releases where Artist X collaborated with Artist Y\nSELECT r.title FROM releases r\nJOIN release_artists ra1 ON r.id = ra1.release_id  \nJOIN release_artists ra2 ON r.id = ra2.release_id\nWHERE ra1.artist_id = 1 AND ra2.artist_id = 2;\n\n-- Count releases by label\nSELECT l.name, COUNT(*) FROM labels l\nJOIN release_labels rl ON l.id = rl.label_id\nGROUP BY l.name ORDER BY COUNT(*) DESC;\n\n-- Find longest tracks in collection\nSELECT r.title, t.title, t.duration FROM tracks t\nJOIN releases r ON t.release_id = r.id\nORDER BY t.duration_seconds DESC LIMIT 10;\n\n-- Find favorite decade based on collection (earliest version of each master release only)\n-- - Groups your music collection by decade using the earliest release year for\n-- each album you own. This prevents duplicate counting when you own multiple pressings of the same album\n-- (e.g., original + remaster), giving you accurate statistics about\n-- which decades your music taste favors most.\n WITH earliest_releases AS (\n      SELECT\n          r.master_id,\n          MIN(\n              COALESCE(\n                  CAST(strftime('%Y', r.released) AS INTEGER),\n                  m.year,\n                  CAST(json_extract(uc.basic_information, '$.year') AS INTEGER)\n              )\n          ) as earliest_year\n      FROM releases r\n      INNER JOIN user_collection uc ON r.id = uc.release_id\n      LEFT JOIN masters m ON r.master_id = m.id\n      WHERE r.master_id IS NOT NULL\n        AND (\n            r.released IS NOT NULL OR\n            m.year IS NOT NULL OR\n            json_extract(uc.basic_information, '$.year') IS NOT NULL\n        )\n      GROUP BY r.master_id\n\n      UNION ALL\n\n      -- Include releases without master_id (standalone releases)\n      SELECT\n          NULL as master_id,\n          COALESCE(\n              CAST(strftime('%Y', r.released) AS INTEGER),\n              CAST(json_extract(uc.basic_information, '$.year') AS INTEGER)\n          ) as earliest_year\n      FROM releases r\n      INNER JOIN user_collection uc ON r.id = uc.release_id\n      WHERE r.master_id IS NULL\n        AND (\n            r.released IS NOT NULL OR\n            json_extract(uc.basic_information, '$.year') IS NOT NULL\n        )\n  ),\n  decade_counts AS (\n      SELECT\n          (earliest_year / 10) * 10 as decade_start,\n          COUNT(*) as release_count\n      FROM earliest_releases\n      WHERE earliest_year IS NOT NULL\n      GROUP BY (earliest_year / 10) * 10\n  )\n  SELECT\n      decade_start,\n      (decade_start || 's') as decade,\n      release_count,\n      ROUND(100.0 * release_count / SUM(release_count) OVER(), 2) as percentage\n  FROM decade_counts\n  ORDER BY release_count DESC;\n\n-- Find releases where you own multiple copies - Identifies albums in your collection where you own more than one pressing or version. Groups by master release to show\n--  unique albums with multiple copies, helping you track duplicates, variants, and different pressings of the same album (e.g., original vinyl + remaster + special\n--  edition).\n\n  WITH duplicate_releases AS (\n      SELECT\n          r.master_id,\n          m.title as master_title,\n          COUNT(DISTINCT uc.release_id) as copy_count  -- COUNT DISTINCT release_ids\n      FROM releases r\n      INNER JOIN user_collection uc ON r.id = uc.release_id\n      INNER JOIN masters m ON r.master_id = m.id\n      WHERE r.master_id IS NOT NULL AND r.master_id \u003e 0\n      GROUP BY r.master_id, m.title\n      HAVING COUNT(DISTINCT uc.release_id) \u003e 1  -- Use DISTINCT here too\n  )\n  SELECT\n      master_title as release_name,\n      copy_count\n  FROM duplicate_releases\n  ORDER BY copy_count DESC, master_title\n  LIMIT 5;\n```\n\n## 💾 Storage Strategy\n\nDiscoStar offers flexible release data management to balance completeness with performance:\n\n### Release Storage Options\n\n| Strategy | Records | Use Case | Storage | Query Speed |\n|----------|---------|----------|---------|-------------|\n| **`all`** | 8M+ releases | Complete dataset, discovery | ~2GB+ | Slower |\n| **`skip`** | 0 releases | Collection-only analysis | ~50MB | Fastest |\n| **`collection_only`** | 100s-1000s | Personal collection focus | ~100MB | Fast |\n| **`collection_only` + masters** | 1000s-10000s | Collection + all variants | ~200MB | Fast |\n\n### 🆕 Master Release Expansion\n\n**NEW FEATURE**: For `collection_only` strategy, you can now include all releases linked to masters in your collection. This gives you comprehensive coverage of all pressings, remasters, and variants of albums you own.\n\n**Example**: If you own \"Abbey Road\" (1969 UK pressing), enabling master expansion will also include:\n- Abbey Road (1969 US pressing)\n- Abbey Road (1987 CD remaster)\n- Abbey Road (2019 anniversary edition)\n- All other official releases of the album\n\n### Recommended Workflow\n\n```bash\n# Option 1: Start with essential data only\necho \"strategy: skip\" \u003e\u003e config/settings.yaml\ndiscostar ingest-data --type artists,labels,masters\n# Later: sync collection via API\n\n# Option 2: Import everything, optimize later  \ndiscostar ingest-data  # All data including 8M+ releases\n# After collection sync:\ndiscostar optimize-db --clean-unused  # Remove unused releases\n```\n\n### Configuration\n\nEdit `config/settings.yaml`:\n```yaml\ningestion:\n  releases:\n    strategy: \"collection_only\"  # or \"all\", \"skip\"\n    include_master_releases: true  # Include all pressings of albums in collection\n```\n\n### Master Expansion Workflow\n\n```bash\n# 1. Set up collection-only strategy with master expansion\necho \"ingestion:\n  releases:\n    strategy: 'collection_only'\n    include_master_releases: true\" \u003e\u003e config/settings.yaml\n\n# 2. Sync your collection first\ndiscostar sync-collection\n\n# 3. Import releases with master expansion\ndiscostar ingest-data --type releases\n\n# 4. Check results\ndiscostar status  # Shows collection + master variant counts\n```\n\n\n## 🏗️ Architecture\n\n```\ndiscostar/\n├── src/\n│   ├── core/           # Shared business logic\n│   │   ├── database/   # Database models and operations\n│   │   ├── discogs/    # API client and XML processing\n│   │   ├── analytics/  # Statistical analysis\n│   │   └── utils/      # Utilities and configuration\n│   ├── cli/            # Command-line interface\n│   └── web/            # Web interface (future)\n├── infrastructure/     # Azure deployment resources\n├── tests/             # Test suite\n├── data/              # Local data storage\n└── config/            # Configuration files\n```\n\n## 🔧 Configuration\n\nDiscoStar uses YAML configuration with environment variable overrides.\n\n### Available CLI Commands\n\n```bash\n# Core commands\ndiscostar init                    # Initialize database and directories\ndiscostar download-dumps          # Download all XML dumps\ndiscostar ingest-data            # Import XML data into database\ndiscostar sync-collection        # Sync your collection from Discogs API\ndiscostar status                 # Show database and sync status\n\n# Analytics commands\ndiscostar analytics                     # Basic collection summary\ndiscostar analytics --type all          # Run all available analyses\ndiscostar analytics --type decades --format csv  # Decade analysis as CSV\ndiscostar analytics --type collaborations --artist1 \"Artist1\" --artist2 \"Artist2\"\n\n# Collection sync options\ndiscostar sync-collection --force       # Force refresh of collection data\ndiscostar sync-wantlist                 # Sync wantlist (coming soon)\n\n# Advanced XML ingestion options\ndiscostar download-dumps --type artists  # Download specific dump type\ndiscostar ingest-data --type releases    # Import specific data type\ndiscostar ingest-data --force            # Force re-ingestion\ndiscostar clear-data --type artists      # Clear specific data type\n\n# Relationship processing (join tables)\ndiscostar process-relationships          # Populate join tables from release JSON data\n\n# Collection-only workflow guidance\ndiscostar collection-workflow            # Interactive guide for collection-only setup\n\n# Database optimization (after collection sync)\ndiscostar optimize-db --clean-unused     # Remove releases not in collections\n\n# Master release expansion options\ndiscostar ingest-data --include-masters  # CLI override for master expansion\n\n# Verbose logging\ndiscostar -v \u003ccommand\u003e           # Enable detailed logging\n```\n\n### Environment Variables\n\n```bash\n# Required\nDISCOGS_API_TOKEN=your_discogs_api_token\nDISCOGS_USERNAME=your_username\n\n# Optional\nDATABASE_URL=sqlite:///data/discostar.db\nAZURE_STORAGE_CONNECTION_STRING=your_azure_connection\n```\n\n### Configuration File\n\nSee `config/settings.yaml` for detailed configuration options including:\n- Database settings\n- Discogs API configuration and rate limiting\n- SSL verification settings (for development environments)\n- Logging configuration\n- Cache settings\n- XML ingestion batch processing parameters\n\n## 🧪 Development\n\n### Running Tests\n\n```bash\npytest\n```\n\n### Code Quality\n\n```bash\n# Format code\nblack src/ tests/\n\n# Lint code\nflake8 src/ tests/\n\n# Type checking\nmypy src/\n```\n\n### Project Structure\n\n- **Core Modules**: Business logic separated into focused modules\n- **CLI Interface**: Click-based command structure\n- **Database Layer**: SQLAlchemy models matching Discogs schema\n- **API Client**: Async HTTP client with rate limiting and error handling\n- **Collection Sync**: Real-time synchronization with progress tracking\n- **Async Processing**: aiohttp for concurrent API operations\n- **Testing**: Pytest with async support\n\n## ☁️ Deployment\n\n### Azure Deployment\n\n__TODO__ nothing done yet 😊\n\n1. Configure Azure credentials\n2. Deploy infrastructure:\n```bash\ncd infrastructure/terraform\nterraform init\nterraform plan\nterraform apply\n```\n\n3. Deploy application:\n```bash\n# Build and push Docker container\ndocker build -t discostar .\n# Deploy to Azure Container Instances\n```\n\n## 🤝 Contributing\n\n1. Fork the repository\n2. Create a feature branch: `git checkout -b feature-name`\n3. Make your changes\n4. Add tests for new functionality\n5. Run the test suite: `pytest`\n6. Format code: `black src/ tests/`\n7. Submit a pull request\n\n## 📝 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## 🙏 Acknowledgments\n\n- [Discogs](https://www.discogs.com/) for providing the comprehensive music database and API\n- The open-source community for the excellent Python libraries that make this project possible\n\n## 📞 Support\n\n- Create an [issue](https://github.com/yourusername/discostar/issues) for bug reports or feature requests\n- Check the [documentation](https://github.com/yourusername/discostar/wiki) for detailed guides\n\n---\n\n**DiscoStar** - Illuminate your music collection with data-driven insights! ⭐\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdifu%2Fdiscostar","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdifu%2Fdiscostar","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdifu%2Fdiscostar/lists"}