{"id":13573159,"url":"https://github.com/philipmat/discogs-xml2db","last_synced_at":"2025-05-16T10:07:53.086Z","repository":{"id":49093806,"uuid":"2640517","full_name":"philipmat/discogs-xml2db","owner":"philipmat","description":"Imports the discogs.com monthly XML dumps into databases","archived":false,"fork":false,"pushed_at":"2025-03-13T09:10:13.000Z","size":1585,"stargazers_count":222,"open_issues_count":25,"forks_count":81,"subscribers_count":17,"default_branch":"develop","last_synced_at":"2025-04-19T13:39:59.862Z","etag":null,"topics":["discogs","python"],"latest_commit_sha":null,"homepage":"","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/philipmat.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2011-10-25T01:33:32.000Z","updated_at":"2025-04-17T05:18:55.000Z","dependencies_parsed_at":"2024-11-05T07:32:08.032Z","dependency_job_id":"03b52a40-eed3-4f81-9750-61ccb31dc319","html_url":"https://github.com/philipmat/discogs-xml2db","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philipmat%2Fdiscogs-xml2db","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philipmat%2Fdiscogs-xml2db/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philipmat%2Fdiscogs-xml2db/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/philipmat%2Fdiscogs-xml2db/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/philipmat","download_url":"https://codeload.github.com/philipmat/discogs-xml2db/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254509476,"owners_count":22082891,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["discogs","python"],"created_at":"2024-08-01T15:00:30.971Z","updated_at":"2025-05-16T10:07:48.070Z","avatar_url":"https://github.com/philipmat.png","language":"C#","funding_links":[],"categories":["C# #"],"sub_categories":[],"readme":"# discogs-xml2db v2\n\ndiscogs-xml2db is a python program for importing [discogs data dumps](https://data.discogs.com/)\ninto several databases.\n\nVersion 2 is a rewrite of the original *discogs-xml2db*\n(referred in here as the *classic* version).  \nIt is based on a [branch by RedApple](https://github.com/redapple/discogs-xml2db)\nand it is several times faster.\n\nCurrently supports MySQL and PostgreSQL as target databases.\nInstructions for importing into MongoDB, though these are untested.  \nLet us know how it goes!\n\n## Experimental version\n\nIn parallel to the original Python codebase, we're working on a parser/exporter\nthat's even faster. This is a complete rewrite in C# and initial results are highly\npromising:\n\n| File | Record Count | Python | C# |\n| --- | ---: | :---: | :---: |\n| discogs_20200806_artists.xml.gz  |  7,046,615 | 6:22    | 2:35 |\n| discogs_20200806_labels.xml.gz   |  1,571,873 | 1:15    | 0:22 |\n| discogs_20200806_masters.xml.gz  |  1,734,371 | 3:56    | 1:57 |\n| discogs_20200806_releases.xml.gz | 12,867,980 | 1:45:16 | 42:38 |\n\nIf you're interested in testing one of this versions, read more about it\nin the [.NET Parser README](./alternatives/dotnet/README.md) or grab\nthe appropriate binaries from the\n[Releases page](https://github.com/philipmat/discogs-xml2db/releases).\n\nWhile this version does not have yet complete feature-parity with the Python\nversion, the core export-to-csv is there and it's likely it will\neventually replace it.\n\n![DotNet Build](https://github.com/philipmat/discogs-xml2db/workflows/DotNet%20Build/badge.svg)\n\n## Running discogs-xml2db\n\n![Build Status - develop](https://github.com/philipmat/discogs-xml2db/workflows/Python%20build%20check/badge.svg)\n\n### Requirements\n\n**discogs-xml2db requires python3 (minimum 3.6)** and some python modules.  \nAdditionally, the bash shell is used for automating some tasks.  \n\nImporting to some databases may require additional dependencies,\nsee the documentation for your target database below.\n\nIt's best that a [Python virtual environment](https://docs.python.org/3/library/venv.html)\nis created in order to install the required modules in a safe\nlocation, which does not require elevated security permissions:\n\n```sh\n# Create a virtual environment and activate it\n$ python3 -m venv .discogsenv\n\n# Activate virtual environment\n# On Linux/macOS:\n$ source .discogsenv/bin/activate\n# on Windows, in Powershell\n$ .discogsenv\\Scripts\\Activate.ps1\n\n# Install requirements:\n(.discogsenv) $ pip3 install -r requirements.txt\n```\n\nInstallation instruction for other platforms can be found in the [pip documentation](https://pip.pypa.io/en/stable/installing/).\n\n### Downloading discogs dumps\n\nDownload the latest dump files from discogs manually from [discogs](https://data.discogs.com/)\nor run `get_latest_dumps.sh`.\n\nTo check the files' integrity download the appropriate checksum file from\n[https://data.discogs.com/](https://data.discogs.com/),\nplace it in the same directory as the dumps and compare the checksums.\n\n```sh\n# run in folder where the data dump files have been downloaded\n$ sha256sum -c discogs_*_CHECKSUM.txt\n```\n\n### Converting dumps to CSV\n\nRun `run.py` to convert the dump files to csv.\n\nThere are two run modes:\n\n1. You can point it to a directory where the discogs dump files are\n   and use one or multiple `--export` options to indicate which files to process:\n\n```sh\n# ensure the virtual environment is active\n(.discogsenv) $ python3 run.py \\\n  --bz2 \\ # compresses resulting csv files\n  --apicounts \\ # provides more accurate progress counts\n  --export artist --export label --export master --export release \\\n  --output csv-dir    # folder where to output the csv files\n  dump-dir \\ # folder where the data dumps are\n```\n\n2. You can specify the individual files instead:\n\n```sh\n# ensure the virtual environment is active\n(.discogsenv) $ python3 run.py \\\n  --bz2 \\ # compresses resulting csv files\n  --apicounts \\ # provides more accurate progress counts\n  --output csv-dir    # folder where to output the csv files\n  path/to/discogs_20200806_artist.xml.gz path/to/discogs_20200806_labels.xml.gz\n```\n\n`run.py` takes the following arguments:\n\n- `--export`: the types of dump files to export: \"artist\", \"label\", \"master\", \"release.  \n  It matches the names of the dump files, e.g. \"discogs_20200806_*artist*s.xml.gz\"\n  Not needed if the individual files are specified.\n- `--bz2`: Compresses output csv files using bz2 compression library.\n- `--limit=\u003clines\u003e`: Limits export to some number of entities\n- `--apicounts`: Makes progress report more accurate by getting total amounts from Discogs API.\n- `--output` : the folder where to store the csv files; default it current directory\n\nThe exporter provides progress information in real time:\n\n```text\nProcessing      labels:  99%|█████████████████████████████████████████▊| 1523623/1531339 [01:41\u003c00:00, 14979.04labels/s]\nProcessing     artists: 100%|████████████████████████████████████████▊| 6861991/6894139 [09:02\u003c00:02, 12652.23artists/s]\nProcessing    releases:  78%|█████████████████████████████▌        | 9757740/12560177 [2:02:15\u003c36:29, 1279.82releases/s]\n```\n\nThe total amount and percentages might be off a bit as the exact amount is not known while reading the file.  \nSpecifying `--apicounts` will provide more accurate predictions by getting the latest amounts from the Discogs API.\n\n### Importing\n\nIf `pv` is available it will be used to display progress during import.  \nTo install it run `$ sudo apt-get install pv` on Ubuntu and Debian or check the\n[installation instructions for other platforms](http://www.ivarch.com/programs/pv.shtml).  \n\nExample output if using `pv`:\n\n```sh\n$ mysql/importcsv.sh 2020-05-01/csv/*\nartist_alias.csv.bz2: 12,5MiB 0:00:03 [3,75MiB/s] [===================================\u003e] 100%\nartist.csv.bz2:  121MiB 0:00:29 [4,09MiB/s] [=========================================\u003e] 100%\nartist_image.csv.bz2:  7,3MiB 0:00:01 [3,72MiB/s] [===================================\u003e] 100%\nartist_namevariation.csv.bz2: 2,84MiB 0:00:01 [2,76MiB/s] [==\u003e                         ] 12% ETA 0:00:07\n```\n\n#### Importing into PostgreSQL\n\n```sh\n# install PostgreSQL libraries (might be required for next step)\n$ sudo apt-get install libpq-dev\n\n# install the PostgreSQL package for python\n# ensure the virtual environment has been activated\n(.discogsenv) $ pip3 install -r postgresql/requirements.txt\n\n# Configure PostgreSQL username, password, database, ...\n$ nano postgresql/postgresql.conf\n\n# Create database tables\n(.discogsenv) $ python3 postgresql/psql.py \u003c postgresql/sql/CreateTables.sql\n\n# Import CSV files\n(.discogsenv) $ python3 postgresql/importcsv.py /csvdir/*\n\n# Configure primary keys and constraints, build indexes\n(.discogsenv) $ python3 postgresql/psql.py \u003c postgresql/sql/CreatePrimaryKeys.sql\n(.discogsenv) $python3 postgresql/psql.py \u003c postgresql/sql/CreateFKConstraints.sql\n(.discogsenv) $ python3 postgresql/psql.py \u003c postgresql/sql/CreateIndexes.sql\n```\n\n#### Importing into Mysql\n\n```sh\n# Configure MySQL username, password, database, ...\n$ nano mysql/mysql.conf\n\n# Create database tables\n$ mysql/exec_sql.sh \u003c mysql/CreateTables.sql\n\n# Import CSV files\n$ mysql/importcsv.sh /csvdir/*\n\n# Configure primary keys and build indexes\n$ mysql/exec_sql.sh \u003c mysql/AssignPrimaryKeys.sql\n```\n\n#### Importing into MongoDB\n\nThe CSV files can be imported into MongoDB using\n[mongoimport](https://docs.mongodb.com/manual/reference/program/mongoimport/).\n\n```sh\nmongoimport --db=discogs --collection=releases --type=csv --headerline --file=release.csv\n```\n\n#### Importing into CouchDB\n\nCouchDB only supports importing JSON files.  \n[`couchimport`](https://github.com/glynnbird/couchimport) can be used to convert\nthe CSV files to JSON and import them into CouchDB,\nas explained in [this tutorial](https://medium.com/codait/simple-csv-import-for-couchdb-71616200b095).\n\n## Comparison to classic discogs-xml2db\n\n*speedup* is many times faster than *classic* because it uses a different approach:\n\n1. The discogs xml dumps are first converted into one csv file per database table.\n2. These csv files are then imported into the different target databases (bulk load).  \n   This is different from *classic* discogs-xml2db which loads records into the database\n   one by one while parsing the xml file, waiting on the database after every row.\n\n*speedup* requires less disk space than *classic* as it can work while the dump files are still compressed.\nWhile the uncompressed dumps for May 2020 take up 57GB of space the compressed dumps are only 8.8GB.\nThe dumps can be deleted after converting them to compressed CSV files (6.1GB).\n\nAs many databases can import CSV files out of the box it should be easy\nto add support for more databases to discogs-xml2db *speedup* in the future.\n\n### Database schema changes\n\nThe database schema was changed in v2.0 to be more consistent and normalize some more data.\nThe following things changed compared to *classic* `discogs-xml2db`:\n\n- renamed table: `releases_labels` =\u003e `release_label`\n- renamed table: `releases_formats` =\u003e `release_format`\n- renamed table: `releases_artists` =\u003e `release_artist`\n- renamed table: `tracks_artists` =\u003e `release_track_artist`\n- renamed table: `track` =\u003e `release_track`\n- renamed column: `release_artists.join_relation` =\u003e `release_artist.join_string`\n- renamed column: `release_track_artist.join_relation` =\u003e `release_track_artist.join_string`\n- renamed column: `release_format.format_name` =\u003e `release_format.name`\n- renamed column: `label.contactinfo` =\u003e `label.contact_info`\n- renamed column: `label.parent_label` =\u003e `label.parent_name`\n- added: `label` has new `parent_id` field\n- added: `release_label` has extra fields\n- moved: `aliases` now in `artist_alias` table\n- moved: `tracks_extra_artists` now in `track_artist` table with extra flag\n- moved: `releases_extra_artists` now in `release_track_artist` table with extra flag\n- moved: `release.genres` now in own `release_genre` table\n- moved: `release.styles` now in own `release_style` table\n- moved: `release.barcode` now in `release_identifier` table\n- moved: `artist.anv` fields now in `artist_namevariation` table\n- moved: `artist.url` fields now in `artist_url` table\n- removed: `release_format.position` no longer exists but can use id field to preserve order when release has multiple formats.\n- `release_track_artist` now use `tmp_track_id` to match to `tmp_track` in `release_track`\n\n### Running discogs-xml2db classic\n\nTo run the classic version of discogs-xml2db, check out the v1.99 git tag.  \nIt contains both the classic and the speed-up version.\n\nPlease be aware that the classic version is no longer maintained.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphilipmat%2Fdiscogs-xml2db","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fphilipmat%2Fdiscogs-xml2db","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphilipmat%2Fdiscogs-xml2db/lists"}