{"id":20723570,"url":"https://github.com/networks-learning/stackexchange-dump-to-postgres","last_synced_at":"2025-04-23T17:27:21.233Z","repository":{"id":25379918,"uuid":"28808182","full_name":"Networks-Learning/stackexchange-dump-to-postgres","owner":"Networks-Learning","description":"Python scripts to import StackExchange data dump into Postgres DB.","archived":false,"fork":false,"pushed_at":"2022-07-06T19:28:01.000Z","size":65,"stargazers_count":87,"open_issues_count":6,"forks_count":29,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-16T17:17:07.315Z","etag":null,"topics":["data-dump","database","postgres","python","stackexchange-dump","stackoverflow-data"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Networks-Learning.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-01-05T10:32:22.000Z","updated_at":"2025-03-16T23:02:06.000Z","dependencies_parsed_at":"2022-07-10T12:01:10.148Z","dependency_job_id":null,"html_url":"https://github.com/Networks-Learning/stackexchange-dump-to-postgres","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Networks-Learning%2Fstackexchange-dump-to-postgres","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Networks-Learning%2Fstackexchange-dump-to-postgres/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Networks-Learning%2Fstackexchange-dump-to-postgres/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Networks-Learning%2Fstackexchange-dump-to-postgres/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Networks-Learning","download_url":"https://codeload.github.com/Networks-Learning/stackexchange-dump-to-postgres/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250479580,"owners_count":21437388,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-dump","database","postgres","python","stackexchange-dump","stackoverflow-data"],"created_at":"2024-11-17T04:09:08.004Z","updated_at":"2025-04-23T17:27:21.215Z","avatar_url":"https://github.com/Networks-Learning.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# StackOverflow data to postgres\n\nThis is a quick script to move the Stackoverflow data from the [StackExchange\ndata dump (Sept '14)](https://archive.org/details/stackexchange) to a Postgres\nSQL database.\n\nSchema hints are taken from [a post on\nMeta.StackExchange](http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede)\nand from [StackExchange Data Explorer](http://data.stackexchange.com).\n\n## Quickstart\n\nInstall requirements, create a new database (e.g. `beerSO` below), and use `load_into_pg.py` script:\n\n``` console\n$ pip install -r requirements.txt\n...\nSuccessfully installed argparse-1.2.1 libarchive-c-2.9 lxml-4.5.2 psycopg2-binary-2.8.4 six-1.10.0\n$ createdb beerSO\n$ python load_into_pg.py -s beer -d beerSO\n```\n\nThis will download compressed files from\n[archive.org](https://ia800107.us.archive.org/27/items/stackexchange/) and load\nall the tables at once.\n\n\n## Advanced Usage\n\nYou can use a custom database name as well. Make sure to explicitly give it\nwhile executing the script later.\n\nEach table data is archived in an XML file. Available tables varies accross\nhistory. `load_into_pg.py` knows how to handle the following tables:\n\n- `Badges`.\n- `Posts`.\n- `Tags` (not present in earliest dumps).\n- `Users`.\n- `Votes`.\n- `PostLinks`.\n- `PostHistory`.\n- `Comments`.\n\nYou can download manually the files to the folder from where the program is\nexecuted: `Badges.xml`, `Votes.xml`, `Posts.xml`, `Users.xml`, `Tags.xml`. In\nsome old dumps, the cases in the filenames are different.\n\nThen load each file with e.g. `python load_into_pg.py -t Badges`.\n\nAfter all the initial tables have been created:\n\n``` console\n$ psql beerSO \u003c ./sql/final_post.sql\n```\n\nFor some additional indexes and tables, you can also execute the the following;\n\n``` console\n$ psql beerSO \u003c ./sql/optional_post.sql\n```\n\nIf you give a schema name using the `-n` switch, all the tables will be moved\nto the given schema. This schema will be created in the script.\n\nThe paths are not changed in the final scripts `sql/final_post.sql` and\n`sql/optional_post.sql`. To run them, first set the _search_path_ to your\nschema name: `SET search_path TO \u003cmyschema\u003e;`\n\n\n## Caveats and TODOs\n\n - It prepares some indexes and views which may not be necessary for your analysis.\n - The `Body` field in `Posts` table is NOT populated by default. You have to use `--with-post-body` argument to include it.\n - The `EmailHash` field in `Users` table is NOT populated.\n\n### Sept 2011 data dump\n\n - The `tags.xml` is missing from the data dump. Hence, the `PostTag` and `UserTagQA` tables will be empty after `final_post.sql`.\n - The `ViewCount` in `Posts` is sometimes equal to an `empty` value. It is replaced by `NULL` in those cases.\n\n\n## Acknowledgement\n\n - [@madtibo](https://github.com/madtibo) made significant contributions by adding `jsonb` and Foreign Key support.\n - [@bersace](https://github.com/bersace) brought the dependencies and the `README.md` instructions into 2020s.\n - [@rdrg109](https://github.com/rdrg109) simplified handling of non-public schemas and fixed bugs associated with re-importing tables.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnetworks-learning%2Fstackexchange-dump-to-postgres","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnetworks-learning%2Fstackexchange-dump-to-postgres","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnetworks-learning%2Fstackexchange-dump-to-postgres/lists"}