{"id":27183177,"url":"https://github.com/datatimp/brfss","last_synced_at":"2025-06-14T11:06:18.017Z","repository":{"id":284392761,"uuid":"954795704","full_name":"datatimp/brfss","owner":"datatimp","description":"BRFSS in Postgres for faster, less memory-intensive queries","archived":false,"fork":false,"pushed_at":"2025-04-16T19:21:59.000Z","size":335,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-17T04:17:50.403Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/datatimp.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-25T16:18:56.000Z","updated_at":"2025-04-16T19:22:02.000Z","dependencies_parsed_at":"2025-04-01T14:37:51.348Z","dependency_job_id":"bbcca839-c5c8-40ca-99fd-fb292b47a5c5","html_url":"https://github.com/datatimp/brfss","commit_stats":null,"previous_names":["wrinklerelease/brfss","datatimp/brfss"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/datatimp/brfss","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datatimp%2Fbrfss","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datatimp%2Fbrfss/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datatimp%2Fbrfss/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datatimp%2Fbrfss/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/datatimp","download_url":"https://codeload.github.com/datatimp/brfss/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datatimp%2Fbrfss/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259804867,"owners_count":22913902,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-09T15:55:33.785Z","updated_at":"2025-06-14T11:06:18.001Z","avatar_url":"https://github.com/datatimp.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"This repository aims to provide instructions on how to run the CDC's BRFSS data for 2022 and 2023 in a PostgreSQL database while accessing it from RStudio to perform analysis. \n\nBefore getting into that, let me jump right into what is, perhaps, of greatest utility: the codebooks for [2023](https://github.com/datatimp/brfss/blob/main/brfss-2023/supplemental/llcp-2023-codebook.md) and [2022](https://github.com/datatimp/brfss/blob/main/brfss-2022/supplemental/llcp-2022-codebook.md) re-formatted in markdown. Markdown's navigable outline feature makes these documents much more manageable.\n\nThe schema presented here divide the BRFSS into multiples tables grouped by section name (as indicated in the LLCP codebook). The resulting SQL db for each year is around half that of the XPT file. This size reduction, along with the ability to only pull the table or column you need into R, means a faster, more responsive analysis environment. \n\n\u003cbr\u003e\n\n# Obtain the BRFSS File\n\nThe CDC hosts the data file in either `.xpt` or `.asc` format. They can be downloaded, zipped, from [this location](https://www.cdc.gov/brfss/annual_data/annual_2023.html). Scroll down to the section entitled **Data Files** and choose the format you prefer. I worked with the `.xpt` file as the `.asc` requires a separate file (known as a dictionary file) to correctly assign column names and etc.\n\n\u003cbr\u003e\n\n# Convert XPT to a SQL Friendly Format\n\nI found two methods:\n\n**1. Using a simple `.r` script in RStudio**  \nThe `.r` file I used to convert the XPT file to CSV is included. This file was slightly larger, but didn't contain extra rows which made dealing with it slightly easier.\n\n**2. SAS Universal Viewer**  \nSAS produces the free [SAS Universal Viewer](https://support.sas.com/downloads/browse.htm?cat=74) that will open `.xpt` files and export `.csv`. You'll need an account to download the tool. The csv file it produced was nearly 100MB smaller than the version produced by RStudio; however, it contained 78 _extra rows_ added to the bottom of the file.[^1]\n\n\u003cbr\u003e\n\n# Prepare the CSV\n\n## Acknowledging Discrepancies\n\nAs of the publication of this readme, both years represented here have columns present in the dataset but missing from the codebook (all are listed at the bottom of this readme). Dealing with these columns means never extracting from the master csv in the first place. 2023 actually has a variable in the codebook not in the dataset (`rcsborg1`) and this was simply ignored when making individual csv files.\n\n\u003cbr\u003e\n\n# SQL Schema \u0026 Splitting the CSV\n\nMy Postgres schema is based on the codebook, which lists the SAS Variable names along with the Section Name. Each Section is its own table, and each column is an SAS Variable in that section. \n\nTaking the 2023 dataset as an example:\n1. Using the file `brfss-var-sec-type-2023.csv` file provided in the supplemental folder, cut the master csv file into sections. In Terminal, `cut -d ',' -f[column number(s)] input-file.csv \u003e output-file.csv` does the job. So, `cut -d ',' -f216,217,218,219,220,221,222,223,224,225,226,227,228 inpute-file.csv \u003e ace.csv` returns what will become the `ace` table as a csv.\n\n2. Once all individual csv files have been created and placed in the `/data` folder, run the `create_table_and_co.py` script, making sure the paths to your files and output are correct. This will produce the commands you'll use to create the tables in SQL and copy the data in using psql, or you can use the `.sql` files I generated. \n\n\u003cbr\u003e\n\n# NULL and Blank Values\n\nThe conversion from XPT to CSV converted the blank values to `NA`. Most of the columns in the SQL db are of datatype INTEGER, which means `NA` isn't compatible. I chose to convert each `NA` to `NULL` when loading the data into Postgres and my psql commands reflect this.\n\n\u003cbr\u003e\n\n# Setting up a Docker PostgreSQL instance\n\nInstall Docker Desktop and get the latest PostgreSQL image. The `docker-compose.yml` files found in `brfss-2022` and `brfss-2023` directories are the instructions for starting up your Postgres db. The compose file is structured to give a persistent db even if the container is stopped and restarted. \n\n\u003e [!CAUTION]\n\u003e If you want to keep your Postgres data, spin the container down with `docker-compose down`. Do _not_ use `docker-compose down -v` unless you want to scrub completely and re-initialize. The `-v` flag scrubs the persistent volumes. You can see these volumes with `docker volume ls`. Nothing listed means no persistent docker volumes are present.\n\nSince you started the docker container without an `init.db` the Postgres db exists but has no table or variable data in it.\n\nI solved this by adding the tables in dBeaver then using pqsl in terminal to copy the data over. Here are the psql commands:\n\n```shell\n# first \ndocker ps\n\n# second\ndocker exec -it your_container_name psql -U your_user_name -d your_db_name\n\n# third\n\\copy table_name (\"column-name\",\"column-name\") from '/docker-entrypoint-initdb.d/file_name.csv' WITH (FORMAT CSV, HEADER, NULL 'NA');\n```\n\nCheck to see if any of your tables are empty\n```sql\n# A 0 indicates no rows have been inserted; thus, empty table\nSELECT schemaname, relname, n_tup_ins \nFROM pg_stat_all_tables \nWHERE schemaname = 'public' \nORDER BY n_tup_ins;\n```\n\nBackup and restore the Postgres db.\n```shell\n# backup sql db\ndocker exec -t your-container-name pg_dump -U username your-db-name \u003e db_name_backup.sql\n\n# restore db\ndocker exec -i your-container-name pg_dump -U username your-db-name \u003e db_name_backup.sql\n```\n\n\u003cbr\u003e\n\n# Connecting to the database from RStudio\n\nEstablish the db connection\n```R\n# install needed packages \ninstall.packages(\"DBI\")\ninstall.packages(\"RPostgres\")\n\n# call packages on script execution\nlibrary(DBI)\nlibrary(RPostgres)\n\n# Replace with your actual database credentials\ncon \u003c- dbConnect(RPostgres::Postgres(),\n                 dbname = \"your_database_name\",\n                 host = \"your_host\",      # e.g., \"localhost\" or an IP address\n                 port = 5432,             # default PostgreSQL port\n                 user = \"your_username\",\n                 password = \"your_password\")\n\n```\n\nCall tables or variables as needed\n```R\n# Replace 'your_table_name' with the actual table name\ndata \u003c- dbReadTable(con, \"your_table_name\")\n\n# Replace 'your_table_name' with the actual table name and specify the columns you want\nquery \u003c- \"SELECT column1, column2 FROM your_table_name\"\ndata \u003c- dbGetQuery(con, query)\n```\n\nOnce finished, disconnect\n```R\ndbDisconnect(con)\n```\n\n\u003cbr\u003e\n\n# Supplemental Material\n\nThe following supplemental material is provided for each year.\n\n**Codebook**: The codebook gives the variable name, location, and frequency of values for all reporting areas combined for the landline and cell phone data set. The CDC distributes the codebook as an `.html` file which can be found on their site. Through python scripts, I've converted the codebook into a `.md` file with section and question headers, which greatly reduce seek time when looking up a variable, its SAS code and answer codes.\n\n\u003cbr\u003e\n\n# Handling Errors\n\nDiscrepancies between the CDC's codebook and the dataset existed for both years. \n\nI omitted the columns from my final SQL database that didn't appear in the codebook. \n\n## 2022\n\n| SAS Variable | 1-Based Column No. | In Codebook | In Dataset | Remediation   |\n|--------------|---------------|-------------|------------|--------------------|\n| `diabage4`   | 57            | No          | Yes        | Not included in db |\n| `numphon4`   | 62            | No          | Yes        | Not included in db |\n| `cpdemo1c`   | 63            | No          | Yes        | Not included in db |\n| `usemrjn4`   | 210           | No          | Yes        | Not included in db |\n\n## 2023\n\n| SAS Variable | 1-Based Column No. | In Codebook | In Dataset | Remediation          |\n|--------------|---------------|-------------|------------|---------------------------|\n| `rcsborg1`   | NA            | Yes         | No         | Omitted                   |\n| `lndsxbrt`   | 19            | No          | Yes        | Not included in db        |\n| `celsxbrt`   | 25            | No          | Yes        | Not included in db        |\n| `birthsex`   | 205           | No          | Yes        | Not included in db        |\n| `trnsgndr`   | 208           | No          | Yes        | Not included in db        |\n| `usemrjn4`   | 215           | No          | Yes        | Not included in db        |\n| `rcsgend1`   | 252           | No          | Yes        | Not included in db        |\n| `rcsxbrth`   | 253           | No          | Yes        | Not included in db        |\n\n\n\n[^1]: Taking the LLCP2023.XPT file as an example: the original XPT file, when read into RStudio, contains 433,323 rows. The csv file produced by SAS Universal Viewer contains 78 more rows (rows 433,325 - 443,401) most likely due to metadata being written in as extra rows. \u003cbr\u003e\u003cbr\u003e A quick check in `terminal` of the csv created by the SAS viewer using `sed -n '433325p' file-name.csv` shows us empty data for that row. Use `sed -n '433324p' file-name.csv` to see the difference. Line 433,324 is the last line of BRFSS data (the csv starts count on the header row, while R doesn't when reading in an XPT, hence the 1 digit difference in the last line). These need to be trimmed off before going further. For this reason I stuck with the R script.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatatimp%2Fbrfss","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatatimp%2Fbrfss","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatatimp%2Fbrfss/lists"}