{"id":30369392,"url":"https://github.com/dsacms/npd_nucc_slurp","last_synced_at":"2026-02-15T22:03:38.301Z","repository":{"id":306879520,"uuid":"1024475356","full_name":"DSACMS/npd_nucc_slurp","owner":"DSACMS","description":"because the CSV makes calculating the parent difficult.","archived":false,"fork":false,"pushed_at":"2025-08-15T17:03:38.000Z","size":1629,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-15T19:26:35.919Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"HTML","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DSACMS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-22T18:56:17.000Z","updated_at":"2025-08-15T17:03:41.000Z","dependencies_parsed_at":"2025-07-29T23:05:26.146Z","dependency_job_id":"a7d7f128-209d-433a-968a-5ca9045ad257","html_url":"https://github.com/DSACMS/npd_nucc_slurp","commit_stats":null,"previous_names":["dsacms/ndh_nucc_slurp","dsacms/npd_nucc_slurp"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/DSACMS/npd_nucc_slurp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DSACMS%2Fnpd_nucc_slurp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DSACMS%2Fnpd_nucc_slurp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DSACMS%2Fnpd_nucc_slurp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DSACMS%2Fnpd_nucc_slurp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DSACMS","download_url":"https://codeload.github.com/DSACMS/npd_nucc_slurp/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DSACMS%2Fnpd_nucc_slurp/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271252993,"owners_count":24726918,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-20T02:00:09.606Z","response_time":69,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-20T02:15:31.897Z","updated_at":"2026-02-15T22:03:38.294Z","avatar_url":"https://github.com/DSACMS.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# NUCC Slurp\n\nA comprehensive toolkit for scraping and analyzing NUCC (National Uniform Claim Committee) taxonomy data from the official taxonomy website.\n\n## Approach and Purpose\n\nThe [web version](https://taxonomy.nucc.org/) of the NUCC taxonomy code list has slightly different data than the CSV download version.\nFirst, it has explict parent linkage on a per-taxonomy basis. Second it includes rows for groupings that are not directly in the CSV (the title of the grouping can be inferred, but the description of the grouping cannot).\n\nAs a result, this project both slurps the web version and processes the CSV version to create a merged picture of the nucc codeset.\nIt also creates path csv file for fast numerical querying, and a csv file that documents the various \"sources\" in the notes sections.\n\n## Overview\n\nThis project provides a complete pipeline for extracting, processing, and analyzing NUCC taxonomy codes and their hierarchical relationships. The NUCC taxonomy is used to classify healthcare provider types and specialties in the United States.\n\n## Scripts and Execution Order\n\nThe scripts should be executed in the following order:\n\n### 1. `scrape_nucc_ancestors.py`\n\n**Purpose**: Scrapes the main NUCC taxonomy website to extract hierarchical relationships between codes.\n\n**What it does**:\n\n- Fetches HTML from \u003chttps://taxonomy.nucc.org/\u003e\n- Parses the JavaScript treenodes data structure\n- Extracts all ancestor-child relationships in the taxonomy hierarchy\n- Creates self-referencing relationships (each code is its own ancestor)\n\n**Output**: `data/nucc_parent_code.csv` with columns:\n\n- `ancestor_nucc_code_id`: The ancestor code ID\n- `child_nucc_code_id`: The child code ID\n\n**Usage**:\n\n```bash\npython3 scrape_nucc_ancestors.py\n```\n\n### 2. `scrape_nucc_nodes.py`\n\n**Purpose**: Scrapes detailed information for each individual taxonomy code from the NUCC API.\n\n**What it does**:\n\n- Reads all unique node IDs from `data/nucc_parent_code.csv`\n- Downloads detailed information for each node from the NUCC API\n- Parses HTML content to extract structured data (name, definition, notes, etc.)\n- Caches HTML snippets in `data/tables/` for analysis\n- Uses intelligent caching to avoid re-downloading recently fetched data\n\n**Output**:\n\n- `data/nucc_codes.csv` with detailed code information\n- `data/tables/node_*.html` files containing raw HTML snippets\n\n**Usage**:\n\n```bash\npython3 scrape_nucc_nodes.py\n```\n\n### 3. `parse_nucc_sources.py`\n\n**Purpose**: Extracts and structures source information from the notes column of the NUCC codes.\n\n**What it does**:\n\n- Parses the `code_notes` column from `data/nucc_codes.csv`\n- Extracts source citations that follow the pattern \"Source: text [date: note]\"\n- Automatically extracts URLs from source text\n- Handles multiple sources per code\n- Creates normalized source records\n\n**Output**: `data/nucc_sources.csv` with columns:\n\n- `nucc_code_id`: The NUCC code ID\n- `full_source_text`: Complete source text\n- `source_date`: Date from source citation\n- `source_date_note`: Note from source citation\n- `extracted_urls`: URLs found in source text\n\n**Usage**:\n\n```bash\npython3 parse_nucc_sources.py\n```\n\n### 4. `compare_nucc_data.py`\n\n**Purpose**: Compares scraped data with official NUCC taxonomy CSV files to identify differences.\n\n**What it does**:\n\n- Loads both the scraped data and an official NUCC taxonomy CSV\n- Performs outer join on taxonomy codes\n- Identifies codes that exist in only one dataset\n- Creates a merged dataset with all available information\n- Generates summary statistics and reports\n\n**Output**:\n\n- `data/merged_nucc_data.csv`: Combined dataset from both sources\n- `data/nucc_comparison_summary.txt`: Summary report of differences\n\n**Usage**:\n\n```bash\npython3 compare_nucc_data.py --download_csv /path/to/official/nucc_taxonomy.csv --scrapped_csv ./data/nucc_codes.csv\n```\n\n## Data Files Generated\n\n- `data/nucc_parent_code.csv`: Hierarchical relationships between codes\n- `data/nucc_codes.csv`: Detailed information for each taxonomy code\n- `data/nucc_sources.csv`: Structured source information\n- `data/merged_nucc_data.csv`: Comparison between scraped and official data\n- `data/nucc_comparison_summary.txt`: Summary of data comparison\n- `data/tables/`: Directory containing raw HTML snippets for each code\n\n## Requirements\n\nInstall the required Python packages:\n\n```bash\npip install -r requirements.txt\n```\n\n## Features\n\n- **Intelligent Caching**: Avoids re-downloading recently fetched data\n- **Robust Error Handling**: Gracefully handles network issues and parsing errors\n- **Future-Proof Parsing**: Automatically detects and includes new data fields\n- **URL Extraction**: Automatically extracts and normalizes URLs from source text\n- **Data Validation**: Includes data cleaning and validation steps\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdsacms%2Fnpd_nucc_slurp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdsacms%2Fnpd_nucc_slurp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdsacms%2Fnpd_nucc_slurp/lists"}