{"id":20915259,"url":"https://github.com/brooksian/censusecon","last_synced_at":"2025-07-22T04:04:34.694Z","repository":{"id":182374179,"uuid":"130394201","full_name":"BrooksIan/CensusEcon","owner":"BrooksIan","description":"Data Mining Census ECON using Apache Spark","archived":false,"fork":false,"pushed_at":"2020-09-23T17:53:40.000Z","size":120,"stargazers_count":4,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-05-13T10:43:51.586Z","etag":null,"topics":["spark","sparksql","zeppelin-notebook"],"latest_commit_sha":null,"homepage":"https://www.census.gov/econ/cfs/pums.html","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BrooksIan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-04-20T17:31:15.000Z","updated_at":"2022-02-11T15:37:02.000Z","dependencies_parsed_at":"2023-07-19T19:16:21.660Z","dependency_job_id":null,"html_url":"https://github.com/BrooksIan/CensusEcon","commit_stats":null,"previous_names":["brooksian/censusecon"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/BrooksIan/CensusEcon","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BrooksIan%2FCensusEcon","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BrooksIan%2FCensusEcon/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BrooksIan%2FCensusEcon/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BrooksIan%2FCensusEcon/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BrooksIan","download_url":"https://codeload.github.com/BrooksIan/CensusEcon/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BrooksIan%2FCensusEcon/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266424014,"owners_count":23926123,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-22T02:00:09.085Z","response_time":66,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["spark","sparksql","zeppelin-notebook"],"created_at":"2024-11-18T16:13:48.290Z","updated_at":"2025-07-22T04:04:34.658Z","avatar_url":"https://github.com/BrooksIan.png","language":"Python","readme":"## Data Science in Apache Spark\n### Census - Econ Workbook\n#### Report Building\n\n**Level**: Easy\n**Language**: Scala\n\n**Requirements**: \n- [HDP 2.6.X]\n- Spark 2.x\n\n**Author**: Ian Brooks\n**Follow** [LinkedIn - Ian Brooks PhD] (https://www.linkedin.com/in/ianrbrooksphd/)\n\n![Census](https://yt3.ggpht.com/a-/AN66SAwtUzDFUsvm7MJszIS3ccgglaFGrw1_Ye16ew=s900-mo-c-c0xffffffff-rj-k-no \"Census\")\n\n\n## File Description\n\nThe PUM file includes 20 variables for all usable shipment records collected by the CFS – a total of 4,547,661 shipments from approximately 60,000 responding establishments.  The information included on each shipment record is:\n*\tShipment Origin\n    *\tState \n    *\tMetropolitan Area\n*\tShipment Destination (in US)\n    *\tState\n    *\tMetropolitan Area\n*\tNAICS industry classification of the shipper\n*\tQuarter in 2012 in which the shipment was made\n*\tType of commodity \n*\tMode of transportation\n*\tThe value of the shipment (dollars)\n*\tThe weight of the shipment (pounds)\n*\tThe great circle distance between the shipment origin and US destination (in miles)\n*\tThe routed distance between the shipment origin and US destination (in miles) \n*\tWhether or not the shipment required temperature control during transportation\n*\tWhether or not the shipment was an export\n*\tIf an export, the final export destination country\n*\tHazardous material code\n*\tShipment tabulation weighting factor \n\n##Column Descriptions \n\n* SHIPMT_ID   Shipment identifier\t\n* ORIG_STATE\tFIPS state code of shipment origin\t\n* ORIG_MA\tMetro area of shipment origin\t\n* ORIG_CFS_AREA\tCFS Area of shipment origin\tConcatenation of ORIG_STATE and ORIG_MA (ex: 24-12580)\n* DEST_STATE\tFIPS state code of shipment destination\t01-56\n* DEST_MA\tMetro area of shipment destination\tSee Note (1)\n* DEST_CFS_AREA\tCFS Area of shipment destination\tConcatenation of DEST_STATE and DEST_MA (ex: 01-142)\n* NAICS\tIndustry classification of shipper\tSee Note (2)\n* QUARTER\tQuarter of 2012 in which the shipment occurred\t1, 2, 3, 4\n* SCTG\t2-digit SCTG commodity code of the shipment\tSee Note (3)\n* MODE\tMode of transportation of the shipment\tSee Note (4)\n* SHIPMT_VALUE\tValue of the shipment in dollars\t0 - 999,999,999\n* SHIPMT_WGHT\tWeight of the shipment in pounds\t0 - 999,999,999\n* SHIPMT_DIST_GC\tGreat circle distance between ship-ment origin and destination (in miles)\t0 - 99,999\n* SHIPMT_DIST_ROUTED\tRouted distance between shipment origin and destination (in miles)\t0 - 99,999\n* TEMP_CNTL_YN\tTemperature controlled shipment - Yes or No\tY, N\n* EXPORT_YN\tExport shipment - Yes or No\tY, N\n* EXPORT_CNTRY\tExport final destination\t\n\t* C = Canada\n\t* M = Mexico\n\t* O = Other country\n\t* N = Not an export\n* HAZMAT\tHazardous material (HAZMAT) code\t\n\t* P = Class 3.0 Hazmat (flammable liquids)\n\t* H = Other Hazmat\n\t* N = Not Hazmat\n* WGT_FACTOR\tShipment tabulation weighting factor.  (This factor is also an estimate of the total number of shipments represent-ted by the PUM file shipment.)\t0 – 975,000.0\n\n## Pre-Run Instructions\n\n### For HDP with Apache Zeppelin\n1. Log into Apache Ambari \n\n2. In Ambari, select \"Files View\" and upload all of the CSV files to the /tmp/ directory.  For assistance, please use the following [tutorial.](https://fr.hortonworks.com/tutorial/loading-and-querying-data-with-hadoop/)\n\n3. Upload the source data file  [CFS 2012 csv] (https://www.census.gov/data/datasets/2012/econ/cfs/historical-datasets.html) to HDFS in the /tmp directory \n\n4. Upload helper files to the HDFS in the /tmp directory \nUpload all of the helper files to HDFS in the /tmp directory \n\na. CFS_2012_table_CFSArea.csv\n\nb. CFS_2012_table_ModeofTrans.csv\n\nc. CFS_2012_table_NAICS.csv\n\nd. CFS_2012_table_SCTG.csv\n\ne. CFS_2012_table_StateCodes.csv\n\n\n5. In Zeppelin, download the Zeppelin Note [JSON file.](https://github.com/BrooksIan/CensusEcon) For assistance, please use the following [tutorial](https://hortonworks.com/tutorial/getting-started-with-apache-zeppelin/)\n\n### For Cloudera Data Science Workbench\n1. Log into CDSW and upload the project\n2. Open a terminal on a session and run the loaddata.sh script \n\n## License\nUnlike all other Apache projects which use Apache license, this project uses an advanced and modern license named The Star And Thank Author License (SATA). Please see the [LICENSE](LICENSE) file for more information.","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrooksian%2Fcensusecon","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbrooksian%2Fcensusecon","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrooksian%2Fcensusecon/lists"}