https://github.com/brooksian/censussipp
Reprodicing Census SIPP Reports Using Apache Spark
https://github.com/brooksian/censussipp
spark sparksql zeppelin-notebook
Last synced: about 2 months ago
JSON representation
Reprodicing Census SIPP Reports Using Apache Spark
- Host: GitHub
- URL: https://github.com/brooksian/censussipp
- Owner: BrooksIan
- License: unlicense
- Created: 2019-03-06T21:10:13.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2019-03-06T21:10:19.000Z (about 6 years ago)
- Last Synced: 2025-01-19T16:16:02.189Z (3 months ago)
- Topics: spark, sparksql, zeppelin-notebook
- Language: Shell
- Size: 1.95 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## Data Science in Apache Spark
### Census - SIPP Workbook
#### Report Building**Level**: Easy
**Language**: Scala**Requirements**:
- [HDP 2.6.X]
- Spark 2.x**Author**: Ian Brooks
**Follow** [LinkedIn - Ian Brooks PhD] (https://www.linkedin.com/in/ianrbrooksphd/)
## Source File Description
Upload the source data file [CFS 2012 csv] (https://www.census.gov/econ/cfs/pums.html) to HDFS in the /tmp directory##Column Descriptions
## Pre-Run Instructions
### For HDP with Apache Zeppelin
1. Log into Apache Ambari2. In Ambari, select "Files View" and upload all of the CSV files to the /tmp/ directory. For assistance, please use the following [tutorial.](https://fr.hortonworks.com/tutorial/loading-and-querying-data-with-hadoop/)
3. Upload the source data file [CFS 2012 csv] (https://www.census.gov/econ/cfs/pums.html) to HDFS in the /tmp directory
4. Upload helper files to the HDFS in the /tmp directory
Upload all of the helper files to HDFS in the /tmp directorya. SIPP08A.csv
b. SIPP08B.csv
c. SIPP08C.csv
d. SIPP08D.csv
5. In Zeppelin, download the Zeppelin Note [JSON file.](https://github.com/BrooksIan/CensusSIPP) For assistance, please use the following [tutorial](https://hortonworks.com/tutorial/getting-started-with-apache-zeppelin/)
### For Cloudera Data Science Workbench
1. Log into CDSW and upload the project
2. Open a terminal on a session and run the loaddata.sh script## License
Unlike all other Apache projects which use Apache license, this project uses an advanced and modern license named The Star And Thank Author License (SATA). Please see the [LICENSE](LICENSE) file for more information.