https://github.com/varunu28/aadhar-dataset-analysis
Data analysis of AADHAR dataset using Apache Spark
https://github.com/varunu28/aadhar-dataset-analysis
analysis scala spark spark-sql
Last synced: 6 months ago
JSON representation
Data analysis of AADHAR dataset using Apache Spark
- Host: GitHub
- URL: https://github.com/varunu28/aadhar-dataset-analysis
- Owner: varunu28
- Created: 2017-11-29T17:36:44.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2018-03-30T22:52:09.000Z (over 7 years ago)
- Last Synced: 2025-04-23T01:57:51.087Z (6 months ago)
- Topics: analysis, scala, spark, spark-sql
- Language: Scala
- Size: 1.82 MB
- Stars: 7
- Watchers: 1
- Forks: 9
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# AADHAR-Dataset-Analysis
Data analysis of AADHAR dataset using Apache Spark#### Technologies Used
- Spark
- Scala
- Spark SQL
- Linux Shell Scripting#### Initial Data Cleaning
- Removing the header containing column names (Done using scala)
- Removing NULL values. Assumed them to be 0 (Done using UNIX SED)#### Creating a DataFrame
Creating the DataFrame for starting the analysis using the case class corresponding to the column names in input data
## Questions Answered about data
#### Count for number of participants and count for each gender
- Number of Male Participants = 102037
- Number of Female Participants = 120225
- Total Number of Participants = 222281
- Number of records with unspecified gender(T) = 19#### Count the number of identities(Aadhaar) generated by each Enrollment Agency and get Top 3
- CSC SPV : 85088
- Rajcomp Info Services Ltd : 16356
- Mahaonline Limited : 7749#### Top 10 districts with maximum identities generated for both Male and Female
- East Champaran : 3700
- Jaipur : 3144
- West Champaran : 2619
- East Khasi Hills : 2481
- Siwan : 2402
- Muzaffarpur : 2250
- Bharatpur : 1999
- Agra : 1865
- Ahmedabad : 1851
- Shrawasti : 1810
#### Bottom 10 districts with maximum identities generated for both Male and Female
- Serchhip : 0
- Yanam : 1
- Nicobar : 1
- North Sikkim : 1
- Dibang Valley : 1
- Anjaw : 1
- Tirap : 2
- Mokokchung : 2
- North Cachar Hills : 2
- Narayanpur : 3
*Seeing the top 10 and bottom 10 one thing we can notice that it is easy to bring well-known districts under the radar for issuing the aadhar but work still needs to be done in the remote areas*
#### Top 3 State With number of identities generated for both Male and Female
- Uttar Pradesh : 50254
- Bihar : 29842
- Rajasthan : 20744
#### Bottom 3 State With number of identities generated for both Male and Female
- Lakshadweep : 14
- Dadra and Nagar Haveli : 27
- Daman and Diu : 45#### Top 3 States With number of identities generated for Female
- Uttar Pradesh : 26063
- Bihar : 15353
- Rajasthan : 11404
#### Bottom 3 States With number of identities generated for Female
- Lakshadweep - 6
- Others - 17
- Dadra and Nagar Haveli - 21#### Top 3 States With number identities generated for Male
- Uttar Pradesh : 24191
- Bihar : 14489
- Rajasthan : 9340
#### Bottom 3 States With number identities generated for Male
- Dadra and Nagar Haveli - 6
- Lakshadweep - 8
- Daman and Diu - 17
*The gender-wise distribution follows the same trend as that of same distribution*