An open API service indexing awesome lists of open source software.

https://github.com/varunu28/aadhar-dataset-analysis

Data analysis of AADHAR dataset using Apache Spark
https://github.com/varunu28/aadhar-dataset-analysis

analysis scala spark spark-sql

Last synced: 6 months ago
JSON representation

Data analysis of AADHAR dataset using Apache Spark

Awesome Lists containing this project

README

          

# AADHAR-Dataset-Analysis
Data analysis of AADHAR dataset using Apache Spark

#### Technologies Used
- Spark
- Scala
- Spark SQL
- Linux Shell Scripting

#### Initial Data Cleaning

- Removing the header containing column names (Done using scala)
- Removing NULL values. Assumed them to be 0 (Done using UNIX SED)

#### Creating a DataFrame

Creating the DataFrame for starting the analysis using the case class corresponding to the column names in input data

## Questions Answered about data

#### Count for number of participants and count for each gender
- Number of Male Participants = 102037
- Number of Female Participants = 120225
- Total Number of Participants = 222281
- Number of records with unspecified gender(T) = 19

#### Count the number of identities(Aadhaar) generated by each Enrollment Agency and get Top 3
- CSC SPV : 85088
- Rajcomp Info Services Ltd : 16356
- Mahaonline Limited : 7749

#### Top 10 districts with maximum identities generated for both Male and Female
- East Champaran : 3700
- Jaipur : 3144
- West Champaran : 2619
- East Khasi Hills : 2481
- Siwan : 2402
- Muzaffarpur : 2250
- Bharatpur : 1999
- Agra : 1865
- Ahmedabad : 1851
- Shrawasti : 1810

#### Bottom 10 districts with maximum identities generated for both Male and Female
- Serchhip : 0
- Yanam : 1
- Nicobar : 1
- North Sikkim : 1
- Dibang Valley : 1
- Anjaw : 1
- Tirap : 2
- Mokokchung : 2
- North Cachar Hills : 2
- Narayanpur : 3

*Seeing the top 10 and bottom 10 one thing we can notice that it is easy to bring well-known districts under the radar for issuing the aadhar but work still needs to be done in the remote areas*

#### Top 3 State With number of identities generated for both Male and Female
- Uttar Pradesh : 50254
- Bihar : 29842
- Rajasthan : 20744

#### Bottom 3 State With number of identities generated for both Male and Female
- Lakshadweep : 14
- Dadra and Nagar Haveli : 27
- Daman and Diu : 45

#### Top 3 States With number of identities generated for Female
- Uttar Pradesh : 26063
- Bihar : 15353
- Rajasthan : 11404

#### Bottom 3 States With number of identities generated for Female
- Lakshadweep - 6
- Others - 17
- Dadra and Nagar Haveli - 21

#### Top 3 States With number identities generated for Male
- Uttar Pradesh : 24191
- Bihar : 14489
- Rajasthan : 9340

#### Bottom 3 States With number identities generated for Male
- Dadra and Nagar Haveli - 6
- Lakshadweep - 8
- Daman and Diu - 17

*The gender-wise distribution follows the same trend as that of same distribution*