https://github.com/anshul619/big-data
https://github.com/anshul619/big-data
airflow apache-spark batch-processing big-data data-analytics data-engineering data-lake data-warehouse hadoop looker map-reduce redshift stream-processing
Last synced: 5 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/anshul619/big-data
- Owner: Anshul619
- Created: 2025-08-08T08:46:54.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-08-08T14:22:37.000Z (5 months ago)
- Last Synced: 2025-08-08T16:15:49.989Z (5 months ago)
- Topics: airflow, apache-spark, batch-processing, big-data, data-analytics, data-engineering, data-lake, data-warehouse, hadoop, looker, map-reduce, redshift, stream-processing
- Homepage:
- Size: 281 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
Awesome Lists containing this project
README
# Layers in Big data architecture
| Layer | Description | Remarks |
|--------------------------|------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
| Data Ingestion/Streaming | [Bring your data](DataIngestion.md) into your data platform. | |
| Data Processing | Create your [data processing pipelines](DataProcessing). | [Apache Spark vs MapReduce vs Flink vs Storm vs Kafka Streams](DataProcessing/SparkVsMapReduceVsFlinkVsStorm.md) |
| Data Cataloging | Store your metadata. | |
| Data Storage | Store your [structured and unstructured data](DataStorage). | [Data warehouses vs lake](DataStorage/DataWarehousesVsLake.md) |
| Data Consumption | Enable your user personas for [purpose-built analytics](DataConsumption) and machine learning. | |
| Security and governance | Protect your data across the layers and data access management. | |
# General Use Cases of Big Data Processing
| Use Case | Processing Type | Remarks |
|--------------------------------------|-------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| :star: Fraud Detection | [Stream Processing](DataProcessing/ProcessingTypes/StreamProcessing.md) | Fraud detection systems need to determine if the usage patterns of a credit card have unexpectedly changed, and block the card if it is likely to have been stolen. |
| :star: Financial Stock Market | [Stream Processing](DataProcessing/ProcessingTypes/StreamProcessing.md) | Trading systems need to examine price changes in a financial market and execute trades according to specified rules. |
| Log analytics | [Stream Processing](DataProcessing/ProcessingTypes/StreamProcessing.md) | Log files generated by server or applications |
| User Events on app like ClickStreams | [Stream Processing](DataProcessing/ProcessingTypes/StreamProcessing.md) | Customer interaction data from a web application or mobile application |
| Manufacturing Systems | [Stream Processing](DataProcessing/ProcessingTypes/StreamProcessing.md) | Manufacturing systems need to monitor the status of machines in a factory, and quickly identify the problem if there is a malfunction. |
| Military Systems | [Stream Processing](DataProcessing/ProcessingTypes/StreamProcessing.md) | Military and intelligence systems need to track the activities of a potential aggregation, and raise the alarm if there are signs of an attack. |
| Stream Analytics | [Stream Processing](DataProcessing/ProcessingTypes/StreamProcessing.md) | Measuring the rate of some type of event (how often it occurs per time interval)
- Calculating the rolling average of a value over some time period
- Comparing current statistics to previous time intervals (e.g. to detect trends or to alert on metrics that are unusually high or low compared to the same time last week). |
| :star: Data from IoT sensors | [Stream Processing](DataProcessing/ProcessingTypes/StreamProcessing.md) | Internet of Things (IoT), ad tech, gaming etc. |
| Payment Processing Systems | [Stream Processing](DataProcessing/ProcessingTypes/StreamProcessing.md) | |
| :star: ETL Pipeline | [Batch Processing](DataProcessing/ProcessingTypes/BatchProcessing.md) | [Read more](ETL.md) |
| Building indexes for search DBs | [Batch Processing](DataProcessing/ProcessingTypes/BatchProcessing.md) | [Apache Hadoop](ApacheHadoop/Readme.md) can be used to build indexes for [Lucene/Solr](https://github.com/Anshul619/HLD-System-Designs/blob/main/1_Databases/9_Search-Databases/Readme.md). |
| Recommendation System | [Batch Processing](DataProcessing/ProcessingTypes/BatchProcessing.md) | [50-100 MapReduce jobs](DataProcessing/ApacheMapReduce/Readme.md) are used for recommendation system in Google. |
| Ranking System | [Batch Processing](DataProcessing/ProcessingTypes/BatchProcessing.md) | |
| Machine learning systems | [Batch Processing](DataProcessing/ProcessingTypes/BatchProcessing.md) | Example - Classifiers (spam filters, anomaly detection, image recognition etc.) |
# Various Services in Data layers

# How can we define big data?
| | Remarks |
|--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Data Volume | **100s of TB** to PB-scale and higher |
| Architecture | Parallel Processing often involved using [Hadoop](ApacheHadoop/Readme.md), [Spark](DataProcessing/ApacheSpark/Readme.md), [data warehouse](DataStorage/DataWarehouses/Readme.md) platforms. |
| Necessity | **Processing of data sets too large** for operational databases |
| Nominally | Big data tech sometimes imposed on small data problems |
# Read more
- [AWS Summit ASEAN 2023 | Simplify data management with modern data architecture on AWS (INSO203)](https://www.youtube.com/watch?v=hwF0AZaUc6U)
- [What is Data Pipeline? | Why Is It So Popular?](https://www.youtube.com/watch?v=kGT4PcTEPP8)