Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/cleberzumba/hadoop-in-pseudodistributed-mode

Installation and Configuration of the Big Data Environment with Hadoop and Spark
https://github.com/cleberzumba/hadoop-in-pseudodistributed-mode

hadoop spark

Last synced: about 1 month ago
JSON representation

Installation and Configuration of the Big Data Environment with Hadoop and Spark

Host: GitHub
URL: https://github.com/cleberzumba/hadoop-in-pseudodistributed-mode
Owner: cleberzumba
Created: 2024-04-22T12:43:40.000Z (10 months ago)
Default Branch: main
Last Pushed: 2024-05-02T14:31:12.000Z (9 months ago)
Last Synced: 2024-11-08T07:42:53.592Z (3 months ago)
Topics: hadoop, spark
Homepage:
Size: 5.69 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

Creation, assembly, installation, configuration and documentation of a Big Data environment with Hadoop Pseudodistributed Mode, containing Apache Spark, two of the most advanced frameworks for Big Data processing, and also highlighting how this environment can be useful for Big Data processing:

*Title: Creating a Big Data environment with Hadoop Pseudodistributed Mode with Apache Spark*

### Creation and Assembly:
From the physical assembly of the machines to the network and storage configuration, every detail was carefully planned. The infrastructure was scaled to support large volumes of data and high processing loads.

### Installation and Configuration:
Apache Hadoop is the heart of this Big Data environment, providing a distributed file system (HDFS) and distributed computing framework (YARN). Additionally, Apache Spark was installed for in-memory processing and real-time data analysis.

### Utility and Benefits:
This environment offers a series of significant benefits for Big Data processing:

- Real-Time Processing: With Apache Kafka, we are able to process streams of data in real-time, enabling instant analysis and timely decision-making.

- Advanced Analytics: Apache Spark provides a distributed processing environment for advanced analytics, from complex queries to machine learning and graph processing.

- Latency Reduction: With the distributed processing capacity of Apache Spark, we are able to reduce the latency of queries and analyses, speeding up response times for our users.

### Documentation and Maintenance:
A detailed manual was created to document all steps of installing, configuring and using the cluster. This ensures that everyone has access to the information needed to operate and maintain the system effectively.

### Conclusion:
In summary, this Big Data Environment with Hadoop Pseudodistributed Mode and Apache Spark is a powerful tool to deal with the challenges of Big Data processing. From real-time analysis to advanced analysis, it allows you to perform various data-based tests.