Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/cleberzumba/hadoop-in-pseudodistributed-mode
Installation and Configuration of the Big Data Environment with Hadoop and Spark
https://github.com/cleberzumba/hadoop-in-pseudodistributed-mode
hadoop spark
Last synced: about 1 month ago
JSON representation
Installation and Configuration of the Big Data Environment with Hadoop and Spark
- Host: GitHub
- URL: https://github.com/cleberzumba/hadoop-in-pseudodistributed-mode
- Owner: cleberzumba
- Created: 2024-04-22T12:43:40.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-05-02T14:31:12.000Z (9 months ago)
- Last Synced: 2024-11-08T07:42:53.592Z (3 months ago)
- Topics: hadoop, spark
- Homepage:
- Size: 5.69 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Creation, assembly, installation, configuration and documentation of a Big Data environment with Hadoop Pseudodistributed Mode, containing Apache Spark, two of the most advanced frameworks for Big Data processing, and also highlighting how this environment can be useful for Big Data processing:
*Title: Creating a Big Data environment with Hadoop Pseudodistributed Mode with Apache Spark*
### Creation and Assembly:
From the physical assembly of the machines to the network and storage configuration, every detail was carefully planned. The infrastructure was scaled to support large volumes of data and high processing loads.### Installation and Configuration:
Apache Hadoop is the heart of this Big Data environment, providing a distributed file system (HDFS) and distributed computing framework (YARN). Additionally, Apache Spark was installed for in-memory processing and real-time data analysis.### Utility and Benefits:
This environment offers a series of significant benefits for Big Data processing:- Real-Time Processing: With Apache Kafka, we are able to process streams of data in real-time, enabling instant analysis and timely decision-making.
- Advanced Analytics: Apache Spark provides a distributed processing environment for advanced analytics, from complex queries to machine learning and graph processing.
- Latency Reduction: With the distributed processing capacity of Apache Spark, we are able to reduce the latency of queries and analyses, speeding up response times for our users.
### Documentation and Maintenance:
A detailed manual was created to document all steps of installing, configuring and using the cluster. This ensures that everyone has access to the information needed to operate and maintain the system effectively.### Conclusion:
In summary, this Big Data Environment with Hadoop Pseudodistributed Mode and Apache Spark is a powerful tool to deal with the challenges of Big Data processing. From real-time analysis to advanced analysis, it allows you to perform various data-based tests.