Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/touero/ctenopharyngodon-idella
Hadoop, MapReduce Distributed Crawling of Data Information from All Chinese Universities.
https://github.com/touero/ctenopharyngodon-idella
fastapi hadoop hadoop-mapreduce java mapreduce maven scraping
Last synced: 7 days ago
JSON representation
Hadoop, MapReduce Distributed Crawling of Data Information from All Chinese Universities.
- Host: GitHub
- URL: https://github.com/touero/ctenopharyngodon-idella
- Owner: touero
- License: apache-2.0
- Created: 2023-04-10T08:39:34.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-10-16T02:35:59.000Z (2 months ago)
- Last Synced: 2024-12-07T14:22:03.673Z (15 days ago)
- Topics: fastapi, hadoop, hadoop-mapreduce, java, mapreduce, maven, scraping
- Language: Java
- Homepage:
- Size: 3.75 MB
- Stars: 140
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
ctenopharyngodon-idella
## Repository Introduction
Hadoop, MapReduce Distributed Crawling of Data Information from All Chinese UniversitiesThe widely used MapReduce distributed crawler still recommends using Jsoup, but it cannot parse data loaded by JavaScrip Therefore, this is a warehouse that utilizes Fast Json to crawl data information from all Chinese universities, utilizing the Map Reduce distributed computing crawler in the Hadoop ecosystem At present, my programming environment is Windows10, and virtual Hadoop cannot be tested on Linux or Mac in the testing environment of Windows10. It is currently determined that Linux is an HDFS path. If you are interested, please submit Issues or Pr.
![img.png](.about/img.png)
This repository contains:
1. Building a simulated distributed environment under Windows
2. Crawling 掌上高考
3. Data Storage## Install
This project uses [Java](https://www.java.com/) [Git](https://git-scm.com/), Go check them out if you don't have them locally installed.
```shell
git clone https://github.com/weiensong/ScrapySchoolAll.git
```## Usage
- A truly distributed environment
```shell
mvn package# hdfs in Master
hadoop jar PackageName.jar
```
- Distributed environment simulated by Windows
- run initTest.bat directly as administrator
- ```command
cd /d "%~dp0"
copy hadoop.dll C:\Windows\System32
cd /src/main/java/job
javac MyJob.java
java MyJob
```
## Related Repository
- [hadoop](https://github.com/apache/hadoop) —Apache Hadoop
- [opsariichthys-bidens](https://github.com/weiensong/opsariichthys-bidens) — Basic information API construction of Chinese national universities.(中国全国大学基本信息API搭建)## Related Efforts
- [Hadoop](https://hadoop.apache.org/)
- [Maven Central Warehouse](https://mvnrepository.com/)
- [掌上高考](https://www.gaokao.cn/)## Maintainers
[@weiensong](https://github.com/weiensong)
## Contributing
Feel free to dive in! [Open an issue](https://github.com/weiensong/ScrapySchoolAll/issues) or submit PRs.
Standard Java follows the [Google apache](https://google.github.io/styleguide/javaguide.html) Code of Conduct.
### Contributors
This project exists thanks to all the people who contribute.## License
[Apache License 2.0](https://github.com/weiensong/ctenopharyngodon-idella/blob/master/LICENSE) © weiensong