Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/silasener/web-scraping-academia
This project aims to facilitate access to information from the Springer academic website using web scraping.
https://github.com/silasener/web-scraping-academia
java mongodb mongodb-database mongorepository nosql spring-boot web webscraping
Last synced: about 18 hours ago
JSON representation
This project aims to facilitate access to information from the Springer academic website using web scraping.
- Host: GitHub
- URL: https://github.com/silasener/web-scraping-academia
- Owner: silasener
- Created: 2024-02-24T10:02:39.000Z (9 months ago)
- Default Branch: master
- Last Pushed: 2024-09-10T10:23:31.000Z (2 months ago)
- Last Synced: 2024-09-10T11:53:15.711Z (2 months ago)
- Topics: java, mongodb, mongodb-database, mongorepository, nosql, spring-boot, web, webscraping
- Language: Java
- Homepage:
- Size: 131 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Web Scraping Academia
## Introduction
This project aims to facilitate access to information from the Springer academic website using web scraping. It utilizes MongoDB database with Elasticsearch query structures and is implemented using the Java Spring framework, MongoDB repository, and Jsoup for web scraping. Additionally, MongoDB Compass and NoSQL queries are employed for database management.## Technologies
- Java Spring framework
- MongoDB database
- Jsoup for web scraping
- MongoDB Compass for database management
- NoSQL queries for database operations## Project Screens
1. **Main Screen:**
2. **Detail Screen:**
## Project Overview
The project consists of three main components:1. **Web Scraping:**
- Retrieves information from the [Springer](https://link.springer.com/) academic website based on user-entered keywords.
- Displays details of at least the top 10 academic publications on a custom-built web page.
- Utilizes HTML parsing or request methods to access the desired data from the Springer website.
- Downloads PDF files for each publication.2. **Database:**
- Stores the scraped data using MongoDB.
- Required publication information includes:
- Publication ID
- Publication title
- Author names
- Publication type (research paper, review, conference, book, etc.)
- Publication date
- Publisher name
- Keywords (searched on the academic search engine)
- Keywords (related to the article)
- Abstract
- References
- Citation count
- DOI number (if available)
- URL address
- MongoDB Compass and NoSQL queries are used for database management.3. **Web Interface:**
- Creates a web page to display the retrieved publication information.
- Provides a text area for users to enter keywords for searching publications.
- Initially displays all records from the database upon page load.
- Enables dynamic searching with automatic spelling correction suggestions.
- Includes dynamic filtering options based on various attributes of publications.
- Allows sorting by publication date and citation count.## Usage
1. **Installation:**
- Clone the repository:
```
git clone https://github.com/your/repository.git
```
- Navigate to the project directory:
```
cd project-directory
```
- Install dependencies:
```
// Add commands to install any dependencies if needed
```2. **Running the Application:**
- Start the application:
```
// Add commands to start the application
```3. **Accessing the Web Interface:**
- Once the application is running, access the web interface by navigating to [http://localhost:port](http://localhost:port) in your web browser.4. **Using the Web Interface:**
- Enter keywords in the provided text area to search for publications.
- Browse through the displayed publications and click on a publication title for detailed information.
- Use the dynamic filtering options to refine the displayed publications.
- Sort the publications by publication date or citation count.## MongoDB Compass and NoSQL Queries
Include instructions or examples of using MongoDB Compass and NoSQL queries for managing the database.## Contributors
- [Emre Terzi](https://github.com/emretterzi)