https://github.com/bkraad47/guardian_crawler

A simple java glassfish, webcrawler instance
https://github.com/bkraad47/guardian_crawler

glassfish guardian java jsoup mongodb webcrawler

Last synced: 2 months ago
JSON representation

A simple java glassfish, webcrawler instance

Host: GitHub
URL: https://github.com/bkraad47/guardian_crawler
Owner: bkraad47
Created: 2017-05-10T18:02:02.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2017-05-18T07:33:04.000Z (about 8 years ago)
Last Synced: 2025-02-15T22:42:51.288Z (4 months ago)
Topics: glassfish, guardian, java, jsoup, mongodb, webcrawler
Language: Java
Homepage:
Size: 4.73 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Guardian AU webcrawler - Badruddin Kamal (Raad)

## EC2 Link - http://ec2-34-210-127-161.us-west-2.compute.amazonaws.com:8080/guardiancrawler

This simple Java Application Crawls through - The Guardian Australia website and scans for Headlines, Articles, Authors and Dates. Which it stores in MongoDB.
The Application also implements a restful search service, allowing it to search for keywords in specific stored fields. Owing, to the scheduler, the application automatically
updates its knowledge and crawls The Guardian website everynight at 2 AM (US west time).

*** The DB has been changed to mLab from compose for testing/hosting purposes ***

*** The code is multi-threaded and will only work if number of threads running at the processor < Max MongoDB connections***

## Dependencies

1. MongoDB (mlab)
2. JSoup
3. Gson
4. Quartz
5. Glassfish
6. Java 7

## Running / Deployment

Simply compile the .war and deploy to your Glassfish server and ensure mLab DB connection is accessible. A deployment test is run to ensure there is a suitable environment.

## Restful Search

A get api, which consumes type and query as query parameters, does the search on MongoDB and returns results accordingly as a JSON array.

Endpoint[GET] - http://ec2-34-210-127-161.us-west-2.compute.amazonaws.com:8080/guardiancrawler/search

type can be author, headline, date or text.

query should be the string you want to search for.

Test Example - http://ec2-34-210-127-161.us-west-2.compute.amazonaws.com:8080/guardiancrawler/search?type=headline&query=Trump

## Test

A simple deployment test is run to ensure resource availability.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bkraad47/guardian_crawler

Awesome Lists containing this project

README