Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ashwanthkumar/crawler-intro
Getting started with Crawler4J and Scala
https://github.com/ashwanthkumar/crawler-intro
Last synced: about 10 hours ago
JSON representation
Getting started with Crawler4J and Scala
- Host: GitHub
- URL: https://github.com/ashwanthkumar/crawler-intro
- Owner: ashwanthkumar
- Created: 2014-05-17T08:47:50.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2014-05-17T13:52:31.000Z (over 10 years ago)
- Last Synced: 2024-04-14T09:19:01.069Z (7 months ago)
- Language: Scala
- Size: 193 KB
- Stars: 0
- Watchers: 2
- Forks: 17
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Crawler Assignment
========================The purpose of this assignment is to get started in Crawler4J with Scala.
## Deliverables
* Link to your fork of this assignment (Yes you need to fork this project and make changes on your repo) with the changes (refer next section for the changes)
* Final CSV file containing atleast 100 product information details.## Changes to be done in the code
* Support for css / js / image / other junk exclude filters. (Check **BaseCrawler.scala**)
* Add Crawler for the site (site name will be sent separately in the email)
* Support for Proxy (check **CrawlDriver.scala**) - If everybody starts crawling from office, the site can identify that and block our IP Address. Since all outgoing requests originate from the same IP. Will share the proxy details in the email separately.
* Parser unit tests for the new site(s).
* Fix the existing unit test for Flipkart as well.## Resources
- [JSoup CSS Selector Syntax](http://jsoup.org/cookbook/extracting-data/selector-syntax)
- [Crawler4J Source Code](http://code.google.com/p/crawler4j/source/browse/)
- [Proxy Server](http://en.wikipedia.org/wiki/Proxy_server)
- http://computer.howstuffworks.com/firewall4.htm## More Resources on CSS Selectors
- [MDN CSS Selector](https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Getting_started/Selectors)
- [document.querySelector](https://developer.mozilla.org/en-US/docs/Web/API/document.querySelector)
- [W3 Selector Draft](http://www.w3.org/TR/selectors-api/)
- [30 Selectors you need to know](http://code.tutsplus.com/tutorials/the-30-css-selectors-you-must-memorize--net-16048)