https://github.com/maxpleaner/tagger
scrapes descriptions & tags for LinkedIn companies
https://github.com/maxpleaner/tagger
Last synced: 9 months ago
JSON representation
scrapes descriptions & tags for LinkedIn companies
- Host: GitHub
- URL: https://github.com/maxpleaner/tagger
- Owner: MaxPleaner
- Created: 2016-02-15T23:54:40.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2016-09-04T17:39:49.000Z (over 9 years ago)
- Last Synced: 2025-03-24T06:30:52.119Z (about 1 year ago)
- Language: JavaScript
- Homepage:
- Size: 984 KB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
## Tagger
scrapes descriptions & tags for LinkedIn companies
------
#### Setup
It requires one environment variable to be set, `MONKEY_LEARN_TOKEN` which is an api key from
[monkeylearn.com](http://monkeylearn.com). MonkeyLearn is used to extract tags from text.
Other than that, it's standard Rails:
`clone`, `bundle` `rake db:create db:migrate`, `localhost:3000`
#### basic usage
- Use the HTML interface at `localhost:3000`
- Scraper command (call with `SelfScraper.begin("google")` in `rails c` when the server is running)
#### details on scraper
You can load some seed data by running `rake db:data:load`
You can run the crawler by running `SelfScraper.begin("google")` in `rails console`. This example will start
at the Google page and move to "related" pages from there.
- This automatically interacts with the server using `Mechanize`.
- Eventually it will loop and stop finding new companies.
- This is because of bidirectional linking in LinkedIn's 'people also clicked' sections.
- There are often small groups of companies which all link back to each other.
- When this happens, the scraper will need to be restarted with a new company name.
Note the 'clear cache' button on the HTML site actually wipes the
entire database.
This is a HTML scaper, not an authenticated API application, so it probably has
more severe rate limits. `999` errors means the IP address is being throttled.
The bulk of the code is in [`application_controller.rb`](https://github.com/MaxPleaner/tagger/blob/master/app/controllers/application_controller.rb),
[`pages_controller.rb`](https://github.com/MaxPleaner/tagger/blob/master/app/controllers/pages_controller.rb),
[`pages/root.html.erb`](https://github.com/MaxPleaner/tagger/blob/master/app/views/pages/root.html.erb),
and [`application.rb`](https://github.com/MaxPleaner/tagger/blob/master/config/application.rb).