Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/austinoboyle/scrape-linkedin-selenium
`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.
https://github.com/austinoboyle/scrape-linkedin-selenium
linkedin python scrape scraper scraping selenium selenium-webdriver web-scraper web-scraping
Last synced: 6 days ago
JSON representation
`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.
- Host: GitHub
- URL: https://github.com/austinoboyle/scrape-linkedin-selenium
- Owner: austinoboyle
- License: mit
- Created: 2018-02-22T02:21:14.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2022-10-16T16:44:50.000Z (about 2 years ago)
- Last Synced: 2024-12-21T06:08:32.495Z (13 days ago)
- Topics: linkedin, python, scrape, scraper, scraping, selenium, selenium-webdriver, web-scraper, web-scraping
- Language: HTML
- Homepage:
- Size: 269 KB
- Stars: 466
- Watchers: 26
- Forks: 164
- Open Issues: 17
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# scrape_linkedin
## Introduction
`scrape_linkedin` is a python package to scrape all details from public LinkedIn
profiles, turning the data into structured json. You can scrape Companies
and user profiles with this package.**Warning**: LinkedIn has strong anti-scraping policies, they may blacklist ips making
unauthenticated or unusual requests## Table of Contents
- [scrape_linkedin](#scrapelinkedin)
- [Introduction](#introduction)
- [Table of Contents](#table-of-contents)
- [Installation](#installation)
- [Install with pip](#install-with-pip)
- [Install from source](#install-from-source)
- [Tests](#tests)
- [Getting & Setting LI_AT](#getting--setting-liat)
- [Getting LI_AT](#getting-liat)
- [Setting LI_AT](#setting-liat)
- [Examples](#examples)
- [Usage](#usage)
- [Command Line](#command-line)
- [Python Package](#python-package)
- [Profiles](#profiles)
- [Companies](#companies)
- [config](#config)
- [Scraping in Parallel](#scraping-in-parallel)
- [Example](#example)
- [Configuration](#configuration)
- [Issues](#issues)## Installation
### Install with pip
Run `pip install git+git://github.com/austinoboyle/scrape-linkedin-selenium.git`
### Install from source
`git clone https://github.com/austinoboyle/scrape-linkedin-selenium.git`
Run `python setup.py install`
### Tests
Tests are (so far) only run on static html files. One of which is a linkedin
profile, the other is just used to test some utility functions.## Getting & Setting LI_AT
Because of Linkedin's anti-scraping measures, you must make your selenium
browser look like an actual user. To do this, you need to add the li_at cookie
to the selenium session.### Getting LI_AT
1. Navigate to www.linkedin.com and log in
2. Open browser developer tools (Ctrl-Shift-I or right click -> inspect
element)
3. Select the appropriate tab for your browser (**Application** on Chrome,
**Storage** on Firefox)
4. Click the **Cookies** dropdown on the left-hand menu, and select the
`www.linkedin.com` option
5. Find and copy the li_at **value**### Setting LI_AT
There are two ways to set your li_at cookie:
1. Set the LI_AT environment variable
- `$ export LI_AT=YOUR_LI_AT_VALUE`
- **On Windows**: `C:/foo/bar> set LI_AT=YOUR_LI_AT_VALUE`
2. Pass the cookie as a parameter to the Scraper object.
> `>>> with ProfileScraper(cookie='YOUR_LI_AT_VALUE') as scraper:`A cookie value passed directly to the Scraper **will override your
environment variable** if both are set.## Examples
See [`/examples`](https://github.com/austinoboyle/scrape-linkedin-selenium/tree/master/examples)
## Usage
### Command Line
scrape_linkedin comes with a command line argument module `scrapeli` created
using [click](http://click.pocoo.org/5/).**Note: CLI only works with Personal Profiles as of now.**
Options:
- --url : Full Url of the profile you want to scrape
- --user: www.linkedin.com/in/USER
- --driver: choose Browser type to use (Chrome/Firefox), **default: Chrome**
- -a --attribute : return only a specific attribute (default: return all
attributes)
- -i --input_file : Raw path to html file of the profile you want to scrape
- -o --output_file: Raw path to output file for structured json profile (just
prints results by default)
- -h --help : Show this screen.Examples:
- Get Austin O'Boyle's profile info: `$ scrapeli --user=austinoboyle`
- Get only the skills of Austin O'Boyle: `$ scrapeli --user=austinoboyle -a skills`
- Parse stored html profile and save json output: `$ scrapeli -i /path/file.html -o output.json`### Python Package
#### Profiles
Use `ProfileScraper` component to scrape profiles.
```python
from scrape_linkedin import ProfileScraperwith ProfileScraper() as scraper:
profile = scraper.scrape(user='austinoboyle')
print(profile.to_dict())
````Profile` - the class that has properties to access all information pulled from
a profile. Also has a to_dict() method that returns all of the data as a dictwith open('profile.html', 'r') as profile_file:
profile = Profile(profile_file.read())print (profile.skills)
# [{...} ,{...}, ...]
print (profile.experiences)
# {jobs: [...], volunteering: [...],...}
print (profile.to_dict())
# {personal_info: {...}, experiences: {...}, ...}**Structure of the fields scraped**
- personal_info
- name
- company
- school
- headline
- followers
- summary
- websites
- phone
- connected
- image
- skills
- experiences
- volunteering
- jobs
- education
- interests
- accomplishments
- publications
- cerfifications
- patents
- courses
- projects
- honors
- test scores
- languages
- organizations#### Companies
Use `CompanyScraper` component to scrape companies.
```python
from scrape_linkedin import CompanyScraperwith CompanyScraper() as scraper:
company = scraper.scrape(company='facebook')
print(company.to_dict())
````Company` - the class that has properties to access all information pulled from
a company profile. There will be three properties: overview, jobs, and life.
**Overview is the only one currently implemented.**with open('overview.html', 'r') as overview,
open('jobs.html', 'r') as jobs,
open('life.html', 'r') as life:
company = Company(overview, jobs, life)print (company.overview)
# {...}**Structure of the fields scraped**
- overview
- name
- company_size
- specialties
- headquarters
- founded
- website
- description
- industry
- num_employees
- type
- image
- jobs **NOT YET IMPLEMENTED**
- life **NOT YET IMPLEMENTED**#### config
Pass these keyword arguments into the constructor of your Scraper to override
default values. You may (for example) want to decrease/increase the timeout if
your internet is very fast/slow.- _cookie_ **`{str}`**: li_at cookie value (overrides env variable)
- **default: `None`**
- _driver_ **`{selenium.webdriver}`**: driver type to use
- **default: `selenium.webdriver.Chrome`**
- _driver_options_ **`{dict}`**: kwargs to pass to driver constructor
- **default: `{}`**
- _scroll_pause_ **`{float}`**: time(s) to pause during scroll increments
- **default: `0.1`**
- _scroll_increment_ **`{int}`** num pixels to scroll down each time
- **default: `300`**
- _timeout_ **`{float}`**: default time to wait for async content to load
- **default: `10`**## Scraping in Parallel
New in version 0.2: built in parallel scraping functionality. Note that the
up-front cost of starting a browser session is high, so in order for this to be
beneficial, you will want to be scraping many (> 15) profiles.### Example
```python
from scrape_linkedin import scrape_in_parallel, CompanyScrapercompanies = ['facebook', 'google', 'amazon', 'microsoft', ...]
#Scrape all companies, output to 'companies.json' file, use 4 browser instances
scrape_in_parallel(
scraper_type=CompanyScraper,
items=companies,
output_file="companies.json",
num_instances=4
)
```### Configuration
**Parameters:**
- _scraper_type_ **`{scrape_linkedin.Scraper}`**: Scraper to use
- _items_ **`{list}`**: List of items to be scraped
- _output_file_ **`{str}`**: path to output file
- _num_instances_ **`{int}`**: number of parallel instances of selenium to run
- _temp_dir_ **`{str}`**: name of temporary directory to use to store data from intermediate steps
- **default: 'tmp_data'**
- _driver_ {selenium.webdriver}: driver to use for scraping
- **default: selenium.webdriver.Chrome**
- _driver_options_ **`{dict}`**: dict of keyword arguments to pass to the driver function.
- **default: scrape_linkedin.utils.HEADLESS_OPTIONS**
- _\*\*kwargs_ **`{any}`**: extra keyword arguments to pass to the `scraper_type` constructor for each job## Issues
Report bugs and feature requests
[here](https://github.com/austinoboyle/scrape-linkedin-selenium/issues).