Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/austinoboyle/scrape-linkedin-selenium

`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.
https://github.com/austinoboyle/scrape-linkedin-selenium

linkedin python scrape scraper scraping selenium selenium-webdriver web-scraper web-scraping

Last synced: about 1 month ago
JSON representation

`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.

Awesome Lists containing this project

README

        

# scrape_linkedin

## Introduction

`scrape_linkedin` is a python package to scrape all details from public LinkedIn
profiles, turning the data into structured json. You can scrape Companies
and user profiles with this package.

**Warning**: LinkedIn has strong anti-scraping policies, they may blacklist ips making
unauthenticated or unusual requests

## Table of Contents

- [scrape_linkedin](#scrapelinkedin)
- [Introduction](#introduction)
- [Table of Contents](#table-of-contents)
- [Installation](#installation)
- [Install with pip](#install-with-pip)
- [Install from source](#install-from-source)
- [Tests](#tests)
- [Getting & Setting LI_AT](#getting--setting-liat)
- [Getting LI_AT](#getting-liat)
- [Setting LI_AT](#setting-liat)
- [Examples](#examples)
- [Usage](#usage)
- [Command Line](#command-line)
- [Python Package](#python-package)
- [Profiles](#profiles)
- [Companies](#companies)
- [config](#config)
- [Scraping in Parallel](#scraping-in-parallel)
- [Example](#example)
- [Configuration](#configuration)
- [Issues](#issues)

## Installation

### Install with pip

Run `pip install git+git://github.com/austinoboyle/scrape-linkedin-selenium.git`

### Install from source

`git clone https://github.com/austinoboyle/scrape-linkedin-selenium.git`

Run `python setup.py install`

### Tests

Tests are (so far) only run on static html files. One of which is a linkedin
profile, the other is just used to test some utility functions.

## Getting & Setting LI_AT

Because of Linkedin's anti-scraping measures, you must make your selenium
browser look like an actual user. To do this, you need to add the li_at cookie
to the selenium session.

### Getting LI_AT

1. Navigate to www.linkedin.com and log in
2. Open browser developer tools (Ctrl-Shift-I or right click -> inspect
element)
3. Select the appropriate tab for your browser (**Application** on Chrome,
**Storage** on Firefox)
4. Click the **Cookies** dropdown on the left-hand menu, and select the
`www.linkedin.com` option
5. Find and copy the li_at **value**

### Setting LI_AT

There are two ways to set your li_at cookie:

1. Set the LI_AT environment variable
- `$ export LI_AT=YOUR_LI_AT_VALUE`
- **On Windows**: `C:/foo/bar> set LI_AT=YOUR_LI_AT_VALUE`
2. Pass the cookie as a parameter to the Scraper object.
> `>>> with ProfileScraper(cookie='YOUR_LI_AT_VALUE') as scraper:`

A cookie value passed directly to the Scraper **will override your
environment variable** if both are set.

## Examples

See [`/examples`](https://github.com/austinoboyle/scrape-linkedin-selenium/tree/master/examples)

## Usage

### Command Line

scrape_linkedin comes with a command line argument module `scrapeli` created
using [click](http://click.pocoo.org/5/).

**Note: CLI only works with Personal Profiles as of now.**

Options:

- --url : Full Url of the profile you want to scrape
- --user: www.linkedin.com/in/USER
- --driver: choose Browser type to use (Chrome/Firefox), **default: Chrome**
- -a --attribute : return only a specific attribute (default: return all
attributes)
- -i --input_file : Raw path to html file of the profile you want to scrape
- -o --output_file: Raw path to output file for structured json profile (just
prints results by default)
- -h --help : Show this screen.

Examples:

- Get Austin O'Boyle's profile info: `$ scrapeli --user=austinoboyle`
- Get only the skills of Austin O'Boyle: `$ scrapeli --user=austinoboyle -a skills`
- Parse stored html profile and save json output: `$ scrapeli -i /path/file.html -o output.json`

### Python Package

#### Profiles

Use `ProfileScraper` component to scrape profiles.

```python
from scrape_linkedin import ProfileScraper

with ProfileScraper() as scraper:
profile = scraper.scrape(user='austinoboyle')
print(profile.to_dict())
```

`Profile` - the class that has properties to access all information pulled from
a profile. Also has a to_dict() method that returns all of the data as a dict

with open('profile.html', 'r') as profile_file:
profile = Profile(profile_file.read())

print (profile.skills)
# [{...} ,{...}, ...]
print (profile.experiences)
# {jobs: [...], volunteering: [...],...}
print (profile.to_dict())
# {personal_info: {...}, experiences: {...}, ...}

**Structure of the fields scraped**

- personal_info
- name
- company
- school
- headline
- followers
- summary
- websites
- email
- phone
- connected
- image
- skills
- experiences
- volunteering
- jobs
- education
- interests
- accomplishments
- publications
- cerfifications
- patents
- courses
- projects
- honors
- test scores
- languages
- organizations

#### Companies

Use `CompanyScraper` component to scrape companies.

```python
from scrape_linkedin import CompanyScraper

with CompanyScraper() as scraper:
company = scraper.scrape(company='facebook')
print(company.to_dict())
```

`Company` - the class that has properties to access all information pulled from
a company profile. There will be three properties: overview, jobs, and life.
**Overview is the only one currently implemented.**

with open('overview.html', 'r') as overview,
open('jobs.html', 'r') as jobs,
open('life.html', 'r') as life:
company = Company(overview, jobs, life)

print (company.overview)
# {...}

**Structure of the fields scraped**

- overview
- name
- company_size
- specialties
- headquarters
- founded
- website
- description
- industry
- num_employees
- type
- image
- jobs **NOT YET IMPLEMENTED**
- life **NOT YET IMPLEMENTED**

#### config

Pass these keyword arguments into the constructor of your Scraper to override
default values. You may (for example) want to decrease/increase the timeout if
your internet is very fast/slow.

- _cookie_ **`{str}`**: li_at cookie value (overrides env variable)
- **default: `None`**
- _driver_ **`{selenium.webdriver}`**: driver type to use
- **default: `selenium.webdriver.Chrome`**
- _driver_options_ **`{dict}`**: kwargs to pass to driver constructor
- **default: `{}`**
- _scroll_pause_ **`{float}`**: time(s) to pause during scroll increments
- **default: `0.1`**
- _scroll_increment_ **`{int}`** num pixels to scroll down each time
- **default: `300`**
- _timeout_ **`{float}`**: default time to wait for async content to load
- **default: `10`**

## Scraping in Parallel

New in version 0.2: built in parallel scraping functionality. Note that the
up-front cost of starting a browser session is high, so in order for this to be
beneficial, you will want to be scraping many (> 15) profiles.

### Example

```python
from scrape_linkedin import scrape_in_parallel, CompanyScraper

companies = ['facebook', 'google', 'amazon', 'microsoft', ...]

#Scrape all companies, output to 'companies.json' file, use 4 browser instances
scrape_in_parallel(
scraper_type=CompanyScraper,
items=companies,
output_file="companies.json",
num_instances=4
)
```

### Configuration

**Parameters:**

- _scraper_type_ **`{scrape_linkedin.Scraper}`**: Scraper to use
- _items_ **`{list}`**: List of items to be scraped
- _output_file_ **`{str}`**: path to output file
- _num_instances_ **`{int}`**: number of parallel instances of selenium to run
- _temp_dir_ **`{str}`**: name of temporary directory to use to store data from intermediate steps
- **default: 'tmp_data'**
- _driver_ {selenium.webdriver}: driver to use for scraping
- **default: selenium.webdriver.Chrome**
- _driver_options_ **`{dict}`**: dict of keyword arguments to pass to the driver function.
- **default: scrape_linkedin.utils.HEADLESS_OPTIONS**
- _\*\*kwargs_ **`{any}`**: extra keyword arguments to pass to the `scraper_type` constructor for each job

## Issues

Report bugs and feature requests
[here](https://github.com/austinoboyle/scrape-linkedin-selenium/issues).