Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/austinoboyle/scrape-linkedin-selenium

`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.
https://github.com/austinoboyle/scrape-linkedin-selenium

linkedin python scrape scraper scraping selenium selenium-webdriver web-scraper web-scraping

Last synced: 3 months ago
JSON representation

`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.

Host: GitHub
URL: https://github.com/austinoboyle/scrape-linkedin-selenium
Owner: austinoboyle
License: mit
Created: 2018-02-22T02:21:14.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2022-10-16T16:44:50.000Z (over 1 year ago)
Last Synced: 2024-01-30T04:46:32.195Z (5 months ago)
Topics: linkedin, python, scrape, scraper, scraping, selenium, selenium-webdriver, web-scraper, web-scraping
Language: HTML
Homepage:
Size: 269 KB
Stars: 425
Watchers: 24
Forks: 163
Open Issues: 16
Metadata Files:
- Readme: README.md
- License: LICENSE

Lists

awesome-stars - scrape-linkedin-selenium - turning the data into structured json. | austinoboyle | 447 | (HTML)

README

        # scrape_linkedin

## Introduction

`scrape_linkedin` is a python package to scrape all details from public LinkedIn

profiles, turning the data into structured json. You can scrape Companies

and user profiles with this package.

**Warning**: LinkedIn has strong anti-scraping policies, they may blacklist ips making

unauthenticated or unusual requests

## Table of Contents

- [scrape_linkedin](#scrapelinkedin)

  - [Introduction](#introduction)

  - [Table of Contents](#table-of-contents)

  - [Installation](#installation)

    - [Install with pip](#install-with-pip)

    - [Install from source](#install-from-source)

    - [Tests](#tests)

  - [Getting & Setting LI_AT](#getting--setting-liat)

    - [Getting LI_AT](#getting-liat)

    - [Setting LI_AT](#setting-liat)

  - [Examples](#examples)

  - [Usage](#usage)

    - [Command Line](#command-line)

    - [Python Package](#python-package)

      - [Profiles](#profiles)

      - [Companies](#companies)

      - [config](#config)

  - [Scraping in Parallel](#scraping-in-parallel)

    - [Example](#example)

    - [Configuration](#configuration)

  - [Issues](#issues)

## Installation

### Install with pip

Run `pip install git+git://github.com/austinoboyle/scrape-linkedin-selenium.git`

### Install from source

`git clone https://github.com/austinoboyle/scrape-linkedin-selenium.git`

Run `python setup.py install`

### Tests

Tests are (so far) only run on static html files. One of which is a linkedin

profile, the other is just used to test some utility functions.

## Getting & Setting LI_AT

Because of Linkedin's anti-scraping measures, you must make your selenium

browser look like an actual user. To do this, you need to add the li_at cookie

to the selenium session.

### Getting LI_AT

1.  Navigate to www.linkedin.com and log in

2.  Open browser developer tools (Ctrl-Shift-I or right click -> inspect

    element)

3.  Select the appropriate tab for your browser (**Application** on Chrome,

    **Storage** on Firefox)

4.  Click the **Cookies** dropdown on the left-hand menu, and select the

    `www.linkedin.com` option

5.  Find and copy the li_at **value**

### Setting LI_AT

There are two ways to set your li_at cookie:

1.  Set the LI_AT environment variable

    -   `$ export LI_AT=YOUR_LI_AT_VALUE`

    -   **On Windows**: `C:/foo/bar> set LI_AT=YOUR_LI_AT_VALUE`

2.  Pass the cookie as a parameter to the Scraper object.

    > `>>> with ProfileScraper(cookie='YOUR_LI_AT_VALUE') as scraper:`

A cookie value passed directly to the Scraper **will override your

environment variable** if both are set.

## Examples

See [`/examples`](https://github.com/austinoboyle/scrape-linkedin-selenium/tree/master/examples)

## Usage

### Command Line

scrape_linkedin comes with a command line argument module `scrapeli` created

using [click](http://click.pocoo.org/5/).

**Note: CLI only works with Personal Profiles as of now.**

Options:

-   --url : Full Url of the profile you want to scrape

-   --user: www.linkedin.com/in/USER

-   --driver: choose Browser type to use (Chrome/Firefox), **default: Chrome**

-   -a --attribute : return only a specific attribute (default: return all

    attributes)

-   -i --input_file : Raw path to html file of the profile you want to scrape

-   -o --output_file: Raw path to output file for structured json profile (just

    prints results by default)

-   -h --help : Show this screen.

Examples:

-   Get Austin O'Boyle's profile info: `$ scrapeli --user=austinoboyle`

-   Get only the skills of Austin O'Boyle: `$ scrapeli --user=austinoboyle -a skills`

-   Parse stored html profile and save json output: `$ scrapeli -i /path/file.html -o output.json`

### Python Package

#### Profiles

Use `ProfileScraper` component to scrape profiles.

```python

from scrape_linkedin import ProfileScraper

with ProfileScraper() as scraper:

    profile = scraper.scrape(user='austinoboyle')

print(profile.to_dict())

```

`Profile` - the class that has properties to access all information pulled from

a profile. Also has a to_dict() method that returns all of the data as a dict

    with open('profile.html', 'r') as profile_file:

        profile = Profile(profile_file.read())

    print (profile.skills)

    # [{...} ,{...}, ...]

    print (profile.experiences)

    # {jobs: [...], volunteering: [...],...}

    print (profile.to_dict())

    # {personal_info: {...}, experiences: {...}, ...}

**Structure of the fields scraped**

-   personal_info

    -   name

    -   company

    -   school

    -   headline

    -   followers

    -   summary

    -   websites

    -   email

    -   phone

    -   connected

    -   image

-   skills

-   experiences

    -   volunteering

    -   jobs

    -   education

-   interests

-   accomplishments

    -   publications

    -   cerfifications

    -   patents

    -   courses

    -   projects

    -   honors

    -   test scores

    -   languages

    -   organizations

#### Companies

Use `CompanyScraper` component to scrape companies.

```python

from scrape_linkedin import CompanyScraper

with CompanyScraper() as scraper:

    company = scraper.scrape(company='facebook')

print(company.to_dict())

```

`Company` - the class that has properties to access all information pulled from

a company profile. There will be three properties: overview, jobs, and life.

**Overview is the only one currently implemented.**

    with open('overview.html', 'r') as overview,

        open('jobs.html', 'r') as jobs,

        open('life.html', 'r') as life:

            company = Company(overview, jobs, life)

    print (company.overview)

    # {...}

**Structure of the fields scraped**

-   overview

    -   name

    -   company_size

    -   specialties

    -   headquarters

    -   founded

    -   website

    -   description

    -   industry

    -   num_employees

    -   type

    -   image

-   jobs **NOT YET IMPLEMENTED**

-   life **NOT YET IMPLEMENTED**

#### config

Pass these keyword arguments into the constructor of your Scraper to override

default values. You may (for example) want to decrease/increase the timeout if

your internet is very fast/slow.

-   _cookie_ **`{str}`**: li_at cookie value (overrides env variable)

    -   **default: `None`**

-   _driver_ **`{selenium.webdriver}`**: driver type to use

    -   **default: `selenium.webdriver.Chrome`**

-   _driver_options_ **`{dict}`**: kwargs to pass to driver constructor

    -   **default: `{}`**

-   _scroll_pause_ **`{float}`**: time(s) to pause during scroll increments

    -   **default: `0.1`**

-   _scroll_increment_ **`{int}`** num pixels to scroll down each time

    -   **default: `300`**

-   _timeout_ **`{float}`**: default time to wait for async content to load

    -   **default: `10`**

## Scraping in Parallel

New in version 0.2: built in parallel scraping functionality. Note that the

up-front cost of starting a browser session is high, so in order for this to be

beneficial, you will want to be scraping many (> 15) profiles.

### Example

```python

from scrape_linkedin import scrape_in_parallel, CompanyScraper

companies = ['facebook', 'google', 'amazon', 'microsoft', ...]

#Scrape all companies, output to 'companies.json' file, use 4 browser instances

scrape_in_parallel(

    scraper_type=CompanyScraper,

    items=companies,

    output_file="companies.json",

    num_instances=4

)

```

### Configuration

**Parameters:**

-   _scraper_type_ **`{scrape_linkedin.Scraper}`**: Scraper to use

-   _items_ **`{list}`**: List of items to be scraped

-   _output_file_ **`{str}`**: path to output file

-   _num_instances_ **`{int}`**: number of parallel instances of selenium to run

-   _temp_dir_ **`{str}`**: name of temporary directory to use to store data from intermediate steps

    -   **default: 'tmp_data'**

-   _driver_ {selenium.webdriver}: driver to use for scraping

    -   **default: selenium.webdriver.Chrome**

-   _driver_options_ **`{dict}`**: dict of keyword arguments to pass to the driver function.

    -   **default: scrape_linkedin.utils.HEADLESS_OPTIONS**

-   _\*\*kwargs_ **`{any}`**: extra keyword arguments to pass to the `scraper_type` constructor for each job

## Issues

Report bugs and feature requests

[here](https://github.com/austinoboyle/scrape-linkedin-selenium/issues).