{"id":13807534,"url":"https://github.com/austinoboyle/scrape-linkedin-selenium","last_synced_at":"2025-04-04T19:13:03.113Z","repository":{"id":38898990,"uuid":"122420141","full_name":"austinoboyle/scrape-linkedin-selenium","owner":"austinoboyle","description":"`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles \u0026 company pages - turning the data into structured json.","archived":false,"fork":false,"pushed_at":"2022-10-16T16:44:50.000Z","size":275,"stargazers_count":480,"open_issues_count":17,"forks_count":164,"subscribers_count":25,"default_branch":"master","last_synced_at":"2025-03-28T18:15:15.083Z","etag":null,"topics":["linkedin","python","scrape","scraper","scraping","selenium","selenium-webdriver","web-scraper","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/austinoboyle.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-02-22T02:21:14.000Z","updated_at":"2025-03-12T07:01:41.000Z","dependencies_parsed_at":"2023-01-19T16:15:23.980Z","dependency_job_id":null,"html_url":"https://github.com/austinoboyle/scrape-linkedin-selenium","commit_stats":null,"previous_names":[],"tags_count":22,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/austinoboyle%2Fscrape-linkedin-selenium","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/austinoboyle%2Fscrape-linkedin-selenium/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/austinoboyle%2Fscrape-linkedin-selenium/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/austinoboyle%2Fscrape-linkedin-selenium/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/austinoboyle","download_url":"https://codeload.github.com/austinoboyle/scrape-linkedin-selenium/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247234923,"owners_count":20905854,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["linkedin","python","scrape","scraper","scraping","selenium","selenium-webdriver","web-scraper","web-scraping"],"created_at":"2024-08-04T01:01:26.519Z","updated_at":"2025-04-04T19:13:03.093Z","avatar_url":"https://github.com/austinoboyle.png","language":"HTML","funding_links":[],"categories":["HTML"],"sub_categories":[],"readme":"# scrape_linkedin\n\n## Introduction\n\n`scrape_linkedin` is a python package to scrape all details from public LinkedIn\nprofiles, turning the data into structured json. You can scrape Companies\nand user profiles with this package.\n\n**Warning**: LinkedIn has strong anti-scraping policies, they may blacklist ips making\nunauthenticated or unusual requests\n\n## Table of Contents\n\n\u003c!--ts--\u003e\n\n- [scrape_linkedin](#scrapelinkedin)\n  - [Introduction](#introduction)\n  - [Table of Contents](#table-of-contents)\n  - [Installation](#installation)\n    - [Install with pip](#install-with-pip)\n    - [Install from source](#install-from-source)\n    - [Tests](#tests)\n  - [Getting \u0026 Setting LI_AT](#getting--setting-liat)\n    - [Getting LI_AT](#getting-liat)\n    - [Setting LI_AT](#setting-liat)\n  - [Examples](#examples)\n  - [Usage](#usage)\n    - [Command Line](#command-line)\n    - [Python Package](#python-package)\n      - [Profiles](#profiles)\n      - [Companies](#companies)\n      - [config](#config)\n  - [Scraping in Parallel](#scraping-in-parallel)\n    - [Example](#example)\n    - [Configuration](#configuration)\n  - [Issues](#issues)\n\n\u003c!-- Added by: austinoboyle, at: 2018-05-06T20:13-04:00 --\u003e\n\n\u003c!--te--\u003e\n\n## Installation\n\n### Install with pip\n\nRun `pip install git+git://github.com/austinoboyle/scrape-linkedin-selenium.git`\n\n### Install from source\n\n`git clone https://github.com/austinoboyle/scrape-linkedin-selenium.git`\n\nRun `python setup.py install`\n\n### Tests\n\nTests are (so far) only run on static html files. One of which is a linkedin\nprofile, the other is just used to test some utility functions.\n\n## Getting \u0026 Setting LI_AT\n\nBecause of Linkedin's anti-scraping measures, you must make your selenium\nbrowser look like an actual user. To do this, you need to add the li_at cookie\nto the selenium session.\n\n### Getting LI_AT\n\n1.  Navigate to www.linkedin.com and log in\n2.  Open browser developer tools (Ctrl-Shift-I or right click -\u003e inspect\n    element)\n3.  Select the appropriate tab for your browser (**Application** on Chrome,\n    **Storage** on Firefox)\n4.  Click the **Cookies** dropdown on the left-hand menu, and select the\n    `www.linkedin.com` option\n5.  Find and copy the li_at **value**\n\n### Setting LI_AT\n\nThere are two ways to set your li_at cookie:\n\n1.  Set the LI_AT environment variable\n    -   `$ export LI_AT=YOUR_LI_AT_VALUE`\n    -   **On Windows**: `C:/foo/bar\u003e set LI_AT=YOUR_LI_AT_VALUE`\n2.  Pass the cookie as a parameter to the Scraper object.\n    \u003e `\u003e\u003e\u003e with ProfileScraper(cookie='YOUR_LI_AT_VALUE') as scraper:`\n\nA cookie value passed directly to the Scraper **will override your\nenvironment variable** if both are set.\n\n## Examples\n\nSee [`/examples`](https://github.com/austinoboyle/scrape-linkedin-selenium/tree/master/examples)\n\n## Usage\n\n### Command Line\n\nscrape_linkedin comes with a command line argument module `scrapeli` created\nusing [click](http://click.pocoo.org/5/).\n\n**Note: CLI only works with Personal Profiles as of now.**\n\nOptions:\n\n-   --url : Full Url of the profile you want to scrape\n-   --user: www.linkedin.com/in/USER\n-   --driver: choose Browser type to use (Chrome/Firefox), **default: Chrome**\n-   -a --attribute : return only a specific attribute (default: return all\n    attributes)\n-   -i --input_file : Raw path to html file of the profile you want to scrape\n-   -o --output_file: Raw path to output file for structured json profile (just\n    prints results by default)\n-   -h --help : Show this screen.\n\nExamples:\n\n-   Get Austin O'Boyle's profile info: `$ scrapeli --user=austinoboyle`\n-   Get only the skills of Austin O'Boyle: `$ scrapeli --user=austinoboyle -a skills`\n-   Parse stored html profile and save json output: `$ scrapeli -i /path/file.html -o output.json`\n\n### Python Package\n\n#### Profiles\n\nUse `ProfileScraper` component to scrape profiles.\n\n```python\nfrom scrape_linkedin import ProfileScraper\n\nwith ProfileScraper() as scraper:\n    profile = scraper.scrape(user='austinoboyle')\nprint(profile.to_dict())\n```\n\n`Profile` - the class that has properties to access all information pulled from\na profile. Also has a to_dict() method that returns all of the data as a dict\n\n    with open('profile.html', 'r') as profile_file:\n        profile = Profile(profile_file.read())\n\n    print (profile.skills)\n    # [{...} ,{...}, ...]\n    print (profile.experiences)\n    # {jobs: [...], volunteering: [...],...}\n    print (profile.to_dict())\n    # {personal_info: {...}, experiences: {...}, ...}\n\n**Structure of the fields scraped**\n\n-   personal_info\n    -   name\n    -   company\n    -   school\n    -   headline\n    -   followers\n    -   summary\n    -   websites\n    -   email\n    -   phone\n    -   connected\n    -   image\n-   skills\n-   experiences\n    -   volunteering\n    -   jobs\n    -   education\n-   interests\n-   accomplishments\n    -   publications\n    -   cerfifications\n    -   patents\n    -   courses\n    -   projects\n    -   honors\n    -   test scores\n    -   languages\n    -   organizations\n\n#### Companies\n\nUse `CompanyScraper` component to scrape companies.\n\n```python\nfrom scrape_linkedin import CompanyScraper\n\nwith CompanyScraper() as scraper:\n    company = scraper.scrape(company='facebook')\nprint(company.to_dict())\n```\n\n`Company` - the class that has properties to access all information pulled from\na company profile. There will be three properties: overview, jobs, and life.\n**Overview is the only one currently implemented.**\n\n    with open('overview.html', 'r') as overview,\n        open('jobs.html', 'r') as jobs,\n        open('life.html', 'r') as life:\n            company = Company(overview, jobs, life)\n\n    print (company.overview)\n    # {...}\n\n**Structure of the fields scraped**\n\n-   overview\n    -   name\n    -   company_size\n    -   specialties\n    -   headquarters\n    -   founded\n    -   website\n    -   description\n    -   industry\n    -   num_employees\n    -   type\n    -   image\n-   jobs **NOT YET IMPLEMENTED**\n-   life **NOT YET IMPLEMENTED**\n\n#### config\n\nPass these keyword arguments into the constructor of your Scraper to override\ndefault values. You may (for example) want to decrease/increase the timeout if\nyour internet is very fast/slow.\n\n-   _cookie_ **`{str}`**: li_at cookie value (overrides env variable)\n    -   **default: `None`**\n-   _driver_ **`{selenium.webdriver}`**: driver type to use\n    -   **default: `selenium.webdriver.Chrome`**\n-   _driver_options_ **`{dict}`**: kwargs to pass to driver constructor\n    -   **default: `{}`**\n-   _scroll_pause_ **`{float}`**: time(s) to pause during scroll increments\n    -   **default: `0.1`**\n-   _scroll_increment_ **`{int}`** num pixels to scroll down each time\n    -   **default: `300`**\n-   _timeout_ **`{float}`**: default time to wait for async content to load\n    -   **default: `10`**\n\n## Scraping in Parallel\n\nNew in version 0.2: built in parallel scraping functionality. Note that the\nup-front cost of starting a browser session is high, so in order for this to be\nbeneficial, you will want to be scraping many (\u003e 15) profiles.\n\n### Example\n\n```python\nfrom scrape_linkedin import scrape_in_parallel, CompanyScraper\n\ncompanies = ['facebook', 'google', 'amazon', 'microsoft', ...]\n\n#Scrape all companies, output to 'companies.json' file, use 4 browser instances\nscrape_in_parallel(\n    scraper_type=CompanyScraper,\n    items=companies,\n    output_file=\"companies.json\",\n    num_instances=4\n)\n```\n\n### Configuration\n\n**Parameters:**\n\n-   _scraper_type_ **`{scrape_linkedin.Scraper}`**: Scraper to use\n-   _items_ **`{list}`**: List of items to be scraped\n-   _output_file_ **`{str}`**: path to output file\n-   _num_instances_ **`{int}`**: number of parallel instances of selenium to run\n-   _temp_dir_ **`{str}`**: name of temporary directory to use to store data from intermediate steps\n    -   **default: 'tmp_data'**\n-   _driver_ {selenium.webdriver}: driver to use for scraping\n    -   **default: selenium.webdriver.Chrome**\n-   _driver_options_ **`{dict}`**: dict of keyword arguments to pass to the driver function.\n    -   **default: scrape_linkedin.utils.HEADLESS_OPTIONS**\n-   _\\*\\*kwargs_ **`{any}`**: extra keyword arguments to pass to the `scraper_type` constructor for each job\n\n## Issues\n\nReport bugs and feature requests\n[here](https://github.com/austinoboyle/scrape-linkedin-selenium/issues).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faustinoboyle%2Fscrape-linkedin-selenium","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faustinoboyle%2Fscrape-linkedin-selenium","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faustinoboyle%2Fscrape-linkedin-selenium/lists"}