https://github.com/hansel-7/linkedin_stalker

Python tool to scrape LinkedIn company posts using Playwright
https://github.com/hansel-7/linkedin_stalker

automation linkedin-scraper playwright python web-scraping

Last synced: about 2 months ago
JSON representation

Python tool to scrape LinkedIn company posts using Playwright

Host: GitHub
URL: https://github.com/hansel-7/linkedin_stalker
Owner: hansel-7
License: mit
Created: 2025-12-05T08:08:08.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-12-05T08:13:56.000Z (6 months ago)
Last Synced: 2025-12-25T13:38:02.438Z (6 months ago)
Topics: automation, linkedin-scraper, playwright, python, web-scraping
Language: Python
Homepage:
Size: 16.6 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

          # LinkedIn Stalker - LinkedIn Post Scraper

A Python tool to scrape LinkedIn company page posts and user activity feeds using Playwright. Extract and organize posts from multiple companies for competitive analysis, market research, or content monitoring.

> ⚠️ **Disclaimer**: This tool is for educational and research purposes only. Users must comply with LinkedIn's Terms of Service and applicable laws.

## Features

- ✅ Extracts top N posts from LinkedIn feeds (default: 10)

- ✅ Cookie-based authentication

- ✅ Duplicate detection

- ✅ Repost filtering

- ✅ Multiple URL support

- ✅ Text output format

## Installation

1. Install dependencies:

```bash

pip install -r requirements.txt

```

2. Install Playwright browsers:

```bash

playwright install chromium

```

## Setup

### Step 1: Extract LinkedIn Cookies

Run the cookie extractor to save your LinkedIn session:

```bash

python get_linkedin_cookies.py

```

This will:

1. Open a browser window

2. Navigate to LinkedIn

3. Wait for you to log in manually

4. Save your cookies to `linkedin_cookies.json`

### Step 2: Configure Companies and URLs

Edit `linkedin_urls.json` to add company names and their LinkedIn URLs:

```json

[

  ["OpenAI", "https://www.linkedin.com/company/openai/posts/"],

  ["Anthropic", "https://www.linkedin.com/company/anthropic/posts/"],

  ["Google DeepMind", "https://www.linkedin.com/company/google-deepmind/posts/"]

]

```

**Format:** Each entry is an array with two elements:

1. Company/Person name (string)

2. LinkedIn URL (string)

## Usage

### Scrape Multiple URLs

Run the main scraper to process all URLs from `linkedin_urls.json`:

```bash

python scrape_linkedin.py

```

Results will be saved to `linkedin_output.txt`, organized by company with clear section separators.

### Scrape Single URL (Manual)

Use the agent module directly in Python:

```python

from linkedin_agent import get_linkedin_updates

# Get top 10 posts

posts = get_linkedin_updates("https://www.linkedin.com/company/openai/posts/", max_posts=10)

for post in posts:

    print(f"Position: {post['position']}")

    print(f"Content: {post['text']}")

    print(f"URL: {post['url']}")

    print()

```

## Configuration

### Number of Posts

Change the number of posts to extract by editing `scrape_linkedin.py`:

```python

posts = get_linkedin_updates(url, max_posts=20)  # Get 20 posts instead of 10

```

### Date Filtering (Optional)

If you want to filter by date (experimental, disabled by default):

```python

posts = get_linkedin_updates(url, max_posts=10, max_days=30)

```

## Output Format

The output file (`linkedin_output.txt`) is organized by company:

```

================================================================================

LINKEDIN SCRAPING RESULTS

Generated: 2024-12-05 10:30:00

Total Companies: 2

================================================================================

████████████████████████████████████████████████████████████████████████████████

COMPANY: COMPANY NAME 1

████████████████████████████████████████████████████████████████████████████████

URL: https://www.linkedin.com/company/...

Scraped: 2024-12-05 10:30:15

────────────────────────────────────────────────────────────────────────────────

📊 Total Posts Found: 10

┌─ POST #1 ──────────────────────────────────────────────────────────────────

│ Position in feed: 1

│

│ Content:

│ [Post content here...]

└───────────────────────────────────────────────────────────────────────────

[Additional posts...]

████████████████████████████████████████████████████████████████████████████████

COMPANY: COMPANY NAME 2

████████████████████████████████████████████████████████████████████████████████

[...]

```

## Project Structure

```

linkedin_stalker/

├── linkedin_agent.py              # Core scraping module

├── scrape_linkedin.py             # Batch scraper for multiple companies

├── get_linkedin_cookies.py        # Cookie extraction helper

├── debug_linkedin.py              # Debug mode scraper

├── linkedin_urls.json             # List of companies and URLs to scrape

├── linkedin_cookies.json.example  # Example cookie file structure

├── requirements.txt               # Python dependencies

├── LINKEDIN_STRUCTURE.md          # LinkedIn HTML structure documentation

├── README.md                      # This file

├── LICENSE                        # MIT License

└── .gitignore                     # Git ignore rules

# Generated files (gitignored):

├── linkedin_cookies.json          # Your session cookies

├── linkedin_output.txt            # Scraping results

└── debug.html                     # Debug output

```

## Troubleshooting

### Browser doesn't open

Make sure Playwright browsers are installed:

```bash

playwright install chromium

```

### No posts found

1. Check that `linkedin_cookies.json` exists and has valid cookies

2. Try running `get_linkedin_cookies.py` again to refresh cookies

3. Make sure you're logged into LinkedIn in the browser when extracting cookies

### Duplicates or missing content

The script includes deduplication logic and filters out:

- Reposts

- Image placeholders

- Action buttons (Like, Comment, Share)

- Metadata text

If posts are still being missed, try increasing `max_posts` parameter.

## Notes

- **Legal & Ethical Use**: This tool is for educational purposes and personal research. LinkedIn may rate-limit or block automated access. Always respect LinkedIn's Terms of Service and robots.txt.

- **Cookie Expiration**: Session cookies expire after some time. Re-run `get_linkedin_cookies.py` if scraping fails.

- **Detection Avoidance**: The script uses non-headless mode by default to reduce detection risk.

- **Rate Limiting**: Add delays between requests when scraping multiple companies to avoid rate limits.

## Disclaimer

This tool is for educational and research purposes only. Users are responsible for ensuring their use complies with LinkedIn's Terms of Service and applicable laws. The authors are not responsible for any misuse of this software.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

MIT - see [LICENSE](LICENSE) file for details

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hansel-7/linkedin_stalker

Awesome Lists containing this project

README