https://github.com/toofancodes/scrapperathleticscontacts
StaffScrapper is a smart and flexible web scraper built for one job: collecting staff contact details from athletics department websites. Whether you're pulling emails, job titles, or phone numbers — even from JavaScript-heavy or obfuscated sites — this tool handles it with ease. Designed for marketing teams, outreach coordinators, and data analys
https://github.com/toofancodes/scrapperathleticscontacts
beautifulsoup beautifulsoup4 python requests scrapping-python selenium selenium-webdriver webdriver-manager
Last synced: about 1 month ago
JSON representation
StaffScrapper is a smart and flexible web scraper built for one job: collecting staff contact details from athletics department websites. Whether you're pulling emails, job titles, or phone numbers — even from JavaScript-heavy or obfuscated sites — this tool handles it with ease. Designed for marketing teams, outreach coordinators, and data analys
- Host: GitHub
- URL: https://github.com/toofancodes/scrapperathleticscontacts
- Owner: toofanCodes
- License: mit
- Created: 2025-04-11T06:57:11.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2025-04-11T07:44:17.000Z (about 1 month ago)
- Last Synced: 2025-04-11T09:40:51.542Z (about 1 month ago)
- Topics: beautifulsoup, beautifulsoup4, python, requests, scrapping-python, selenium, selenium-webdriver, webdriver-manager
- Language: Python
- Homepage: https://www.linkedin.com/in/saranpavuluri/
- Size: 11.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
### ✅ `README.md` for `staffScrapper_Apr2025.py`
```markdown
# 🏫 Staff Directory Scraper for Athletic WebsitesThis Python script automates the extraction of staff contact information (such as name, email, title, department, and phone number) from college athletics staff directories. It is designed to handle diverse and dynamic HTML structures, including those rendered with JavaScript.
---
## 📌 Features
- ✅ Scrapes data from multiple URLs using a `.csv` input list
- ✅ Handles complex HTML structures (including JavaScript-rendered content with Selenium)
- ✅ Extracts:
- Full name
- Position / title
- Email address (even from obfuscated JS formats)
- Phone number
- Associated department or sport (when available)
- Source URL
- ✅ Gracefully handles errors and logs them for debugging
- ✅ Supports headless scraping for automation pipelines---
## 📂 File Structure
| File | Description |
|------|-------------|
| `staffScrapper_Apr2025.py` | Main scraper script |
| `target_urls.csv` | Input CSV with URLs (one per line) |
| `staff_directory.csv` | Output file with scraped data |
| `scrape_errors.txt` | Error Output log showing failed URLs and parsing issues |---
## 🛠 Requirements
Install dependencies using:
```bash
pip install -r requirements.txt
```### `requirements.txt` content:
```
requests
beautifulsoup4
selenium
webdriver-manager
```---
## 📥 Usage
1. Prepare a CSV file named `target_urls.csv` with this structure:
```
https://example.edu/staff-directory
https://another.edu/staff-directory
...
```> **Note**: No header row is required.
2. Run the script:
```bash
python staffScrapper_Apr2025.py
```3. Output will be saved as:
- `staff_directory.csv` — extracted contact info
- `scrape_errors.txt` — any URLs that couldn’t be processed---
## 🧠 How It Works
- Tries multiple parsing strategies (table, definition list, generic row matching)
- Uses `Selenium` headless Chrome if the page is JavaScript-heavy
- Identifies email patterns even when obfuscated with JS `document.write` or `innerText` replacement
- Categorizes staff into departments based on headings where possible---
## ⚠️ Known Limitations
- Pages with extreme JavaScript complexity may not be 100% compatible
- Obfuscated email formats beyond standard patterns may be missed
- Sites using CAPTCHAs or anti-bot protection are unsupported## 👤 Author
**Jaya Saran Teja Pavuluri**
[GitHub](https://github.com/toofanCodes)
📧 [email protected]---
## 📝 License
MIT License – do what you want, just don't spam the scrapped contacts 😉