https://github.com/elijahcroft/web_crawler
A PHP-based web crawler that efficiently scrapes and indexes web content. It follows links, retrieves data, and stores it in a structured format. Features include customizable crawl depth, domain targeting, and robots.txt compliance. Perfect for building data pipelines or conducting web scraping projects.
https://github.com/elijahcroft/web_crawler
datascraper php7 webscraping
Last synced: 10 months ago
JSON representation
A PHP-based web crawler that efficiently scrapes and indexes web content. It follows links, retrieves data, and stores it in a structured format. Features include customizable crawl depth, domain targeting, and robots.txt compliance. Perfect for building data pipelines or conducting web scraping projects.
- Host: GitHub
- URL: https://github.com/elijahcroft/web_crawler
- Owner: elijahcroft
- Created: 2024-12-08T20:49:16.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-08T21:36:20.000Z (over 1 year ago)
- Last Synced: 2025-02-13T19:40:45.592Z (over 1 year ago)
- Topics: datascraper, php7, webscraping
- Language: PHP
- Homepage:
- Size: 134 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# PHP Web Crawler
This is a basic web crawler built in PHP that navigates through web pages starting from a given URL. It extracts metadata such as titles, descriptions, and keywords from the pages it visits.
## Features
- Crawls web pages recursively starting from a given URL.
- Extracts the following metadata:
- Page Title
- Meta Description
- Meta Keywords
- Resolves relative links to absolute URLs.
- Handles edge cases like JavaScript links and duplicate URLs.
- Outputs metadata in JSON format.
- Tracks all visited URLs.
## Prerequisites
- PHP installed on your system (version 7.0 or higher).
- A local or live server to host and test the crawler.
## How to Use
1. Clone the repository or download the script.
2. Place the script in your server's directory (e.g., `htdocs` for XAMPP or similar).
3. Update the `$start` variable with the URL you want to crawl. For example:
```php
$start = "http://example.com";