An open API service indexing awesome lists of open source software.

https://github.com/elijahcroft/web_crawler

A PHP-based web crawler that efficiently scrapes and indexes web content. It follows links, retrieves data, and stores it in a structured format. Features include customizable crawl depth, domain targeting, and robots.txt compliance. Perfect for building data pipelines or conducting web scraping projects.
https://github.com/elijahcroft/web_crawler

datascraper php7 webscraping

Last synced: 10 months ago
JSON representation

A PHP-based web crawler that efficiently scrapes and indexes web content. It follows links, retrieves data, and stores it in a structured format. Features include customizable crawl depth, domain targeting, and robots.txt compliance. Perfect for building data pipelines or conducting web scraping projects.

Awesome Lists containing this project

README

          

# PHP Web Crawler

This is a basic web crawler built in PHP that navigates through web pages starting from a given URL. It extracts metadata such as titles, descriptions, and keywords from the pages it visits.

## Features

- Crawls web pages recursively starting from a given URL.
- Extracts the following metadata:
- Page Title
- Meta Description
- Meta Keywords
- Resolves relative links to absolute URLs.
- Handles edge cases like JavaScript links and duplicate URLs.
- Outputs metadata in JSON format.
- Tracks all visited URLs.

## Prerequisites

- PHP installed on your system (version 7.0 or higher).
- A local or live server to host and test the crawler.

## How to Use

1. Clone the repository or download the script.
2. Place the script in your server's directory (e.g., `htdocs` for XAMPP or similar).
3. Update the `$start` variable with the URL you want to crawl. For example:
```php
$start = "http://example.com";