https://github.com/elijahcroft/web_crawler

A PHP-based web crawler that efficiently scrapes and indexes web content. It follows links, retrieves data, and stores it in a structured format. Features include customizable crawl depth, domain targeting, and robots.txt compliance. Perfect for building data pipelines or conducting web scraping projects.
https://github.com/elijahcroft/web_crawler

datascraper php7 webscraping

Last synced: 10 months ago
JSON representation

Host: GitHub
URL: https://github.com/elijahcroft/web_crawler
Owner: elijahcroft
Created: 2024-12-08T20:49:16.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-12-08T21:36:20.000Z (over 1 year ago)
Last Synced: 2025-02-13T19:40:45.592Z (over 1 year ago)
Topics: datascraper, php7, webscraping
Language: PHP
Homepage:
Size: 134 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# PHP Web Crawler

This is a basic web crawler built in PHP that navigates through web pages starting from a given URL. It extracts metadata such as titles, descriptions, and keywords from the pages it visits.

## Features

- Crawls web pages recursively starting from a given URL.
- Extracts the following metadata:
- Page Title
- Meta Description
- Meta Keywords
- Resolves relative links to absolute URLs.
- Handles edge cases like JavaScript links and duplicate URLs.
- Outputs metadata in JSON format.
- Tracks all visited URLs.

## Prerequisites

- PHP installed on your system (version 7.0 or higher).
- A local or live server to host and test the crawler.

## How to Use

1. Clone the repository or download the script.
2. Place the script in your server's directory (e.g., `htdocs` for XAMPP or similar).
3. Update the `$start` variable with the URL you want to crawl. For example:
```php
$start = "http://example.com";

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/elijahcroft/web_crawler

Awesome Lists containing this project

README