https://github.com/soulyma/web_crawler

A focused web crawler to extract and structure Arabic content from web pages. Designed for researchers, data analysts, and developers working on Arabic language datasets.
https://github.com/soulyma/web_crawler

beautifulsoup4 crawler csv data json python structured-data

Last synced: 3 months ago
JSON representation

A focused web crawler to extract and structure Arabic content from web pages. Designed for researchers, data analysts, and developers working on Arabic language datasets.

Host: GitHub
URL: https://github.com/soulyma/web_crawler
Owner: Soulyma
Created: 2024-12-04T16:34:22.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-12-04T17:00:56.000Z (7 months ago)
Last Synced: 2025-02-06T20:24:05.150Z (5 months ago)
Topics: beautifulsoup4, crawler, csv, data, json, python, structured-data
Language: Python
Homepage:
Size: 18.6 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Arabic Web Content Crawler
### Overview
This project provides tools to crawl web pages, extract Arabic content, and convert the extracted data into structured formats like JSON and CSV. It is designed for researchers, data analysts, and developers working with Arabic datasets and content analysis.

### The project includes:

- A Web Crawler: Extracts Arabic content from a given website.
- Data Conversion Tool: Converts the crawled JSON data into CSV format for further analysis.

### Features

**Web Crawler:**
Crawls a specified website starting from a given URL.
Extracts Arabic content and organizes it into structured sections based on headings and paragraphs.
Saves the extracted content into a JSON file.

**Data Converter:**
Converts the crawled JSON data into a CSV file.
Ensures proper encoding (UTF-8 with BOM) for compatibility with tools like Excel.
Includes structured headers such as "Document ID", "URL", "Title", "Section", and "Text".

**Prerequisites**
- Python Version: Python 3.8 or higher.
- beautifulsoup4
- requests

### Example Output:
The crawler saves the extracted data to crawled_data.json in the specified directory.

**Example JSON Structure**
```json
[
{
"Document ID": 1,
"URL": "https://example.com/article1",
"Title": "عنوان المقالة",
"Content": [

{
"section": "المقدمة",
"text": "هذه فقرة تتحدث عن الموضوع."
},
{
"section": "الخاتمة",
"text": "هذه فقرة تلخص الموضوع."
},
...
]
...
}
]
```

The converter saves the data to crawled_data.csv in the specified directory.
**Example CSV Structure**
| Document ID | URL | Title | Section | Text |
|-------------|----------------------------|------------------|-------------|------------------------------|
| 1 | https://example.com/article1 | عنوان المقالة | المقدمة | هذه فقرة تتحدث عن الموضوع. |

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/soulyma/web_crawler

Awesome Lists containing this project

README