https://github.com/fern-aerell/web-crawling-to-txt

Aplikasi web crawling sederhana yang dapat menelusuri URL, mengekstrak konten teks, dan menyimpan hasilnya dalam format TXT.
https://github.com/fern-aerell/web-crawling-to-txt

beautifulsoup4 crawling python requests scraping txt web-crawling web-scraping

Last synced: 2 months ago
JSON representation

Aplikasi web crawling sederhana yang dapat menelusuri URL, mengekstrak konten teks, dan menyimpan hasilnya dalam format TXT.

Host: GitHub
URL: https://github.com/fern-aerell/web-crawling-to-txt
Owner: Fern-Aerell
License: mit
Created: 2024-08-25T06:01:05.000Z (11 months ago)
Default Branch: main
Last Pushed: 2024-08-25T13:19:06.000Z (11 months ago)
Last Synced: 2024-08-26T12:13:19.945Z (11 months ago)
Topics: beautifulsoup4, crawling, python, requests, scraping, txt, web-crawling, web-scraping
Language: Python
Homepage:
Size: 315 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Web Crawling To TXT



  



This project is an asynchronous web crawling application written in Python. The application can crawl a website, collect valid URLs, and extract content from each URL.

## Features

- Asynchronous URL crawling within the same domain  

- Text content extraction from each web page  

- Saving crawl results in TXT format  

- Cleaning extracted text  

## Requirements

To run this application, you need:

- Python 3.x  

- Several Python libraries that can be installed using pip:  

  - aiohttp  

  - beautifulsoup4  

  - lxml  

You can install all dependencies by running:

```sh

pip install aiohttp beautifulsoup4 lxml

```

## Usage

To run the application, use the following command in the terminal:

```sh

python webcrawling2txt.py  

```

Where:  

- `` is the base URL of the website you want to crawl  

- `` is the output file name (without the .txt extension)  

Example:

```sh

python webcrawling2txt.py https://www.example.com crawl_results

```

The crawl results will be saved in a TXT file named `crawl_results.txt`.

## Project Structure

- `webcrawling2txt.py`: Main file containing all functions for web crawling  

  - `clean_text()`: Function to clean the extracted text  

  - `crawl_url()`: Asynchronous function for crawling URLs  

  - `crawl_website()`: Main function that performs crawling and saves the results  

  - `main()`: Function to handle command-line arguments and run the crawling process  

## Notes

- Make sure to comply with the policies and terms of service of the websites you crawl.  

- Use this application responsibly and ethically.  

- This application uses asyncio and aiohttp for asynchronous crawling, which improves performance on websites with many pages.  

## Contribution

Contributions to this project are welcome. If you have suggestions or improvements, feel free to submit a pull request or open an issue.  

## License

[MIT License](LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/fern-aerell/web-crawling-to-txt

Awesome Lists containing this project

README