https://github.com/fern-aerell/web-crawling-to-txt
Aplikasi web crawling sederhana yang dapat menelusuri URL, mengekstrak konten teks, dan menyimpan hasilnya dalam format TXT.
https://github.com/fern-aerell/web-crawling-to-txt
beautifulsoup4 crawling python requests scraping txt web-crawling web-scraping
Last synced: 2 months ago
JSON representation
Aplikasi web crawling sederhana yang dapat menelusuri URL, mengekstrak konten teks, dan menyimpan hasilnya dalam format TXT.
- Host: GitHub
- URL: https://github.com/fern-aerell/web-crawling-to-txt
- Owner: Fern-Aerell
- License: mit
- Created: 2024-08-25T06:01:05.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-08-25T13:19:06.000Z (11 months ago)
- Last Synced: 2024-08-26T12:13:19.945Z (11 months ago)
- Topics: beautifulsoup4, crawling, python, requests, scraping, txt, web-crawling, web-scraping
- Language: Python
- Homepage:
- Size: 315 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Web Crawling To TXT
![]()
This project is an asynchronous web crawling application written in Python. The application can crawl a website, collect valid URLs, and extract content from each URL.
## Features
- Asynchronous URL crawling within the same domain
- Text content extraction from each web page
- Saving crawl results in TXT format
- Cleaning extracted text## Requirements
To run this application, you need:
- Python 3.x
- Several Python libraries that can be installed using pip:
- aiohttp
- beautifulsoup4
- lxmlYou can install all dependencies by running:
```sh
pip install aiohttp beautifulsoup4 lxml
```## Usage
To run the application, use the following command in the terminal:
```sh
python webcrawling2txt.py
```Where:
- `` is the base URL of the website you want to crawl
- `` is the output file name (without the .txt extension)Example:
```sh
python webcrawling2txt.py https://www.example.com crawl_results
```The crawl results will be saved in a TXT file named `crawl_results.txt`.
## Project Structure
- `webcrawling2txt.py`: Main file containing all functions for web crawling
- `clean_text()`: Function to clean the extracted text
- `crawl_url()`: Asynchronous function for crawling URLs
- `crawl_website()`: Main function that performs crawling and saves the results
- `main()`: Function to handle command-line arguments and run the crawling process## Notes
- Make sure to comply with the policies and terms of service of the websites you crawl.
- Use this application responsibly and ethically.
- This application uses asyncio and aiohttp for asynchronous crawling, which improves performance on websites with many pages.## Contribution
Contributions to this project are welcome. If you have suggestions or improvements, feel free to submit a pull request or open an issue.
## License
[MIT License](LICENSE)