Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/alexmili/reachable
Check if a URL exists and is reachable
https://github.com/alexmili/reachable
crawler health-check monitoring reachability webscraping
Last synced: 25 days ago
JSON representation
Check if a URL exists and is reachable
- Host: GitHub
- URL: https://github.com/alexmili/reachable
- Owner: AlexMili
- License: mit
- Created: 2024-08-15T15:06:07.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-12-02T21:44:41.000Z (about 1 month ago)
- Last Synced: 2024-12-02T22:01:52.873Z (about 1 month ago)
- Topics: crawler, health-check, monitoring, reachability, webscraping
- Language: Python
- Homepage: https://pypi.org/project/reachable/
- Size: 49.8 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
**Reachable** checks if a URL exists and is reachable.
# Features
- Use `HEAD`request instead of `GET` to save some bandwidth
- Follow redirects
- Handle local redirects (without full URL in `location` header)
- Record all the URLs of the redirection chain
- Check if redirected URL match the TLD of source URL
- Detect Cloudflare protection
- Avoid basic bot detectors
- Use randome Chrome user agent
- Wait between consecutive requests to the same host
- Include `Host` header
- Use of HTTP/2
- Detect parking domains# Installation
You can install it with pip :
```bash
pip install reachable
```
Or clone this repository and simply run :
```bash
cd reachable/
pip install -e .
```# Usage
## Simple URL
```python
from reachable import is_reachable
result = is_reachable("https://google.com")
```The output will look like this:
```json
{
"original_url": "https://google.com",
"final_url": "https://www.google.com/",
"response": null,
"status_code": 200,
"success": true,
"error_name": null,
"cloudflare_protection": false,
"redirect": {
"chain": ["https://www.google.com/"],
"final_url": "https://www.google.com/",
"tld_match": true
}
}
```## Multiple URLs
```python
from reachable import is_reachable
result = is_reachable(["https://google.com", "http://bing.com"])
```The output will look like this:
```json
[
{
"original_url": "https://google.com",
"final_url": "https://www.google.com/",
"response": null,
"status_code": 200,
"success": true,
"error_name": null,
"cloudflare_protection": false,
"redirect": {
"chain": ["https://www.google.com/"],
"final_url": "https://www.google.com/",
"tld_match": true
}
},
{
"original_url": "http://bing.com",
"final_url": "https://www.bing.com/?toWww=1&redig=16A78C94",
"response": null,
"status_code": 200,
"success": true,
"error_name": null,
"cloudflare_protection": false,
"redirect": {
"chain": ["https://www.bing.com:443/?toWww=1&redig=16A78C94"],
"final_url": "https://www.bing.com/?toWww=1&redig=16A78C94",
"tld_match": true
}
}
]
```## Async
```python
import asyncio
from reachable import is_reachable_asyncresult = asyncio.run(is_reachable_async("https://google.com"))
```
or
```python
import asyncio
from reachable import is_reachable_asyncurls = ["https://google.com", "https://bing.com"]
try:
loop = asyncio.get_running_loop()
except RuntimeError:
# No loop already exists so we crete one
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
result = loop.run_until_complete(asyncio.gather(*[is_reachable_async(url) for url in urls]))
finally:
loop.close()
```### Handling high volumes with Taskpool
If you want to process a large number of URLs (> 500) you will quickly hit the limits of your hardware and/or OS because you can only open a defined number of active connections.
To bypass this problem you can use the `TaskPool` class. It uses Asyncio Semaphores to limit the number of asyncio threads running. It works by acquiring a lock when starting the worker and releasing it when done. It allows to always have a number of asyncio workers without overwhelming the OS.
```python
import asynciofrom reachable import is_reachable_async
from reachable.client import AsyncClient
from reachable.pool import TaskPoolurls = ["https://google.com", "https://bing.com"]
async def worker(url, client):
result = await is_reachable_async(url, client=client)
return resultasync def workers_builder(urls, pool_size: int = 100):
async with AsyncClient() as client:
tasks = TaskPool(workers=pool_size)for url in urls:
await tasks.put(worker(url, client=client))await tasks.join()
return tasks._results
try:
loop = asyncio.get_running_loop()
except RuntimeError:
# No loop already exists so we crete one
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)try:
result = loop.run_until_complete(workers_builder(urls))
print(result)
finally:
loop.close()```