Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/alexmili/reachable

Check if a URL exists and is reachable
https://github.com/alexmili/reachable

crawler health-check monitoring reachability webscraping

Last synced: about 2 months ago
JSON representation

Check if a URL exists and is reachable

Host: GitHub
URL: https://github.com/alexmili/reachable
Owner: AlexMili
License: mit
Created: 2024-08-15T15:06:07.000Z (6 months ago)
Default Branch: main
Last Pushed: 2024-12-02T21:44:41.000Z (2 months ago)
Last Synced: 2024-12-02T22:01:52.873Z (2 months ago)
Topics: crawler, health-check, monitoring, reachability, webscraping
Language: Python
Homepage: https://pypi.org/project/reachable/
Size: 49.8 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        **Reachable** checks if a URL exists and is reachable.

# Features

- Use `HEAD`request instead of `GET` to save some bandwidth

- Follow redirects

- Handle local redirects (without full URL in `location` header)

- Record all the URLs of the redirection chain

- Check if redirected URL match the TLD of source URL

- Detect Cloudflare protection

- Avoid basic bot detectors

    - Use randome Chrome user agent

    - Wait between consecutive requests to the same host

    - Include `Host` header

- Use of HTTP/2

- Detect parking domains

# Installation

You can install it with pip :

```bash

pip install reachable

```

Or clone this repository and simply run :

```bash

cd reachable/

pip install -e .

```

# Usage

## Simple URL

```python

from reachable import is_reachable

result = is_reachable("https://google.com")

```

The output will look like this:

```json

{

    "original_url": "https://google.com",

    "final_url": "https://www.google.com/",

    "response": null, 

    "status_code": 200,

    "success": true,

    "error_name": null,

    "cloudflare_protection": false,

    "redirect": {

        "chain": ["https://www.google.com/"],

        "final_url": "https://www.google.com/",

        "tld_match": true

    }

}

```

## Multiple URLs

```python

from reachable import is_reachable

result = is_reachable(["https://google.com", "http://bing.com"])

```

The output will look like this:

```json

[

    {

        "original_url": "https://google.com",

        "final_url": "https://www.google.com/",

        "response": null, 

        "status_code": 200,

        "success": true,

        "error_name": null,

        "cloudflare_protection": false,

        "redirect": {

            "chain": ["https://www.google.com/"],

            "final_url": "https://www.google.com/",

            "tld_match": true

        }

    },

    {

        "original_url": "http://bing.com",

        "final_url": "https://www.bing.com/?toWww=1&redig=16A78C94",

        "response": null,

        "status_code": 200,

        "success": true,

        "error_name": null,

        "cloudflare_protection": false,

        "redirect": {

            "chain": ["https://www.bing.com:443/?toWww=1&redig=16A78C94"],

            "final_url": "https://www.bing.com/?toWww=1&redig=16A78C94",

            "tld_match": true

        }

    }

]

```

## Async

```python

import asyncio

from reachable import is_reachable_async

result = asyncio.run(is_reachable_async("https://google.com"))

```

or

```python

import asyncio

from reachable import is_reachable_async

urls = ["https://google.com", "https://bing.com"]

try:

    loop = asyncio.get_running_loop()

except RuntimeError:

    # No loop already exists so we crete one

    loop = asyncio.new_event_loop()

    asyncio.set_event_loop(loop)

try:

    result = loop.run_until_complete(asyncio.gather(*[is_reachable_async(url) for url in urls]))

finally:

    loop.close()

```

### Handling high volumes with Taskpool

If you want to process a large number of URLs (> 500) you will quickly hit the limits of your hardware and/or OS because you can only open a defined number of active connections.

To bypass this problem you can use the `TaskPool` class. It uses Asyncio Semaphores to limit the number of asyncio threads running. It works by acquiring a lock when starting the worker and releasing it when done. It allows to always have a number of asyncio workers without overwhelming the OS.

```python

import asyncio

from reachable import is_reachable_async

from reachable.client import AsyncClient

from reachable.pool import TaskPool

urls = ["https://google.com", "https://bing.com"]

async def worker(url, client):

    result = await is_reachable_async(url, client=client)

    return result

async def workers_builder(urls, pool_size: int = 100):

    async with AsyncClient() as client:

        tasks = TaskPool(workers=pool_size)

        for url in urls:

            await tasks.put(worker(url, client=client))

        await tasks.join()

    return tasks._results

try:

    loop = asyncio.get_running_loop()

except RuntimeError:

    # No loop already exists so we crete one

    loop = asyncio.new_event_loop()

    asyncio.set_event_loop(loop)

try:

    result = loop.run_until_complete(workers_builder(urls))

    print(result)

finally:

    loop.close()

```