An open API service indexing awesome lists of open source software.

https://github.com/codedotjs/urlist

A Python script that extracts URLs from a text file containing a list of websites.
https://github.com/codedotjs/urlist

collection extractor python scraping urls

Last synced: 9 months ago
JSON representation

A Python script that extracts URLs from a text file containing a list of websites.

Awesome Lists containing this project

README

          

URLs from URL(s)

---

### Purpose

- This script extracts all the URLs from a text file containing a list of websites, and saves them in JSON format.
- Handles missing schemas and fixes relative URLs to ensure accurate results.
- Uses multithreading to concurrently process multiple websites, so it's fast!

---

### Usage

- Install the required modules

```sh
$ pip install aiohttp beautifulsoup4 fake_useragent
```

- Download the script

```sh
$ curl -OL https://raw.githubusercontent.com/CodeDotJS/urlist/master/extractor.py
```

- Run

```sh
$ python extractor.py
```

__Note:__ If you need to save all the links present in the JSON to a text file, you can download

```sh
$ curl -OL https://raw.githubusercontent.com/CodeDotJS/urlist/master/generateTxt.py
```

### Reason

I needed a tool to generate thousands of active URLs and dump them as JSON, so I built one.

### License

MIT