https://github.com/codedotjs/urlist

A Python script that extracts URLs from a text file containing a list of websites.
https://github.com/codedotjs/urlist

collection extractor python scraping urls

Last synced: 10 months ago
JSON representation

A Python script that extracts URLs from a text file containing a list of websites.

Host: GitHub
URL: https://github.com/codedotjs/urlist
Owner: CodeDotJS
License: mit
Created: 2023-07-05T07:17:41.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2023-07-05T22:54:10.000Z (over 2 years ago)
Last Synced: 2025-03-20T00:41:20.924Z (10 months ago)
Topics: collection, extractor, python, scraping, urls
Language: Python
Homepage:
Size: 37.1 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

URLs from URL(s)

---

### Purpose

- This script extracts all the URLs from a text file containing a list of websites, and saves them in JSON format.
- Handles missing schemas and fixes relative URLs to ensure accurate results.
- Uses multithreading to concurrently process multiple websites, so it's fast!

---

### Usage

- Install the required modules

```sh
$ pip install aiohttp beautifulsoup4 fake_useragent
```

- Download the script

```sh
$ curl -OL https://raw.githubusercontent.com/CodeDotJS/urlist/master/extractor.py
```

- Run

```sh
$ python extractor.py
```

__Note:__ If you need to save all the links present in the JSON to a text file, you can download

```sh
$ curl -OL https://raw.githubusercontent.com/CodeDotJS/urlist/master/generateTxt.py
```

### Reason

I needed a tool to generate thousands of active URLs and dump them as JSON, so I built one.

### License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/codedotjs/urlist

Awesome Lists containing this project

README

URLs from URL(s)