https://github.com/hasdata/find-urls-from-any-domain

This repository provides practical examples of website link scraping using Python and Node.js.
https://github.com/hasdata/find-urls-from-any-domain

ai-extraction crawler hasdata-api nodejs python sitemap-parser url-extraction web-crawling web-scraping

Last synced: 3 months ago
JSON representation

This repository provides practical examples of website link scraping using Python and Node.js.

Host: GitHub
URL: https://github.com/hasdata/find-urls-from-any-domain
Owner: HasData
Created: 2025-05-19T13:12:00.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-05-19T13:34:22.000Z (about 1 year ago)
Last Synced: 2026-04-27T12:35:17.610Z (3 months ago)
Topics: ai-extraction, crawler, hasdata-api, nodejs, python, sitemap-parser, url-extraction, web-crawling, web-scraping
Language: JavaScript
Homepage: https://hasdata.com/blog/find-all-urls-on-a-domain
Size: 328 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          ![Python](https://img.shields.io/badge/python-3.10+-blue)

![Node.js](https://img.shields.io/badge/node.js-18+-green)

# Web Crawling & Scraping Examples (Python & Node.js)

[![HasData_bannner](banner.png)](https://hasdata.com/)

This repository contains practical examples of website link collection using **Python** and **Node.js**. It covers different methods: from basic sitemap parsing with `requests` to crawling entire websites and scraping Google SERPs with HasData’s API.

## Table of Contents

1. [Requirements](#requirements)

2. [Project Structure](#project-structure)

3. [Scraping & Crawling Examples](#scraping--crawling-examples)

   * [Sitemap Scraping (Requests)](#sitemap-scraping-requests)

   * [Sitemap Scraping (HasData)](#sitemap-scraping-hasdata)

   * [Full Website Crawling (HasData)](#full-website-crawling-hasdata)

   * [Crawling with AI Extraction (HasData)](#crawling-with-ai-extraction-hasdata)

   * [Google SERP Scraping (HasData)](#google-serp-scraping-hasdata)

## Requirements

**Python 3.10+** or **Node.js 18+**

### Python Setup

Required packages:

* `requests`

Install:

```bash

pip install requests

```

### Node.js Setup

Required packages:

* `axios`

Install:

```bash

npm install axios

```

## Project Structure

```

web-scraping-examples/

│

├── python/

│   ├── sitemap_scraper_requests.py

│   ├── sitemap_scraper_hasdata.py

│   ├── crawler_hasdata.py

│   ├── crawler_ai_hasdata.py

│   ├── google_serp_scraper_hasdata.py

│

├── nodejs/

│   ├── sitemap_scraper_requests.js

│   ├── sitemap_scraper_hasdata.js

│   ├── crawler_hasdata.js

│   ├── crawler_ai_hasdata.js

│   ├── google_serp_scraper_hasdata.js

│

└── README.md

```

Each script is focused on a specific use case. No frameworks. Just clean and minimal examples to get things done.

## Scraping & Crawling Examples

Read full article about [scraping URLs from any website](https://hasdata.com/blog/find-all-urls-on-a-domain).

### Sitemap Scraping (Requests)

A basic script that fetches and parses a sitemap XML using `requests` and `xml.etree.ElementTree`. No external services involved. Good for simple sites with clean sitemaps.

Change this data:

| Parameter     | Description                  | Example                                      |

| ------------- | ---------------------------- | -------------------------------------------- |

| `sitemap_url` | URL of the sitemap to scrape | `'https://demo.nopcommerce.com/sitemap.xml'` |

| `output_file` | File name to save links      | `'sitemap_links.txt'`                        |

### Sitemap Scraping (HasData)

Uses HasData's API to process a sitemap and extract links. Easier to scale, works even if the sitemap is large or spread across multiple files.

Change this data:

| Parameter    | Description                  | Example                                      |

| ------------ | ---------------------------- | -------------------------------------------- |

| `API_KEY`    | Your HasData API key         | `'111-1111-11-1'`                            |

| `sitemapUrl` | URL of the sitemap to scrape | `'https://demo.nopcommerce.com/sitemap.xml'` |

### Full Website Crawling (HasData)

Launches a full crawl of a website using [HasData’s crawler](https://docs.hasdata.com/scrapers/websites-crawler/quickstart). Useful when the sitemap is missing or incomplete. Returns all discovered URLs.

Change this data:

| Parameter       | Description                         | Example                            |

| --------------- | ----------------------------------- | ---------------------------------- |

| `API_KEY`       | Your HasData API key                | `'111-1111-11-1'`                  |

| `payload.limit` | Max number of links to collect      | `20`                               |

| `payload.urls`  | List of URLs to crawl               | `['https://demo.nopcommerce.com']` |

| `output_path`   | Filename to save the collected URLs | `'results_.json'`          |

### Crawling with AI Extraction (HasData)

Same as above, but adds AI-powered content extraction. You can define what kind of data you want from each page using `aiExtractRules`. Great for structured scraping.

Change this data:

| Parameter        | Description                        | Example                   | 

| ---------------- | ---------------------------------- | ------------------------- | 

| `API_KEY`        | Your HasData API key               | `'111-1111-11-1'`         | 

| `urls`           | List of URLs to crawl              | `["https://example.com"]` | 

| `limit`          | Max number of pages to crawl       | `20`                      | 

| `aiExtractRules` | JSON schema for AI content parsing | See script                | 

| `outputFormat`   | Desired output format(s)           | `["json", "text"]`        | 

### Google SERP Scraping (HasData)

Sends a search query to HasData and gets back links from Google search results. No browser automation needed. Simple and fast way to collect SERP data.

Change this data:

| Parameter     | Description                | Example                         |

| ------------- | -------------------------- | ------------------------------- |

| `api_key`     | Your HasData API key       | `'YOUR-API-KEY'`                |

| `query`       | Search query for Google    | `'site:hasdata.com inurl:blog'` |

| `location`    | Search location            | `'Austin,Texas,United States'`  |

| `deviceType`  | Device type for search     | `'desktop'`                     |

| `num_results` | Number of results to fetch | `100`                           |

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hasdata/find-urls-from-any-domain

Awesome Lists containing this project

README