An open API service indexing awesome lists of open source software.

https://github.com/bogdanpricop/massdownload

Bulk-download files (PDFs etc.) from Google, Bing, or sitemap.xml — across all result pages — into a searchable per-host library. Manifest V3, TypeScript, no API key required.
https://github.com/bogdanpricop/massdownload

bing-scraper browser-extension bulk-download chrome-extension edge-extension file-downloader google-scraper manifest-v3 mass-downloader osint pdf-downloader research-tool side-panel sitemap typescript web-scraping

Last synced: about 1 month ago
JSON representation

Bulk-download files (PDFs etc.) from Google, Bing, or sitemap.xml — across all result pages — into a searchable per-host library. Manifest V3, TypeScript, no API key required.

Awesome Lists containing this project

README

          

# MassDownload

**A Chrome/Edge extension that scrapes Google, Bing, or sitemap.xml — across all pages — and bulk-downloads the files it finds, into a searchable local library.**

[![CI](https://github.com/bogdanpricop/MassDownload/actions/workflows/ci.yml/badge.svg)](https://github.com/bogdanpricop/MassDownload/actions/workflows/ci.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Latest release](https://img.shields.io/github/v/release/bogdanpricop/MassDownload)](https://github.com/bogdanpricop/MassDownload/releases/latest)
[![Manifest V3](https://img.shields.io/badge/Manifest-V3-4285F4?logo=googlechrome&logoColor=white)](https://developer.chrome.com/docs/extensions/mv3/intro/)
[![TypeScript](https://img.shields.io/badge/TypeScript-strict-3178C6?logo=typescript&logoColor=white)](https://www.typescriptlang.org/)
[![Code size](https://img.shields.io/github/languages/code-size/bogdanpricop/MassDownload)](https://github.com/bogdanpricop/MassDownload)

[Quickstart](#30-second-quickstart) • [Install](#install) • [How it works](#how-it-works) • [Comparison](#comparison) • [Use cases](#use-cases) • [FAQ](#faq) • [Contributing](#contributing)

---

## 30-second quickstart

After [installing](#install):

1. Click the **MassDownload** icon in your browser toolbar — the side panel opens
2. Fill in **Site** = `data.gov` (or any domain), **Filetype** = `pdf`, leave Source on Google
3. Click **Search** — the extension paginates through Google in a real background tab
4. Pick the files you want and click **Download selected** — they land in `Downloads/MassDownload/data.gov/`
5. Click **Library** to browse a per-host HTML index of everything you've collected, with live search

That's the whole loop. The rest of this README covers tuning, alternative sources (Bing, sitemap.xml), and edge cases.

---

## What it does

You search Google for `site:example.gov filetype:pdf` and you want every PDF — across **all** result pages, not just the first ten. Without MassDownload you'd:

1. Open Google
2. Type the query
3. Click each PDF
4. Paginate through 10+ Google pages
5. Repeat for the next site

With MassDownload:

1. Click the toolbar icon → side panel opens
2. Site = `example.gov`, filetype = `pdf` → **Search**
3. All matching PDFs across all result pages are listed (with title and snippet)
4. **Download selected** → files saved in parallel to `Downloads/MassDownload/example.gov/`
5. Click **Library** → searchable HTML index of everything you've downloaded from that site

Saved searches let you re-run a query in one click. Sitemap mode finds files Google never indexed. Auto-fallback to Bing handles Google CAPTCHA gracefully.

## Screenshots

Quick Search form in the side panel

Quick Search — site, filetype, source, keywords. Saved searches re-run in one click.

Scan results with already-in-library badges

Scan results — counter shows new vs already-downloaded. Files in the library are unchecked by default.

Per-host library.html with live search

Per-host library — single-file HTML with live search, sort, filter, and file:// links to local downloads.

## Install

### Option A — Pre-built zip (no Node.js required) ⭐ recommended for users

1. Go to the [latest release](https://github.com/bogdanpricop/MassDownload/releases/latest).
2. Download `MassDownload-vX.Y.Z.zip` from the release assets.
3. Extract the archive somewhere stable (e.g. `C:\Users\you\Apps\MassDownload\`). **Don't pick a temp folder** — the browser keeps loading the extension from this path, so deleting the folder later breaks it.
4. Open the extension manager:
- Chrome / Brave / Vivaldi: `chrome://extensions`
- Microsoft Edge: `edge://extensions`
5. Toggle **Developer mode**.
6. Click **Load unpacked** and select the extracted folder (the one that contains `manifest.json`).
7. Pin the MassDownload icon to the toolbar — click it on any tab to open the side panel.

To update later: download the new zip, replace the folder contents, then click the **🔄 Reload** button on the extension's card.

### Option B — Build from source (developers)

Requires Node.js 18+ and npm.

```bash
git clone https://github.com/bogdanpricop/MassDownload.git
cd MassDownload
npm install
npm run build # outputs dist/ (Chrome / Edge)
npm run build:firefox # outputs dist-firefox/ (experimental)
```

Then in your browser: `chrome://extensions` → **Developer mode** → **Load unpacked** → select `dist/` (or `dist-firefox/` for Firefox via `about:debugging`).

### Dev with HMR

```bash
npm run dev
```

Reload the extension once after the first dev run; the side panel hot-reloads on changes.

---

## How it works

### Three scan sources

| Source | When to use | How |
|---|---|---|
| **Google** (default) | You want what Google indexes; site has no public sitemap | Builds `site:X filetype:Y` query, paginates with `start=0,10,20…` via a real background tab (loading→complete cycles, random 0.8–2s delays — looks like a human, not a bot). **Auto-falls back to Bing on CAPTCHA.** |
| **Bing** | Google rate-limited; alternative result set | Uses `bing.com/search?q=site:X filetype:Y`, paginates `first=1,51,101…&count=50`. Bing rarely CAPTCHAs. |
| **Sitemap.xml** | Site has a sitemap; you want **everything**, not just indexed pages | Reads `robots.txt` for `Sitemap:` directives, falls back to `/sitemap.xml`, `/sitemap_index.xml`. Recursively follows sitemap-index trees (max depth 3, max 50 sitemap files). Supports gzipped `.xml.gz`. |
| **Crawl** | No sitemap, Google indexing is patchy, but the site has internal links | BFS from the homepage following intra-domain `` links up to depth 2 and 100 pages, harvesting any URLs that match your filetype filter. |

The side panel also has a **Scan tab** button: parses the active tab as Google SERP if the URL matches, otherwise extracts every `a[href]` from the page (generic mode).

### Smart filename

When Google/Bing supply a result title, it's used as the filename instead of the URL pathname:

- URL `https://example.com/cgi-bin/dl.php?id=7821` + title `"Decision 312/2024"` → file `Decision 312_2024.pdf`
- URL `https://example.com/papers/foo.pdf` (no title) → file `foo.pdf`

### Per-host library

Every successfully downloaded file is recorded in `chrome.storage.local` with metadata: title, snippet description, query that found it, source engine, host, timestamp, size. There are **two views** of the library:

1. **In-extension editable page** (click **Library** in the side panel) — full search, filters by host/source/tag, plus per-entry editing: custom title, tags, notes, remove from library, show in folder. JSON / CSV / portable-HTML export.
2. **Standalone `Downloads/MassDownload/{host}/library.html`** — auto-regenerated after each batch (and on every edit). Live search, sort, filter, `file://` links to local files. **Fully self-contained** — open from USB stick, email to a colleague, archive. Zero external dependencies.

The on-disk HTML is read-only; tags/notes/custom titles are added in the in-extension page and propagate to the on-disk version on save.

### Scan-time dedup

Results that are already in the library appear with an **"in library"** badge and are unchecked by default. You can still opt to re-download (e.g. to refresh content).

### Resumable downloads

The download queue is persisted to `chrome.storage.local` after every few settled items. If the service worker is evicted mid-batch (Chrome does this aggressively under memory pressure), opening the side panel surfaces a *"Resume previous queue?"* bar showing pending count and start age. Click **Resume** and only the un-finished items are processed.

### Pre-flight HEAD check

Optional setting (default off). Before each download, sends a HEAD request with a 3-second timeout. URLs that return 404/410 are marked `skipped` and don't consume a download slot. Most useful for sitemap mode where stale URLs are common; adds ~200ms per file otherwise.

### Recurring scheduled re-scans

Click the **⌚** icon next to a saved search to set an interval in days. The extension wakes up via `chrome.alarms`, re-runs the scan headlessly (no UI required), downloads only files that aren't already in the library, and optionally fires a desktop notification on new files. The alarm is owned by the browser, so it survives service-worker eviction.

### JSON / CSV export

The library HTML has **↓ JSON** and **↓ CSV** buttons that download whatever is currently filtered/sorted on screen. CSV uses RFC 4180 quoting + UTF-8 BOM (Excel-friendly); JSON is pretty-printed.

### Saving without per-file prompts

The extension calls `chrome.downloads.download({ saveAs: false })` so files save automatically. Default destination is `Downloads/MassDownload/{host}/`. Use the **Pick…** button to choose a different subfolder once via a system dialog — it sticks for all subsequent downloads.

If your browser still asks *"What do you want to do with X.pdf?"* for each file, the global setting needs disabling:

| Browser | Path | Setting |
|---|---|---|
| **Microsoft Edge** | `edge://settings/downloads` | *Ask me what to do with each download* |
| **Google Chrome** | `chrome://settings/downloads` | *Ask where to save each file before downloading* |
| **Brave / Vivaldi** | `brave://settings/downloads` / `vivaldi://settings/downloads` | same as Chrome |

The side panel has an **Open browser download settings** link that takes you straight there.

---

## Comparison

How does MassDownload stack up against alternatives?

| Tool | SERP autopagination | Anti-CAPTCHA strategy | Sitemap fallback | Library + search | Browser session | Free |
|---|:-:|:-:|:-:|:-:|:-:|:-:|
| **MassDownload** | ✅ | ✅ Real tab + delays | ✅ | ✅ Per-host HTML | ✅ | ✅ |
| **DownThemAll!** | ❌ Per-page only | n/a | ❌ | ❌ | ✅ | ✅ |
| **Simple Mass Downloader** | ❌ URL templates only | n/a | ❌ | ❌ | ✅ | ✅ |
| **Web Scraper.io** | ⚠️ DIY config | ❌ | ⚠️ DIY | ❌ | ✅ | Free tier |
| **wget --recursive** | ❌ | n/a | ⚠️ Manual | ❌ | ❌ | ✅ |
| **SerpAPI + script** | ✅ via API | ✅ | ❌ | DIY | ❌ | ❌ ($50+/mo) |

**Pick MassDownload if** you want a single-click flow from "search a domain" to "files on disk + searchable library", without paying for a SaaS or scripting bash.

**Pick DownThemAll** if you have a single page already open and just want to bulk-pull all links from it.

**Pick wget** for a full recursive site mirror in CLI.

**Pick SerpAPI** if you need this at industrial scale and don't mind paying.

---

## Use cases

### 📚 Researchers / journalists
Collect every public PDF from a government or institutional site for analysis. Sitemap mode catches documents the search engines don't surface.

### ⚖️ Legal / paralegal
Scrape court decisions, executor notices, notary records from public registries. Per-host library with description snippets makes finding a specific case fast.

### 🔍 OSINT
Quick intel sweep on a domain — what documents has this site published? Sitemap fallback works on small sites with poor Google coverage.

### 👨‍💻 Developers
Open-source MV3 reference: side panel + offscreen DOMParser + service-worker download queue + tab-based stealth scraping. Fork and extend.

---

## Project layout

```
src/
├── background.ts # service worker — scan dispatcher + download queue
├── offscreen.html / .ts # DOMParser host (MV3 service workers can't use DOMParser directly)
├── sidepanel/
│ ├── sidepanel.html # Quick Search form, saved list, results, progress
│ ├── sidepanel.ts # UI state + long-lived port to background
│ ├── sidepanel.css
│ └── folderPicker.ts # one-shot Save-As dialog → relative subfolder
├── parsers/
│ ├── google.ts # SERP URL helpers (extraction runs in-tab)
│ ├── bing.ts # SERP parsing + /ck/a?u= base64 redirect unwrap
│ ├── sitemap.ts # urlset / sitemapindex extraction + robots.txt
│ ├── queryBuilder.ts # site/filetype/keywords → search URL
│ └── filters.ts # extension match, canonicalize/dedup, smart filename
├── library/
│ ├── manifest.ts # CRUD on chrome.storage.local['library']
│ └── htmlGenerator.ts # standalone HTML view (search/sort/filter)
├── downloader.ts # parallel queue + cancel + one-shot retry
├── messages.ts # typed messages (sidepanel ↔ background ↔ offscreen)
├── storage.ts # settings + saved searches
└── types.ts # LinkInfo, LibraryEntry, SearchQuery, Settings
```

## Settings

| Field | Default | Range / notes |
|---|---|---|
| Filetype(s) | `pdf` | Comma-separated. Used as both query filter (`filetype:pdf`) and post-scan extension filter |
| Source | Google | google / sitemap / bing |
| Keywords | empty | Free text appended to query |
| Exclude | empty | Comma-separated, each becomes `-term` in query |
| Parallel downloads | 5 | 1 – 20 |
| Max pages | 20 | 1 – 50 (also caps sitemap files visited) |
| Subfolder | `MassDownload/{host}` | Supports `{host}` placeholder |
| Saved searches | (managed via UI) | up to 30 stored |

All persist via `chrome.storage.local`.

---

## Limitations

- **Google CAPTCHA on heavy use**: real-tab navigation reduces but doesn't eliminate it. When you hit it, the tab is brought to the foreground for you to solve, then the next scan reuses that tab with a clean session.
- **Sitemaps may lie**: some sites list URLs that 404. Failed downloads are reported in the log; the queue continues.
- **Files behind login**: downloads follow your existing browser cookies — works only if you're already logged in to that site.
- **JavaScript-only sites**: if a page renders link lists purely in JS, sitemap mode is the only reliable option.
- **Resume after crash**: not implemented. If the service worker is killed mid-queue, in-flight downloads continue (Chrome owns them) but queue progress is lost.

## Permissions

| Permission | Why it's needed |
|---|---|
| `sidePanel` | Side panel UI |
| `downloads` | Save files, regenerate library.html |
| `activeTab` + `scripting` | Read links from the current tab; in-page Google SERP extraction |
| `storage` | Persist settings, saved searches, library manifest, resumable queue |
| `offscreen` | Run `DOMParser` on Bing HTML / sitemap XML / crawled pages |
| `tabs` | Read active tab URL, manage scan tab |
| `alarms` | Wake up periodically to re-run scheduled saved searches |
| `notifications` | Show a desktop alert when a scheduled run downloads new files |
| `` host | Fetch Google in any locale, Bing, any site's robots.txt and sitemap |

**No telemetry. No remote config. No data leaves your browser** except the HTTP requests required to fetch search pages and download the files you select.

---

## FAQ

**Q: Does this work on Firefox?**
A: As of v0.3.0 there's an **experimental** Firefox build (`npm run build:firefox`). It uses `sidebar_action` instead of `chrome.sidePanel`, parses HTML inline (no offscreen document), and sets a `browser_specific_settings.gecko` block. It compiles and loads via `about:debugging`, but hasn't been daily-driven yet. Bug reports welcome.

**Q: Will this get me in trouble with Google?**
A: It uses your real browser session for SERP scraping (no API key, no rotation). At low volume (a few sites a day) you'll rarely hit CAPTCHA. At high volume Google may temporarily challenge you — solve it in the tab the extension surfaces, then continue.

**Q: Does it scrape sites behind login?**
A: It downloads via your existing browser cookies, so if you're logged in, yes. The extension itself never asks for or stores credentials.

**Q: Where does the library data live?**
A: `chrome.storage.local['library']` — a flat map keyed by canonical URL. The HTML view at `Downloads/MassDownload/{host}/library.html` is regenerated from this on each download batch. Wipe by uninstalling the extension or via the storage inspector in DevTools.

**Q: Can I edit titles or add tags from the library HTML?**
A: Not yet — the HTML is read-only. Sync-back-to-extension is on the roadmap.

---

## Contributing

Issues and PRs welcome. See [CONTRIBUTING.md](CONTRIBUTING.md).

For a feature idea or bug, [open an issue](https://github.com/bogdanpricop/MassDownload/issues/new/choose).

## License

[MIT](LICENSE) © 2026 Bogdan Pricop