https://github.com/teamnsrg/mida

MIDA: A Tool for Measuring the Internet
https://github.com/teamnsrg/mida

chrome chromedp crawling devtools golang web

Last synced: about 2 months ago
JSON representation

MIDA: A Tool for Measuring the Internet

Host: GitHub
URL: https://github.com/teamnsrg/mida
Owner: teamnsrg
License: mit
Created: 2018-12-18T16:20:04.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2023-03-07T02:37:22.000Z (about 2 years ago)
Last Synced: 2024-11-05T21:45:22.570Z (7 months ago)
Topics: chrome, chromedp, crawling, devtools, golang, web
Language: Go
Size: 562 KB
Stars: 18
Watchers: 9
Forks: 4
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# MIDA: A Tool for Measuring the Web

[![Go](https://github.com/teamnsrg/mida/actions/workflows/go.yml/badge.svg)](https://github.com/teamnsrg/mida/actions/workflows/go.yml)
[![Go Report Card](https://goreportcard.com/badge/github.com/teamnsrg/mida)](https://goreportcard.com/report/github.com/teamnsrg/mida)

MIDA is meant to be a general tool for web measurement projects. It is built in Go
on top of Chrome/Chromium and the DevTools protocol, giving it a realistic vantage point
to study the web and fine-grained access to information provided by Chrome Developer Tools.

---

## Getting Started

Getting started with MIDA is easy! First, install:

```bash
$ wget files.mida.sprai.org/setup.py
$ sudo python3 setup.py
```

Now we are ready to visit a site and collect some data:
```bash
$ mida go example.org
```

You can find the results of your crawl in the `results/` directory.

## Easy At-Scale Crawling

One major benefit of MIDA is in being able to run large scale, highly configurable crawls
without needing to write your own crawler code. Here's an example of a single MIDA command which
will crawl the Alexa Top 100K and gather a few specific types of data:

```bash
$ mida go -f https://files.mida.sprai.org/toplists/alexa.lst -n100000 -c8 --all-resources --screenshot --dom
```

Breaking this down by argument:

`-f https://files.mida.sprai.org/toplists/alexa.lst`: This is a list of the Alexa Top Websites.
You can read from a local file or go get one hosted on the web somewhere

`-n100000`: Read the top 100,000 entries from the list

`-c8`: Run with 8 parallel crawlers (browser instances)

`--all-resources`: Gather all of the actual files/resources required to render the web page.
Beware, this takes a lot of space!

`--screenshot`: Capture a screenshot after/if the load event for each website fires.

`--dom`: Capture a JSON representation of the DOM for each website visited.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/teamnsrg/mida

Awesome Lists containing this project

README