{"id":13567026,"url":"https://github.com/medialab/minet","last_synced_at":"2025-04-13T07:50:11.365Z","repository":{"id":34242840,"uuid":"169059797","full_name":"medialab/minet","owner":"medialab","description":"A webmining CLI tool \u0026 library for python.","archived":false,"fork":false,"pushed_at":"2025-04-09T12:31:57.000Z","size":16470,"stargazers_count":317,"open_issues_count":99,"forks_count":27,"subscribers_count":13,"default_branch":"master","last_synced_at":"2025-04-13T07:49:57.073Z","etag":null,"topics":["cli","python","webmining"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/medialab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2019-02-04T10:11:57.000Z","updated_at":"2025-04-09T10:30:46.000Z","dependencies_parsed_at":"2023-10-30T10:36:16.849Z","dependency_job_id":"203a0696-168b-4060-9015-b35732786220","html_url":"https://github.com/medialab/minet","commit_stats":{"total_commits":2780,"total_committers":24,"mean_commits":"115.83333333333333","dds":0.07553956834532372,"last_synced_commit":"38188054ed0c6be4f2b73e8b52fd6e272c204e66"},"previous_names":[],"tags_count":197,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/medialab%2Fminet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/medialab%2Fminet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/medialab%2Fminet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/medialab%2Fminet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/medialab","download_url":"https://codeload.github.com/medialab/minet/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248681494,"owners_count":21144700,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","python","webmining"],"created_at":"2024-08-01T13:02:21.899Z","updated_at":"2025-04-13T07:50:11.343Z","avatar_url":"https://github.com/medialab.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"[![Build Status](https://github.com/medialab/minet/workflows/Tests/badge.svg)](https://github.com/medialab/minet/actions) [![DOI](https://zenodo.org/badge/169059797.svg)](https://zenodo.org/badge/latestdoi/169059797) [![download number](https://static.pepy.tech/badge/minet)](https://pepy.tech/project/minet)\n\n![Minet](docs/img/minet.png)\n\n**minet** is a webmining command line tool \u0026 library for python (\u003e= 3.8) that can be used to collect and extract data from a large variety of web sources such as raw webpages, Facebook, YouTube, Twitter, Media Cloud etc.\n\nIt adopts a very simple approach to various webmining problems by letting you perform a wide array of tasks from the comfort of the command line. No database needed: raw CSV files should be sufficient to do most of the work.\n\nIn addition, **minet** also exposes its high-level programmatic interface as a python library so you remain free to use its utilities to suit your use-cases better.\n\n**minet** is developed by [médialab SciencesPo](https://github.com/medialab/) research engineers and is the consolidation of more than a decade of webmining practices targeted at social sciences.\n\nAs such, it has been designed to be:\n\n1. **low-tech**, as it requires minimal resources such as memory, CPUs or hard drive space and should be able to work on any low-cost PC.\n2. **fault-tolerant**, as it is able to recover when network is bad and retry HTTP calls when suitable. What's more, most of minet commands can be resumed if aborted and are designed to run for a long time (think days or months) without leaking memory.\n3. **unix-compliant**, as it can be piped easily and know how to work with the usual streams.\n\n**Shortcuts**: [Command line documentation](./docs/cli.md), [Python library documentation](./docs/lib.md).\n\n![fetch](./docs/img/fetch.gif)\n\n_How to cite?_\n\n**minet** is published on [Zenodo](https://zenodo.org/) as [10.5281/zenodo.4564399](http://doi.org/10.5281/zenodo.4564399).\n\nYou can cite it thusly:\n\n\u003e Guillaume Plique, Pauline Breteau, Jules Farjas, Héloïse Théro, Jean Descamps, Amélie Pellé, Laura Miguel, César Pichon, \u0026 Kelly Christensen. (2019, October 14). Minet, a webmining CLI tool \u0026 library for python. Zenodo. http://doi.org/10.5281/zenodo.4564399\n\n## Whirlwind tour\n\n```bash\n# Downloading large amount of urls as fast as possible\nminet fetch url -i urls.csv \u003e report.csv\n\n# Extracting raw text from the downloaded HTML files\nminet extract -i report.csv -I downloaded \u003e extracted.csv\n\n# Scraping the urls found in the downloaded HTML files\nminet scrape urls -i report.csv -I downloaded \u003e scraped_urls.csv\n\n# Parsing \u0026 normalizing the scraped urls\nminet url-parse scraped_url -i scraped_urls.csv \u003e parsed_urls.csv\n\n# Scraping data from Twitter\nminet twitter scrape tweets \"from:medialab_ScPo\" \u003e tweets.csv\n\n# Printing a command's help\nminet twitter scrape -h\n\n# Searching videos on YouTube\nminet youtube search -k \"MY-YT-API-KEY\" \"médialab\" \u003e videos.csv\n```\n\n## Summary\n\n- [What it does](#what-it-does)\n- [Documented use cases](#documented-use-cases)\n- [Features (from a technical standpoint)](#features-from-a-technical-standpoint)\n- [Installation](#installation)\n- [Upgrading](#upgrading)\n- [Uninstallation](#uninstallation)\n- [Documentation](#documentation)\n- [Contributing](#contributing)\n\n## What it does\n\nMinet can single-handedly:\n\n- Extract URLs from a text file (or a table)\n- Parse URLs (get useful information, with Facebook- and Youtube-specific stuff)\n- Join two CSV files by matching the columns containing URLs\n- From a list of URLs, resolve their redirections\n  - ...and check their HTTP status\n  - ...and download the HTML\n  - ...and extract hyperlinks\n  - ...and extract the text content and other metadata (title...)\n  - ...and scrape structured data (using a declarative language to define your heuristics)\n- Crawl (using a declarative language to define a browsing behavior, and what to harvest)\n- Mine or search:\n  - _[Bluesky](https://bsky.app/)_ (requires a free user account)\n  - _[Mediacloud](https://mediacloud.org/)_ (requires free API access)\n  - _[Twitter](https://twitter.com)_ (requires free API access)\n  - _[Wikipedia](https://www.wikipedia.org)_\n  - _[Youtube](https://www.youtube.com/)_ (requires free API access)\n- Scrape (without requiring special access, often just a user account):\n  - _[Instagram](https://www.instagram.com/)_\n  - _[Reddit](https://www.reddit.com/)_\n  - _[Telegram](https://telegram.org/)_\n  - _[TikTok](https://www.tiktok.com)_\n  - _[Twitter](https://twitter.com)_\n  - _[Google Drive](https://drive.google.com)_ (spreadsheets etc.)\n- Grab \u0026 dump cookies from your browser\n- Dump _[Hyphe](https://hyphe.medialab.sciences-po.fr/)_ data\n\n## Documented use cases\n\n- [Fetching a large amount of urls](./docs/cookbook/fetch.md)\n- [Joining 2 CSV files by urls](./docs/cookbook/url_join.md)\n- [Using minet from a Jupyter notebook](./docs/cookbook/notebooks/Minet%20in%20a%20Jupyter%20notebook.ipynb) (_very useful to experiment with the tool or teach students_)\n- [Downloading images associated with a given hashtag on Twitter](./docs/cookbook/twitter_images.md)\n- [Scraping DSL Tutorial](./docs/cookbook/scraping_dsl.md)\n\n## Features (from a technical standpoint)\n\n- Multithreaded, memory-efficient fetching from the web.\n- Multithreaded, scalable crawling.\n- Multiprocessed raw text content extraction from HTML pages.\n- Multiprocessed scraping from HTML pages.\n- URL-related heuristics utilities such as extraction, normalization and matching.\n- Data collection from various APIs such as [YouTube](https://www.youtube.com/).\n\n## Installation\n\n**minet** can be installed as a standalone CLI tool (currently only on mac \u003e= 10.14, ubuntu \u0026 similar) by running the following command in your terminal:\n\n```shell\ncurl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/install.sh | bash\n```\n\nDon't trust us enough to pipe the result of a HTTP request into `bash`? We wouldn't either, so feel free to read the installation script [here](./scripts/install.sh) and run it on your end if you prefer.\n\nOn ubuntu \u0026 similar you might need to install `curl` and `unzip` before running the installation script if you don't already have it:\n\n```shell\nsudo apt-get install curl unzip\n```\n\nElse, **minet** can be installed directly as a python CLI tool and library using pip:\n\n```shell\npip install minet\n```\n\nFinally if you want to install the standalone binaries by yourself (even for windows) you can find them in each release [here](https://github.com/medialab/minet/releases).\n\n## Upgrading\n\nTo upgrade the standalone version, simply run the install script once again:\n\n```shell\ncurl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/install.sh | bash\n```\n\nTo upgrade the python version you can use pip thusly:\n\n```shell\npip install -U minet\n```\n\n## Uninstallation\n\nTo uninstall the standalone version:\n\n```shell\ncurl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/uninstall.sh | bash\n```\n\nTo uninstall the python version:\n\n```shell\npip uninstall minet\n```\n\n## Documentation\n\n- [minet as a command line tool](./docs/cli.md)\n- [minet as a python library](./docs/lib.md)\n\n## Contributing\n\nTo contribute to **minet** you can check out [this](./CONTRIBUTING.md) documentation.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmedialab%2Fminet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmedialab%2Fminet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmedialab%2Fminet/lists"}