https://github.com/nightmachinery/get_the_nini
Ninisite Scraper: Fetches all pages of a Ninisite discussion and formats in org-mode, Markdown, or JSON
https://github.com/nightmachinery/get_the_nini
Last synced: 4 months ago
JSON representation
Ninisite Scraper: Fetches all pages of a Ninisite discussion and formats in org-mode, Markdown, or JSON
- Host: GitHub
- URL: https://github.com/nightmachinery/get_the_nini
- Owner: NightMachinery
- Created: 2025-08-20T04:15:58.000Z (10 months ago)
- Default Branch: master
- Last Pushed: 2025-08-21T05:11:03.000Z (10 months ago)
- Last Synced: 2025-10-07T11:15:09.019Z (8 months ago)
- Language: HTML
- Size: 416 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.org
Awesome Lists containing this project
README
#+TITLE: get-the-nini: Ninisite Post Scraper
A command-line tool for scraping discussion threads from the Ninisite website. It can take a topic ID or a full URL and save the entire conversation into a single, well-structured file.
* *Code*: [[file:get_the_nini/main.py]]
* *Purpose*: This tool is designed to archive and analyze discussion threads from ninisite.com, converting them into portable and easy-to-read formats.
* *Features*
- Scrape entire discussion threads by Topic ID or URL.
- Automatically handles pagination.
- Outputs in multiple formats: **Org-mode**, **Markdown**, and **JSON**.
- Extracts rich metadata including topic title, author, categories, views, and post dates.
- Preserves the structure of posts, including replies and quoted content.
- Streaming output for Org-mode, ideal for large topics or viewing progress live.
- Progress bar during page fetching.
* *Installation*
This tool can be installed from PyPI using pip.
**Prerequisites**
1. **Python 3**: Ensure you have Python 3 installed.
2. **Pandoc**: The `pypandoc` library is used for converting HTML to other formats. You must have Pandoc installed and available on your system's PATH. Please see the [Pandoc installation instructions](https://pandoc.org/installing.html).
**Install with pip**
To install the package, run the following command in your terminal:
#+begin_src sh
pip install get-the-nini
#+end_src
Or install the latest version from git:
#+begin_src sh :eval never
pip install 'git+https://github.com/NightMachinery/get_the_nini.git'
#+end_src
* *Usage*
Once installed, the script can be run from the command line, providing a topic ID or a full URL.
**Syntax**
#+begin_src sh
get-the-nini [OPTIONS]
#+end_src
**Examples**
1. **Scrape by Topic ID (Default Org-mode output)**
This command will scrape the discussion for topic ID `11473285` and save it to an automatically generated file named `ninisite_11473285.org`.
#+begin_src sh
get-the-nini 11473285
#+end_src
2. **Scrape using a full URL**
#+begin_src sh
get-the-nini "https://www.ninisite.com/discussion/topic/11473285/"
#+end_src
3. **Specify an output file and format (Markdown)**
The format can be inferred from the file extension, or specified explicitly with `--format`.
#+begin_src sh
get-the-nini 11473285 -o output.md
#+end_src
4. **Output as JSON to stdout**
Use `-o -` to direct output to standard output, which can be redirected to a file.
#+begin_src sh
get-the-nini 11473285 --format json -o - > ninisite_11473285.json
#+end_src
* *Output Formats & Examples*
The scraper can produce output in three different formats. Below are links to examples generated from the same topic.
**Org-mode (.org)**
A highly structured and readable plain-text format, perfect for use in Emacs. This is the default format and supports streaming output directly to a file as pages are scraped.
- *Example*: [[file:examples/ninisite_11473285.org]]
**Markdown (.md)**
A popular lightweight markup language for easy conversion to HTML and other formats.
- *Example*: [[file:examples/ninisite_11473285.md]]
**JSON (.json)**
A structured data format that includes all metadata and post content, suitable for programmatic analysis or integration into other systems.
- *Example*: [[file:examples/ninisite_11473285.json]]