An open API service indexing awesome lists of open source software.

https://github.com/nightmachinery/get_the_nini

Ninisite Scraper: Fetches all pages of a Ninisite discussion and formats in org-mode, Markdown, or JSON
https://github.com/nightmachinery/get_the_nini

Last synced: 4 months ago
JSON representation

Ninisite Scraper: Fetches all pages of a Ninisite discussion and formats in org-mode, Markdown, or JSON

Awesome Lists containing this project

README

          

#+TITLE: get-the-nini: Ninisite Post Scraper

A command-line tool for scraping discussion threads from the Ninisite website. It can take a topic ID or a full URL and save the entire conversation into a single, well-structured file.

* *Code*: [[file:get_the_nini/main.py]]

* *Purpose*: This tool is designed to archive and analyze discussion threads from ninisite.com, converting them into portable and easy-to-read formats.

* *Features*
- Scrape entire discussion threads by Topic ID or URL.
- Automatically handles pagination.
- Outputs in multiple formats: **Org-mode**, **Markdown**, and **JSON**.
- Extracts rich metadata including topic title, author, categories, views, and post dates.
- Preserves the structure of posts, including replies and quoted content.
- Streaming output for Org-mode, ideal for large topics or viewing progress live.
- Progress bar during page fetching.

* *Installation*
This tool can be installed from PyPI using pip.

**Prerequisites**
1. **Python 3**: Ensure you have Python 3 installed.
2. **Pandoc**: The `pypandoc` library is used for converting HTML to other formats. You must have Pandoc installed and available on your system's PATH. Please see the [Pandoc installation instructions](https://pandoc.org/installing.html).

**Install with pip**
To install the package, run the following command in your terminal:
#+begin_src sh
pip install get-the-nini
#+end_src

Or install the latest version from git:
#+begin_src sh :eval never
pip install 'git+https://github.com/NightMachinery/get_the_nini.git'
#+end_src

* *Usage*
Once installed, the script can be run from the command line, providing a topic ID or a full URL.

**Syntax**
#+begin_src sh
get-the-nini [OPTIONS]
#+end_src

**Examples**

1. **Scrape by Topic ID (Default Org-mode output)**
This command will scrape the discussion for topic ID `11473285` and save it to an automatically generated file named `ninisite_11473285.org`.
#+begin_src sh
get-the-nini 11473285
#+end_src

2. **Scrape using a full URL**
#+begin_src sh
get-the-nini "https://www.ninisite.com/discussion/topic/11473285/"
#+end_src

3. **Specify an output file and format (Markdown)**
The format can be inferred from the file extension, or specified explicitly with `--format`.
#+begin_src sh
get-the-nini 11473285 -o output.md
#+end_src

4. **Output as JSON to stdout**
Use `-o -` to direct output to standard output, which can be redirected to a file.
#+begin_src sh
get-the-nini 11473285 --format json -o - > ninisite_11473285.json
#+end_src

* *Output Formats & Examples*
The scraper can produce output in three different formats. Below are links to examples generated from the same topic.

**Org-mode (.org)**
A highly structured and readable plain-text format, perfect for use in Emacs. This is the default format and supports streaming output directly to a file as pages are scraped.
- *Example*: [[file:examples/ninisite_11473285.org]]

**Markdown (.md)**
A popular lightweight markup language for easy conversion to HTML and other formats.
- *Example*: [[file:examples/ninisite_11473285.md]]

**JSON (.json)**
A structured data format that includes all metadata and post content, suitable for programmatic analysis or integration into other systems.
- *Example*: [[file:examples/ninisite_11473285.json]]