https://github.com/vberlier/mcwiki

A scraping library for the Minecraft Wiki.
https://github.com/vberlier/mcwiki

mcwiki minecraft minecraft-wiki scraping wiki

Last synced: about 2 months ago
JSON representation

A scraping library for the Minecraft Wiki.

Host: GitHub
URL: https://github.com/vberlier/mcwiki
Owner: vberlier
License: mit
Created: 2020-09-19T17:15:24.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2023-09-26T12:13:14.000Z (over 1 year ago)
Last Synced: 2025-04-13T22:40:39.255Z (2 months ago)
Topics: mcwiki, minecraft, minecraft-wiki, scraping, wiki
Language: Python
Homepage:
Size: 140 KB
Stars: 3
Watchers: 2
Forks: 1
Open Issues: 9
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

README

        # mcwiki

[![GitHub Actions](https://github.com/vberlier/mcwiki/workflows/CI/badge.svg)](https://github.com/vberlier/mcwiki/actions)

[![PyPI](https://img.shields.io/pypi/v/mcwiki.svg)](https://pypi.org/project/mcwiki/)

[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mcwiki.svg)](https://pypi.org/project/mcwiki/)

[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black)

> A scraping library for the [Minecraft Wiki](https://minecraft.fandom.com/wiki/Minecraft_Wiki).

```python

import mcwiki

page = mcwiki.load("Data Pack")

print(page["pack.mcmeta"].extract(mcwiki.TREE))

```

```

[TAG_Compound]

The root object.

└─ pack

   [TAG_Compound]

   Holds the data pack information.

   ├─ description

   │  [TAG_String, TAG_List, TAG_Compound]

   │  A JSON text that appears when hovering over the data pack's name in

   │  the list given by the /datapack list command, or when viewing the pack

   │  in the Create World screen.

   └─ pack_format

      [TAG_Int]

      Pack version: If this number does not match the current required

      number, the data pack displays a warning and requires additional

      confirmation to load the pack. Requires 4 for 1.13–1.14.4. Requires 5

      for 1.15–1.16.1. Requires 6 for 1.16.2–1.16.5. Requires 7 for 1.17.

```

## Introduction

The Minecraft Wiki is a well-maintained source of information but is a bit too organic to be used as anything more than a reference. This project tries its best to make it possible to locate and extract the information you're interested in and use it as a programmatic source of truth for developing Minecraft-related tooling.

### Features

- Easily navigate through page sections

- Extract paragraphs, code blocks and recursive tree-like hierarchies

- Create custom extractors or extend the provided ones

## Installation

The package can be installed with `pip`.

```bash

$ pip install mcwiki

```

## Getting Started

The `load` function allows you to load a page from the Minecraft Wiki. The page can be specified by providing a URL or simply the title of the page.

```python

mcwiki.load("https://minecraft.fandom.com/wiki/Data_Pack")

mcwiki.load("Data Pack")

```

You can use the `load_file` function to read from a page downloaded locally or the `from_markup` function if you already have the html loaded in a string.

```python

mcwiki.load_file("Data_Pack.html")

mcwiki.from_markup("\n

```

## Extracting Data

There are 4 built-in extractors. Extractors are instantiated with a CSS selector and define a `process` method that produces an item for each element returned by the selector.

| Extractor    | Type                    | Extracted Item                                            |

| ------------ | ----------------------- | --------------------------------------------------------- |

| `PARAGRAPH`  | `TextExtractor("p")`    | String containing the text content of a paragraph         |

| `CODE`       | `TextExtractor("code")` | String containing the text content of a code span         |

| `CODE_BLOCK` | `TextExtractor("pre")`  | String containing the text content of a code block        |

| `TREE`       | `TreeExtractor()`       | An instance of `mcwiki.Tree` containing the treeview data |

Page sections can invoke extractors by using the `extract` and `extract_all` methods. The `extract` method will return the first item in the page section or `None` if the extractor couldn't extract anything.

```python

print(page.extract(mcwiki.PARAGRAPH))

```

```

Custom advancements in data packs of a Minecraft world store the advancement data for that world as separate JSON files.

```

You can use the `index` argument to specify which paragraph to extract.

```python

print(page.extract(mcwiki.PARAGRAPH, index=1))

```

```

All advancement JSON files are structured according to the following format:

```

The `extract_all` method will return a lazy sequence-like container of all the items the extractor could extract from the page section.

```python

for paragraph in page.extract_all(mcwiki.PARAGRAPH):

    print(paragraph)

```

You can use the `limit` argument or slice the returned sequence to limit the number of extracted items.

```python

# Both yield exactly the same list

paragraphs = page.extract_all(mcwiki.PARAGRAPH)[:10]

paragraphs = list(page.extract_all(mcwiki.PARAGRAPH, limit=10))

```

## Tree Structures

The `TREE` extractor returns recursive tree-like hierarchies. You can use the `children` property to iterate through the direct children of a tree.

```python

def print_nodes(tree: mcwiki.Tree):

    for key, node in tree.children:

        print(key, node.text, node.icons)

        print_nodes(node.content)

print_nodes(section.extract(mcwiki.TREE))

```

Folded entries are automatically fetched, inlined, and cached. This means that iterating over the `children` property can yield a node that's already been visited so make sure to handle infinite recursions where appropriate.

Tree nodes have 3 attributes that can all be empty:

- The `text` attribute holds the text content of the node

- The `icons` attribute is a tuple that stores the names of the icons associated to the node

- The `content` attribute is a tree containing the children of the node

You can transform the tree into a shallow dictionary with the `as_dict` method.

```python

# Both yield exactly the same dictionary

nodes = tree.as_dict()

nodes = dict(tree.children)

```

## Contributing

Contributions are welcome. Make sure to first open an issue discussing the problem or the new feature before creating a pull request. The project uses [`poetry`](https://python-poetry.org/).

```bash

$ poetry install

```

You can run the tests with `poetry run pytest`.

```bash

$ poetry run pytest

```

The project must type-check with [`pyright`](https://github.com/microsoft/pyright). If you're using VSCode the [`pylance`](https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance) extension should report diagnostics automatically. You can also install the type-checker locally with `npm install` and run it from the command-line.

```bash

$ npm run watch

$ npm run check

```

The code follows the [`black`](https://github.com/psf/black) code style. Import statements are sorted with [`isort`](https://pycqa.github.io/isort/).

```bash

$ poetry run isort mcwiki tests

$ poetry run black mcwiki tests

$ poetry run black --check mcwiki tests

```

---

License - [MIT](https://github.com/vberlier/mcwiki/blob/main/LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vberlier/mcwiki

Awesome Lists containing this project

README