https://github.com/microwebstacks/content-structure

Content Structure collect all your markdown files meta data and parses the Abstract Syntax Tree of each file
https://github.com/microwebstacks/content-structure

ast cms content markdown remark structure

Last synced: 4 months ago
JSON representation

Content Structure collect all your markdown files meta data and parses the Abstract Syntax Tree of each file

Host: GitHub
URL: https://github.com/microwebstacks/content-structure
Owner: MicroWebStacks
License: mit
Created: 2023-10-16T16:03:55.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-06-15T10:09:18.000Z (almost 2 years ago)
Last Synced: 2024-10-29T14:22:11.608Z (over 1 year ago)
Topics: ast, cms, content, markdown, remark, structure
Language: JavaScript
Homepage: https://www.npmjs.com/package/content-structure
Size: 2.78 MB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Content Structure
Parsed markdown is stored in SQLite tables that can be used for rendering and database content management.

![overview](./docs/overview.drawio.svg)

The overview diagram above shows the main `content-structure` operations (AST parsing and assets collection in sqlite DB) and shows an example ussage of the resulting dataset which is web framework agnostic.

![design](./docs/design.drawio.svg)

A more detailed flow with focus on versioned and non versionned tables.

## Deepwiki
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/MicroWebStacks/content-structure)

# Install
prerequisites
- choco
- node-gyp
see https://github.com/nodejs/node-gyp?tab=readme-ov-file#on-windows

# Usage
```shell
npm install content-structure
```
this project is a library, for full config, see the example in [example/parse.js](example/parse.js)

collect all data by running this once
```javascript
import {collect} from 'content-structure'

await collect({
rootdir:rootdir,
contentdir:join(rootdir,"content"),
file_link_ext:["svg","webp","png","jpeg","jpg","xlsx","glb"],
outdir:join(rootdir,".structure")
})
```
see demo with
```cmd
>pnpm run demo
> node parse.js

content_dir : C:\dev\MicroWebStacks\content-structure\example\content
searching for files with extensions : *.md
Structure DB tables and row counts:
- asset_info: 19
- assets: 19
- blob_store: 14
- documents: 30
- items: 82
```

# Documentation
Content Structure produces a relational snapshot of every markdown run using the schema declared in [`catalog.yaml`](./catalog.yaml).
The catalog defines a single `structure` dataset whose tables are optimized for rendering, search indexing, and asset management. Each run populates these tables under `.structure/structure.db`.

### Table overview
| Table | Purpose | Relationships |
| --- | --- | --- |
| `documents` | Canonical row per markdown entry. Stores stable ids, routing metadata, and leftover front matter via the `meta_data` JSON column. | `items`, `assets`, and `asset_info` reference `documents.sid`. |
| `items` | Flattened AST stream in reading order. Each row keeps `body_text` for simple rendering plus an optional serialized AST subtree for nested constructs (stored in `ast`). | References `documents` via `doc_sid`; `assets` rows connect items to blobs when an AST node produces a file. |
| `assets` | Run-specific join table so consumers can tell which document referenced which asset at a given `version_id`. | Bridge between `documents` and `asset_info`; also carries the `blob_uid` for quick payload lookups. |
| `asset_info` | Deduplicated description of every asset (code blocks, tables, linked files, etc.) regardless of run. | Points to the owning document (`parent_doc_uid`) and the physical payload via `blob_uid`. |
| `blob_store` | Source of truth for payloads. Large blobs are stored under `blobs/YYYY/MM/ff/hash` and referenced by path, while small blobs inline their bytes (compressed when eligible). | `asset_info`/`assets` link to blobs through `blob_uid`. |

The catalog is intentionally compact: fields are named to match DOM concerns (`slug`, `url_type`, `level`), content analysis (`headings`, `links`, `code`), and asset lifecycle (`first_seen`, `last_seen`). Instead of memorizing every column, browse [`catalog.yaml`](./catalog.yaml) whenever you need the exact types or to extend the dataset. Downstream tools can rely on the catalog as the authoritative contract when generating queries, migrations, or analytics dashboards.

### Document behavior highlights
- **Metadata folding** – Any front matter not mapped to a declared column is serialized into `documents.meta_data`, keeping schemas manageable without losing context.
- **Automatic ordering** – Documents inherit incremental `order` values scoped to their directory level unless you pin them explicitly. This keeps navigation menus stable even when markdown files are added later.
- **Mixed routing** – Folder-style (`readme.md` or matching filenames) and file-style URLs coexist. `url_type` reveals which variant was used to generate the url.

### Item and asset lifecycle
- Paragraphs, headings, tables, code blocks, and images are all represented in `items`. Simple rows expose fully extracted text; nested structures store their sanitized AST so you can re-render bold or embedded assets without reparsing the original markdown.
- Every asset mentioned by an item produces two entries: a durable definition in `asset_info` and a run-scoped membership row in `assets`. The membership row ties the asset to both the document and its blob so you can know exactly when something was added, removed, or reused.
- Blob payloads avoid bloat with configurable thresholds: large files stream to disk under `blobs/`, while smaller text blobs can be compressed inline and served straight from SQLite.

Refer back to the catalog for exhaustive field notes, and treat the tables above as the primary contract between your markdown source and any rendering or analytics layers.

## Config parameters
the config parameter is optional and do have default values
* `rootdir` : defaults to current working directory. The path where to find the a `content` directory.
* `outdir` : defaults to `.structure`. Relative output directory is the location where all output data will be generated, which is relative to the root directory.
* `folder_single_doc` : defaults to `false`. When `true`, each folder is treated as a single document and the first YAML/YML file contributes overrides plus `meta_data` fields.
* `external_storage_kb` : defaults to `512`. Blobs larger than this size (in KB) are written to disk under `blobs////`.
* `inline_compression_kb` : defaults to `32`. Inline blobs bigger than or equal to this size are eligible for gzip compression before being stored inside the `blob_store` table.
* `file_compress_ext` : defaults to `["txt","md","json","csv","tsv","yaml","yml"]`. Inline blobs are compressed only if their source extension (when known) appears in this list.

## Generated output
* `.structure/structure.db` : a SQLite database (powered by better-sqlite3).
The database exposes the tables `documents`, `items`, `assets`, `asset_info`, and `blob_store`.
* `blobs/year/month/prefix/hash` path for all files larger than `config.external_storage_kb`

# Publish how to

```bash
git tag v2.2.1 && git push origin v2.2.1
```

delete tags
```bash
git push origin :refs/tags/v2.2.0
git tag -d v2.2.0
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/microwebstacks/content-structure

Awesome Lists containing this project

README