https://github.com/bokuweb/lawrenceanum

Last synced: 6 days ago
JSON representation
Host: GitHub
URL: https://github.com/bokuweb/lawrenceanum
Owner: bokuweb
License: other
Created: 2026-05-07T15:43:00.000Z (about 2 months ago)
Default Branch: main
Last Pushed: 2026-06-26T17:25:09.000Z (7 days ago)
Last Synced: 2026-06-26T19:14:57.267Z (7 days ago)
Language: Rust
Homepage: https://bokuweb.github.io/lawrenceanum/
Size: 1.98 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Roadmap: docs/roadmap.md
Awesome Lists containing this project

README

          # lawrenceanum

A static-hosted JSON API + WASM-SQLite-powered SPA for Japanese statute data

(法令), built on top of e-Gov 法令API. GitHub Actions periodically pulls the

upstream data, the Rust CLI (`lawpub`) normalizes it into stable JSON, and the

result is served from GitHub Pages.

Detailed design: [docs/plan.md](docs/plan.md).

## What you get

- Static JSON API at `https://.github.io//...`

  - `index.json`, `manifest.json`, `health.json`

  - `laws/index.json`, `laws/{law_id}/{current,versions,timeline}.json`

  - `laws/{law_id}/revisions/{rev_id}.json`, `laws/{law_id}/articles/{art_id}.json`

  - `updates/latest.json`, `updates/{YYYY-MM-DD}.json`

  - `kanpo/{YYYY-MM-DD}/index.json`

  - `sitemap.xml`, `robots.txt`, `laws/all.ndjson`

- A React SPA at the same origin that consumes those JSON files

  - HashRouter, deep-linkable to any law / article / version

  - Browser-side full-text search via **WASM SQLite (sql.js) + FTS5**

  - Cross-reference graph: `第○条` → article links, backlinks panel,

    cross-law jumps for `民法第七百九条` style references

## Local quickstart

```bash

cargo build --release -p lawpub-cli

./target/release/lawpub update --public public --cache .cache --provider mock

./target/release/lawpub validate --public public

# Serve the SPA on top

cd figma && pnpm install

pnpm dev          # http://localhost:5173/   (a custom Vite middleware

                  # serves ../public/*.json so the SPA reads live JSON)

# or production build:

pnpm build        # writes index.html + assets/ next to the JSON

```

Generated files under `public/`:

```

public/

├── index.json / manifest.json / health.json / sitemap.xml / robots.txt

├── laws/

│   ├── index.json

│   ├── all.ndjson

│   └── {law_id}/

│       ├── current.json

│       ├── versions.json

│       ├── timeline.json

│       ├── revisions/{rev_id}.json

│       └── articles/{art_id}.json

├── updates/{latest.json,{YYYY-MM-DD}.json}

├── kanpo/{YYYY-MM-DD}/index.json

├── schema/{law-document,manifest,updates}.json

├── search.db                                  # SQLite + FTS5

├── index.html / assets/                       # SPA build output

state/latest.json                              # cron-managed pointer

```

## CLI surface

```text

lawpub update         --public public --cache .cache [--provider http|mock] [--date YYYY-MM-DD] [--force]

lawpub fetch-update   --date YYYY-MM-DD --cache .cache

lawpub fetch-range    --from YYYY-MM-DD --to YYYY-MM-DD --cache .cache [--provider http|mock]

lawpub fetch-bulk     --category N [--limit M] --cache .cache [--provider http|mock]

lawpub build-json     --input .cache --output public

lawpub build-index    --output public

lawpub kanpo-fetch    --date YYYY-MM-DD --cache .cache

lawpub kanpo-link     --output public

lawpub validate       --public public

lawpub status         --public public --cache .cache

```

The provider defaults to `http` and uses `https://laws.e-gov.go.jp/api/1` (v1

API; v2 has a different path scheme — `/api/2/laws`, `/api/2/law_data/{id}`).

Override with `LAWPUB_PROVIDER` and `LAWPUB_EGOV_BASE_URL`.

## Workspace layout

| crate | purpose |

|---|---|

| `crates/egov-client`     | e-Gov fetcher (`HttpProvider`, `MockProvider`) |

| `crates/law-normalizer`  | LawXML → normalized `LawDocument` |

| `crates/kanpo-client`    | 官報 site scraper (Phase 3, mock for now) |

| `crates/kanpo-linker`    | amendment ↔ 官報 PDF matching with confidence score |

| `crates/search-index`    | bigram tokenizer + SQLite FTS5 builder + ref-graph extractor |

| `crates/lawpub-cli`      | the `lawpub` binary |

## Browser search (WASM SQLite + FTS5 over Cloudflare R2)

`lawpub` emits `public/search.db` (SQLite + FTS5, ~1.5 GB at full bulk) at

build time. The SPA reads it through **sql.js-httpvfs** (sqlite.org's

Emscripten WASM build + an HTTP-Range VFS). Each query downloads only the

SQLite **pages** (4 KB) it needs — typically 100-300 KB / query — so the

1.5 GB DB stays remote.

Hosting options:

| Option | search.db location | When to use |

|---|---|---|

| **GitHub Pages only (default)** | `public/search.db` (same origin) | OK for tiny demos (<50 MB), hard limit 100 MB git |

| **Cloudflare R2 (recommended)**  | `https:///search.db` via `VITE_SEARCH_DB_URL` | Production / full bulk. R2 free tier (10 GB storage + free egress) covers personal use indefinitely |

| Turso / D1                       | Their HTTP API | Only if edge-replicated reads matter |

### R2 setup (one-time)

1. Sign up for Cloudflare (free). R2 dashboard → **Create bucket** (e.g.

   `lawrenceanum`).

2. Bucket settings → **Public access** → enable "r2.dev subdomain". Note the

   public URL `https://pub-.r2.dev`.

3. Bucket settings → **CORS policy** → allow your Pages origin:

   ```json

   [

     {

       "AllowedOrigins": ["https://.github.io"],

       "AllowedMethods": ["GET"],

       "AllowedHeaders": ["range", "if-match", "if-none-match"],

       "ExposeHeaders": ["content-length", "content-range", "etag"],

       "MaxAgeSeconds": 86400

     }

   ]

   ```

4. R2 → **Manage R2 API tokens** → create token with **Object Read & Write**

   on that single bucket.

5. GitHub repo → Settings → Secrets and variables → Actions → add:

   | Secret | Example |

   |---|---|

   | `R2_ACCOUNT_ID`       | your account id |

   | `R2_ACCESS_KEY_ID`    | from step 4 |

   | `R2_SECRET_ACCESS_KEY`| from step 4 |

   | `R2_BUCKET`           | `lawrenceanum` |

   | `R2_ENDPOINT`         | `https://.r2.cloudflarestorage.com` |

   | `R2_PUBLIC_URL`       | `https://pub-.r2.dev` |

When all of `R2_BUCKET` / `R2_ENDPOINT` are set, the workflow uploads

`search.db` to R2 after `validate`, removes it from the Pages artifact, and

builds the SPA with `VITE_SEARCH_DB_URL=$R2_PUBLIC_URL/search.db`. With the

secrets unset, everything still works (search.db stays in `public/`).

- Indexed at the article level. The FTS5 virtual table has columns

  `law_id` / `article_id` / `article_no` / `caption` / `title_tokens` /

  `content_tokens`.

- Japanese is pre-tokenized as **character bigrams**

  (`crates/search-index::tokenize` and

  `figma/src/app/data/search-engine::tokenize` are kept in lockstep).

- Queries go through the same bigram tokenizer; FTS5 `snippet()` produces

  highlighted excerpts.

- A `meta` table stores `built_at` / `law_count` / `article_count` /

  `ref_count`.

- A `refs` table stores cross-references between articles:

  ```sql

  CREATE TABLE refs (

    from_law_id TEXT, from_article_id TEXT,

    to_law_id   TEXT, to_article_id   TEXT,

    ref_text TEXT,

    ref_type TEXT  -- 'self_article' | 'previous_article' | 'next_article' | 'cross_law'

  );

  ```

  Extraction uses Aho-Corasick (`MatchKind::LeftmostLongest`) to keep build

  time linear in body length × match count even with thousands of laws.

The browser exposes `getOutgoingRefs` / `getIncomingRefs` / `getRefsForLaw` and

the Browse detail view linkifies article text in place. Clicking a reference

scrolls to `#article_id`; cross-law references navigate to

`/laws/{other_id}#{article_id}`. Each article header also lists incoming

references as backlinks.

`/search` lazy-loads sql-wasm (~320KB gzip) + `search.db` on first navigation;

falling back to a mock filter when the DB is unreachable so local dev still

works.

Inspired by ellisii's [`jp-tokenizer-bigram`](../ellisii/crates/jp-tokenizer-bigram/)

and [`store-sqlite`](../ellisii/crates/store-sqlite/).

## Web UI (static SPA)

`figma/` doubles as the design source-of-truth and the actual UI implementation

(Vite + React + Tailwind v4 + shadcn/ui). It builds straight into the same

`public/` directory the JSON lives in.

- `base: './'` so assets are relative — works on any GitHub Pages sub-path

- `outDir: ../public`, `emptyOutDir: false` so the JSON survives a Vite build

- `publicDir: false` to avoid copying assets into themselves

- Dev mode: `lawpubJsonDevServer` Vite middleware serves `../public/*.json`

  on the fly so `pnpm dev` sees live data without a separate server

- Lazy-loaded chart bundle (recharts ≈ 420 KB) via `React.lazy`, kept out of

  the initial dashboard render

### CI step order

1. `lawpub update` writes JSON via atomic `public.tmp/` → rename

2. `lawpub kanpo-link` overlays 官報 matches on each `timeline.json`,

   recomputes `manifest.json`

3. **Change detection**: read `state/last_run.json.changed`; if `false`, skip

   the rest

4. `pnpm build` adds `index.html` + `assets/` to `public/` (JSON untouched)

5. `lawpub validate` cross-checks every manifest entry's sha256

6. `actions/configure-pages` → `actions/upload-pages-artifact`

7. `git commit && git push` (`public/` plus `state/latest.json`)

8. Separate `deploy` job runs `actions/deploy-pages`

## Auto-update via GitHub Actions

`update-law-data.yml` is driven by three triggers:

| Trigger | Behaviour |

|---|---|

| `schedule` (JST 06:30 / 12:30 / 18:30 / 00:30) | Pull latest e-Gov diff, commit + deploy if anything changed |

| `push` (merge to `main`) | Rebuild SPA over the existing committed `public/` and redeploy. **No** e-Gov fetch, **no** auto-commit |

| `workflow_dispatch` | Pick `provider` / `date` / `force` / `from_date` / `to_date` / `bulk_category` / `bulk_limit` |

Auto-commits use `GITHUB_TOKEN`, which by GitHub policy does not re-trigger

workflows — so a cron auto-commit cannot create a deploy loop.

### Change detection (no-op suppression)

`lawpub update` writes `state/last_run.json` (gitignored) on every run:

```json

{

  "version": 1,

  "ran_at": "2026-05-09T03:30:00Z",

  "provider": "http",

  "dates": ["2026-05-06", "2026-05-07", "2026-05-08", "2026-05-09"],

  "new_xmls": 0,

  "errors": [],

  "changed": false

}

```

If the sha256-deduped revision store (`.cache/revisions/`) gained no new XMLs

**and** `public/manifest.json` already exists, the run reports `changed=false`

and every downstream step (build / commit / deploy) is skipped. So idle hours

on the e-Gov side do not bloat git history.

### Failure handling

- HttpProvider retries each request three times with exponential backoff. A

  failed date is logged in `errors` and other dates keep going (plan §14).

- `public/` is replaced atomically via `public.tmp/` → `public.bak/` →

  rename. A failure mid-swap is rolled back from the backup.

- `concurrency: update-law-json` serializes overlapping schedule + dispatch

  runs.

### Manual triggers

```bash

# Single date (overrides the auto state-based range)

gh workflow run update-law-data.yml -f date=2026-05-01

# Range backfill (fill in dates before cron started)

gh workflow run update-law-data.yml \

  -f from_date=2024-04-01 -f to_date=2026-05-09

# Bulk fetch (one-shot collection of every law in a category)

#   1 = 憲法・法律

#   2 = 政令・勅令

#   3 = 府省令・規則

gh workflow run update-law-data.yml -f bulk_category=1

gh workflow run update-law-data.yml -f bulk_category=2 -f bulk_limit=500

# Force a redeploy without touching e-Gov

gh workflow run update-law-data.yml -f force=true

```

### One-time amendment-history backfill (e-Gov API v2)

`/api/1/lawdata/{id}` (currently used by bulk/cron) only returns the law's *current*

snapshot — no historical revisions. To populate the timeline with actual amendment

history (e.g. 民法 has ~33 revisions back to Heisei era), we use e-Gov API v2's

`/law_revisions/{id}` endpoint. This is a one-time backfill done **locally** (not in

Actions) because it makes ~9000 requests and would be slow / risky in CI.

```bash

# 1. Smoke-test on a few laws first. ID 源は public/laws/index.json (auto-committed

#    by Actions) なので fresh checkout でも .cache 不要で回せる。

./target/release/lawpub fetch-revisions --from-public ./public --limit 5

# 2. Full backfill. Concurrency 2 is e-Gov-friendly (CloudFront rate-limits at ~4+).

#    Resumes if interrupted; existing per-law JSONs are skipped (use --force to redo).

./target/release/lawpub fetch-revisions --from-public ./public --concurrency 2

#    Alternative: when .cache/revisions/ is already populated locally:

# ./target/release/lawpub fetch-revisions --all --concurrency 2

# 3. Pack the per-law JSONs into a single jsonl for shipping.

./target/release/lawpub bundle-revisions-meta --mode pack \

  --dir .cache/revisions_meta --file .cache/revisions_meta.jsonl

# 4. Upload to R2 via wrangler (uses your `wrangler login` session — no R2

#    access key needed locally). CI's "Restore revisions_meta from R2" step

#    later pulls the same object back via the S3 API.

export R2_BUCKET=

pnpm install               # installs wrangler (root devDependency)

pnpm upload-revisions-meta # = wrangler r2 object put "$R2_BUCKET/revisions_meta.jsonl" ...

# 5. Trigger a force rebuild so build-json picks up the new meta.

gh workflow run "Update law JSON" -f force=true

```

The upload uses `wrangler` (a root `devDependency`); `pnpm upload-revisions-meta`

wraps `wrangler r2 object put ... --remote`. CI reads the object back with the

S3 API + `R2_*` secrets — upload and download paths differ but hit the same

R2 object.

After this, the cron path (`lawpub update`) refreshes the meta for *only* the

laws updated that day, so the timeline stays fresh without re-running the

full backfill.

Priority is `bulk_category > from_date/to_date > date > automatic state-based`.

Bulk runs do thousands of requests × 200 ms throttle, so the workflow's

`timeout-minutes` is 360. If a bulk run dies partway through, the in-job

`.cache/revisions/` still holds whatever it managed to fetch and `build-json`

will produce a partial `public/`.

## Status

Up and running on Pages. The cron is incremental from the moment it starts;

historical revisions only accumulate going forward unless you explicitly

backfill via `bulk_category=N` or `from_date=…/to_date=…`. There is no e-Gov

endpoint that returns historical revisions of a single law (only the current

version + a daily-update list), so deeper history requires the daily snapshots

to keep stacking up over time.

## License

- **個人利用**: 非商用の個人利用（学習・研究・検証・趣味目的での利用・改変・

  再配布を含む）は無償で自由に行えます。

- **商用利用**: 商用目的での利用は事前のお問い合わせ・許諾が必要です。

詳細は [LICENSE](LICENSE) を参照してください。

なお、本リポジトリが扱う法令データは e-Gov 法令API 等の公的データに由来します。

データ自体の利用条件は各提供元の規約に従ってください。
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bokuweb/lawrenceanum

Awesome Lists containing this project

README