Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/guardian/typerighter

Even if you’re the right typer, couldn’t hurt to use Typerighter!
https://github.com/guardian/typerighter

production

Last synced: 5 days ago
JSON representation

Even if you’re the right typer, couldn’t hurt to use Typerighter!

Awesome Lists containing this project

README

        

# Typerighter

image

Typerighter is the server-side part of a service to check a document against a set of user-defined rules. It's designed to work like a spelling or grammar checker. It contains two services, the [checker](https://checker.typerighter.gutools.co.uk/) and the [rule manager](https://manager.typerighter.gutools.co.uk/) – see [architecture](#architecture) for more information.

We use it at the Guardian to check content against our style guide. Max Walker, the subeditor who inspired the creation of Typerighter, has written an introduction [here](https://www.theguardian.com/help/insideguardian/2020/nov/20/introducing-typerighter-making-life-easier-for-journalists-and-stories-better-for-readers).

To understand our goals for the tool, see the [vision document](./vision.md).

For setup, see [the docs directory](./docs/).

For an example of a Typerighter client (the part that presents the spellcheck-style interface to the user), see [prosemirror-typerighter](https://github.com/guardian/prosemirror-typerighter).

## How it works: an overview

The Typerighter Rule Manager produces a JSON artefact (stored in S3) which is ingested by the Checker service. This artefact represents all the rules in our system, currently including user-defined regex rules, user-defined Language Tool pattern rules (defined as XML) and Language Tool core rules (pre-defined rules from Language Tool). Historically, rules were derived from a Google Sheet, rather than the Rule Manager.

Each rule in the service corresponds to a `Matcher` that receives the document and passes back a list of `RuleMatch`. We have the following `Matcher` implementations:

- `RegexMatcher` uses regular expressions
- `LanguageToolMatcher` is powered by the [LanguageTool](https://languagetool.org/) project, and uses a combination of native LanguageTool rules and user-defined XML rules as its corpus

Matches contain the range that match applies to, a description of why the match has occurred, and any relevant suggestions – see the `RuleMatch` interface for the full description.

## Architecture

### Roles

- Rule owner: a person responsible for maintaining the rules that Typerighter consumes.
- Rule user: a person checking their copy with the checker service.

The system consists of two Scala services:
- The rule-manager service, which is responsible for the lifecycle of Typerighter's corpus of rules, and publishes them as an artefact
- The checker service, which consumes that artefact and responds to requests to check copy against the corpus of rules with matches.

They're arranged like so:

```mermaid
flowchart LR
checker[Checker service]
manager[Manager service]
sheet[Google Sheet]
client[Typerighter client]
s3[(typerighter-rules.json)]
db[(Postgres DB)]
owner{{Rule owner role}}
user{{Rule user role}}

sheet--"Get rules"-->manager
manager--"Write rules"-->db
db--"Read rules"--> manager
manager--"Write rule artefact"-->s3
checker--"Read rule artefact"-->s3
client--"Request matches"-->checker

owner-."Force manager to re-fetch sheet".->manager
user-."Request document check".->client
owner-."Edit rules".->sheet
```

### The checker service

Typerighter's built to manage document checks of every kind, include checks that we haven't yet thought of. To that end, a `MatcherPool` is instantiated for each running checker service, which is responsible for managing incoming checks, including parallelism, backpressure, and ensuring that our checks are given to the appropriate matchers.

A `MatcherPool` accepts any matcher instance that satisfies the `Matcher` trait. Two core `Matcher` implementations include `RegexMatcher`, that checks copy with regular expressions, and `LanguageToolMatcher`, that checks copy with an instance of a `JLanguageTool`. The `MatcherPool` is excited to accommodate new matchers in the future! Here's a diagram to illustrate:
```mermaid
flowchart TD
CH(["Check requests"])
MP-."matches[]".->CH
MP[MatcherPool]--has many--->MS
CH-.document.->MP
subgraph MS[Matchers]
R[RegexMatcher]
L[LanguageToolMatcher]
F[...FancyHypotheticalAIMatcher]
end
```

## Implementation

Both the Checker and Rule Manager services are built in Scala with the Play framework. Data in the Rule Manager is stored in a Postgres database, queried via ScalikeJDBC.

Google credentials are fetched from SSM using AWS Credentials or Instance Role.

It's worth noting that, at the moment, there are a fair few assumptions built into this repository that are Guardian-specific:
- We assume the use of AWS cloud services, and default to the `eu-west-1` region. This is configurable on a [per-project](https://github.com/guardian/typerighter/blob/main/apps/checker/conf/application.conf) basis with the [configuration parameter `aws.region`](https://github.com/guardian/typerighter/blob/fa90ef260cd71e0f4fa1b893d7bba9b87ff828ef/apps/common-lib/src/main/scala/com/gu/typerighter/lib/CommonConfig.scala#L16).
- Building and deployment is handled by riff-raff, [the Guardian's deployment platform](https://github.com/guardian/riff-raff).
- Configuration is handled by [simple-configuration](https://github.com/guardian/simple-configuration).

We'd be delighted to participate in discussions, or consider PRs, that aimed to make Typerighter easier to use in a less institionally specific context.

## Integration

The [prosemirror-typerighter](https://github.com/guardian/prosemirror-typerighter) plugin provides an integration for the [Prosemirror](https://prosemirror.net) rich text editor.

If you'd like to provide your own integration, this service will function as a standalone REST platform, but you'll need to use [pan-domain-authentication](https://github.com/guardian/pan-domain-authentication) to provide a valid auth cookie with your requests.

## Upgrading LanguageTool

LanguageTool has core rules that we use, and as we upgrade LT, these could change underneath us.

There's a script to see if rules have changed as a result of an upgrade in ./script/js/compare-rule-xml.js.

## Formatting

### Prettier formatting

Prettier is installed in the client app using the Guardian's recommended [config](https://github.com/guardian/csnx/tree/main/libs/%40guardian/prettier). To format files you can run `npm run format:write`. A formatting check will run as part of CI.

To configure the IntelliJ Prettier plugin to format on save see the guide [here](https://www.jetbrains.com/help/idea/prettier.html#ws_prettier_configure). To configure the VS Code Prettier plugin see [here](https://github.com/prettier/prettier-vscode#format-on-save).

### Scala formatting
Typerighter uses [Scalafmt](https://scalameta.org/scalafmt/) to ensure consistent linting across all Scala files.

To lint all files you can run `sbt scalafmtAll`
To confirm all files are linted correctly, you can run `sbt scalafmtCheckAll`

You can configure your IDE to format scala files on save according to the linting rules defined in [.scalafmt.conf](.scalafmt.conf)

For intellij there is a guide to set up automated linting on save [here](https://www.jetbrains.com/help/idea/work-with-scala-formatter.html#scalafmt_config) and [here](https://scalameta.org/scalafmt/docs/installation.html). For visual studio code with metals see [here](https://scalameta.org/scalafmt/docs/installation.html#vs-code)

### Automatic formatting

The project contains a pre-commit hook which will automatically run the Scala formatter on all staged files. To enable this, run `./script/setup` from the root of the project.

## Developer how-tos

### Connecting to the rule-manager database in CODE or PROD

Sometimes it's useful to connect to the databases running in AWS to inspect the data locally.

We can use `ssm-scala` to create an SSH tunnel that exposes the remote database on a local port. For example, to connect to the CODE database, we can run:

```bash
ssm ssh -x -t typerighter-rule-manager,CODE -p composer --rds-tunnel 5000:rule-manager-db,CODE
```

You should then be able to connect the database on `localhost:5000`. You'll need to use the username and password specified in [AWS parameter store](https://eu-west-1.console.aws.amazon.com/systems-manager/parameters/?region=eu-west-1&tab=Table) at `/${STAGE}/flexible/typerighter-rule-manager/db.default.username` and `db.default.password`.

Don't forget to kill the connection once you're done! Here's a handy one-liner: `kill $(lsof -ti {PORT_NUMBER})`