Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/ovotech/comms-deduplication

company-kaluza team-comms
Last synced: 12 days ago
JSON representation
Host: GitHub
URL: https://github.com/ovotech/comms-deduplication
Owner: ovotech
Created: 2018-12-13T11:34:38.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2024-04-11T04:30:33.000Z (9 months ago)
Last Synced: 2024-04-11T05:50:33.529Z (9 months ago)
Topics: company-kaluza, team-comms
Language: Scala
Homepage:
Size: 747 KB
Stars: 3
Watchers: 13
Forks: 3
Open Issues: 2
Metadata Files:
- Readme: README.md
- Codeowners: .github/CODEOWNERS
Awesome Lists containing this project

README

        # Mnemosyne

> Mnemosyne (mnɛːmosýːnɛː) is the Greek god of memory. "Mnemosyne" is derived

> from the same source as the word mnemonic, that being the Greek word mnēmē,

> which means "remembrance, memory"

Given an effectful operation `F[A]`, this library can be used to wrap the

computation so that it's only executed once[*](#limitations). It achieves this

by remembering which operations have already been executed and what their result

`A` was.

It is based on two main concepts:

- `id`: An identifier of the operation, this can be any value that is both

  consistent and unique for each operation. _Only one[*](#limitations)_

  operation with the same `id` will be allowed to run.

- `contextId`: A context identifier, usually the type of operation being

  performed. This is used to keep different types of operation separate within a

  service, so that they can all use the same `id` (eg: they are all processing

  the same entity)

The library is able to work across multiple nodes with the same `contextId`. The

persistence is based on [DynamoDb](https://aws.amazon.com/dynamodb/) and its

strong write consistency capability. The same concept can be applied to [Apache

Cassandra](http://cassandra.apache.org/) or any other similar database that

provides these two features:

- Strong write consistency

- Upsert with the previous record values returned

## Versions

| Version | Description              |

|---------|--------------------------|

| 0.x     | Old logic                |

| 1.x     | Old logic, Cats Effect 3 |

| 2.x     | New logic                |

 | 3.x     | New logic, Cats Effect 3 |

## Add the dependency to your project

You'll need to add our public Maven repository:

```scala

resolvers += "Artifactory maven" at "https://kaluza.jfrog.io/artifactory/maven"

```

Then add this snippet to your `build.sbt`

```scala

libraryDependencies += "com.ovoenergy.comms" %% "deduplication" % "$VERSION"

```

An [example terraform file](example.tf) is provided for provisioning the backing

database with DynamoDB.

## How to use it

The main two objects in the library are `Deduplication` and

`DeduplicationContext`.

`Deduplication` is the entrypoint.

It holds the global configuration and a reference to the data storage

and would usually be instantiated once at the beginning of your program.

`DeduplicationContext` is used to deduplicate a single effectful operation.

You can have as many contexts as you like, as long as they all have a different

context ID.

New contexts can be instantiated by calling `.context()` on a

`Deduplication` instance.

Once you have an instance of `DeduplicationContext`, you can use

`context.protect(id: ID, fa: F[A])` to wrap your side effects.

#### DynamoDB backend

In order to create a new `Deduplication` instance you will need an

implementation of `ProcessRepo`, which is the data backend that is responsible

to store information around which operations have been executed and what their

results were.

The library has a built-in implementation of `ProcessRepo` that uses DynamoDB

and [Meteor](https://d2a4u.github.io/meteor/) under the hood. You can use it by

creating an instance of `Deduplication` with the

`com.ovoenergy.comms.deduplication.meteor.MeteorDeduplication` factory object.

If you use the meteor implementation you will need to make sure the meteor

codecs for the return type of your operations are in scope when you instantiate

new contexts.

#### Example

In the following example a service is consuming a stream of events and

performing 2 effectful operations on top of them.

Using the library we make sure[*](#limitations) that, if the same event is

consumed more than once, the side effects will not be re-executed.

```scala

import cats.effect._

import com.ovoenergy.comms.deduplication

import com.ovoenergy.comms.deduplication.meteor.MeteorDeduplication

import com.ovoenergy.comms.deduplication.meteor.codecs._

import meteor.CompositeKeysTable

import meteor.syntax._

// A stream of events.

// This could contain duplicates itself or the same event could be present

// in different streams across service instances

val events: Stream[IO, MyEvent] = ???

// Effectful operations that need to be executed for each event

def sendEmail(evt: MyEvent): IO[String] = ???

def storeEmail(evt: MyEvent, sendId: String): IO[Unit] = ???

// Global configuration

val dedupConf: deduplication.Config = ???

val dedupTable: CompositeKeysTable[String, String] = ???

val dedupResource = for {

  client <- meteor.Client.resource[IO]

  dedup <- MeteorDeduplication.resource[IO, String, String](

    client,

    dedupTable,

    dedupConf

  )

} yield dedup

dedupResource.use { deduplication =>

  // Create two contexts to deduplicate the operations separately

  val sendEmailCtx = deduplication.context[String]("sendEmail")

  val storeEmailCtx = deduplication.context[Unit]("storeEmail")

  events

    .evalMap { evt =>

      for {

        // Wrap the operations in a protect call

        // returns the stored result if sendEmail(evt) was already executed

        // in a different thread or process

        sendId <- sendEmailCtx.protect(evt.id, sendEmail(evt))

        _ <- storeEmailCtx.protect(evt.id, storeEmail(evt, sendId))

      } yield ()

    }

    .compile

    .drain

}

```

## How does it work

The library is based on the two phase commit strategy. It records when the

an operation starts being executed within a context and when it's completed.

 It provides a `protect` method that wraps an effectful operation to

 guarantee[*](#limitations) that it will happen only once for each `contextId`.

The DynamoDb table has this structure:

- `id`: S - The unique identifier of the execution

- `contextId`: S - The unique identifier of the context

- `startedAt`: N - The datetime the signal has started to be processed

- `result`: M - An object containing the result of the execution

- `expiresOn`: N - The datetime when the process result will expire

Each time a context with a given `contextId` attempts to execute an operation

identified by `id`, it updates or writes on the table a record with `id`,

`contextId`, `startedAt`. If a record with the given `id` and `contextId` is

already present, its value is returned to the library, otherwise nothing is

returned. After the operation has run successfully, the library marks it as

completed by storing the `result` and the `expiresOn` fields.

The `expiresOn` allows clean up of old data and the same operation to re-run

after some time.

When the library attempts to start a process, these scenarios can happen:

1. The signal has never been processed previously (no previous record found)

2. The signal has timed out processing (`result` is absent and `startedAt +

processingTime` is in the past)

3. The signal has already been processed previously (`result` is present)

4. The signal is still being processed (`result` is absent and `startedAt +

processingTime` is in the future)

In cases (1) and (2) the library allows the signal to be processed. In case (3)

the library does not allow the signal to be processed again and returns the

stored result. In case (4) the library waits for the process to either complete

or timeout before making any decision.

## How to configure it

A `deduplication.Config` is required in order to create an instance of

`Deduplication`. The following parameters are available:

- `maxProcessingTime: FiniteDuration`: The time after which a pending operation

  will be considered stale and a new one will be allowed to take over.

- `ttl: Option[FiniteDuration]`: The time after which a successful operation

  will be considered expired and allowed to run again. If `None` the operation

  will never expire.

- `pollStrategy: Config.PollStrategy`: The delay strategy to use when polling

  for the status of a running operation. The `PollStrategy` object provides

  helper methods for creating one easily.

A `contextId` needs to be assigned to each instance of `DeduplicationContext`.

It will uniquely identify the type of operation being run.  If more than one

service instance uses the same `contextId` the library will

ensure[*](#limitations) that each operation is only executed once across all of

them.

## Limitations

Unsurprisingly, this library doesn't achieve perfect _exactly once_ executions,

but it makes a best effort at it.

Its purpose is to __limit__ duplication as much as possible while making sure

that all operations are executed _at least once_.

You should still make sure your system is resilient to duplicates.