Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/vitalibo/cetus

Cost-effective REST API for dense and slowly changing data
https://github.com/vitalibo/cetus

cloudfront spark

Last synced: about 1 month ago
JSON representation

Cost-effective REST API for dense and slowly changing data

Host: GitHub
URL: https://github.com/vitalibo/cetus
Owner: vitalibo
Created: 2024-06-22T11:07:59.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-01-02T01:36:43.000Z (about 2 months ago)
Last Synced: 2025-01-02T02:44:57.882Z (about 2 months ago)
Topics: cloudfront, spark
Language: Python
Homepage:
Size: 111 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

# Cetus

![status](https://github.com/vitalibo/cetus/actions/workflows/ci.yaml/badge.svg)

Cetus is a high-performance, low-latency, global scale and cost-effective REST API for dense and slowly changing data.
The core idea revolves around precomputing responses and leveraging cloud storage (S3) and CDN (CloudFront) to serve
these responses, thus avoiding the need for always-on web servers, databases, and caches.

### Problem Statement

Let's assume we have a dataset that updates once a day, and we want to expose this data via a REST API that can handle
high traffic globally. The traditional approach would be to have a web server that queries the database and caches the
response. This approach has one major drawback: the web server must always be on, even if the data is updated once a
day. This leads to high costs and operational complexity.

### Initial Idea

1. Once a day, we precompute using Apache Spark batch processing the responses and store them in Amazon S3.
File path in S3 is determined by the request URL, and the file content is the response body.
2. We use Amazon CloudFront to serve these responses globally with low latency and high availability.
Amazon CloudFront caches the responses based on the request URL and the response body.

#### Challenges

###### Challenge 1: a large number of files in S3

Initially, each possible response was stored in an individual file. This led to a large number of small files, resulting
in longer write times. To address this, we grouped responses into larger files. The content within each file is
organized in such a way that allows for efficient indexing. Each response is stored with a fixed length, enabling easy
calculation of the offset for any given dimensions.

###### Challenge 2: efficient response lookup

To efficiently look up responses, we need to determine the file and offset for a given request URL. We use a
Lambda@Edge function to parse the request URL and determine the file and offset, then use S3 byte-range requests to
retrieve the response body. The Lambda@Edge function is deployed globally, ensuring low latency for all requests.

### Final Implementation

1. Once a day, Apache Spark batch processing precomputes responses and stores them in S3. Responses are grouped into
larger files to reduce write times. Each response is stored with a fixed length to enable efficient indexing. The
file path in S3 partially reflects the request URL, other parts are stored in the file content.
2. Amazon CloudFront serves responses globally with low latency and high availability. Before retrieving a response,
CloudFront triggers a Lambda@Edge function to determine the file and offset. The Lambda@Edge function parses the
request URL and calculates the file and offset, then uses S3 byte-range requests to retrieve the response body.

![Diagram](https://markdown-inline-svg.vitalibo.click/svg?source=https://raw.githubusercontent.com/vitalibo/cetus/main/readme.md&name=diagram.svg)

SVG code

```
@diagram.svg

Cache hit

CloudFront

Return response

CloudFront

Bytes

returned

Bytes...

Cache

miss

Cache...

Client request

Return cached

response

Return cached...

Calculate response

offset in file

Calculate response...

Bytes-range

request

Bytes-range...

Return response to CloudFront

Trims any padding

Write files

Glue Job

Glue JobText is not SVG - cannot display
@diagram.svg
```

### Usage

#### Setup

In order to deploy the solution in your AWS account, you need to create a new stage file in the `infrastructure/stages`
directory. Name the file with the name of the environment you want to deploy to. For example, if you want to deploy to
the `dev` environment, create a file named `dev.yml`. After creating the file, you can deploy the solution by running
the following command:

```bash
make clean build && make deploy environment= profile=
```

Where `` is the name of the environment you want to deploy to and `` is the name of the AWS CLI
profile you want to use. If you don't provide a profile name, the default profile will be used.

#### Generate test data

To generate test data, you can use the following command:

```bash
python3 tests/generate.py --frac 0.1 sample-0.1.csv
```

#### Configure

To modify the final file structure, you can update `transform` section. The following options are available:

- `dimensions.path` - list of dimensions that will be included in a path of the final file.
- `dimensions.file` - list of dimensions that will be used to define position offset in the final file.
Here we should use dimensions with low cardinality and high density between them.
- `dimensions.body` - list of dimensions that will be included in the body of the response.
- `metrics` - list of metrics that will be included in the body of the response.

#### Cleanup

To remove the solution from your AWS account, you need to remove all files in S3 bucket and then delete the
CloudFormation stack. As we use Lambda@Edge function, removing stack will fail on the first attempt. You need to wait
for one hour and then delete the stack again.