Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/vitalibo/cetus
Cost-effective REST API for dense and slowly changing data
https://github.com/vitalibo/cetus
cloudfront spark
Last synced: 27 days ago
JSON representation
Cost-effective REST API for dense and slowly changing data
- Host: GitHub
- URL: https://github.com/vitalibo/cetus
- Owner: vitalibo
- Created: 2024-06-22T11:07:59.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-09-20T14:10:46.000Z (about 2 months ago)
- Last Synced: 2024-09-29T07:01:40.621Z (about 1 month ago)
- Topics: cloudfront, spark
- Language: Python
- Homepage:
- Size: 98.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# Cetus
![status](https://github.com/vitalibo/cetus/actions/workflows/ci.yaml/badge.svg)
Cetus is a high-performance, low-latency, global scale and cost-effective REST API for dense and slowly changing data.
The core idea revolves around precomputing responses and leveraging cloud storage (S3) and CDN (CloudFront) to serve
these responses, thus avoiding the need for always-on web servers, databases, and caches.### Problem Statement
Let's assume we have a dataset that updates once a day, and we want to expose this data via a REST API that can handle
high traffic globally. The traditional approach would be to have a web server that queries the database and caches the
response. This approach has one major drawback: the web server must always be on, even if the data is updated once a
day. This leads to high costs and operational complexity.### Initial Idea
1. Once a day, we precompute using Apache Spark batch processing the responses and store them in Amazon S3.
File path in S3 is determined by the request URL, and the file content is the response body.
2. We use Amazon CloudFront to serve these responses globally with low latency and high availability.
Amazon CloudFront caches the responses based on the request URL and the response body.#### Challenges
###### Challenge 1: a large number of files in S3
Initially, each possible response was stored in an individual file. This led to a large number of small files, resulting
in longer write times. To address this, we grouped responses into larger files. The content within each file is
organized in such a way that allows for efficient indexing. Each response is stored with a fixed length, enabling easy
calculation of the offset for any given dimensions.###### Challenge 2: efficient response lookup
To efficiently look up responses, we need to determine the file and offset for a given request URL. We use a
Lambda@Edge function to parse the request URL and determine the file and offset, then use S3 byte-range requests to
retrieve the response body. The Lambda@Edge function is deployed globally, ensuring low latency for all requests.### Final Implementation
1. Once a day, Apache Spark batch processing precomputes responses and stores them in S3. Responses are grouped into
larger files to reduce write times. Each response is stored with a fixed length to enable efficient indexing. The
file path in S3 partially reflects the request URL, other parts are stored in the file content.
2. Amazon CloudFront serves responses globally with low latency and high availability. Before retrieving a response,
CloudFront triggers a Lambda@Edge function to determine the file and offset. The Lambda@Edge function parses the
request URL and calculates the file and offset, then uses S3 byte-range requests to retrieve the response body.![Diagram](https://fwtbbmf399.execute-api.us-east-1.amazonaws.com/Prod/svg?source=https://raw.githubusercontent.com/vitalibo/cetus/main/readme.md&name=diagram.svg)
SVG code```
@diagram.svgCache hitCache hitCloudFrontCloudFrontReturn responseReturn responseCloudFrontCloudFrontBytes...BytesreturnedS3S3Cache...CachemissClient requestClient requestReturn cached...Return cachedresponseCalculate response...Calculate responseoffset in file
Bytes-range...Bytes-rangerequestReturn response to CloudFrontReturn response to CloudFrontTrims any paddingTrims any padding
Write filesWrite filesGlue JobText is not SVG - cannot displayGlue Job
@diagram.svg
```### Usage
#### Setup
In order to deploy the solution in your AWS account, you need to create a new stage file in the `infrastructure/stages`
directory. Name the file with the name of the environment you want to deploy to. For example, if you want to deploy to
the `dev` environment, create a file named `dev.yml`. After creating the file, you can deploy the solution by running
the following command:```bash
make clean build && make deploy environment= profile=
```Where `` is the name of the environment you want to deploy to and `` is the name of the AWS CLI
profile you want to use. If you don't provide a profile name, the default profile will be used.#### Generate test data
To generate test data, you can use the following command:
```bash
python3 tests/generate.py --frac 0.1 sample-0.1.csv
```#### Configure
To modify the final file structure, you can update `transform` section. The following options are available:
- `dimensions.path` - list of dimensions that will be included in a path of the final file.
- `dimensions.file` - list of dimensions that will be used to define position offset in the final file.
Here we should use dimensions with low cardinality and high density between them.
- `dimensions.body` - list of dimensions that will be included in the body of the response.
- `metrics` - list of metrics that will be included in the body of the response.#### Cleanup
To remove the solution from your AWS account, you need to remove all files in S3 bucket and then delete the
CloudFormation stack. As we use Lambda@Edge function, removing stack will fail on the first attempt. You need to wait
for one hour and then delete the stack again.