Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/drmikecrowe/serverless-s3-streaming-example
Serverless Project Streaming and Parsing S3 files
https://github.com/drmikecrowe/serverless-s3-streaming-example
Last synced: 2 months ago
JSON representation
Serverless Project Streaming and Parsing S3 files
- Host: GitHub
- URL: https://github.com/drmikecrowe/serverless-s3-streaming-example
- Owner: drmikecrowe
- License: mit
- Created: 2019-09-22T12:10:57.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2023-01-24T00:41:06.000Z (almost 2 years ago)
- Last Synced: 2023-03-10T19:55:34.137Z (almost 2 years ago)
- Language: TypeScript
- Size: 3.18 MB
- Stars: 5
- Watchers: 2
- Forks: 5
- Open Issues: 25
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Serverless Project Streaming and Parsing S3 files
This repo illustrates how to stream a large file from S3 and split it into separate S3 files after removing prior files
## Goals
1. Parse a large file without loading the whole file into memory
2. Remove old data when new data arrives
3. Wait for all these secondary streams to finish uploading to s3## Managing Complex Timing
* Writing to S3 is slow. You must ensure you wait until the S3 upload is complete
* We can't start writing to S3 **until** all the old files are deleted.
* We don't know how many output files will be created, so we must wait until the input file has finished processing before starting to waiting for the outputs to finish## Demonstration Problem Statement
* A school district central computer uploads all the grades for the district for a semester
* The data file is has the following headers:
* `School,Semester,Grade,Subject,Class,Student Name,Score`
* Process the uploaded file, splitting it into the following structure:
* Semester/School/Grade
* Create a file called Subject-Class.csv with all the grades for that class
* For this simulation, the central computer can update an entire Semester by uploading a new file. This could be set differently based on the application: For instance, if the central computer could upload the grades for a specific Semester + School, then we could update [this line](https://github.com/drmikecrowe/serverless-s3-streaming-example/blob/master/src/lib/FileHandler.ts#L191) with the revised criteria to only clear that block of dataHere's the general outline of the demo program flow:
* Open the S3 file as a Stream (`readStream`)
* Create a `csvStream` from the input `readStream`
* Pipe `readStream` to `csvStream`
* While we have New Lines
* Is this line for a new school (i.e. new CSV file)?
* Start a PassThru stream (`passThruStream`)
* Does this line start a new Semester (top-level folder we're replacing) in S3?
* Start deleting S3 folder
* Are all files deleted?
* Use `s3.upload` with `Body`=`passThruStream` to upload the file
* Write New Line to the `passThruStream`
* Loop thru all `passThruStream` streams and close/end
* Wait for all `passThruStream` streams to finish writing to S3## Environment Variables
```
BUCKET=(your s3 bucket name)
```## yarn Commands:
* `yarn build:test`: Build fake CSV data in `fixtures/`
* `yarn test`: Run a local test outputing files to `/tmp/output` instead of S3
* `yarn deploy:dev`: Run `serverless deploy` with (stage=dev) to deploy function to AWS Lambda
* `yarn deploy:prod`: Run `serverless deploy --stage prod` to deploy function to AWS Lambda
* `yarn logs:dev`: Pull the AWS CloudWatch logs for the latest stage=dev run
* `yarn logs:prod`: Pull the AWS CloudWatch logs for the latest stage=prod run
* `yarn upload:small:tiny`: Upload `fixtures/master-data-tiny.csv` to S3 `${BUCKET}/dev/uploads`
* `yarn upload:small:dev`: Upload `fixtures/master-data-small.csv` to S3 `${BUCKET}/dev/uploads`
* `yarn upload:medium:dev`: Upload `fixtures/master-data-medium.csv` to S3 `${BUCKET}/dev/uploads`
* `yarn upload:large:dev`: Upload `fixtures/master-data-large.csv` to S3 `${BUCKET}/dev/uploads`
* `yarn upload:small:tiny`: Upload `fixtures/master-data-tiny.csv` to S3 `${BUCKET}/prod/uploads`
* `yarn upload:small:prod`: Upload `fixtures/master-data-small.csv` to S3 `${BUCKET}/prod/uploads`
* `yarn upload:medium:prod`: Upload `fixtures/master-data-medium.csv` to S3 `${BUCKET}/prod/uploads`
* `yarn upload:large:prod`: Upload `fixtures/master-data-large.csv` to S3 `${BUCKET}/prod/uploads`## Validating S3 files
The following commands will downlaod the S3 processed files and use the same validations as `yarn test`:
* **NOTE**: This assumes you've already run `yarn upload:small`
```bash
md /tmp/s3files
aws s3 cp s3://${BUCKET}/dev/processed /tmp/s3files --recursive
ts-node test/fileValidators.ts fixtures/master-data-small.csv /tmp/s3files/
```