https://github.com/Borewit/tokenizer-s3

Amazon S3 tokenizer
https://github.com/Borewit/tokenizer-s3
Last synced: over 1 year ago
JSON representation
Amazon S3 tokenizer
Host: GitHub
URL: https://github.com/Borewit/tokenizer-s3
Owner: Borewit
Created: 2020-01-01T18:11:40.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2024-10-28T15:48:01.000Z (almost 2 years ago)
Last Synced: 2024-10-29T16:41:58.706Z (almost 2 years ago)
Language: TypeScript
Size: 2.56 MB
Stars: 8
Watchers: 3
Forks: 6
Open Issues: 5
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          [![Node.js CI](https://github.com/Borewit/tokenizer-s3/actions/workflows/nodejs-ci.yml/badge.svg?branch=master)](https://github.com/Borewit/tokenizer-s3/actions/workflows/nodejs-ci.yml)

[![CodeQL](https://github.com/Borewit/tokenizer-s3/actions/workflows/github-code-scanning/codeql/badge.svg?branch=master)](https://github.com/Borewit/tokenizer-s3/actions/workflows/github-code-scanning/codeql)

[![NPM version](https://img.shields.io/npm/v/@tokenizer/s3.svg)](https://npmjs.org/package/@tokenizer/s3)

[![npm downloads](https://img.shields.io/npm/dm/@tokenizer/s3.svg)](https://npmcharts.com/compare/@tokenizer/s3,@tokenizer/range,streaming-http-token-reader?start=300)

[![Known Vulnerabilities](https://snyk.io/test/github/Borewit/tokenizer-s3/badge.svg?targetFile=package.json)](https://snyk.io/test/github/Borewit/tokenizer-s3?targetFile=package.json)

# @tokenizer/s3

The tokenizer-s3 module enables seamless integration with [Amazon Web Services (AWS) S3](https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html), allowing you to read and tokenize data from S3 objects in a streaming fashion. This module extends the functionality of the strtok3 tokenizer by providing support for chunked S3 data access.

## Features

Streaming Support: Efficiently read and tokenize data from Amazon S3 objects using streaming, which is ideal for handling large files without loading them entirely into memory.

Integration with [strtok3](https://github.com/Borewit/strtok3): Works seamlessly with the [strtok3](https://github.com/Borewit/strtok3) tokenizer to process S3 data streams, making it easy to handle various tokenization tasks.

Flexible Access: Provides options to configure S3 access, allowing for customized tokenization workflows based on your specific needs.

Promise-Based API: Utilizes a promise-based API for easy integration into modern asynchronous workflows.

## Installation

```shell

npm install @tokenizer/s3

```

## Sponsor

If you appreciate my work and want to support the development of open-source projects like [music-metadata](https://github.com/Borewit/music-metadata), [file-type](https://github.com/sindresorhus/file-type), and [listFix()](https://github.com/Borewit/listFix), consider becoming a sponsor or making a small contribution.

Your support helps sustain ongoing development and improvements.

[Become a sponsor to Borewit](https://github.com/sponsors/Borewit)

or



## API Documention

### `makeChunkedTokenizerFromS3`

Initialize a tokenizer, with the option for random access, 

from an Amazon S3 client for use in extracting metadata from media files.

#### Function Signature

```ts

function makeChunkedTokenizerFromS3(s3: S3Client, objRequest: GetObjectRequest): Promise

```

Reads from the S3 as a stream.

#### Parameters

- `s3` (`S3Client`):

  The S3 client used to make requests to Amazon S3.

  > [!NOTE]

  > To configure AWS client authentication see [Configuration and credential file settings](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html).

- `objRequest` (`GetObjectRequest`):

  The S3 object request containing details about the S3 object to fetch.

  This includes properties like the bucket name and object key.

- `options` (`IS3Options`, optional):

#### Returns

- `Promise`:

  A Promise that resolves to an instance of `IRandomAccessTokenizer`.

  This tokenizer can be used to extract metadata from the specified media file in the S3 object.

  It supports [random access](https://en.wikipedia.org/wiki/Random_access) reads. 

### `makeStreamingTokenizerFromS3`

Initialize a tokenizer from an Amazon S3 client for use in extracting metadata from media files.

#### Function Signature

```ts

function makeStreamingTokenizerFromS3(s3: S3Client, objRequest: GetObjectRequest): Promise

```

Reads from the S3 as a stream.

#### Parameters

- `s3` (`S3Client`):

  

  The S3 client used to make requests to Amazon S3.

  > [!NOTE] 

  > To configure AWS client authentication see [Configuration and credential file settings](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html).

 

- `objRequest` (`GetObjectRequest`):

  

  The S3 object request containing details about the S3 object to fetch.

  This includes properties like the bucket name and object key.

#### Returns

 

- `Promise`:

 

  A Promise that resolves to an instance of `ITokenizer`.

  This tokenizer can be used to extract metadata from the specified media file in the S3 object.

## Compatibility

Module: version [0.3.0](https://github.com/Borewit/tokenizer-s3/releases/tag/v0.3.0) migrated from [CommonJS](https://en.wikipedia.org/wiki/CommonJS) to [pure ECMAScript Module (ESM)](https://gist.github.com/sindresorhus/a39789f98801d908bbc7ff3ecc99d99c).

The distributed JavaScript codebase is compliant with the [ECMAScript 2020 (11th Edition)](https://en.wikipedia.org/wiki/ECMAScript_version_history#11th_Edition_%E2%80%93_ECMAScript_2020) standard.

This module requires a [Node.js ≥ 16](https://nodejs.org/en/about/previous-releases) engine.

It can also be used in a browser environment when bundled with a module bundler.

For TypeScript CommonJs backward compatibility, you can use [load-esm](https://github.com/Borewit/load-esm).

## Examples

### Determine S3 file type

Determine file type (based on it's content) from a file stored Amazon S3 cloud:

```js

import { fileTypeFromTokenizer } from 'file-type';

import { fromEnv } from '@aws-sdk/credential-providers';

import { S3Client } from '@aws-sdk/client-s3';

import { makeChunkedTokenizerFromS3 } from '@tokenizer/s3';

(async () => {

  // Initialize S3 client

  const s3 = new S3Client({

    region: 'eu-west-2',

    credentials: fromEnv(),

  });

  // Initialize S3 tokenizer

  const s3Tokenizer = await makeChunkedTokenizerFromS3(s3, {

    Bucket: 'affectlab',

    Key: '1min_35sec.mp4'

  });

  // Figure out what kind of file it is

  const fileType = await fileTypeFromTokenizer(s3Tokenizer);

  console.log(fileType);

})();

```

See also [example at file-type](https://github.com/sindresorhus/file-type#filetypefromtokenizertokenizer).

### Reading audio metadata from Amazon S3 

Retrieve music-metadata 

```js

import { makeChunkedTokenizerFromS3 } from '@tokenizer/s3';

import { S3Client } from '@aws-sdk/client-s3';

import { parseFromTokenizer } from 'music-metadata/lib/core';

/**

 * Retrieve metadata from Amazon S3 object

 * @param objRequest S3 object request

 * @param options `tokenizer-s3` options

 * @return Metadata

 */

async function parseS3Object(s3, objRequest, options) {

  const s3Tokenizer = await makeChunkedTokenizerFromS3(s3, objRequest);

  return parseFromTokenizer(s3Tokenizer, options);

}

(async () => {

  const s3 = new S3Client({});

  const metadata = await parseS3Object(s3, {

    Bucket: 'standing0media',

    Key: '01 Where The Highway Takes Me.mp3'

  });

  console.log(metadata);

})();

```

A module implementation of this example can be found in [@music-metadata/s3](https://github.com/Borewit/music-metadata-s3).

## Dependency graph

![dependency graph](doc/dependency.svg)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/Borewit/tokenizer-s3

Awesome Lists containing this project

README