https://github.com/willshiao/transcript-parser

Parses plaintext speech/debate/radio transcripts into JavaScript objects.
https://github.com/willshiao/transcript-parser

hacktoberfest javascript nodejs npm parse parser transcript transcript-parser

Last synced: 10 months ago
JSON representation

Parses plaintext speech/debate/radio transcripts into JavaScript objects.

Host: GitHub
URL: https://github.com/willshiao/transcript-parser
Owner: willshiao
License: mit
Created: 2016-04-09T01:53:11.000Z (almost 10 years ago)
Default Branch: master
Last Pushed: 2020-05-01T05:51:34.000Z (almost 6 years ago)
Last Synced: 2025-04-20T02:57:55.795Z (10 months ago)
Topics: hacktoberfest, javascript, nodejs, npm, parse, parser, transcript, transcript-parser
Language: JavaScript
Homepage:
Size: 182 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          transcript-parser

=================

[![Build Status](https://travis-ci.org/willshiao/transcript-parser.svg?branch=master)](https://travis-ci.org/willshiao/transcript-parser)

[![Coverage Status](https://coveralls.io/repos/github/willshiao/transcript-parser/badge.svg?branch=master)](https://coveralls.io/github/willshiao/transcript-parser?branch=master)

[![npm](https://img.shields.io/npm/v/transcript-parser.svg?maxAge=2592000)](https://www.npmjs.com/package/transcript-parser)

[![Known Vulnerabilities](https://snyk.io/test/github/willshiao/transcript-parser/badge.svg)](https://snyk.io/test/github/willshiao/transcript-parser)

- [Description](#description)

- [Usage](#usage)

- [Config](#config)

- [Documentation](#documentation)

  * [\.parseStream()](#parsestream)

  * [\.parseOneSync()](#parseonesync)

  * [\.parseOne()](#parseone)

  * [\.resolveAliasesSync()](#resolvealiasessync)

  * [\.resolveAliases()](#resolvealiases)

- [Example](#example)

## Description

Parses plaintext speech/debate/radio transcripts into JavaScript objects. It is still in early development. Pull requests are welcome.

Tests can be run with `npm test` and a benchmark can be run with `npm run benchmark`. For a full coverage report using [Istanbul](https://github.com/gotwarlost/istanbul), run `npm run travis-test`.

Tested for Node.js >= v4.4.6

## Usage

`npm install transcript-parser`

```node

'use strict';

const fs = require('fs');

const TranscriptParser = require('transcript-parser');

const tp = new TranscriptParser();

// Synchronous example

const parsed = tp.parseOneSync(fs.readFileSync('transcript.txt', 'utf8'));

console.log(parsed);

// Asynchronous example

fs.readFile('transcript.txt', (err, data) => {

  if(err) return console.error('Error:', err);

  tp.parseOne(data, (err, parsed) => {

    if(err) return console.error('Error:', err);

    console.log(parsed);

  }));

});

// Stream example

const stream = fs.createReadStream('transcript.txt', 'utf8');

tp.parseStream(stream, (err, parsed) => {

  if(err) return console.error('Error:', err);

  console.log(parsed);

});

```

## Config

The constructor for `TranscriptParser` accepts a settings object.

- `removeActions`

    + default: `true`

    + Specifies if the parser should remove actions (e.g. `(APPLAUSE)`).

- `removeAnnotations`

    + default: `true`

    + Specifies if the parser should remove annotations (surrounded by `[]`).

- `removeTimestamps`

    + default: `true`

    + **True if `removeAnnotations` is true**

    + Specifies if the parser should remove timestamps (in the `[##:##:##]` format).

- `removeUnknownSpeakers`

    + default: `false`

    + Specifies if the parser should remove lines that have no associated speaker.

    + If true, lines that have no associated speaker will be stored under the key `none`.

- `blacklist`

    + default: `[]`

    + A list of speakers (as strings) that the parser should ignore.

- `aliases`

    + default: `{}`

    + A object with the real name as the key and an `Array` of the aliases' regular expressions as the value.

    + Example: `{ "Mr. Robot": [ /[A-Z\ ]*SLATER[A-Z\ ]*/ ] }`

        * Renames all speakers who match the regex to "Mr. Robot".

- `regex` _(>= v0.7.1)_

  + `newLine`

    * default: `/(?:\r?\n)+/` (`RegExp` literal)

    * The regular expression used to match new line seperators (CRLF, LF).

    * Should be set to match multiple consecutive seperators for the fastest parsing.

    * Example: `/\|/`

      - Uses a single pipe (`|`) symbol to indicate a new line instead of the traditional LF or CRLF.

Settings can be changed after object creation by changing the corresponding properties of `tp.settings`, where `tp` is an instance of `TranscriptParser`.

## Documentation

### .parseStream()

The `parseStream()` method parses a [`Stream`](https://nodejs.org/api/stream.html) and returns an object representing it.

This is the preferred method for parsing streams asynchronously as it doesn't have to load the entire transcript into memory (unlike `parseOne()`).

#### Syntax

`tp.parseOneSync(stream, callback)`

##### Parameters

- `stream`

    + The `Readable` stream to read.

- `callback(err, data)`

    + A callback to be executed on function completion or error.

### .parseOneSync()

The `parseOneSync()` method parses a string and returns an object representing it.

#### Syntax

`tp.parseOneSync(transcript)`

##### Parameters

- `transcript`

    + The transcript, as a `string`.

### .parseOne()

The `parseOne()` method parses a string and returns an object representing it.

#### Syntax

`tp.parseOne(transcript, callback)`

##### Parameters

- `transcript`

    + The transcript, as a `string`.

- `callback(err, data)`

    + A callback to be exectuted on function completion or error.

### .resolveAliasesSync()

The `resolveAliasesSync()` method resolves all aliases specified in the configuration passed to the `TranscriptParser`'s constructor (see above).

Renames the names in the `order` list to match the new names in the transcript. Note that there is a signifigant performance penalty, so don't use this method unless you need it.

#### Syntax

`tp.resolveAliasesSync(data)`

##### Parameters

- `data`

    + The transcript object after being parsed.

 

### .resolveAliases()

The `resolveAliases()` method resolves all aliases specified in the configuration passed to the `TranscriptParser`'s constructor (see above).

Renames the names in the `order` list to match the new names in the transcript. Note that there is a signifigant performance penalty, so don't use this method unless you need it.

#### Syntax

`tp.resolveAliases(data, callback)`

##### Parameters

- `data`

    + The transcript object after being parsed.

- `callback(err, resolved)`

    + A callback to be executed on function completion or error.

## Example

### Input

```

A: I like Node.js.

A: I also like C#.

B: I like Node.js too!

A: I especially like the Node Package Manager.

```

### Output

```node

{

  speaker: {

    A: [

      'I like Node.js.',

      'I also like C#.',

      'I especially like the Node Package Manager.'

    ],

    B: ['I like Node.js too!']

  },

  order: ['A', 'A', 'B', 'A']

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/willshiao/transcript-parser

Awesome Lists containing this project

README