https://github.com/andrefs/node-cetem-publico

A wrapper for CETEMPúblico, an European Portuguese corpus of news extracts from the newspaper Público, with 180 million words tagged automatically using PALAVRAS.
https://github.com/andrefs/node-cetem-publico

Last synced: about 2 months ago
JSON representation

A wrapper for CETEMPúblico, an European Portuguese corpus of news extracts from the newspaper Público, with 180 million words tagged automatically using PALAVRAS.

Host: GitHub
URL: https://github.com/andrefs/node-cetem-publico
Owner: andrefs
Created: 2019-04-23T17:22:51.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2022-12-08T23:26:05.000Z (over 3 years ago)
Last Synced: 2025-06-14T21:03:56.381Z (about 1 year ago)
Language: JavaScript
Homepage: https://www.linguateca.pt/CETEMPublico/
Size: 122 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          
# cetem-publico

A wrapper for CETEMPúblico, an European Portuguese corpus of news extracts from the newspaper Público, with 180 million words tagged automatically using PALAVRAS.

## Installation

```bash

$ npm install cetem-publico

```

This will download this module, but it won't download the corpus file,

and it will fail if you try to use it. Use the

[cp.download](#cpdownload) method to download the corpus file

(12GB).

## Usage

___

**This is still a work in progress, API is subject to change without

warning.**

Do you have suggestions? Send me a message or a pull request on

GitHub!

___

```js

const {CETEMPublico} = require('cetem-publico');

const cp = new CETEMPublico();

// cp.download(); // to download the corpus file

async function procLines(){

  for await (const line of cp.lines()){

    // do something with line

  }

}

async function procTokens(){

  for await (const token of cp.tokens()){

    // do something with token

  }

}

async function procSentences(){

  for await (const sent of cp.sentences()){

    // do something with sent

  }

}

async function procParagraphs(){

  for await (const par of cp.paragraphs()){

    // do something with par

  }

}

async function procExtracts(){

  for await (const ext of cp.extracts()){

    // do something with ext

  }

}

```

## Methods

### new CETEMPublico(file)

### new CETEMPublico(opts)

### new CETEMPublico(file, opts)

* `file`: a string containing the path to a local CETEMPublico file. If not provided, the file will be loaded from `$HOME/.cetem-publico/CETEMPublicoAnotado2019.gz`.

* `opts`: see [Options](#options-todo).

### cp.download()

Download a copy of the CETEMPublico corpus from

https://www.linguateca.pt/CETEMPublico/download/, compresses it using

Gzip and stores it in

`$HOME/.cetem-publico/CETEMPublicoAnotado2019.gz`. If file already

exists, it print a warning message and does nothing.

The whole file is 12GB, so this takes some time.

You can monitor the download progress by listening to the

`dl_progress` event. Example:

```

cp.on('dl_progress', state => {

  ({

    fileName,

    speed,

    percent,

    elapsed,

    remaining,

    transf,

    total

  } = state);

  process.stdout.write(`${fileName}\t${speed}\t${percent}%\t${elapsed}/${remaining}\t${transf}/${total}\r`);

});

Returns a `Promise`.

```

### cp.lines(opts)

Returns an `AsyncGenerator` object where each item is a string

containing a line of the original corpus file.

You can monitor the progress of the corpus reading process by listening to the

`read_progress` event. This is valid for any of the corpus reading

functions (`cp.lines`, `cp.tokens`, `cp.sentences`, `cp.paragraphs` and `cp.extracts`). Example:

```

cp.on('read_progress', state => {

  ({

    speed,

    percent,

    elapsed,

    remaining,

    transf,

    total

  } = state);

  process.stdout.write(`Progress: ${speed}\t${percent}%\t${elapsed}/${remaining}\t${transf}/${total}\r`);

});

```

### cp.tokens(opts)

Returns an `AsyncGenerator` object where each item is a Token object

containing one token from the original corpus file.

### cp.sentences(opts)

Returns an `AsyncGenerator` object where each item is a Sentence

object containing a sentence (`` tag) of the original corpus file.

### cp.paragraphs(opts)

Returns an `AsyncGenerator` object where each item is a Paragraph

object containing a paragraph (`
 tag)` of the original corpus file.

### cp.extracts(opts)

Returns an `AsyncGenerator` object where each item is an Extract

object containing an extract (`` tag) of the original corpus file.

## Events

### dl_progress

Event emitted while downloading the corpus file.

```

cp.on('dl_progress', state => {})

```

`state` is an object containing the following fields:

* `fileName`: name of the file being downloaded (default:

  `CETEMPublicoAnotado2019.gz`)

* `speed`: download speed (in bytes per second)

* `percent`: percentage of the file already downloaded

* `elapsed`: time passed (in seconds)

* `remaining`: time left (in seconds)

* `transf`: total transferred bytes

* `total`: total size of the file (in bytes)

### dl_end

Event emitted when download ends.

### read_progress

Event emitted while processing the corpus file.

```

cp.on('read_progress', state => {})

```

`state` is an object containing the following fields:

* `speed`: read speed (in bytes per second)

* `percent`: percentage of the file already read

* `elapsed`: time passed (in seconds)

* `remaining`: time left (in seconds)

* `transf`: total read bytes

* `total`: total size of the file (in bytes)

### read_end

Event emitted when reading ends.

## Options (TODO)

* `noMWEs`: Omit multi-word expressions

* `simplMWEs`: Simplify MWEs: return their tokens as any other token

* `noTitles`: Omit titles

* `noAuthors`: Omit authors

* `noTitles`: Omit titles

## Classes

### Token

Used to represent the tokens in the original corpus file. In the

format used by CETEMPublico, each token is in an individual line.

#### `new Token(word, info)`

* `word` is the word in the original corpus text

* `info` (all these are optional)

    * `lineNum`: the line number for this token in the original corpus

      file

    * `tokenId`: an ID for this token

    * `section`: the ID of the section the token is in

    * `week`:

    * `lemma`: the lemmatized version of `word`

    * `pos`: the part-of-speech (POS) tag for `word`

    * `other*: an object with all the extra information found in

      CETEMPublico for this token

### MultiWordExpression

CETEMPublico annotates some mult-word expressions using `` tags.

Inside each tag, the tokens which compose the expression, one in each

line. MWEs can have attributes indicating the lemma and the POS tag

for the whole expression.

#### `new MultiWordExpression({lemma, pos}, tokens)`

* `lemma`: the lemma for the multi-word expression

* `pos`: the POS tag for the multi-word expression

* `tokens`: an array of Token objects which make this MWE

### Sentence

In CETEMPublico, a sentence is represented using a `` tag.

Sentences contain a list of tokens (the words in that sentence).

Because some words can form multi-word expressions, inside a

`Sentence` we can find both `Token`s and `MultiWordExpression`s

(which, in turn, have `Token` objects inside).

#### `new Sentence(id, tokens)`

* `id`: an id for the sentence

* `tokens`: an array of tokens and MWEs which form this sentence

### Paragraph

A paragraph, represented in CETEMPublico using the tag `
`.

Paragraphs are composed of a sequence of sentences.

#### `new Paragraph(id, sentences)`

* `id`: an id for the sentence

* `sentences`: an array of sentences which form this paragraph

### Extract

An extract of an news article. Extracts are represented by the tag

`` and contain a sequence of sentences. Optionally, they can also

include a Title and Authors, and the attributes `n` (an id for the

extract), `sec` (the newspaper section it was gathered from) and `sem`

(the week in which it was published).

#### `new Extract({n, sec, sem}, contents)`

* `n`: the number of this extract

* `section`: the section in which the extract was found

* `week`: the week it was published on

* `contents`: an array of Paragraph objects, possibly also including a

  Title and an Authors objects

### Authors

The authors of the article an Extract was gathered from.

#### `new Authors(tokens)`

* `tokens`: an array of `Token` objects, each being an author of the

  article

### Title

The title of the article the Extract belongs to.

#### `new Title(tokens)`

* `tokens`: an array of `Token` objects which make the title

## TODO

* Implement `opts`

* Fix ID in '«' and '»' (these quotation marks don't seem to get

  attributed IDs in the original CETEMPublico)

* Add tests

* Speed up download using `fast-request`?

* Add options to `cp.download`

    * Where to download from

    * Where to download to

* ...

## Acknowledgements

This module only exists thanks to the [Publico](https://www.publico.pt) newspaper and the team responsible for the [CETEMPublico](https://www.linguateca.pt/CETEMPublico/) corpus.

## Bugs and stuff

Open a [GitHub issue](https://github.com/andrefs/node-cetem-publico/issues) or, preferably, send me a pull request.

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/andrefs/node-cetem-publico

Awesome Lists containing this project

README