{"id":19557343,"url":"https://github.com/andrefs/node-cetem-publico","last_synced_at":"2026-04-30T06:31:05.153Z","repository":{"id":36178021,"uuid":"183063190","full_name":"andrefs/node-cetem-publico","owner":"andrefs","description":"A wrapper for CETEMPúblico, an European Portuguese corpus of news extracts from the newspaper Público, with 180 million words tagged automatically using PALAVRAS.","archived":false,"fork":false,"pushed_at":"2022-12-08T23:26:05.000Z","size":125,"stargazers_count":0,"open_issues_count":3,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-06-14T21:03:56.381Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://www.linguateca.pt/CETEMPublico/","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/andrefs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-04-23T17:22:51.000Z","updated_at":"2020-06-08T12:18:10.000Z","dependencies_parsed_at":"2023-01-16T23:11:18.137Z","dependency_job_id":null,"html_url":"https://github.com/andrefs/node-cetem-publico","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/andrefs/node-cetem-publico","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrefs%2Fnode-cetem-publico","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrefs%2Fnode-cetem-publico/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrefs%2Fnode-cetem-publico/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrefs%2Fnode-cetem-publico/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/andrefs","download_url":"https://codeload.github.com/andrefs/node-cetem-publico/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrefs%2Fnode-cetem-publico/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261386722,"owners_count":23150869,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-11T04:41:51.154Z","updated_at":"2026-04-30T06:31:05.094Z","avatar_url":"https://github.com/andrefs.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# cetem-publico\n\nA wrapper for CETEMPúblico, an European Portuguese corpus of news extracts from the newspaper Público, with 180 million words tagged automatically using PALAVRAS.\n\n## Installation\n\n```bash\n$ npm install cetem-publico\n```\n\nThis will download this module, but it won't download the corpus file,\nand it will fail if you try to use it. Use the\n[cp.download](#cpdownload) method to download the corpus file\n(12GB).\n\n## Usage\n\n___\n\n**This is still a work in progress, API is subject to change without\nwarning.**\n\nDo you have suggestions? Send me a message or a pull request on\nGitHub!\n___\n\n\n```js\nconst {CETEMPublico} = require('cetem-publico');\nconst cp = new CETEMPublico();\n\n// cp.download(); // to download the corpus file\n\nasync function procLines(){\n  for await (const line of cp.lines()){\n    // do something with line\n  }\n}\n\nasync function procTokens(){\n  for await (const token of cp.tokens()){\n    // do something with token\n  }\n}\n\nasync function procSentences(){\n  for await (const sent of cp.sentences()){\n    // do something with sent\n  }\n}\n\nasync function procParagraphs(){\n  for await (const par of cp.paragraphs()){\n    // do something with par\n  }\n}\n\nasync function procExtracts(){\n  for await (const ext of cp.extracts()){\n    // do something with ext\n  }\n}\n\n```\n\n## Methods\n\n### new CETEMPublico(file)\n### new CETEMPublico(opts)\n### new CETEMPublico(file, opts)\n\n* `file`: a string containing the path to a local CETEMPublico file. If not provided, the file will be loaded from `$HOME/.cetem-publico/CETEMPublicoAnotado2019.gz`.\n* `opts`: see [Options](#options-todo).\n\n### cp.download()\n\nDownload a copy of the CETEMPublico corpus from\nhttps://www.linguateca.pt/CETEMPublico/download/, compresses it using\nGzip and stores it in\n`$HOME/.cetem-publico/CETEMPublicoAnotado2019.gz`. If file already\nexists, it print a warning message and does nothing.\n\nThe whole file is 12GB, so this takes some time.\n\nYou can monitor the download progress by listening to the\n`dl_progress` event. Example:\n\n```\ncp.on('dl_progress', state =\u003e {\n  ({\n    fileName,\n    speed,\n    percent,\n    elapsed,\n    remaining,\n    transf,\n    total\n  } = state);\n\n  process.stdout.write(`${fileName}\\t${speed}\\t${percent}%\\t${elapsed}/${remaining}\\t${transf}/${total}\\r`);\n});\n\nReturns a `Promise`.\n```\n\n### cp.lines(opts)\n\nReturns an `AsyncGenerator` object where each item is a string\ncontaining a line of the original corpus file.\n\nYou can monitor the progress of the corpus reading process by listening to the\n`read_progress` event. This is valid for any of the corpus reading\nfunctions (`cp.lines`, `cp.tokens`, `cp.sentences`, `cp.paragraphs` and `cp.extracts`). Example:\n\n```\ncp.on('read_progress', state =\u003e {\n  ({\n    speed,\n    percent,\n    elapsed,\n    remaining,\n    transf,\n    total\n  } = state);\n\n  process.stdout.write(`Progress: ${speed}\\t${percent}%\\t${elapsed}/${remaining}\\t${transf}/${total}\\r`);\n});\n```\n\n### cp.tokens(opts)\n\nReturns an `AsyncGenerator` object where each item is a Token object\ncontaining one token from the original corpus file.\n\n### cp.sentences(opts)\n\nReturns an `AsyncGenerator` object where each item is a Sentence\nobject containing a sentence (`\u003cs\u003e` tag) of the original corpus file.\n\n### cp.paragraphs(opts)\n\nReturns an `AsyncGenerator` object where each item is a Paragraph\nobject containing a paragraph (`\u003cp\u003e tag)` of the original corpus file.\n\n### cp.extracts(opts)\n\nReturns an `AsyncGenerator` object where each item is an Extract\nobject containing an extract (`\u003cext\u003e` tag) of the original corpus file.\n\n\n## Events\n\n### dl_progress\n\nEvent emitted while downloading the corpus file.\n\n```\ncp.on('dl_progress', state =\u003e {})\n```\n\n`state` is an object containing the following fields:\n\n* `fileName`: name of the file being downloaded (default:\n  `CETEMPublicoAnotado2019.gz`)\n* `speed`: download speed (in bytes per second)\n* `percent`: percentage of the file already downloaded\n* `elapsed`: time passed (in seconds)\n* `remaining`: time left (in seconds)\n* `transf`: total transferred bytes\n* `total`: total size of the file (in bytes)\n\n### dl_end\n\nEvent emitted when download ends.\n\n### read_progress\n\nEvent emitted while processing the corpus file.\n\n```\ncp.on('read_progress', state =\u003e {})\n```\n\n`state` is an object containing the following fields:\n\n* `speed`: read speed (in bytes per second)\n* `percent`: percentage of the file already read\n* `elapsed`: time passed (in seconds)\n* `remaining`: time left (in seconds)\n* `transf`: total read bytes\n* `total`: total size of the file (in bytes)\n\n### read_end\n\nEvent emitted when reading ends.\n\n## Options (TODO)\n\n* `noMWEs`: Omit multi-word expressions\n* `simplMWEs`: Simplify MWEs: return their tokens as any other token\n* `noTitles`: Omit titles\n* `noAuthors`: Omit authors\n* `noTitles`: Omit titles\n\n## Classes\n\n### Token\nUsed to represent the tokens in the original corpus file. In the\nformat used by CETEMPublico, each token is in an individual line.\n\n#### `new Token(word, info)`\n\n* `word` is the word in the original corpus text\n* `info` (all these are optional)\n    * `lineNum`: the line number for this token in the original corpus\n      file\n    * `tokenId`: an ID for this token\n    * `section`: the ID of the section the token is in\n    * `week`:\n    * `lemma`: the lemmatized version of `word`\n    * `pos`: the part-of-speech (POS) tag for `word`\n    * `other*: an object with all the extra information found in\n      CETEMPublico for this token\n\n### MultiWordExpression\n\nCETEMPublico annotates some mult-word expressions using `\u003cmwe\u003e` tags.\nInside each tag, the tokens which compose the expression, one in each\nline. MWEs can have attributes indicating the lemma and the POS tag\nfor the whole expression.\n\n#### `new MultiWordExpression({lemma, pos}, tokens)`\n\n* `lemma`: the lemma for the multi-word expression\n* `pos`: the POS tag for the multi-word expression\n* `tokens`: an array of Token objects which make this MWE\n\n### Sentence\n\nIn CETEMPublico, a sentence is represented using a `\u003cs\u003e` tag.\nSentences contain a list of tokens (the words in that sentence).\nBecause some words can form multi-word expressions, inside a\n`Sentence` we can find both `Token`s and `MultiWordExpression`s\n(which, in turn, have `Token` objects inside).\n\n#### `new Sentence(id, tokens)`\n\n* `id`: an id for the sentence\n* `tokens`: an array of tokens and MWEs which form this sentence\n\n### Paragraph\nA paragraph, represented in CETEMPublico using the tag `\u003cp\u003e`.\nParagraphs are composed of a sequence of sentences.\n\n#### `new Paragraph(id, sentences)`\n\n* `id`: an id for the sentence\n* `sentences`: an array of sentences which form this paragraph\n\n### Extract\n\nAn extract of an news article. Extracts are represented by the tag\n`\u003cext\u003e` and contain a sequence of sentences. Optionally, they can also\ninclude a Title and Authors, and the attributes `n` (an id for the\nextract), `sec` (the newspaper section it was gathered from) and `sem`\n(the week in which it was published).\n\n#### `new Extract({n, sec, sem}, contents)`\n\n* `n`: the number of this extract\n* `section`: the section in which the extract was found\n* `week`: the week it was published on\n* `contents`: an array of Paragraph objects, possibly also including a\n  Title and an Authors objects\n\n### Authors\n\nThe authors of the article an Extract was gathered from.\n\n#### `new Authors(tokens)`\n\n* `tokens`: an array of `Token` objects, each being an author of the\n  article\n\n### Title\n\nThe title of the article the Extract belongs to.\n\n#### `new Title(tokens)`\n\n* `tokens`: an array of `Token` objects which make the title\n\n\n## TODO\n\n* Implement `opts`\n* Fix ID in '«' and '»' (these quotation marks don't seem to get\n  attributed IDs in the original CETEMPublico)\n* Add tests\n* Speed up download using `fast-request`?\n* Add options to `cp.download`\n    * Where to download from\n    * Where to download to\n* ...\n\n## Acknowledgements\n\nThis module only exists thanks to the [Publico](https://www.publico.pt) newspaper and the team responsible for the [CETEMPublico](https://www.linguateca.pt/CETEMPublico/) corpus.\n\n## Bugs and stuff\nOpen a [GitHub issue](https://github.com/andrefs/node-cetem-publico/issues) or, preferably, send me a pull request.\n\n## License\n\nMIT\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandrefs%2Fnode-cetem-publico","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fandrefs%2Fnode-cetem-publico","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandrefs%2Fnode-cetem-publico/lists"}