An open API service indexing awesome lists of open source software.

https://github.com/stencila/encoda

↔️ A format converter for Stencila documents
https://github.com/stencila/encoda

converter document nodejs pandoc remark semantic

Last synced: 6 months ago
JSON representation

↔️ A format converter for Stencila documents

Awesome Lists containing this project

README

          

# Encoda

## Announcement

For some time our main focus has been a rewrite of [our Stencila platform](https://github.com/stencila/stencila) (`v2`). We are porting over codecs and other functionality from this repo to the `v2` of our Stencila platform.

This means that we are unable to be so active with contributions to this project. eLife have volunteered to continue development on this repository until such a time as the `v2` of Stencila can meet their needs. They only have need of the codecs to convert from JATS to JSON.

From `v2.0.0` of this project we will only continue to support the conversion from JATS to JSON. If you want to use any other formats then please see if they are available in Stencila or use the [v1.0.3](https://github.com/stencila/encoda/releases/tag/v1.0.3) release.

We have chosen at this time to leave the rest of the README.md as it was before the removal of much of the code. This does mean that the README.md is out of date. We may prioritise updating it in future.

##### Codecs for structured, semantic, composable, and executable documents

[![Build status](https://dev.azure.com/stencila/stencila/_apis/build/status/stencila.encoda?branchName=master)](https://dev.azure.com/stencila/stencila/_build/latest?definitionId=1&branchName=master)
[![Code coverage](https://codecov.io/gh/stencila/encoda/branch/master/graph/badge.svg)](https://codecov.io/gh/stencila/encoda)
[![NPM](https://img.shields.io/npm/v/@stencila/encoda.svg?style=flat)](https://www.npmjs.com/package/@stencila/encoda)
[![Docs](https://img.shields.io/badge/docs-latest-blue.svg)](https://stencila.github.io/encoda/)

## Introduction

> "A codec is a device or computer program for encoding or decoding a digital data stream or signal. Codec is a portmanteau of coder-decoder. - [Wikipedia](https://en.wikipedia.org/wiki/Codec)

Encoda provides a collection of codecs for converting between, and composing together, documents in various formats. The aim is not to achieve perfect lossless conversion between alternative document formats; there are already several tools for that. Instead the focus of Encoda is to use existing tools to encode and compose semantic documents in alternative formats.

## Formats

As far as possible, Encoda piggybacks on top of existing tools for parsing and serializing documents in various formats. It uses [extensions to schema.org](https://github.com/stencila/schema) as the central data model for all documents and for many formats, it simply transforms the data model of the external tool (e.g. [Pandoc types](https://hackage.haskell.org/package/pandoc-types), [SheetJS spreadsheet model](https://github.com/SheetJS/sheetjs/blob/master/types/index.d.ts)) to that schema ("decoding") and back again ("encoding"). In this sense, you can think of Encoda as a [Rosetta Stone](https://en.wikipedia.org/wiki/Rosetta_Stone) with schema.org at it's centre.

> ⚡ Tip: If a codec for your favorite format is missing below, see if there is already an [issue](https://github.com/stencila/encoda/issues) for it and 👍 or comment. If there is no issue regarding the converter you need, feel free to [create one](https://github.com/stencila/encoda/issues/new).

| Format | Codec | Powered by | Status |
| ---------------------------- | ------------- | ---------------------- | ------ |
| **Text** |
| Plain text | [txt] | [`toString`][tostring] | ✔ |
| Markdown | [md] | [Remark] | ✔ |
| LaTex | [latex] | [Pandoc][pandoc-org] | α |
| Microsoft Word | [docx] | [Pandoc][pandoc-org] | β |
| Google Docs | [gdoc] | [`JSON`][json-api] | β |
| Open Document Text | [odt] | [Pandoc][pandoc-org] | α |
| HTML | [html] | [jsdom], [hyperscript] | ✔ |
| JATS XML | [jats] | [xml-js] | ✔ |
| | [jats-pandoc] | [Pandoc][pandoc-org] | β |
| Portable Document Format | [pdf] | [pdf-lib], [Puppeteer] | β |
| **Math** |
| TeX | [tex] | [mathconverter] | ✔ |
| MathML | [mathml] | [MathJax] | ✔ |
| **Visualization** |
| Plotly | [plotly] | [Plotly.js] | ✔ |
| Vega / Vega-Lite | [vega] | [Vega][vega-io] | ✔ |
| **Bibliographic** |
| Citation Style Language JSON | [csl] | [Citation.js] | ✔ |
| BibTeX | [bib] | [Citation.js] | ✔ |
| **Notebooks** |
| Jupyter | [ipynb] | [`JSON`][json-api] | ✔ |
| RMarkdown | [xmd] | [Remark] | ✔ |
| **Spreadsheets** |
| Microsoft Excel | [xlsx] | [SheetJS] | β |
| Open Document Spreadsheet | [ods] | [SheetJS] | β |
| **Tabular data** |
| CSV | [csv] | [SheetJS] | β |
| Tabular Data Package | [tdp] | [datapackage-js] | α |
| **Collections** |
| Filesystem Directory | [dir] | [`fs`][fs] | β |
| **Data interchange, other** |
| JSON | [json] | [`JSON`][json-api] | ✔ |
| JSON-LD | [jsonld] | [jsonld.js] | ✔ |
| JSON5 | [json5] | [json5][json5-org] | ✔ |
| YAML | [yaml] | [js-yaml] | ✔ |
| Pandoc | [pandoc] | [Pandoc][pandoc-org] | ✔ |
| Reproducible PNG | [rpng] | [Puppeteer] | ✔ |
| XML | [xml] | [xml-js] | ✔ |

[bib]: src/codecs/bib
[crossref]: src/codecs/crossref
[csl]: src/codecs/csl
[csv]: src/codecs/csv
[dar]: src/codecs/dar
[dir]: src/codecs/dir
[dmagic]: src/codecs/dmagic
[docx]: src/codecs/docx
[doi]: src/codecs/doi
[elife]: src/codecs/elife
[gdoc]: src/codecs/gdoc
[html]: src/codecs/html
[http]: src/codecs/http
[ipynb]: src/codecs/ipynb
[jats-pandoc]: src/codecs/jats-pandoc
[jats]: src/codecs/jats
[json]: src/codecs/json
[json5]: src/codecs/json5
[jsonld]: src/codecs/jsonld
[latex]: src/codecs/latex
[mathml]: src/codecs/mathml
[md]: src/codecs/md
[ods]: src/codecs/ods
[odt]: src/codecs/odt
[orcid]: src/codecs/orcid
[pandoc]: src/codecs/pandoc
[pdf]: src/codecs/pdf
[plos]: src/codecs/plos
[plotly]: src/codecs/plotly
[rpng]: src/codecs/rpng
[tdp]: src/codecs/tdp
[tex]: src/codecs/tex
[txt]: src/codecs/txt
[vega]: src/codecs/vega
[xlsx]: src/codecs/xlsx
[xmd]: src/codecs/xmd
[xml]: src/codecs/xml
[yaml]: src/codecs/yaml

[citation.js]: https://citation.js.org/
[datapackage-js]: https://github.com/frictionlessdata/datapackage-js
[fs]: https://nodejs.org/api/fs.html
[hyperscript]: https://github.com/hyperhype/hyperscript
[js-yaml]: https://github.com/nodeca/js-yaml#readme
[jsdom]: https://github.com/jsdom/jsdom
[json-api]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON
[json5-org]: https://json5.org/
[jsonld.js]: https://github.com/digitalbazaar/jsonld.js
[mathconverter]: https://github.com/oerpub/mathconverter
[mathjax]: https://www.mathjax.org/
[pandoc-org]: https://pandoc.org/
[pdf-lib]: https://pdf-lib.js.org/
[plotly.js]: https://plotly.com/javascript/
[puppeteer]: https://pptr.dev/
[remark]: https://remark.js.org/
[sheetjs]: https://sheetjs.com
[tostring]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Object/toString
[vega-io]: https://vega.github.io/vega/
[xml-js]: https://github.com/nashwaan/xml-js#readme

**Key**

- ✗: Not yet implemented
- α: Alpha, initial implementation
- β: Beta, ready for user testing
- ✔: Ready for production use

## Publishers

Several of the codecs in Encoda, deal with fetching content from a particular publisher. For example, to get an eLife article and read it in Markdown:

```bash
stencila convert https://elifesciences.org/articles/45187v2 ye-et-al-2019.md
```

Some of these publisher codecs deal with meta data. e.g.

```bash
stencila convert "Watson and Crick 1953" - --from crossref --to yaml
```

```yaml
type: Article
title: Genetical Implications of the Structure of Deoxyribonucleic Acid
authors:
- familyNames:
- WATSON
givenNames:
- J. D.
type: Person
- familyNames:
- CRICK
givenNames:
- F. H. C.
type: Person
datePublished: '1953,5'
isPartOf:
issueNumber: '4361'
isPartOf:
volumeNumber: '171'
isPartOf:
title: Nature
type: Periodical
type: PublicationVolume
type: PublicationIssue
```

| Source | Codec | Base codec/s | Status | Coverage |
| ---------------------- | ---------- | ------------------------------------ | ------ | ----------------- |
| **General** |
| HTTP | [http] | Based on `Content-Type` or extension | β | ![][http-cov] |
| **`Person`** |
| ORCID | [orcid] | `jsonld` | β | ![][orcid-cov] |
| **`Article` metadata** |
| DOI | [doi] | `csl` | β | ![][doi-cov] |
| Crossref | [crossref] | `jsonld` | β | ![][crossref-cov] |
| **`Article` content** |
| eLife | [elife] | `jats` | β | ![][elife-cov] |
| PLoS | [plos] | `jats` | β | ![][plos-cov] |

## Install

The easiest way to use Encoda is to install the [`stencila` command line tool](https://github.com/stencila/stencila). Encoda powers `stencila convert`, and other commands, in that CLI. However, the version of Encoda in `stencila`, can lag behind the version in this repo. So if you want the latest functionality, install Encoda as a Node.js package:

```bash
npm install @stencila/encoda --global
```

## Use

Encoda is intended to be used primarily as a library for other applications. However, it comes with a simple command line script which allows you to use the `convert` function directly.

### Converting files

```bash
encoda convert notebook.ipynb notebook.docx
```

Encoda will determine the input and output formats based on the file extensions. You can override these using the `--from` and `--to` options. e.g.

```bash
encoda convert notebook.ipynb notebook.xml --to jats
```

You can also convert to more than one file / format (in this case the `--to` argument only applies to the first output file) e.g.

```bash
encoda convert report.docx report.Rmd report.html report.jats
```

### Converting folders

You can decode an entire directory into a `Collection`. Encoda will traverse the directory, including subdirectories, decoding each file matching your glob pattern. You can then encode the `Collection` using the `dir` codec into a tree of HTML files e.g.

```bash
encoda convert myproject myproject-published --to dir --pattern '**/*.{rmd, csv}'
```

### Converting command line input

You can also read content from the first argument. In that case, you'll need to specifying the `--from` format e.g.

```bash
encoda convert "{type: 'Paragraph', content: ['Hello world!']}" --from json5 paragraph.md
```

You can send output to the console by using `-` as the second argument and specifying the `--to` format e.g.

```bash
encoda convert paragraph.md - --to yaml
```

| Option | Description |
| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `--from` | The format of the input content e.g. `--from md` |
| `--to` | The format for the output content e.g. `--to html` |
| `--theme` | The theme for the output (only applies to HTML, PDF and RPNG output) e.g. `--theme eLife`. Either a [Thema theme name](https://github.com/stencila/thema#available-themes) or a path/URL to a directory containing a `styles.css` and a `index.js` file. |
| `--standalone` | Generate a standalone document, not a fragment (default `true`) |
| `--bundle` | Bundle all assets (e.g images, CSS and JS) into the document (default `false`) |
| `--debug` | Print debugging information |

### Using with Executa

Encoda exposes the `decode` and `encode` methods of the [Executa](https://github.com/stencila/executa) API. Register Encoda so that it can be discovered by other executors on your machine,

```bash
npm run register
```

You can then use Encoda as a plugin for Executa that provides additional format conversion capabilities. For example, you can use the `query` REPL on a Markdown document:

```bash
npx executa query CHANGELOG.md --repl
```

You can then use the REPL to explore the structure of the document and do things like create summary documents from it. For example, lets say from some reason we wanted to create a short JATS XML file with the five most recent releases of this package:

```
jmp > %format jats
jmp > %dest latest-releases.jats.xml
jmp > {type: 'Article', content: content[? type==`Heading` && depth==`1`] | [1:5]}
```

Which creates the `latest-major-releases.jats.xml` file:

```xml










0.80.0 (2019-09-30)


...
```

You can query a document in any format supported by Encoda. As another example, lets' fetch a CSV file from Github and get the names of it's columns:

```bash
npx executa query https://gist.githubusercontent.com/jncraton/68beb88e6027d9321373/raw/381dcf8c0d4534d420d2488b9c60b1204c9f4363/starwars.csv --repl
🛈 INFO encoda:http Fetching "https://gist.githubusercontent.com/jncraton/68beb88e6027d9321373/raw/381dcf8c0d4534d420d2488b9c60b1204c9f4363/starwars.csv"
jmp > columns[].name
[
'SetID',
'Number',
'Variant',
'Theme',
'Subtheme',
'Year',
'Name',
'Minifigs',
'Pieces',
'UKPrice',
'USPrice',
'CAPrice',
'EUPrice',
'ImageURL',
'Owned',
'Wanted',
'QtyOwned',
]
jmp >
```

See the `%help` REPL command for more examples.

Note: If you have [`executa`](https://github.com/stencila/executa) installed globally, then the `npx` prefix above is not necessary.

## Documentation

Self-hoisted (documentation converted from various formats to html) and API documentation (generated from source code) is available at: https://stencila.github.io/encoda.

## Develop

Check how to [contribute back to the project](https://github.com/stencila/encoda/blob/master/CONTRIBUTING.md). All PRs are most welcome! Thank you!

Clone the repository and install a development environment:

```bash
git clone https://github.com/stencila/encoda.git
cd encoda
npm install
```

You can manually test conversion using current TypeScript `src` using:

```bash
npm start -- convert simple.md simple.html
```

That can be slow because the TypeScript has to be compiled on the fly (using `ts-node`). Alternatively, compile the TypeScript to JavaScript first, and then run `node` on the `dist` folder:

```bash
npm run build:dist
node dist convert simple.md simple.html
```

If you are using VSCode, you can use the [Auto Attach feature](https://code.visualstudio.com/docs/nodejs/nodejs-debugging#_auto-attach-feature) to attach to the CLI when running the `debug` NPM script:

```bash
npm run debug -- convert simple.gdoc simple.ipynb
```

A simple script to convert JATS to JSON:

```bash
cat simple-jats.xml | npm run convert-jats --silent > simple.json
```

## Testing

### Running tests locally

Run the test suite using:

```bash
npm test
```

Or, run a single test file e.g.

```bash
npx jest tests/xlsx.test.ts --watch
```

To display debug logs during testing set the environment variable `DEBUG=1`, e.g.

```bash
DEBUG=1 npm test
```

To get coverage statistics:

```bash
npm run cover
```

There's also a `Makefile` if you prefer to run tasks that way e.g.

```bash
make lint cover
```

### Running test in Docker

You can also test this package using with a Docker container:

```bash
npm run test:docker
```

## Writing tests

#### Recording and using network fixtures

As far as possible, tests should be able to run with no network access. We use [Nock Back](https://github.com/nock/nock#nock-back) to record and play back network requests and responses. Use the `nockRecord` helper function for this with the convention of starting the fixture file with `nock-record-` e.g.

```ts
const stopRecording = await nockRecord('nock-record-.json')
// Do some things that connect to the interwebs
stopRecording()
```

Note that the HTTP fetcher implements caching so that you may need to remove the cache for the recording of fixtures to work e.g. `rm -rf /tmp/stencila/encoda/cache/`.

If there are changes in the URLs that your test fetches, or you want to check that your test is still works against an external API that may have changed, remove the Nock recording and rerun the test e.g.,

```sh
rm src/codecs/elife/__fixtures__/nock-record-*.json
npx jest src/codecs/elife/ --testTimeout 30000
```

## Contribute

We 💕 contributions! All contributions: ideas 🤔, examples 💡, bug reports 🐛, documentation 📖, code 💻, questions 💬. See [CONTRIBUTING.md](CONTRIBUTING.md) for more on where to start. You can also provide your feedback on the [Community Forum](https://community.stenci.la)
and [Gitter channel](https://gitter.im/stencila/stencila).

## Contributors



Aleksandra Pawlik

💻 📖 🐛

Nokome Bentley

💻 📖 🐛

Jacqueline

📖 🎨

Hamish Mackenzie

💻 📖

Alex Ketch

💻 📖 🎨

Ben Shaw

💻 🐛

Phil Neff

🐛



Raniere Silva

📖

Lorenzo Cangiano

🐛

FAtherden-eLife

🐛 🎨

Giorgio Sironi

👀

Add a contributor...

To add youself, or someone else, to the above list, either,

1. Ask the [@all-contributors bot](https://allcontributors.org/docs/en/bot/overview) to do it for you by commenting on an issue or PR like this:

> @all-contributors please add @octocat for bugs, tests and code

2. Use the [`all-contributors` CLI](https://allcontributors.org/docs/en/cli/overview) to do it yourself:

```bash
npx all-contributors add octocat bugs, tests, code
```

See the list of [contribution types](https://allcontributors.org/docs/en/emoji-key).

## Acknowledgments

Encoda relies on many awesome opens source tools (see `package.json` for the complete list). We are grateful ❤ to their developers and contributors for all their time and energy. In particular, these tools do a lot of the heavy lifting 💪 under the hood.

| Tool | Use |
| ------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| ![Ajv](https://ajv.js.org/images/ajv_logo.png) | [Ajv](https://ajv.js.org/) is "the fastest JSON Schema validator for Node.js and browser". Ajv is not only fast, it also has an impressive breadth of functionality. We use Ajv for the `validate()` and `coerce()` functions to ensure that ingested data is valid against the Stencila [schema](https://github.com/stencila/schema). |
| ![Citation.js](https://avatars0.githubusercontent.com/u/41587916?s=200&v=4) | [Citation.js] converts bibliographic formats like BibTeX, BibJSON, DOI, and Wikidata to CSL-JSON. We use it to power the codecs for those formats and APIs. |
| ![Frictionless Data](https://avatars0.githubusercontent.com/u/5912125?s=200&v=4) | [`datapackage-js`](https://github.com/frictionlessdata/datapackage-js) from the team at [Frictionless Data](https://frictionlessdata.io/) is a Javascript library for working with [Data Packages](https://frictionlessdata.io/specs/data-package/). It does a lot of the work in converting between Tabular Data Packages and Stencila Datatables. |
| ![Glitch Digital](https://avatars1.githubusercontent.com/u/16604593?s=200&v=4) | Glitch Digital's [`structured-data-testing-tool`](https://github.com/glitchdigital/structured-data-testing-tool) is a library and command line tool to help inspect and test for Structured Data. We use it to check that the HTML generated by Encoda can be read by bots 🤖 |
| ![Pa11y](https://pa11y.org/resources/brand/logo.svg) | [Pa11y](https://pa11y.org/) provides a range of free and open source tools to help designers and developers make their web pages more accessible. We use [`pa11y`](https://github.com/pa11y/pa11y) to test that HTML generated produced by Encoda meets the [Web Content Accessibility Guidelines (WCAG)](https://www.w3.org/TR/WCAG20/) and [Axe](https://dequeuniversity.com/rules/axe/3.5) rule set. |
| **Pandoc** | [Pandoc] is a "universal document converter". It's able to convert between an impressive number of formats for textual documents. Our [Typescript definitions for Pandoc's AST](https://github.com/stencila/encoda/blob/c400d798e6b54ea9f88972b038489df79e38895b/src/pandoc-types.ts) allow us to leverage this functionality from within Node.js while maintaining type safety. Pandoc powers our converters for Word, JATS and Latex. We have contributed to Pandoc, including developing its [JATS reader](https://github.com/jgm/pandoc/blob/master/src/Text/Pandoc/Readers/JATS.hs). |
| ![Puppeteer](https://user-images.githubusercontent.com/10379601/29446482-04f7036a-841f-11e7-9872-91d1fc2ea683.png) | [Puppeteer] is a Node library which provides a high-level API to control Chrome. We use it to take screenshots of HTML snippets as part of generating rPNGs and we plan to use it for [generating PDFs](https://github.com/stencila/encoda/issues/53). |
| ![Remark](https://avatars2.githubusercontent.com/u/16309564?s=200&v=4) | [Remark] is an ecosystem of plugins for processing Markdown. It's part of the [unified](https://unifiedjs.github.io/) framework for processing text with syntax trees - a similar approach to Pandoc but in Javascript. We use Remark as our Markdown parser because of it's extensibility. |
| ![SheetJs](https://sheetjs.com/sketch128.png) | [SheetJs] is a Javascript library for parsing and writing various spreadsheet formats. We use their [community edition](https://github.com/sheetjs/js-xlsx) to power converters for CSV, Excel, and Open Document Spreadsheet formats. They also have a [pro version](https://sheetjs.com/pro) if you need extra support and functionality. |

Many thanks ❤ to the [Alfred P. Sloan Foundation](https://sloan.org) and [eLife](https://elifesciences.org) for funding development of this tool.