Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mooz/node-icu-charset-detector

A simple binding of ICU character set detection for Node.js
https://github.com/mooz/node-icu-charset-detector

Last synced: 2 days ago
JSON representation

A simple binding of ICU character set detection for Node.js

Host: GitHub
URL: https://github.com/mooz/node-icu-charset-detector
Owner: mooz
License: mit
Created: 2012-05-17T06:13:32.000Z (over 12 years ago)
Default Branch: master
Last Pushed: 2017-08-17T08:33:16.000Z (over 7 years ago)
Last Synced: 2024-12-23T10:40:08.931Z (8 days ago)
Language: C++
Size: 26.4 KB
Stars: 86
Watchers: 6
Forks: 22
Open Issues: 9
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # ICU Character Set Detection for Node.js

>Character set detection is the process of determining the character set, or encoding, of character data in an unknown format.

A simple binding of ICU character set detection (http://userguide.icu-project.org/conversion/detection) for Node.js.

## Installation

At first, install `libicu` into your system (See [this instruction](#installing-icu) for details).

After that, install `node-icu-charset-detector` from npm.

    npm install node-icu-charset-detector

### Installing ICU

#### Linux

* Debian (Ubuntu)

    apt-get install libicu-dev

* Gentoo

    emerge icu

* Fedora/CentOS

    yum install libicu-devel

#### OSX

* MacPorts

    port install icu +devel

* Homebrew

```shell

brew install icu4c

brew link icu4c --force

```

If experiencing issues with 'homebrew' installing version 50.1 of icu4c, try the following:

```shell

brew search icu4c

brew tap homebrew/versions

brew versions icu4c

cd $(brew --prefix) && git pull --rebase

git checkout c25fd2f $(brew --prefix)/Library/Formula/icu4c.rb

brew install icu4c

```

* From source

```shell

curl -O http://download.icu-project.org/files/icu4c/52.1/icu4c-52_1-src.tgz

tar xzvf icu4c-4_4_2-src.tgz

cd icu/source

chmod +x runConfigureICU configure install-sh

./runConfigureICU MacOSX

make

sudo make install

xcode-select --install

```

## Usage

### Simple usage

`node-icu-charset-detector` provides a function `detectCharset(buffer)`, where `buffer` is an instance of `Buffer` whose charset should be detected.

```javascript

var charsetDetector = require("node-icu-charset-detector");

var buffer = fs.readFileSync("/path/to/the/file");

var charset = charsetDetector.detectCharset(buffer);

console.log("charset name: " + charset.toString());

console.log("language: " + charset.language);

console.log("detection confidence: " + charset.confidence);

```

`detectCharset(buffer)` returns the detected charset name for `buffer`, and the returned charset name has two extra properties `language` and `confidence`:

- `charset.language`

  - language name for the detected character set.

- `charset.confidence`

  - confidence of the charset detection for `charset`.

### Leveraging node-iconv

Since ICU itself does not have a feature to convert character sets, you may need to use `node-iconv` (https://github.com/bnoordhuis/node-iconv), which has a powerful character sets converting feature.

Here is a simple example to leverage `node-iconv` to convert character sets not supported by Node itself.

```javascript

function bufferToString(buffer) {

  var charsetDetector = require("node-icu-charset-detector");

  var charset = charsetDetector.detectCharset(buffer).toString();

  try {

    return buffer.toString(charset);

  } catch (x) {

    var Iconv = require("iconv").Iconv;

    var charsetConverter = new Iconv(charset, "utf8");

    return charsetConverter.convert(buffer).toString();

  }

}

var buffer = fs.readFileSync("/path/to/the/file");

var bufferString = bufferToString(buffer);

```