Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/juliastring/charsetencodings.jl
Types and Traits for Character Sets, Encodings, and Character Set Encodings
https://github.com/juliastring/charsetencodings.jl
character-set character-set-encodings encoding hacktoberfest unicode unicode-subset
Last synced: about 2 months ago
JSON representation
Types and Traits for Character Sets, Encodings, and Character Set Encodings
- Host: GitHub
- URL: https://github.com/juliastring/charsetencodings.jl
- Owner: JuliaString
- License: mit
- Created: 2018-05-04T16:04:36.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2022-03-08T16:09:02.000Z (almost 3 years ago)
- Last Synced: 2024-11-08T23:05:44.037Z (2 months ago)
- Topics: character-set, character-set-encodings, encoding, hacktoberfest, unicode, unicode-subset
- Language: Julia
- Homepage:
- Size: 46.9 KB
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# CharSetEncodings
[pkg-url]: https://github.com/JuliaString/CharSetEncodings.jl.git
[julia-url]: https://github.com/JuliaLang/Julia
[julia-release]:https://img.shields.io/github/release/JuliaLang/julia.svg[release]: https://img.shields.io/github/release/JuliaString/CharSetEncodings.jl.svg
[release-date]: https://img.shields.io/github/release-date/JuliaString/CharSetEncodings.jl.svg[license-img]: http://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat
[license-url]: LICENSE.md[gitter-img]: https://badges.gitter.im/Join%20Chat.svg
[gitter-url]: https://gitter.im/JuliaString/Lobby?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge[checks]: https://img.shields.io/github/checks-status/JuliaString/CharSetEncodings.jl/master
[codecov-url]: https://codecov.io/gh/JuliaString/CharSetEncodings.jl
[codecov-img]: https://codecov.io/gh/JuliaString/CharSetEncodings.jl/branch/master/graph/badge.svg[contrib]: https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat
[![][release]][pkg-url] [![][release-date]][pkg-url] [![][license-img]][license-url] [![contributions welcome][contrib]](https://github.com/JuliaString/CharSetEncodings.jl/issues)
| **Julia Version** | **Unit Tests** | **Coverage** |
|:------------------:|:------------------:|:---------------------:|
| [![][julia-release]][julia-url] | [![][]][] | [![][codecov-img]][codecov-url]
| Julia Latest | [![][checks]][pkg-url] | [![][codecov-img]][codecov-url]## Architecture
This provides the basic types and mode methods for dealing with character sets, encodings,
and character set encodings.## Types
Currently, there are the following types:* `CodeUnitTypes` a Union of the 3 codeunit types (UInt8, UInt16, UInt32) for convenience
* `CharSet` a struct type, which is parameterized by the name of the character set and the type needed to represent a code point
* `Encoding` a struct type, parameterized by the name of the encoding## Built-in Character Sets / Character Set Encodings
* `Binary` For storing non-textual data as a sequence of bytes, 0-0xff* `ASCII` ASCII (Unicode subset, 0-0x7f)
* `Latin` Latin-1 (ISO-8859-1) (Unicode subset, 0-0xff)
* `UCS2` UCS-2 (Unicode subset, 0-0xd7ff, 0xe000-0xffff, BMP only, no surrogates)
* `UTF32` UTF-32 (Full Unicode, 0-0xd7ff, 0xe000-0x10ffff)* `UniPlus` Unvalidated Unicode (i.e. like `String`, can contain invalid codepoints)
* `Text1` Unknown 1-byte character set
* `Text2` Unknown 2-byte character set
* `Text4` Unknown 4-byte character set## Built-in Encodings
* `UTF8Encoding`
* `Native1Byte`
* `Native2Byte`
* `Native4Byte`
* `NativeUTF16`
* `Swapped4Byte`
* `Swapped2Byte`
* `SwappedUTF16`
* `LE2`
* `BE2`
* `LE4`
* `BE4`
* `UTF16LE`
* `UTF16BE`
* `2Byte`
* `4Byte`
* `UTF16`## Built-in CSEs
* `BinaryCSE`, `Text1CSE`, `ASCIICSE`, `LatinCSE`
* `Text2CSE`, `UCS2CSE`
* `Text4CSE`, `UTF32CSE`* `UTF8CSE` `UTF32CharSet`, all valid, using `UTF8Encoding`,
conforming to the Unicode Organization's standard,
i.e. no long encodings, surrogates, or invalid bytes.* `RawUTF8CSE` `UniPlusCharSet`, not validated, using `UTF8Encoding`,
may have invalid sequences, long encodings, encode surrogates and characters
up to `0x7fffffff`* `UTF16CSE` `UTF32CharSet`, all valid, using `UTF16` Encoding (native order),
conforming to the Unicode standard, i.e. no out of order or isolated surrogates.## Internal Unicode subset types
* `_LatinCSE` Indicates has at least 1 character > 0x7f, all <= 0xff
* `_UCS2CSE` Indicates has at least 1 character > 0xff, all <= 0xffff
* `_UTF32CSE` Indicates has at least 1 non-BMP character## API
The `cse` function returns the character set encoding for a string type, string.
Returns `RawUTF8CSE` as a fallback for `AbstractString` (i.e. same as `String`)
The `charset` function returns the character set for a string type, string, character type, or character.
The `encoding` function returns the encoding for a type or string.
The `codeunit` function returns the code unit used for a character set encoding
The `cs"..."` string macro creates a CharSet type with that name
The `enc"..."` string macro creates an Encoding type with that name
The `@cse(cs, enc)` macro creates a character set encoding with the given character set and encodingAlso Exports the helpful constant `Bool` flags `BIG_ENDIAN` and `LITTLE_ENDIAN`