{"id":17832418,"url":"https://github.com/jaynetics/character_set","last_synced_at":"2025-07-19T13:33:48.416Z","repository":{"id":62355200,"uuid":"135935201","full_name":"jaynetics/character_set","owner":"jaynetics","description":"A C-extended Ruby gem to efficiently work with sets of Unicode codepoints.","archived":false,"fork":false,"pushed_at":"2024-01-10T20:00:39.000Z","size":257,"stargazers_count":8,"open_issues_count":0,"forks_count":4,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-05-11T06:41:59.035Z","etag":null,"topics":["characterset","codepoints","performance","ruby","string-manipulation","unicode"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jaynetics.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-06-03T19:46:19.000Z","updated_at":"2024-06-19T01:40:02.428Z","dependencies_parsed_at":"2024-06-19T01:39:58.316Z","dependency_job_id":"b9cdd12b-c888-4248-9c35-3c89115a3018","html_url":"https://github.com/jaynetics/character_set","commit_stats":{"total_commits":123,"total_committers":2,"mean_commits":61.5,"dds":0.008130081300813052,"last_synced_commit":"a626e24769dd9d232a4f8d243b4c8e8ee33a6a1e"},"previous_names":[],"tags_count":12,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaynetics%2Fcharacter_set","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaynetics%2Fcharacter_set/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaynetics%2Fcharacter_set/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaynetics%2Fcharacter_set/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jaynetics","download_url":"https://codeload.github.com/jaynetics/character_set/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243982182,"owners_count":20378605,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["characterset","codepoints","performance","ruby","string-manipulation","unicode"],"created_at":"2024-10-27T19:56:49.733Z","updated_at":"2025-03-19T10:30:55.111Z","avatar_url":"https://github.com/jaynetics.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CharacterSet\n\n[![Gem Version](https://badge.fury.io/rb/character_set.svg)](http://badge.fury.io/rb/character_set)\n[![Build Status](https://github.com/jaynetics/character_set/workflows/tests/badge.svg)](https://github.com/jaynetics/character_set/actions)\n[![Build Status](https://github.com/jaynetics/character_set/workflows/gouteur/badge.svg)](https://github.com/jaynetics/character_set/actions)\n[![Coverage](https://codecov.io/gh/jaynetics/character_set/branch/main/graph/badge.svg?token=oY7gcWNbIN)](https://codecov.io/gh/jaynetics/character_set)\n\nThis is a C-extended Ruby gem to work with sets of Unicode codepoints.\n\nIt can [read](#parseinitialize) and [write](#write) sets of codepoints in various formats and it implements the stdlib `Set` interface for them.\n\nIt also offers a [way of scrubbing and scanning characters in Strings](#interact-with-strings) that is more semantic and consistently offers better performance than `Regexp` and `String` methods from the stdlib for this (see [benchmarks](./BENCHMARK.md)).\n\nMany parts can be used independently, e.g.:\n- `CharacterSet::Character`\n- `CharacterSet::ExpressionConverter`\n- `CharacterSet::Parser`\n- `CharacterSet::Writer`\n\n## Usage\n\n### Usage examples\n\n```ruby\nCharacterSet.url_query.cover?('?a=(b$c;)') # =\u003e true\n\nCharacterSet.non_ascii.delete_in!(string)\n\nCharacterSet.emoji.sample(5) # =\u003e [\"⛷\", \"👈\", \"🌞\", \"♑\", \"⛈\"]\n```\n\n### Parse/Initialize\n\nThese all produce a `CharacterSet` containing `a`, `b` and `c`:\n\n```ruby\nCharacterSet['a', 'b', 'c']\nCharacterSet[97, 98, 99]\nCharacterSet.new('a'..'c')\nCharacterSet.new(0x61..0x63)\nCharacterSet.of('abacababa')\nCharacterSet.parse('[a-c]')\nCharacterSet.parse('\\U00000061-\\U00000063')\n```\n\nIf the gems [`regexp_parser`](https://github.com/ammar/regexp_parser) and [`regexp_property_values`](https://github.com/jaynetics/regexp_property_values) are installed, `Regexp` instances and unicode property names can also be read.\n\n```ruby\nCharacterSet.of(/./) # =\u003e #\u003cCharacterSet (size: 1112064)\u003e\nCharacterSet.of_property('Thai') # =\u003e #\u003cCharacterSet (size: 86)\u003e\n\nrequire 'character_set/core_ext/regexp_ext'\n\n/[\\D\u0026\u0026[:ascii:]\u0026\u0026\\p{emoji}]/.character_set.size # =\u003e 2\n```\n\n### Predefined utility sets\n\n`ascii`, `ascii_alnum`, `ascii_letter`, `assigned`, `bmp`, `crypt`, `emoji`, `newline`, `surrogate`, `unicode`, `url_fragment`, `url_host`, `url_path`, `url_query`, `whitespace`\n\n```ruby\nCharacterSet.ascii # =\u003e #\u003cCharacterSet (size: 128)\u003e\n\n# all can be prefixed with `non_`, e.g.\nCharacterSet.non_ascii\n```\n\n### Interact with Strings\n\n`CharacterSet` can replace some types of `String` handling with better performance than the stdlib.\n\n`#used_by?` and `#cover?` can replace some `Regexp#match?` calls:\n\n```ruby\nCharacterSet.ascii.used_by?('Tüür') # =\u003e true\nCharacterSet.ascii.cover?('Tüür') # =\u003e false\nCharacterSet.ascii.cover?('Tr') # =\u003e true\n```\n\n`#delete_in(!)` and `#keep_in(!)` can replace `String#gsub(!)` and the like:\n\n```ruby\nstring = 'Tüür'\n\nCharacterSet.ascii.delete_in(string) # =\u003e 'üü'\nCharacterSet.ascii.keep_in(string) # =\u003e 'Tr'\nstring # =\u003e 'Tüür'\n\nCharacterSet.ascii.delete_in!(string) # =\u003e 'üü'\nstring # =\u003e 'üü'\nCharacterSet.ascii.keep_in!(string) # =\u003e ''\nstring # =\u003e ''\n```\n\n`#count_in` and `#scan` can replace `String#count` and `String#scan`:\n\n```ruby\nCharacterSet.non_ascii.count_in('Tüür') # =\u003e 2\nCharacterSet.non_ascii.scan('Tüür') # =\u003e ['ü', 'ü']\n```\n\nThere is also a core extension for String interaction.\n```ruby\nrequire 'character_set/core_ext/string_ext'\n\n\"a\\rb\".character_set \u0026 CharacterSet.newline # =\u003e CharacterSet[\"\\r\"]\n\"a\\rb\".uses_character_set?(CharacterSet['ä', 'ö', 'ü']) # =\u003e false\n\"a\\rb\".covered_by_character_set?(CharacterSet.newline) # =\u003e false\n\n# predefined sets can also be referenced via Symbols\n\"a\\rb\".covered_by_character_set?(:ascii) # =\u003e true\n\"a\\rb\".delete_character_set(:newline) # =\u003e 'ab'\n# etc.\n```\n\n### Manipulate\n\nUse [any Ruby Set method](https://ruby-doc.org/stdlib-2.5.1/libdoc/set/rdoc/Set.html), e.g. `#+`, `#-`, `#\u0026`, `#^`, `#intersect?`, `#\u003c`, `#\u003e` etc. to interact with other sets. Use `#add`, `#delete`, `#include?` etc. to change or check for members.\n\nWhere appropriate, methods take both chars and codepoints, e.g.:\n\n```ruby\nCharacterSet['a'].add('b') # =\u003e CharacterSet['a', 'b']\nCharacterSet['a'].add(98) # =\u003e CharacterSet['a', 'b']\nCharacterSet['a'].include?('a') # =\u003e true\nCharacterSet['a'].include?(0x61) # =\u003e true\n```\n\n`#inversion` can be used to create a `CharacterSet` with all valid Unicode codepoints that are not in the current set:\n\n```ruby\nnon_a = CharacterSet['a'].inversion\n# =\u003e #\u003cCharacterSet (size: 1112063)\u003e\n\nnon_a.include?('a') # =\u003e false\nnon_a.include?('ü') # =\u003e true\n\n# surrogate pair halves are not included by default\nCharacterSet['a'].inversion(include_surrogates: true)\n# =\u003e #\u003cCharacterSet (size: 1114112)\u003e\n```\n\n`#case_insensitive` can be used to create a `CharacterSet` where upper/lower case codepoints are supplemented:\n\n```ruby\nCharacterSet['1', 'A'].case_insensitive # =\u003e CharacterSet['1', 'A', 'a']\n```\n\n### Write\n\n```ruby\nset = CharacterSet['a', 'b', 'c', 'j', '-']\n\n# safely printable ASCII chars are not escaped by default\nset.to_s # =\u003e 'a-cj\\x2D'\nset.to_s(escape_all: true) # =\u003e '\\x61-\\x63\\x6A\\x2D'\n\n# brackets may be added\nset.to_s(in_brackets: true) # =\u003e '[a-cj\\x2D]'\n\n# the default escape format is Ruby/ES6 compatible, others are available\nset = CharacterSet['a', 'b', 'c', 'ɘ', '🤩']\nset.to_s # =\u003e 'a-c\\u0258\\u{1F929}'\nset.to_s(format: 'U+') # =\u003e 'a-cU+0258U+1F929'\nset.to_s(format: 'Python') # =\u003e \"a-c\\u0258\\U0001F929\"\nset.to_s(format: 'raw') # =\u003e 'a-cɘ🤩'\n\n# or pass a block\nset.to_s { |char| \"[#{char.codepoint}]\" } # =\u003e \"a-c[600][129321]\"\nset.to_s(escape_all: true) { |c| \"\u003c#{c.hex}\u003e\" } # =\u003e \"\u003c61\u003e-\u003c63\u003e\u003c258\u003e\u003c1F929\u003e\"\n\n# disable abbreviation (grouping of codepoints in ranges)\nset.to_s(abbreviate: false) # =\u003e \"abc\\u0258\\u{1F929}\"\n\n# astral members require some trickery if we want to target environments\n# that are based on UTF-16 or \"UCS-2 with surrogates\", such as JavaScript.\nset = CharacterSet['a', 'b', '🤩', '🤪', '🤫']\n\n# Use #to_s_with_surrogate_ranges e.g. for JavaScript:\nset.to_s_with_surrogate_ranges\n# =\u003e '(?:[ab]|\\uD83E[\\uDD29-\\uDD2B])'\n\n# Or use #to_s_with_surrogate_alternation if such surrogate set pairs\n# don't work in your target environment:\nset.to_s_with_surrogate_alternation\n# =\u003e '(?:[ab]|\\uD83E\\uDD29|\\uD83E\\uDD2A|\\uD83E\\uDD2B)'\n```\n\n### Other features\n\n#### Secure tokens\n\nGenerate secure random strings of characters from a set:\n\n```ruby\nCharacterSet.new('a'..'z').secure_token(8) # =\u003e \"ugwpujmt\"\nCharacterSet.crypt.secure_token # =\u003e \"8.1w7aBT737/pMfcMoO4y2y8/=0xtmo:\"\n```\n\n#### Unicode planes\n\nThere are some methods to check for planes and to handle ASCII, [BMP](https://en.wikipedia.org/wiki/Plane_%28Unicode%29#Basic_Multilingual_Plane) and astral parts:\n```Ruby\nCharacterSet['a', 'ü', '🤩'].ascii_part # =\u003e CharacterSet['a']\nCharacterSet['a', 'ü', '🤩'].ascii_part? # =\u003e true\nCharacterSet['a', 'ü', '🤩'].ascii_only? # =\u003e false\nCharacterSet['a', 'ü', '🤩'].ascii_ratio # =\u003e 0.3333333\nCharacterSet['a', 'ü', '🤩'].bmp_part # =\u003e CharacterSet['a', 'ü']\nCharacterSet['a', 'ü', '🤩'].astral_part # =\u003e CharacterSet['🤩']\nCharacterSet['a', 'ü', '🤩'].bmp_ratio # =\u003e 0.6666666\nCharacterSet['a', 'ü', '🤩'].planes # =\u003e [0, 1]\nCharacterSet['a', 'ü', '🤩'].plane(1) # =\u003e CharacterSet['🤩']\nCharacterSet['a', 'ü', '🤩'].member_in_plane?(7) # =\u003e false\nCharacterSet::Character.new('a').plane # =\u003e 0\n```\n\n## Contributions\n\nFeel free to send suggestions, point out issues, or submit pull requests.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaynetics%2Fcharacter_set","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjaynetics%2Fcharacter_set","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaynetics%2Fcharacter_set/lists"}