Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mathewsanders/Mustard
🌠Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.
https://github.com/mathewsanders/Mustard
substrings swift tokenizer
Last synced: 3 months ago
JSON representation
🌠Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.
- Host: GitHub
- URL: https://github.com/mathewsanders/Mustard
- Owner: mathewsanders
- License: mit
- Created: 2016-12-30T18:42:45.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2018-05-14T20:24:17.000Z (over 6 years ago)
- Last Synced: 2024-04-24T14:14:59.891Z (6 months ago)
- Topics: substrings, swift, tokenizer
- Language: Swift
- Homepage:
- Size: 137 KB
- Stars: 689
- Watchers: 14
- Forks: 18
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-ios-star - Mustard - Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it. (Text / Other Testing)
- awesome-ios - Mustard - Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it. (Text / Other Testing)
README
# Mustard ðŸŒ
[![GitHub license](https://img.shields.io/badge/license-MIT-lightgrey.svg?style=flat)](https://github.com/mathewsanders/Mustard/blob/master/LICENSE) [![Carthage compatible](https://img.shields.io/badge/Carthage-compatible-4BC51D.svg?style=flat)](https://github.com/Carthage/Carthage) [![Swift Package Manager compatible](https://img.shields.io/badge/Swift%20Package%20Manager-compatible-EF5138%20.svg?style=flat)](https://swift.org/package-manager/)
Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.
## Quick start using character sets
Foundation includes the `String` method [`components(separatedBy:)`](https://developer.apple.com/documentation/foundation/nsstring/1413214-components) that allows us to get substrings divided up by certain characters:
````Swift
let sentence = "hello 2017 year"
let words = sentence.components(separatedBy: .whitespaces)
// words.count -> 3
// words = ["hello", "2017", "year"]
````Mustard provides a similar feature, but with the opposite approach, where instead of matching by separators you can match by one or more character sets, which is useful if separators simply don't exist:
````Swift
import Mustardlet sentence = "hello2017year"
let words = sentence.components(matchedWith: .letters, .decimalDigits)
// words.count -> 3
// words = ["hello", "2017", "year"]
````If you want more than just the substrings, you can use the `tokens(matchedWith: CharacterSet...)` method which will return an array of `TokenType`.
As a minimum, `TokenType` requires properties for text (the substring matched), and range (the range of the substring in the original string). When using CharacterSets as a tokenizer, the more specific type `CharacterSetToken` is returned, which includes the property `set` which contains the instance of CharacterSet that was used to create the match.
````Swift
import Mustardlet tokens = "123Hello world&^45.67".tokens(matchedWith: .decimalDigits, .letters)
// tokens: [CharacterSet.Token]
// tokens.count -> 5 (characters '&', '^', and '.' are ignored)
//
// second token..
// token[1].text -> "Hello"
// token[1].range -> Range(3..<8)
// token[1].set -> CharacterSet.letters
//
// last token..
// tokens[4].text -> "67"
// tokens[4].range -> Range(19..<21)
// tokens[4].set -> CharacterSet.decimalDigits
````## Advanced matching with custom tokenizers
Mustard can do more than match from character sets. You can create your own tokenizers with more
sophisticated matching behavior by implementing the `TokenizerType` and `TokenType` protocols.Here's an example of using `DateTokenizer` ([see example for implementation](Documentation/Template%20tokenizer.md)) that finds substrings that match a `MM/dd/yy` format.
`DateTokenizer` returns tokens with the type `DateToken`. Along with the substring text and range, `DateToken` includes a `Date` object corresponding to the date in the substring:
````Swift
import Mustardlet text = "Serial: #YF 1942-b 12/01/17 (Scanned) 12/03/17 (Arrived) ref: 99/99/99"
let tokens = text.tokens(matchedWith: DateTokenizer())
// tokens: [DateTokenizer.Token]
// tokens.count -> 2
// ('99/99/99' is *not* matched by `DateTokenizer` because it's not a valid date)
//
// first date
// tokens[0].text -> "12/01/17"
// tokens[0].date -> Date(2017-12-01 05:00:00 +0000)
//
// last date
// tokens[1].text -> "12/03/17"
// tokens[1].date -> Date(2017-12-03 05:00:00 +0000)
````## Documentation & Examples
- [Greedy tokens and tokenizer order](Documentation/Greedy%20tokens%20and%20tokenizer%20order.md)
- [Token types and AnyToken](Documentation/Token%20types%20and%20AnyToken.md)
- [TokenizerType: implementing your own tokenizer](Documentation/TokenizerType%20protocol.md)
- [EmojiTokenizer: matching emoji substrings](Documentation/Matching%20emoji.md)
- [LiteralTokenizer: matching specific substrings](Documentation/Literal%20tokenizer.md)
- [DateTokenizer: tokenizer based on template match](Documentation/Template%20tokenizer.md)
- [Alternatives to using Mustard](Documentation/Alternatives%20to%20using%20Mustard.md)
- [Performance comparisons](Documentation/Performance%20Comparisons.md)## Roadmap
- [x] Include detailed examples and documentation
- [x] Ability to skip/ignore characters within match
- [x] Include more advanced pattern matching for matching tokens
- [x] Make project logo ðŸŒ
- [x] Performance testing / benchmarking against Scanner
- [ ] Include interface for working with Character tokenizers## Requirements
- Swift 4.1
## Author
Made with :heart: by [@permakittens](http://twitter.com/permakittens)
## Contributing
Feedback, or contributions for bug fixing or improvements are welcome. Feel free to submit a pull request or open an issue.
## License
MIT