https://github.com/ewdlop/automatatheory
https://github.com/ewdlop/automatatheory
alan-turing church-encoding church-turning-thesis context-free-grammar factor-terms simplifed-expressions
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/ewdlop/automatatheory
- Owner: ewdlop
- Created: 2024-12-03T19:45:46.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-02-16T07:33:19.000Z (4 months ago)
- Last Synced: 2025-02-16T08:23:06.429Z (4 months ago)
- Topics: alan-turing, church-encoding, church-turning-thesis, context-free-grammar, factor-terms, simplifed-expressions
- Language: Python
- Homepage:
- Size: 14.7 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# AutomataTheory
## Regular language
```markdown
If you need a **regex pattern that matches alphabets including accented characters**, you can use the following pattern:
```
```regex
[\p{L}\p{M}]+
```### Explanation:
```markdown
- `\p{L}`: Matches any Unicode letter, including accented characters.
- `\p{M}`: Matches diacritical marks (for combining accents).
- `+`: Ensures it matches one or more characters.
```### Example Matches:
```markdown
- **Café**
- **résumé**
- **naïve**
- **français**
- **Schrödinger**
```
```markdown
If your regex engine **does not support Unicode properties (`\p{L}` and `\p{M}`)**, you can use a broader character set:
```
```regex
[a-zA-ZÀ-ÖØ-öø-ÿ]+
```
```markdown
This explicitly includes:
- `À-Ö`: Uppercase accented letters
- `Ø-ö`: More accented characters
- `ø-ÿ`: Lowercase accented lettersWould you like this pattern for a specific programming language?
``````markdown
To create regular expressions that match **Chinese** or **Japanese** characters, you can utilize Unicode script properties, provided your regex engine supports them.**For Chinese Characters:**
- **Unicode Property Syntax:** Use `\p{Han}` to match any Han character, which includes Chinese ideographs.
```
```regex
\p{Han}+
```
```markdown
This pattern matches one or more consecutive Chinese characters.
```
**For Japanese Characters:**```markdown
Japanese text comprises Hiragana, Katakana, and Kanji (which are also Han characters). To match these:- **Hiragana:** Use `\p{Hiragana}`
- **Katakana:** Use `\p{Katakana}`
- **Kanji:** Use `\p{Han}`To match any Japanese character, combine these properties:
``````regex
[\p{Hiragana}\p{Katakana}\p{Han}]+
``````markdown
This pattern matches one or more characters that are either Hiragana, Katakana, or Kanji.**Important Considerations:**
- **Regex Engine Support:** Not all regex engines support Unicode property escapes. Engines like Perl, Java, and JavaScript (with the `/u` flag) do, but others might not. Always verify your specific environment's capabilities.
- **Alternative Approaches:** If your environment lacks support for Unicode properties, you can use explicit Unicode ranges. For example, to match Chinese characters:
```
```regex
[\u4E00-\u9FFF]+
``````markdown
This matches characters in the CJK Unified Ideographs block, commonly used for Chinese.For Japanese Hiragana and Katakana:
```
```regex
[\u3040-\u309F\u30A0-\u30FF]+
```
```markdown
This matches characters in the Hiragana and Katakana blocks.**References:**
- [Regular Expressions for Japanese Text](https://www.localizingjapan.com/blog/2012/01/20/regular-expressions-for-japanese-text/)
- [Regular Expression to Find Chinese Characters](https://salesforce.stackexchange.com/questions/127565/regular-expression-to-find-chinese-characters)These resources provide additional insights and examples for handling Chinese and Japanese text with regular expressions.
```