https://github.com/bgutter/sylvia

Use phoneme-based regular expressions to find words in the Carnegie-Mellon Pronouncing Dictionary.
https://github.com/bgutter/sylvia

Last synced: about 1 year ago
JSON representation

Use phoneme-based regular expressions to find words in the Carnegie-Mellon Pronouncing Dictionary.

Host: GitHub
URL: https://github.com/bgutter/sylvia
Owner: bgutter
License: mit
Created: 2017-03-26T04:43:56.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2023-12-14T02:38:12.000Z (over 2 years ago)
Last Synced: 2024-10-13T12:07:34.823Z (over 1 year ago)
Language: Python
Size: 7.41 MB
Stars: 33
Watchers: 5
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.org
- License: LICENSE

Awesome Lists containing this project

README

#+TITLE: Sylvia

Search pronunciations in the CMU Pronouncing Dictionary using a reglular-expression like syntax. Input-format regular expressions are lightly preprocessed into Python-format regular expressions, and then mapped over an encoded version of cmudict. Results are sorted by popularity using Peter Norvig's list of word popularities derived from Google's N-Gram dataset.

* Here for the Emacs library?
You can skip this and jump directly into the [[./sylvia-emacs/README.org][sylvia-mode README]]!

* Installation

#+BEGIN_SRC sh
brandon@brandon-babypad-linux ~> pip2 install sylvia
#+END_SRC

* Usage

Interactive Sylvia prompt:

#+BEGIN_SRC
brandon@brandon-babypad-linux ~> python2 -m sylvia

Type 'help' for options, press enter to quit.

sylvia>
#+END_SRC

Run one-off command:

#+BEGIN_SRC
brandon@brandon-babypad-linux ~> python2 -m sylvia -c "regex G #* AE #* IH #* %"
Gravity Graphical Grandchildren Garrison Graphically
Gallegos Gravitate Garretson Gastineau Gallimard
Galligan Grandison Gallivan Glatfelter Garibay
Garelick Garrigan Garriga Gravitates Galipeau
Gavigan Gamelin Gateley Grandillo Galipault
Garringer Gradison Grandchildren's Glastetter Garity
Galliher Gantenbein
#+END_SRC

* Commands

Sylvia's functionality is broken down into various subcommands. These commands can be run from the interactive prompt, or as single-lines directly from your system shell.

** regex

This is the most powerful feature of Sylvia. It allows searches of cmudict based on phoneme patterns.

Sylvia's query format is nearly identical to traditional Python 2 regular expressions, with the exception that it is intended not to match against patterns of characters, but rather patterns of phonemes. To construct a regular expression query for Sylvia, remember the following rules:

1. Whitespace must be used to delimit consecutive phoneme literals. It may also be used anywhere else in the regular expression, as whitespace is meaningless in the context of a phoneme sequence, and will be stripped during preprocessing.
1. `#` is a shortcut for "any consonant sound"
1. `@` is a shortcut for "any vowel sound"
1. `%` is a shortcut for "any syllable", and is equivalant to `#*@#*`
1. Otherwise, whatever flies with Python's regular expression format will work in Sylvia. Just use some common sense, as some things (such as character classes) will be wholly inapplicable to searches in phoneme-space.

Use of this command is as follows:

#+BEGIN_SRC
sylvia> regex {regex tokens}
#+END_SRC

[[http://www.speech.cs.cmu.edu/cgi-bin/cmudict][Consult Carnegie Mellon's cmudict documentation]] to learn more about the phoneme set.

[[https://docs.python.org/2/library/re.html][Consult the Python docs]] to learn more about Python's regex format.

*** Examples

Find words starting with zero or more consonant sounds, followed by the "long E" sound (phoneme IY), followed by zero or more consonant sounds, followed by the "ed" sound (the phoneme sequence EH D):

#+BEGIN_SRC
sylvia> regex #* IY #* EH D
Steelhead Seabed Beachhead Retread Behead
#+END_SRC

Find all six syllable words where the first syllable uses the "short i" sound (phoneme IH), and ends in either the D or P phonemes.

#+BEGIN_SRC
sylvia> regex #*IH%%%%%(D|P)
Differentiated Individualized Deteriorated Institutionalized
Incapacitated Internationalized Interrelationship Misappropriated
Disassociated Discombobulated Insubstantiated
#+END_SRC

Note here that only five % symbols are needed, as a single vowel sound constitutes a single syllable, and we explicitly call out the first vowel sound via IH.

Find all words that start with the R sound, followed by some vowel, followed by the D sound, followed by another vowel, followed by the NG phoneme:

#+BEGIN_SRC
sylvia> regex R@D@NG
Reading Riding Redding Raiding Ridding Reding Rodding Ruding Rawding
#+END_SRC

** lookup

If you just want to lookup the pronunciations for a word, you can do that too. This can be a good way to quickly learn the phonemes for a particular sound when constructing queries. Due to cultural and geographic variations in pronunciation, this command can return multiple sequences.

Use of this command is as follows:

#+BEGIN_SRC
sylvia> lookup {word}
#+END_SRC

*** Examples

#+BEGIN_SRC
sylvia> lookup turkmenistan
T ER K M EH N IH S T AE N
#+END_SRC

#+BEGIN_SRC
sylvia> lookup capture
K AE P CH ER
#+END_SRC

#+BEGIN_SRC
sylvia> lookup tomato
T AH M EY T OW T AH M AA T OW
#+END_SRC

** rhyme

Sylvia can act as a rhyming dictionary, returning words which rhyme with a given word. There are three "rhyme levels", which define how rhymes are determined.

1. *perfect* lists all words which contain the same sequence of phonemes as the given word, including and following the first vowel in the given pronunciation. Before that vowel, the matched words can contain any sounds.
2. *default* is the same as perfect, except that additional consonant sounds can be interspersed between the matched sequence phonemes.
3. *loose* is similar, except it ignores consonant sounds entirely.

Use of this command is as follows:

#+BEGIN_SRC
sylvia> rhyme {rhyme-level} {word}
#+END_SRC

rhyme-level can be omitted if default behavior is desired.

There are plans to improve these models by matching phonemes based on their vocal characteristics. For example, all nasal phonemes may be considered matches by default, or all plosive sounds, etc. The behavior documented above is subject to change at any time.

*** Examples

List words which rhyme with "chatter", using the perfect algorithm.

#+BEGIN_SRC
sylvia> rhyme perfect chatter
Matter Latter Batter Mater Platter
Scatter Flatter Shatter Hatter Splatter
Fatter Patter Antimatter Clatter Spatter
Schlatter Blatter Natter Sater Satter
Slatter Tatter Mcphatter Chitterchatter Smatter
Vanatter Vannater Vatter Vannatter Mcfatter
Wildcatter
#+END_SRC

...using the default algorithm...

#+BEGIN_SRC
sylvia> rhyme chatter
After Chapter Matter Master
Factors Factor Pattern Faster
Matters Webmaster Patterns Adapter
Contractor Contractors Disaster Actor
Masters Latter Chapters Actors
Adapters Lancaster Saturn Adaptor
Pastor Thereafter Tractor Scattered
Disasters Ticketmaster Napster Laughter
Reactor Adaptors Baxter Stratford
Blaster Lantern Bastard Maxtor
Tractors Shattered Plaster Hereafter
Subchapter Batter Broadcasters Antwerp
Raptor Mater Platter Scatter
Hamster Raster Subcontractor Reactors
Pastors Subcontractors Broadcaster Mastered
... many more...
#+END_SRC

...and using the loose algorithm.

#+BEGIN_SRC
sylvia> rhyme loose chatter
After Standard
Standards Rather
Answer Master
Factors Factor
Matters Manner
Hampshire Adapter
Contractors Alexander
Actor Masters
Albert Chapters
Scanner Bachelor
Adverse Amber
Planner Hacker
Scanners Manufactured
Anchor Gathered
Grammar Hazard
Lancaster Hammer
Hazards Bradford
Banners Passwords
Hamburg Ladder
Planners Thereafter
Tractor Wagner
Ballard Disasters
Sanders Ticketmaster
Dancer Dancers
Backward Panthers
Sampler Panther
Backwards Adaptors
Baxter Stratford
Blaster Tavern
...many, many more...
#+END_SRC Password Chapter Matter Cancer Transfer Answers Pattern Faster Webmaster Patterns Contractor Banner Capture Disaster Traveler Latter Packard Answered Actors Transfers Tracker Transferred Commander Adapters Stanford Manufacture Travelers Captured Anger Gather Manor Programmer Madagascar Saturn Adaptor Pastor Flashers Programmers Chancellor Frankfurt Hackers Scattered Handler Chandler Napster Banker Jasper Laughter Captures Bladder Reactor Stafford Manufactures Glamour Blackburn Amherst Lambert Fracture

** infer

Sylvia can infer the pronunciation of unknown words using it's own rule-based text-to-phoneme engine. Don't expect great performance though -- written English is only ostensibly phonetic, and rules-based approaches are not fantastic. Any deep-learning based solution to this problem is likely to beat the snot out of Sylvia's engine.

Use of this command is as follows:

#+BEGIN_SRC
sylvia> infer {word}
#+END_SRC

*** Examples

Infer a pronunciation for the word "rooster", then compare to the value from lookup.

#+BEGIN_SRC
sylvia> infer rooster
R UW S T ER

sylvia> lookup rooster
R UW S T ER
#+END_SRC

Infer pronunciations for some made-up words.

#+BEGIN_SRC
sylvia> infer rafloy
R AE F L OY

sylvia> infer rabbilt
R AE B IH L T

sylvia> infer fliberdoodle
F L IH B ER D UW D AH L
#+END_SRC

** lregex

Sylvia can lookup words based on normal regular expressions. This command doesn't touch on anything phonetic, but may be useful in the same use-cases as Sylvia itself.

Use of this command is as follows:

#+BEGIN_SRC
sylvia> lregex {regex tokens}
#+END_SRC

*** Examples

Find all words /which are spelled/ with a C at the start, a P at the end, and which contain either a T or a D.

#+BEGIN_SRC
sylvia> lregex c.*(t|d).*p
Citizenship Craftsmanship Countertop Courtship Catnip
Citicorp Conservatorship Catsup Crudup Catchup
Colstrip Catnap Cutlip Coltharp
#+END_SRC

** popularity

You can ask Sylvia for the popularity of a word. This value depends on the data-source used when compiling the dictionary, but by default, it is the value in Peter Norvig's word popularity list. Larger values indicate higher popularity (think occurrences, not rank).

Use of this command is as follows:

#+BEGIN_SRC
sylvia> popularity {word}
#+END_SRC

*** Examples

Find the popularity of a popular, typical, and rare word.

#+BEGIN_SRC
sylvia> popularity I
3086225277

sylvia> popularity green
108287905

sylvia> popularity teutonic
301907
#+END_SRC

* Contributing and Other General Notes

For a list of known issues feature ideas, and links to relevant research and documentation, [[./NOTES.org][check out the development notes!]]

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bgutter/sylvia

Awesome Lists containing this project

README