An open API service indexing awesome lists of open source software.

https://github.com/bgutter/sylvia

Use phoneme-based regular expressions to find words in the Carnegie-Mellon Pronouncing Dictionary.
https://github.com/bgutter/sylvia

Last synced: 10 months ago
JSON representation

Use phoneme-based regular expressions to find words in the Carnegie-Mellon Pronouncing Dictionary.

Awesome Lists containing this project

README

          

#+TITLE: Sylvia

Search pronunciations in the CMU Pronouncing Dictionary using a reglular-expression like syntax. Input-format regular expressions are lightly preprocessed into Python-format regular expressions, and then mapped over an encoded version of cmudict. Results are sorted by popularity using Peter Norvig's list of word popularities derived from Google's N-Gram dataset.

* Here for the Emacs library?
You can skip this and jump directly into the [[./sylvia-emacs/README.org][sylvia-mode README]]!

* Installation

#+BEGIN_SRC sh
brandon@brandon-babypad-linux ~> pip2 install sylvia
#+END_SRC

* Usage

Interactive Sylvia prompt:

#+BEGIN_SRC
brandon@brandon-babypad-linux ~> python2 -m sylvia

Type 'help' for options, press enter to quit.

sylvia>
#+END_SRC

Run one-off command:

#+BEGIN_SRC
brandon@brandon-babypad-linux ~> python2 -m sylvia -c "regex G #* AE #* IH #* %"
Gravity Graphical Grandchildren Garrison Graphically
Gallegos Gravitate Garretson Gastineau Gallimard
Galligan Grandison Gallivan Glatfelter Garibay
Garelick Garrigan Garriga Gravitates Galipeau
Gavigan Gamelin Gateley Grandillo Galipault
Garringer Gradison Grandchildren's Glastetter Garity
Galliher Gantenbein
#+END_SRC

* Commands

Sylvia's functionality is broken down into various subcommands. These commands can be run from the interactive prompt, or as single-lines directly from your system shell.

** regex

This is the most powerful feature of Sylvia. It allows searches of cmudict based on phoneme patterns.

Sylvia's query format is nearly identical to traditional Python 2 regular expressions, with the exception that it is intended not to match against patterns of characters, but rather patterns of phonemes. To construct a regular expression query for Sylvia, remember the following rules:

1. Whitespace must be used to delimit consecutive phoneme literals. It may also be used anywhere else in the regular expression, as whitespace is meaningless in the context of a phoneme sequence, and will be stripped during preprocessing.
1. `#` is a shortcut for "any consonant sound"
1. `@` is a shortcut for "any vowel sound"
1. `%` is a shortcut for "any syllable", and is equivalant to `#*@#*`
1. Otherwise, whatever flies with Python's regular expression format will work in Sylvia. Just use some common sense, as some things (such as character classes) will be wholly inapplicable to searches in phoneme-space.

Use of this command is as follows:

#+BEGIN_SRC
sylvia> regex {regex tokens}
#+END_SRC

[[http://www.speech.cs.cmu.edu/cgi-bin/cmudict][Consult Carnegie Mellon's cmudict documentation]] to learn more about the phoneme set.

[[https://docs.python.org/2/library/re.html][Consult the Python docs]] to learn more about Python's regex format.

*** Examples

Find words starting with zero or more consonant sounds, followed by the "long E" sound (phoneme IY), followed by zero or more consonant sounds, followed by the "ed" sound (the phoneme sequence EH D):

#+BEGIN_SRC
sylvia> regex #* IY #* EH D
Steelhead Seabed Beachhead Retread Behead
#+END_SRC

Find all six syllable words where the first syllable uses the "short i" sound (phoneme IH), and ends in either the D or P phonemes.

#+BEGIN_SRC
sylvia> regex #*IH%%%%%(D|P)
Differentiated Individualized Deteriorated Institutionalized
Incapacitated Internationalized Interrelationship Misappropriated
Disassociated Discombobulated Insubstantiated
#+END_SRC

Note here that only five % symbols are needed, as a single vowel sound constitutes a single syllable, and we explicitly call out the first vowel sound via IH.

Find all words that start with the R sound, followed by some vowel, followed by the D sound, followed by another vowel, followed by the NG phoneme:

#+BEGIN_SRC
sylvia> regex R@D@NG
Reading Riding Redding Raiding Ridding Reding Rodding Ruding Rawding
#+END_SRC

** lookup

If you just want to lookup the pronunciations for a word, you can do that too. This can be a good way to quickly learn the phonemes for a particular sound when constructing queries. Due to cultural and geographic variations in pronunciation, this command can return multiple sequences.

Use of this command is as follows:

#+BEGIN_SRC
sylvia> lookup {word}
#+END_SRC

*** Examples

#+BEGIN_SRC
sylvia> lookup turkmenistan
T ER K M EH N IH S T AE N
#+END_SRC

#+BEGIN_SRC
sylvia> lookup capture
K AE P CH ER
#+END_SRC

#+BEGIN_SRC
sylvia> lookup tomato
T AH M EY T OW T AH M AA T OW
#+END_SRC

** rhyme

Sylvia can act as a rhyming dictionary, returning words which rhyme with a given word. There are three "rhyme levels", which define how rhymes are determined.

1. *perfect* lists all words which contain the same sequence of phonemes as the given word, including and following the first vowel in the given pronunciation. Before that vowel, the matched words can contain any sounds.
2. *default* is the same as perfect, except that additional consonant sounds can be interspersed between the matched sequence phonemes.
3. *loose* is similar, except it ignores consonant sounds entirely.

Use of this command is as follows:

#+BEGIN_SRC
sylvia> rhyme {rhyme-level} {word}
#+END_SRC

rhyme-level can be omitted if default behavior is desired.

There are plans to improve these models by matching phonemes based on their vocal characteristics. For example, all nasal phonemes may be considered matches by default, or all plosive sounds, etc. The behavior documented above is subject to change at any time.

*** Examples

List words which rhyme with "chatter", using the perfect algorithm.

#+BEGIN_SRC
sylvia> rhyme perfect chatter
Matter Latter Batter Mater Platter
Scatter Flatter Shatter Hatter Splatter
Fatter Patter Antimatter Clatter Spatter
Schlatter Blatter Natter Sater Satter
Slatter Tatter Mcphatter Chitterchatter Smatter
Vanatter Vannater Vatter Vannatter Mcfatter
Wildcatter
#+END_SRC

...using the default algorithm...

#+BEGIN_SRC
sylvia> rhyme chatter
After Chapter Matter Master
Factors Factor Pattern Faster
Matters Webmaster Patterns Adapter
Contractor Contractors Disaster Actor
Masters Latter Chapters Actors
Adapters Lancaster Saturn Adaptor
Pastor Thereafter Tractor Scattered
Disasters Ticketmaster Napster Laughter
Reactor Adaptors Baxter Stratford
Blaster Lantern Bastard Maxtor
Tractors Shattered Plaster Hereafter
Subchapter Batter Broadcasters Antwerp
Raptor Mater Platter Scatter
Hamster Raster Subcontractor Reactors
Pastors Subcontractors Broadcaster Mastered
... many more...
#+END_SRC

...and using the loose algorithm.

#+BEGIN_SRC
sylvia> rhyme loose chatter
After Standard Password Chapter
Standards Rather Matter Cancer
Answer Master Transfer Answers
Factors Factor Pattern Faster
Matters Manner Webmaster Patterns
Hampshire Adapter Contractor Banner
Contractors Alexander Capture Disaster
Actor Masters Traveler Latter
Albert Chapters Packard Answered
Scanner Bachelor Actors Transfers
Adverse Amber Tracker Transferred
Planner Hacker Commander Adapters
Scanners Manufactured Stanford Manufacture
Anchor Gathered Travelers Captured
Grammar Hazard Anger Gather
Lancaster Hammer Manor Programmer
Hazards Bradford Madagascar Saturn
Banners Passwords Adaptor Pastor
Hamburg Ladder Flashers Programmers
Planners Thereafter Chancellor Frankfurt
Tractor Wagner Hackers Scattered
Ballard Disasters Handler Chandler
Sanders Ticketmaster Napster Banker
Dancer Dancers Jasper Laughter
Backward Panthers Captures Bladder
Sampler Panther Reactor Stafford
Backwards Adaptors Manufactures Glamour
Baxter Stratford Blackburn Amherst
Blaster Tavern Lambert Fracture
...many, many more...
#+END_SRC

** infer

Sylvia can infer the pronunciation of unknown words using it's own rule-based text-to-phoneme engine. Don't expect great performance though -- written English is only ostensibly phonetic, and rules-based approaches are not fantastic. Any deep-learning based solution to this problem is likely to beat the snot out of Sylvia's engine.

Use of this command is as follows:

#+BEGIN_SRC
sylvia> infer {word}
#+END_SRC

*** Examples

Infer a pronunciation for the word "rooster", then compare to the value from lookup.

#+BEGIN_SRC
sylvia> infer rooster
R UW S T ER

sylvia> lookup rooster
R UW S T ER
#+END_SRC

Infer pronunciations for some made-up words.

#+BEGIN_SRC
sylvia> infer rafloy
R AE F L OY

sylvia> infer rabbilt
R AE B IH L T

sylvia> infer fliberdoodle
F L IH B ER D UW D AH L
#+END_SRC

** lregex

Sylvia can lookup words based on normal regular expressions. This command doesn't touch on anything phonetic, but may be useful in the same use-cases as Sylvia itself.

Use of this command is as follows:

#+BEGIN_SRC
sylvia> lregex {regex tokens}
#+END_SRC

*** Examples

Find all words /which are spelled/ with a C at the start, a P at the end, and which contain either a T or a D.

#+BEGIN_SRC
sylvia> lregex c.*(t|d).*p
Citizenship Craftsmanship Countertop Courtship Catnip
Citicorp Conservatorship Catsup Crudup Catchup
Colstrip Catnap Cutlip Coltharp
#+END_SRC

** popularity

You can ask Sylvia for the popularity of a word. This value depends on the data-source used when compiling the dictionary, but by default, it is the value in Peter Norvig's word popularity list. Larger values indicate higher popularity (think occurrences, not rank).

Use of this command is as follows:

#+BEGIN_SRC
sylvia> popularity {word}
#+END_SRC

*** Examples

Find the popularity of a popular, typical, and rare word.

#+BEGIN_SRC
sylvia> popularity I
3086225277

sylvia> popularity green
108287905

sylvia> popularity teutonic
301907
#+END_SRC

* Contributing and Other General Notes

For a list of known issues feature ideas, and links to relevant research and documentation, [[./NOTES.org][check out the development notes!]]