{"id":13595987,"url":"https://github.com/cbaziotis/ekphrasis","last_synced_at":"2026-01-14T07:45:59.447Z","repository":{"id":17128559,"uuid":"81201748","full_name":"cbaziotis/ekphrasis","owner":"cbaziotis","description":"Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction,  using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).","archived":false,"fork":false,"pushed_at":"2025-06-02T16:34:16.000Z","size":797,"stargazers_count":671,"open_issues_count":21,"forks_count":93,"subscribers_count":18,"default_branch":"master","last_synced_at":"2025-09-25T10:24:48.543Z","etag":null,"topics":["nlp","nlp-library","semeval","spell-corrector","spelling-correction","text-processing","text-segmentation","tokenization","tokenizer","word-normalization","word-segmentation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cbaziotis.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2017-02-07T11:39:37.000Z","updated_at":"2025-08-28T07:06:33.000Z","dependencies_parsed_at":"2024-06-18T18:36:33.544Z","dependency_job_id":"f7ccd73f-372c-4c25-a209-d98e5c017ea8","html_url":"https://github.com/cbaziotis/ekphrasis","commit_stats":{"total_commits":74,"total_committers":5,"mean_commits":14.8,"dds":0.1216216216216216,"last_synced_commit":"ccfb9ef214d4e332d6abd266bcfe439ec480fa08"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/cbaziotis/ekphrasis","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cbaziotis%2Fekphrasis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cbaziotis%2Fekphrasis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cbaziotis%2Fekphrasis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cbaziotis%2Fekphrasis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cbaziotis","download_url":"https://codeload.github.com/cbaziotis/ekphrasis/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cbaziotis%2Fekphrasis/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28413490,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T05:26:33.345Z","status":"ssl_error","status_checked_at":"2026-01-14T05:21:57.251Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["nlp","nlp-library","semeval","spell-corrector","spelling-correction","text-processing","text-segmentation","tokenization","tokenizer","word-normalization","word-segmentation"],"created_at":"2024-08-01T16:02:03.618Z","updated_at":"2026-01-14T07:45:59.204Z","avatar_url":"https://github.com/cbaziotis.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"Collection of lightweight text tools, geared towards text from social networks, such as Twitter or Facebook, for tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, \nusing word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).\n\n_ekphrasis_ was developed as part of the text processing pipeline for\n_DataStories_ team's submission for _SemEval-2017 Task 4 (English), Sentiment Analysis in Twitter_.\n\nIf you use the library in you research project, please cite the paper \n[\"DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis\"](http://www.aclweb.org/anthology/S17-2126).\n\nCitation:\n```\n@InProceedings{baziotis-pelekis-doulkeridis:2017:SemEval2,\n  author    = {Baziotis, Christos  and  Pelekis, Nikos  and  Doulkeridis, Christos},\n  title     = {DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis},\n  booktitle = {Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)},\n  month     = {August},\n  year      = {2017},\n  address   = {Vancouver, Canada},\n  publisher = {Association for Computational Linguistics},\n  pages     = {747--754}\n}\n```\n\n**Disclaimer:** The library is no longer actively developed. I will try to resolve important issues, but I can't make any promises.\n\n# Installation\n\nbuild from source \n```\npip install git+git://github.com/cbaziotis/ekphrasis.git\n```\nor install from pypi\n```\npip install ekphrasis -U\n```\n\n# Overview\n\n_ekphrasis_ offers the following functionality:\n\n  1. **Social Tokenizer**. A text tokenizer geared towards social networks (Facebook, Twitter...), \n      which understands complex emoticons, emojis and other unstructured expressions like dates, times and more.\n\n  2. **Word Segmentation**. You can split a long string to its constituent words. Suitable for hashtag segmentation.\n\n  3. **Spell Correction**. You can replace a misspelled word, with the most probable candidate word.\n\n  4. **Customization**. Taylor the word-segmentation, spell-correction and term identification, to suit your needs.\n  \n      Word Segmentation and Spell Correction mechanisms, operate on top of word statistics, collected from a given corpus. We provide word statistics from 2 big corpora (from Wikipedia and Twitter), but you can also generate word statistics from your own corpus. You may need to do that if you are working with domain-specific texts, like biomedical documents. For example a word describing a technique or a chemical compound may be treated as a misspelled word, using the word statistics from a general purposed corpus.\n\n      _ekphrasis_ tokenizes the text based on a list of regular expressions. You can easily enable _ekphrasis_ to identify new entities, by simply adding a new entry to the dictionary of regular expressions (`ekphrasis/regexes/expressions.txt`).\n\n  5. **Pre-Processing Pipeline**. You can combine all the above steps in an easy way, in order to prepare the text files in your dataset for some kind of analysis or for machine learning.\n  In addition, to the aforementioned actions, you can perform text normalization, word annotation (labeling) and more.\n\n\n\n\n## Text Pre-Processing pipeline\n\nYou can easily define a preprocessing pipeline, by using the ``TextPreProcessor``. \n\n```python\nfrom ekphrasis.classes.preprocessor import TextPreProcessor\nfrom ekphrasis.classes.tokenizer import SocialTokenizer\nfrom ekphrasis.dicts.emoticons import emoticons\n\ntext_processor = TextPreProcessor(\n    # terms that will be normalized\n    normalize=['url', 'email', 'percent', 'money', 'phone', 'user',\n        'time', 'url', 'date', 'number'],\n    # terms that will be annotated\n    annotate={\"hashtag\", \"allcaps\", \"elongated\", \"repeated\",\n        'emphasis', 'censored'},\n    fix_html=True,  # fix HTML tokens\n    \n    # corpus from which the word statistics are going to be used \n    # for word segmentation \n    segmenter=\"twitter\", \n    \n    # corpus from which the word statistics are going to be used \n    # for spell correction\n    corrector=\"twitter\", \n    \n    unpack_hashtags=True,  # perform word segmentation on hashtags\n    unpack_contractions=True,  # Unpack contractions (can't -\u003e can not)\n    spell_correct_elong=False,  # spell correction for elongated words\n    \n    # select a tokenizer. You can use SocialTokenizer, or pass your own\n    # the tokenizer, should take as input a string and return a list of tokens\n    tokenizer=SocialTokenizer(lowercase=True).tokenize,\n    \n    # list of dictionaries, for replacing tokens extracted from the text,\n    # with other expressions. You can pass more than one dictionaries.\n    dicts=[emoticons]\n)\n\nsentences = [\n    \"CANT WAIT for the new season of #TwinPeaks ＼(^o^)／!!! #davidlynch #tvseries :)))\",\n    \"I saw the new #johndoe movie and it suuuuucks!!! WAISTED $10... #badmovies :/\",\n    \"@SentimentSymp:  can't wait for the Nov 9 #Sentiment talks!  YAAAAAAY !!! :-D http://sentimentsymposium.com/.\"\n]\n\nfor s in sentences:\n    print(\" \".join(text_processor.pre_process_doc(s)))\n```\n\nOutput:\n\n```\ncant \u003callcaps\u003e wait \u003callcaps\u003e for the new season of \u003chashtag\u003e twin peaks \u003c/hashtag\u003e ＼(^o^)／ ! \u003crepeated\u003e \u003chashtag\u003e david lynch \u003c/hashtag\u003e \u003chashtag\u003e tv series \u003c/hashtag\u003e \u003chappy\u003e\n\ni saw the new \u003chashtag\u003e john doe \u003c/hashtag\u003e movie and it sucks \u003celongated\u003e ! \u003crepeated\u003e waisted \u003callcaps\u003e \u003cmoney\u003e . \u003crepeated\u003e \u003chashtag\u003e bad movies \u003c/hashtag\u003e \u003cannoyed\u003e\n\n\u003cuser\u003e : can not wait for the \u003cdate\u003e \u003chashtag\u003e sentiment \u003c/hashtag\u003e talks ! yay \u003callcaps\u003e \u003celongated\u003e ! \u003crepeated\u003e \u003claugh\u003e \u003curl\u003e\n```\n\n\nNotes:\n\n* elongated words are automatically normalized.\n* Spell correction affects performance.\n\n---\n\n### Word Statistics\n_ekphrasis_ provides word statistics (unigrams and bigrams) from 2 big corpora:\n* the english Wikipedia\n* a collection of 330 million english Twitter messages\n\nThese word statistics are required for the word segmentation and spell correction.\nMoreover, you can generate word statistics from your own corpus.\nYou can use `ekphrasis/tools/generate_stats.py` and generate statistics from a text file, or a directory that contains a collection of text files.\nFor example, in order generate word statistics for [text8](http://mattmahoney.net/dc/textdata.html) (http://mattmahoney.net/dc/text8.zip), you can do:\n\n```\npython generate_stats.py --input text8.txt --name text8 --ngrams 2 --mincount 70 30\n```\n* input: path to file or directory containing the files for calculating the statistics.\n* name: the name of the corpus.\n* ngrams: up-to how many ngrams to calculate statistics.\n* mincount: the minimum count of each ngram, in order to be included. \n  In this case, the mincount for unigrams is 70 and for bigrams is 30.\n\nAfter you run the script, you will see a new directory inside `ekphrasis/stats/` with the statistics of your corpus. \nIn the case of the example above, `ekphrasis/stats/text8/`. \n\n\n\n### Word Segmentation\nThe word segmentation implementation uses the Viterbi algorithm and is based on [CH14](http://norvig.com/ngrams/ch14.pdf) from the book [Beautiful Data (Segaran and Hammerbacher, 2009)](http://shop.oreilly.com/product/9780596157128.do).\nThe implementation requires word statistics in order to identify and separating the words in a string. \nYou can use the word statistics from one of the 2 provided corpora, or from your own corpus.\n\n\n**Example:**\nIn order to perform word segmentation, first you have to instantiate a segmenter with a given corpus, and then just use the `segment()` method:\n```python\nfrom ekphrasis.classes.segmenter import Segmenter\nseg = Segmenter(corpus=\"mycorpus\") \nprint(seg.segment(\"smallandinsignificant\"))\n```\nOutput:\n```\n\u003e small and insignificant\n```\n\nYou can test the output using statistics from the different corpora:\n```python\nfrom ekphrasis.classes.segmenter import Segmenter\n\n# segmenter using the word statistics from english Wikipedia\nseg_eng = Segmenter(corpus=\"english\") \n\n# segmenter using the word statistics from Twitter\nseg_tw = Segmenter(corpus=\"twitter\")\n\nwords = [\"exponentialbackoff\", \"gamedev\", \"retrogaming\", \"thewatercooler\", \"panpsychism\"]\nfor w in words:\n    print(w)\n    print(\"(eng):\", seg_eng.segment(w))\n    print(\"(tw):\", seg_tw.segment(w))\n    print()\n```\nOutput:\n```\nexponentialbackoff\n(eng): exponential backoff\n(tw): exponential back off\n\ngamedev\n(eng): gamedev\n(tw): game dev\n\nretrogaming\n(eng): retrogaming\n(tw): retro gaming\n\nthewatercooler\n(eng): the water cooler\n(tw): the watercooler\n\npanpsychism\n(eng): panpsychism\n(tw): pan psych is m\n\n```\n\nFinally, if the word is camelCased or PascalCased, then the algorithm splits the words based on the case of the characters.\n```python\nfrom ekphrasis.classes.segmenter import Segmenter\nseg = Segmenter() \nprint(seg.segment(\"camelCased\"))\nprint(seg.segment(\"PascalCased\"))\n```\nOutput:\n```\n\u003e camel cased\n\u003e pascal cased\n```\n\n### Spell Correction\nThe Spell Corrector is based on [Peter Norvig's spell-corrector](http://norvig.com/spell-correct.html).\nJust like the segmentation algorithm, we utilize word statistics in order to find the most probable candidate.\nBesides the provided statistics, you can use your own.\n\n**Example:**\n\nYou can perform the spell correction, just like the word segmentation.\nFirst you have to instantiate a `SpellCorrector` object, \nthat uses the statistics from the corpus of your choice and then use on of the available methods.\n```python\nfrom ekphrasis.classes.spellcorrect import SpellCorrector\nsp = SpellCorrector(corpus=\"english\") \nprint(sp.correct(\"korrect\"))\n```\nOutput:\n```\n\u003e correct\n```\n\n\n### Social Tokenizer\nThe difficulty in tokenization is to avoid splitting expressions or words that should be kept intact (as one token).\nThis is more important in texts from social networks, with \"creative\" writing and expressions like emoticons, hashtags and so on.\nAlthough there are some tokenizers geared towards Twitter [1],[2], \nthat recognize the Twitter markup and some basic sentiment expressions or simple emoticons, \nour tokenizer is able to identify almost all emoticons, emojis and many complex expressions.\n\nEspecially for tasks such as sentiment analysis, there are many expressions that play a decisive role in identifying the sentiment expressed in text. Expressions like these are: \n\n- Censored words, such as ``f**k``, ``s**t``.\n- Words with emphasis, such as ``a *great* time``, ``I don't *think* I ...``.\n- Emoticons, such as ``\u003e:(``, ``:))``, ``\\o/``.\n- Dash-separated words, such as ``over-consumption``, ``anti-american``, ``mind-blowing``.\n\nMoreover, ekphrasis can identify information-bearing  expressions. Depending on the task, you may want to keep preserve / extract them as one token (IR) and then normalize them since this information may be irrelevant for the task (sentiment analysis). Expressions like these are:\n\n\n-   Dates, such as ``Feb 18th``, ``December 2, 2016``, ``December 2-2016``,\n    ``10/17/94``, ``3 December 2016``, ``April 25, 1995``, ``11.15.16``,\n    ``November 24th 2016``, ``January 21st``.\n-   Times, such as ``5:45pm``, ``11:36 AM``, ``2:45 pm``, ``5:30``.\n-   Currencies, such as ``$220M``, ``$2B``, ``$65.000``, ``€10``, ``$50K``.\n-   Phone numbers.\n-   URLs, such as ``http://www.cs.unipi.gr``, ``https://t.co/Wfw5Z1iSEt``.\n\n**Example**:\n\n```python\nimport nltk\nfrom ekphrasis.classes.tokenizer import SocialTokenizer\n\n\ndef wsp_tokenizer(text):\n    return text.split(\" \")\n\npuncttok = nltk.WordPunctTokenizer().tokenize\n\nsocial_tokenizer = SocialTokenizer(lowercase=False).tokenize\n\nsents = [\n    \"CANT WAIT for the new season of #TwinPeaks ＼(^o^)／ yaaaay!!! #davidlynch #tvseries :)))\",\n    \"I saw the new #johndoe movie and it suuuuucks!!! WAISTED $10... #badmovies \u003e3:/\",\n    \"@SentimentSymp:  can't wait for the Nov 9 #Sentiment talks!  YAAAAAAY !!! \u003e:-D http://sentimentsymposium.com/.\",\n]\n\nfor s in sents:\n    print()\n    print(\"ORG: \", s)  # original sentence\n    print(\"WSP : \", wsp_tokenizer(s))  # whitespace tokenizer\n    print(\"WPU : \", puncttok(s))  # WordPunct tokenizer\n    print(\"SC : \", social_tokenizer(s))  # social tokenizer\n\n```\n\nOutput:\n\n```\nORG:  CANT WAIT for the new season of #TwinPeaks ＼(^o^)／ yaaaay!!! #davidlynch #tvseries :)))\nWSP :  ['CANT', 'WAIT', 'for', 'the', 'new', 'season', 'of', '#TwinPeaks', '＼(^o^)／', 'yaaaay!!!', '#davidlynch', '#tvseries', ':)))']\nWPU :  ['CANT', 'WAIT', 'for', 'the', 'new', 'season', 'of', '#', 'TwinPeaks', '＼(^', 'o', '^)／', 'yaaaay', '!!!', '#', 'davidlynch', '#', 'tvseries', ':)))']\nSC :  ['CANT', 'WAIT', 'for', 'the', 'new', 'season', 'of', '#TwinPeaks', '＼(^o^)／', 'yaaaay', '!', '!', '!', '#davidlynch', '#tvseries', ':)))']\n\nORG:  I saw the new #johndoe movie and it suuuuucks!!! WAISTED $10... #badmovies \u003e3:/\nWSP :  ['I', 'saw', 'the', 'new', '#johndoe', 'movie', 'and', 'it', 'suuuuucks!!!', 'WAISTED', '$10...', '#badmovies', '\u003e3:/']\nWPU :  ['I', 'saw', 'the', 'new', '#', 'johndoe', 'movie', 'and', 'it', 'suuuuucks', '!!!', 'WAISTED', '$', '10', '...', '#', 'badmovies', '\u003e', '3', ':/']\nSC :  ['I', 'saw', 'the', 'new', '#johndoe', 'movie', 'and', 'it', 'suuuuucks', '!', '!', '!', 'WAISTED', '$10', '.', '.', '.', '#badmovies', '\u003e', '3:/']\n```\n\n\n\n\u003c!-- \n\n---\n_Ekphrasis_ means expression in Greek (Modern Greek:έκφραση, Ancient Greek:ἔκφρασις). \n relies on Regular Expression for the text tokenization.\n\n --\u003e\n\n#### References\n\n[1] K. Gimpel et al., “Part-of-speech tagging for twitter: Annotation, features, and experiments,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, 2011, pp. 42–47.\n\n[2] C. Potts, “Sentiment Symposium Tutorial: Tokenizing,” Sentiment Symposium Tutorial, 2011. [Online]. Available: http://sentiment.christopherpotts.net/tokenizing.html.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcbaziotis%2Fekphrasis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcbaziotis%2Fekphrasis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcbaziotis%2Fekphrasis/lists"}