{"id":13528217,"url":"https://github.com/twitter/twitter-korean-text","last_synced_at":"2025-07-09T22:43:05.867Z","repository":{"id":22599685,"uuid":"25941738","full_name":"twitter/twitter-korean-text","owner":"twitter","description":"Korean tokenizer","archived":false,"fork":false,"pushed_at":"2023-04-10T11:33:16.000Z","size":28942,"stargazers_count":859,"open_issues_count":21,"forks_count":175,"subscribers_count":171,"default_branch":"master","last_synced_at":"2025-07-05T23:02:31.417Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/twitter.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2014-10-29T21:16:33.000Z","updated_at":"2025-07-04T06:58:58.000Z","dependencies_parsed_at":"2022-07-12T16:07:29.413Z","dependency_job_id":"117289fa-df77-447a-ac89-53c985388fdf","html_url":"https://github.com/twitter/twitter-korean-text","commit_stats":null,"previous_names":[],"tags_count":40,"template":false,"template_full_name":null,"purl":"pkg:github/twitter/twitter-korean-text","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twitter%2Ftwitter-korean-text","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twitter%2Ftwitter-korean-text/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twitter%2Ftwitter-korean-text/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twitter%2Ftwitter-korean-text/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/twitter","download_url":"https://codeload.github.com/twitter/twitter-korean-text/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twitter%2Ftwitter-korean-text/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264504616,"owners_count":23618831,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T06:02:19.723Z","updated_at":"2025-07-09T22:43:05.847Z","avatar_url":"https://github.com/twitter.png","language":"Scala","funding_links":[],"categories":["Scala","1. Tools","人工智能"],"sub_categories":["1.1. Morpheme/형태소 분석기 +  Part of Speech(PoS)/품사 Tagger"],"readme":"## twitter-korean-text [![Coverage Status](https://coveralls.io/repos/twitter/twitter-korean-text/badge.png)](https://coveralls.io/r/twitter/twitter-korean-text)\n[//]: # (Travis has been deactivated: [![Build Status](https://secure.travis-ci.org/twitter/twitter-korean-text.png?branch=master)](http://travis-ci.org/twitter/twitter-korean-text))\n  \n트위터에서 만든 오픈소스 한국어 처리기\n\n* 2017년 4.4 버전 이후의 개발은 http://openkoreantext.org 에서 진행됩니다. \n* We now started an official fork at http://openkoreantext.org as of early 2017. All the development after version 4.4 will be done in open-korean-text.\n\nScala/Java library to process Korean text with a Java wrapper. twitter-korean-text currently provides Korean normalization and tokenization. Please join our community at [Google Forum](https://groups.google.com/forum/#!forum/twitter-korean-text). The intent of this text processor is not limited to short tweet texts.\n\n스칼라로 쓰여진 한국어 처리기입니다. 현재 텍스트 정규화와 형태소 분석, 스테밍을 지원하고 있습니다. 짧은 트윗은 물론이고 긴 글도 처리할 수 있습니다. 개발에 참여하시고 싶은 분은 [Google Forum](https://groups.google.com/forum/#!forum/twitter-korean-text)에 가입해 주세요. 사용법을 알고자 하시는 초보부터 코드에 참여하고 싶으신 분들까지 모두 환영합니다. \n\ntwitter-korean-text의 목표는 빅데이터 등에서 간단한 한국어 처리를 통해 색인어를 추출하는 데에 있습니다. 완전한 수준의 형태소 분석을 지향하지는 않습니다.\n\ntwitter-korean-text는 normalization, tokenization, stemming, phrase extraction 이렇게 네가지 기능을 지원합니다. \n\n\n**정규화 normalization (입니닼ㅋㅋ -\u003e 입니다 ㅋㅋ, 샤릉해 -\u003e 사랑해)**\n\n* 한국어를 처리하는 예시입니닼ㅋㅋㅋㅋㅋ -\u003e 한국어를 처리하는 예시입니다 ㅋㅋ\n\n**토큰화 tokenization**\n\n* 한국어를 처리하는 예시입니다 ㅋㅋ -\u003e 한국어Noun, 를Josa, 처리Noun, 하는Verb, 예시Noun, 입Adjective, 니다Eomi ㅋㅋKoreanParticle\n\n**어근화 stemming (입니다 -\u003e 이다)**\n\n* 한국어를 처리하는 예시입니다 ㅋㅋ -\u003e 한국어Noun, 를Josa, 처리Noun, 하다Verb, 예시Noun, 이다Adjective, ㅋㅋKoreanParticle\n\n\n**어구 추출 phrase extraction** \n\n* 한국어를 처리하는 예시입니다 ㅋㅋ -\u003e 한국어, 처리, 예시, 처리하는 예시\n\nIntroductory Presentation: [Google Slides](https://docs.google.com/presentation/d/10CZj8ry03oCk_Jqw879HFELzOLjJZ0EOi4KJbtRSIeU/)\n\n\n## Try it here\n\nGunja Agrawal kindly created a test API webpage for this project: [http://gunjaagrawal.com/langhack/](http://gunjaagrawal.com/langhack/)\n\nGunja Agrawal님이 만들어주신 테스트 웹 페이지 입니다. \n[http://gunjaagrawal.com/langhack/](http://gunjaagrawal.com/langhack/)\n\nOpensourced here: [twitter-korean-tokenizer-api](https://github.com/gunjaag/twitter-korean-tokenizer-api)\n\n## API\n[scaladoc](http://twitter.github.io/twitter-korean-text/scaladocs/#com.twitter.penguin.korean.TwitterKoreanProcessor$)\n\n[mavendoc](http://twitter.github.io/twitter-korean-text)\n\n\n## Maven\nTo include this in your Maven-based JVM project, add the following lines to your pom.xml:\n\nMaven을 이용할 경우 pom.xml에 다음의 내용을 추가하시면 됩니다:\n\n```xml\n  \u003cdependency\u003e\n    \u003cgroupId\u003ecom.twitter.penguin\u003c/groupId\u003e\n    \u003cartifactId\u003ekorean-text\u003c/artifactId\u003e\n    \u003cversion\u003e4.4\u003c/version\u003e\n  \u003c/dependency\u003e\n```\n\nThe maven site is available here http://twitter.github.io/twitter-korean-text/ and scaladocs are here http://twitter.github.io/twitter-korean-text/scaladocs/\n\n## Support for other languages.\n### .net \n\n[modamoda](https://github.com/modamoda) kindly offered a .net wrapper: [https://github.com/modamoda/TwitterKoreanProcessorCS](https://github.com/modamoda/TwitterKoreanProcessorCS)\n\n### node.js \n\n[Ch0p](https://github.com/Ch0p) kindly offered a node.js wrapper: [twtkrjs](https://github.com/Ch0p/twtkrjs)\n\n[Youngrok Kim](https://github.com/rokoroku) kindly offered a node.js wrapper: [node-twitter-korean-text](https://github.com/rokoroku/node-twitter-korean-text)\n\n### Python \n\n[Baeg-il Kim](https://github.com/cedar101) kindly offered a Python version: https://github.com/cedar101/twitter-korean-py\n\n[Jaepil Jeong](https://github.com/jaepil) kindly offered a Python wrapper: https://github.com/jaepil/twkorean\n\n* Python Korean NLP project [KoNLPy](https://github.com/konlpy/konlpy) now includes twitter-korean-text. 파이썬에서 쉬운 활용이 가능한 [KoNLPy](https://github.com/konlpy/konlpy) 패키지에 twkorean이 포함되었습니다. \n\n### Ruby \n\n[jun85664396](https://github.com/jun85664396) kindly offered a Ruby wrapper: \n[twitter-korean-text-ruby](https://github.com/jun85664396/twitter-korean-text-ruby)\n* This provides access to com.twitter.penguin.korean.TwitterKoreanProcessorJava (Java wrapper).\n\n\n[Jaehyun Shin](https://github.com/keepcosmos) kindly offered a Ruby wrapper: \n[twitter-korean-text-ruby](https://github.com/keepcosmos/twitter-korean-text-ruby)\n* This provides access to com.twitter.penguin.korean.TwitterKoreanProcessor (Original Scala Class).\n\n### Elastic Search\n\n[socurites](https://github.com/socurites)'s Korean analyzer for elasticsearch based on twitter-korean-text: [tkt-elasticsearch](https://github.com/socurites/tkt-elasticsearch)\n\n\n## Get the source 소스를 원하시는 경우\n\nClone the git repo and build using maven.\n\nGit 전체를 클론하고 Maven을 이용하여 빌드합니다.\n\n```bash\ngit clone https://github.com/twitter/twitter-korean-text.git\ncd twitter-korean-text\nmvn compile\n```\n\nOpen 'pom.xml' from your favorite IDE.\n\n## Usage 사용 방법\n\nYou can find these [examples](examples) in examples folder.\n\n[examples](examples) 폴더에 사용 방법 예제 파일이 있습니다. \n\nfrom Scala\n```scala\nimport com.twitter.penguin.korean.TwitterKoreanProcessor\nimport com.twitter.penguin.korean.phrase_extractor.KoreanPhraseExtractor.KoreanPhrase\nimport com.twitter.penguin.korean.tokenizer.KoreanTokenizer.KoreanToken\n\nobject ScalaTwitterKoreanTextExample {\n  def main(args: Array[String]) {\n    val text = \"한국어를 처리하는 예시입니닼ㅋㅋㅋㅋㅋ #한국어\"\n\n    // Normalize\n    val normalized: CharSequence = TwitterKoreanProcessor.normalize(text)\n    println(normalized)\n    // 한국어를 처리하는 예시입니다ㅋㅋ #한국어\n\n    // Tokenize\n    val tokens: Seq[KoreanToken] = TwitterKoreanProcessor.tokenize(normalized)\n    println(tokens)\n    // List(한국어(Noun: 0, 3), 를(Josa: 3, 1),  (Space: 4, 1), 처리(Noun: 5, 2), 하는(Verb: 7, 2),  (Space: 9, 1), 예시(Noun: 10, 2), 입니(Adjective: 12, 2), 다(Eomi: 14, 1), ㅋㅋ(KoreanParticle: 15, 2),  (Space: 17, 1), #한국어(Hashtag: 18, 4))\n\n    // Stemming\n    val stemmed: Seq[KoreanToken] = TwitterKoreanProcessor.stem(tokens)\n\n    println(stemmed)\n    // List(한국어(Noun: 0, 3), 를(Josa: 3, 1),  (Space: 4, 1), 처리(Noun: 5, 2), 하다(Verb: 7, 2),  (Space: 9, 1), 예시(Noun: 10, 2), 이다(Adjective: 12, 3), ㅋㅋ(KoreanParticle: 15, 2),  (Space: 17, 1), #한국어(Hashtag: 18, 4))\n\n    // Phrase extraction\n    val phrases: Seq[KoreanPhrase] = TwitterKoreanProcessor.extractPhrases(tokens, filterSpam = true, enableHashtags = true)\n    println(phrases)\n    // List(한국어(Noun: 0, 3), 처리(Noun: 5, 2), 처리하는 예시(Noun: 5, 7), 예시(Noun: 10, 2), #한국어(Hashtag: 18, 4))\n  }\n}\n```\n\nfrom Java\n```java\nimport java.util.List;\n\nimport scala.collection.Seq;\n\nimport com.twitter.penguin.korean.TwitterKoreanProcessor;\nimport com.twitter.penguin.korean.TwitterKoreanProcessorJava;\nimport com.twitter.penguin.korean.phrase_extractor.KoreanPhraseExtractor;\nimport com.twitter.penguin.korean.tokenizer.KoreanTokenizer;\n\npublic class JavaTwitterKoreanTextExample {\n  public static void main(String[] args) {\n    String text = \"한국어를 처리하는 예시입니닼ㅋㅋㅋㅋㅋ #한국어\";\n\n    // Normalize\n    CharSequence normalized = TwitterKoreanProcessorJava.normalize(text);\n    System.out.println(normalized);\n    // 한국어를 처리하는 예시입니다ㅋㅋ #한국어\n\n\n    // Tokenize\n    Seq\u003cKoreanTokenizer.KoreanToken\u003e tokens = TwitterKoreanProcessorJava.tokenize(normalized);\n    System.out.println(TwitterKoreanProcessorJava.tokensToJavaStringList(tokens));\n    // [한국어, 를, 처리, 하는, 예시, 입니, 다, ㅋㅋ, #한국어]\n    System.out.println(TwitterKoreanProcessorJava.tokensToJavaKoreanTokenList(tokens));\n    // [한국어(Noun: 0, 3), 를(Josa: 3, 1),  (Space: 4, 1), 처리(Noun: 5, 2), 하는(Verb: 7, 2),  (Space: 9, 1), 예시(Noun: 10, 2), 입니(Adjective: 12, 2), 다(Eomi: 14, 1), ㅋㅋ(KoreanParticle: 15, 2),  (Space: 17, 1), #한국어(Hashtag: 18, 4)]\n\n\n    // Stemming\n    Seq\u003cKoreanTokenizer.KoreanToken\u003e stemmed = TwitterKoreanProcessorJava.stem(tokens);\n    System.out.println(TwitterKoreanProcessorJava.tokensToJavaStringList(stemmed));\n    // [한국어, 를, 처리, 하다, 예시, 이다, ㅋㅋ, #한국어]\n    System.out.println(TwitterKoreanProcessorJava.tokensToJavaKoreanTokenList(stemmed));\n    // [한국어(Noun: 0, 3), 를(Josa: 3, 1),  (Space: 4, 1), 처리(Noun: 5, 2), 하다(Verb: 7, 2),  (Space: 9, 1), 예시(Noun: 10, 2), 이다(Adjective: 12, 3), ㅋㅋ(KoreanParticle: 15, 2),  (Space: 17, 1), #한국어(Hashtag: 18, 4)]\n\n\n    // Phrase extraction\n    List\u003cKoreanPhraseExtractor.KoreanPhrase\u003e phrases = TwitterKoreanProcessorJava.extractPhrases(tokens, true, true);\n    System.out.println(phrases);\n    // [한국어(Noun: 0, 3), 처리(Noun: 5, 2), 처리하는 예시(Noun: 5, 7), 예시(Noun: 10, 2), #한국어(Hashtag: 18, 4)]\n\n  }\n}\n```\n\n\n## Basics\n\n[TwitterKoreanProcessor.scala](src/main/scala/com/twitter/penguin/korean/TwitterKoreanProcessor.scala) is the central object that provides the interface for all the features.\n\n[TwitterKoreanProcessor.scala](src/main/scala/com/twitter/penguin/korean/TwitterKoreanProcessor.scala)에 지원하는 모든 기능을 모아 두었습니다. \n\n\n## Running Tests\n\n`mvn test` will run our unit tests\n\n모든 유닛 테스트를 실행하려면 `mvn test`를 이용해 주세요.\n\n\n## Tools\n\nWe provide tools for quality assurance and test resources. They can be found under [src/main/scala/com/twitter/penguin/korean/qa](src/main/scala/com/twitter/penguin/korean/qa) and [src/main/scala/com/twitter/penguin/korean/tools](src/main/scala/com/twitter/penguin/korean/tools).\n\n \n## Contribution\n\nRefer to the [general contribution guide](CONTRIBUTING.md). We will add this project-specific contribution guide later.\n\n[설치 및 수정하는 방법 상세 안내](docs/contribution-guide.md)\n\n\n## Performance 처리 속도\n\nTested on Intel i7 2.3 Ghz\n\nInitial loading time (초기 로딩 시간): 2~4 sec\n\nAverage time per parsing a chunk (평균 어절 처리 시간): 0.12 ms\n\n\n**Tweets (Avg length ~50 chars)**\n\nTweets|100K|200K|300K|400K|500K|600K|700K|800K|900K|1M\n---|---|---|---|---|---|---|---|---|---|---\nTime in Seconds|57.59|112.09|165.05|218.11|270.54|328.52|381.09|439.71|492.94|542.12\nAverage per tweet: 0.54212 ms\n\n**Benchmark test by [KoNLPy](http://konlpy.org/)**\n\n![Benchmark test](http://konlpy.org/ko/v0.4.2/_images/time.png)\n\nFrom [http://konlpy.org/ko/v0.4.2/morph/](http://konlpy.org/ko/v0.4.2/morph/)\n\n\n## Author(s)\n\n* Will Hohyon Ryu (유호현): https://github.com/nlpenguin | https://twitter.com/NLPenguin\n\n## License\n\nCopyright 2014 Twitter, Inc.\n\nLicensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftwitter%2Ftwitter-korean-text","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftwitter%2Ftwitter-korean-text","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftwitter%2Ftwitter-korean-text/lists"}