{"id":20020261,"url":"https://github.com/multivacplatform/multivac-nlp","last_synced_at":"2026-05-11T11:40:30.942Z","repository":{"id":91345017,"uuid":"114240355","full_name":"multivacplatform/multivac-nlp","owner":"multivacplatform","description":"Testing and benchmarking some of the existing NLP libraries in Apache Spark","archived":false,"fork":false,"pushed_at":"2019-01-11T16:19:29.000Z","size":12620,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-01-12T16:11:30.599Z","etag":null,"topics":["nlp","spark","spark-ml","spark-mllib","spark-nlp","spark-sql","stanford-corenlp","word2vec"],"latest_commit_sha":null,"homepage":"https://multivac.iscpif.fr","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/multivacplatform.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-12-14T11:06:16.000Z","updated_at":"2019-01-11T16:19:31.000Z","dependencies_parsed_at":"2024-04-22T05:47:49.785Z","dependency_job_id":null,"html_url":"https://github.com/multivacplatform/multivac-nlp","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multivacplatform%2Fmultivac-nlp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multivacplatform%2Fmultivac-nlp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multivacplatform%2Fmultivac-nlp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multivacplatform%2Fmultivac-nlp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/multivacplatform","download_url":"https://codeload.github.com/multivacplatform/multivac-nlp/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241454266,"owners_count":19965341,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["nlp","spark","spark-ml","spark-mllib","spark-nlp","spark-sql","stanford-corenlp","word2vec"],"created_at":"2024-11-13T08:30:46.420Z","updated_at":"2025-11-26T11:05:12.429Z","avatar_url":"https://github.com/multivacplatform.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# multivac-nlp [![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/multivacplatform/es-punchcard/blob/master/LICENSE.md) [![Build Status](https://travis-ci.org/multivacplatform/multivac-nlp.svg?branch=master)](https://travis-ci.org/multivacplatform/multivac-nlp) [![multivac discuss](https://img.shields.io/badge/multivac-discuss-ff69b4.svg)](https://discourse.iscpif.fr/c/multivac) [![multivac channel](https://img.shields.io/badge/multivac-chat-ff69b4.svg)](https://chat.iscpif.fr/channel/multivac)\n\nTesting and benchmarking some of the existing NLP libraries in Apache Spark\n\n## NLP libraries used in Multivac-NLP\n### spark-nlp\nlatest version: [https://github.com/JohnSnowLabs/spark-nlp]()\n### Stanford-CoreNLP\nCoreNLP 3.7: [https://github.com/stanfordnlp/CoreNLP]()\n\n\n##Functions\n### Word2Vec\nTraining Spark ML Word2Vec:\n\n#### Wiki News\n\n* articles 19750\n* unique tokens: 145349\n* total tokens: 8070537\n\n```\nfindSynonyms: London\n\n+----------+------------------+\n|word      |similarity        |\n+----------+------------------+\n|cologne   |0.8254308104515076|\n|glasgow   |0.820296585559845 |\n|londons   |0.7977003455162048|\n|birmingham|0.7859082818031311|\n+----------+------------------+\n\nfindSynonyms: France\n\n+-----------+------------------+\n|word       |similarity        |\n+-----------+------------------+\n|leipheimer |0.8519462943077087|\n|levi       |0.8423983454704285|\n|spain      |0.8412571549415588|\n|netherlands|0.8346607685089111|\n+-----------+------------------+\n\nfindSynonyms: Monday\n\n+---------+------------------+\n|word     |similarity        |\n+---------+------------------+\n|thursday |0.9608868956565857|\n|wednesday|0.9582304954528809|\n|friday   |0.951666533946991 |\n|tuesday  |0.9416679739952087|\n+---------+------------------+\n```\n#### French Political Tweets\n* tweets: 18716\n* unique tokens: 53268\n* total tokens: 385948\n\n```\nfindSynonyms: Lundi\n\n+--------+------------------+\n|word    |similarity        |\n+--------+------------------+\n|dcembre |0.5641320943832397|\n|mercredi|0.5640853643417358|\n|vendredi|0.5532589554786682|\n|samedi  |0.5358499884605408|\n+--------+------------------+\n```\n\n\n\n## Environment\n\n* Spark 2.2 Local / IntelliJ\n* Spark 2.2 / Cloudera CDH 5.13 / YARN (cluster - client)\n\n## Code of Conduct\n\nThis, and all github.com/multivacplatform projects, are under the [Multivac Platform Open Source Code of Conduct](https://github.com/multivacplatform/code-of-conduct/blob/master/code-of-conduct.md). Additionally, see the [Typelevel Code of Conduct](http://typelevel.org/conduct) for specific examples of harassing behavior that are not tolerated.\n\n## Copyright and License\n\nCode and documentation copyright (c) 2017-2019 [ISCPIF - CNRS](http://iscpif.fr). Code released under the [MIT license](https://github.com/multivacplatform/multivac-nlp/blob/master/LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmultivacplatform%2Fmultivac-nlp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmultivacplatform%2Fmultivac-nlp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmultivacplatform%2Fmultivac-nlp/lists"}