Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/thjbdvlt/french-word-vectors
word vectors for french
https://github.com/thjbdvlt/french-word-vectors
french gensim nlp word2vec wordembeddings wordvectors
Last synced: 22 days ago
JSON representation
word vectors for french
- Host: GitHub
- URL: https://github.com/thjbdvlt/french-word-vectors
- Owner: thjbdvlt
- License: other
- Created: 2024-07-19T17:24:12.000Z (5 months ago)
- Default Branch: sea
- Last Pushed: 2024-08-23T10:24:03.000Z (4 months ago)
- Last Synced: 2024-11-30T20:14:25.474Z (22 days ago)
- Topics: french, gensim, nlp, word2vec, wordembeddings, wordvectors
- Homepage:
- Size: 18.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
__word vectors__ for french.
vectors are trained with [Gensim](https://radimrehurek.com/gensim/) on a patchwork corpus (22 millions sentences, 500 millions tokens), using [word2vec](https://radimrehurek.com/gensim/models/word2vec.html) algorithm (CBOW) and have 100 dimensions.
there are two models: one is trained on the whole corpus (with minimal preprocessing); the other one is trained on the lemmatized corpus (words only) (see [lemmatization](#lemmatization) below).
example
--------------```python
from gensim.models import KeyedVectors
import pprint# load word vectors
wv = KeyedVectors.load_word2vec_format('vectors.bin', binary=True)# most similar words
for mot in (
"écrire", "lire", "semblable", "humaine", "nature",
):
print(mot.upper())
pprint.pprint(wv.most_similar(mot))
``````txt
ÉCRIRE
[('lire', 0.7958993911743164),
('interpréter', 0.7516734600067139),
('éditer', 0.7383493781089783),
('rédiger', 0.7355990409851074),
('apprendre', 0.7343490123748779),
('utiliser', 0.7281709313392639),
('inventer', 0.7118361592292786),
('employer', 0.7102519273757935),
('écouter', 0.7078548073768616),
('appeler', 0.702791690826416)]LIRE
[('relire', 0.8466554880142212),
('regarder', 0.8015256524085999),
('écrire', 0.7958993315696716),
('consulter', 0.7852454781532288),
('publier', 0.7608461380004883),
('recopier', 0.7554864883422852),
('feuilleter', 0.7442073822021484),
('rédiger', 0.7388473153114319),
('poster', 0.7297687530517578),
('voir', 0.7152807712554932)]SEMBLABLE
[('similaire', 0.8344558477401733),
('analogue', 0.8341560363769531),
('comparable', 0.799608051776886),
('identique', 0.7407772541046143),
('ressemblant', 0.738508403301239),
('apparentée', 0.6046147346496582),
('lié', 0.600321888923645),
('différente', 0.5880765318870544),
('assimilable', 0.5812517404556274),
('réfractaire', 0.5703239440917969)]HUMAINE
[('animale', 0.7537525296211243),
('spirituelle', 0.7472118735313416),
('surnaturelle', 0.7453622817993164),
('immatérielle', 0.7149630188941956),
('réelle', 0.7088828086853027),
('innée', 0.7061097025871277),
('naturelle', 0.703719437122345),
('fondamentale', 0.6810879111289978),
('émotionnelle', 0.6754382252693176),
('sous-jacente', 0.6738153100013733)]NATURE
[('faune', 0.6992893218994141),
('biodiversité', 0.6951753497123718),
('diversité', 0.6917913556098938),
('perception', 0.691189706325531),
('richesse', 0.6776712536811829),
('dignité', 0.6773139238357544),
('contemplation', 0.6590573191642761),
('matérialité', 0.657234251499176),
('réalité', 0.6533279418945312),
('fragilité', 0.651949942111969)]
```corpus
------the corpus used is an aggregation of existings corpora (mostly from [Ortolang](https://www.ortolang.fr/fr/accueil/)) available in Creative Common Licenses, public domain books, free texts and stuff from wikipedia. i've made some minimal preprocessing (removed lines, normalized characters, ...).
### linguistic corpora
- [WikiDisc](https://www.ortolang.fr/market/corpora/wikidisc), a corpus of Wikipedia discussions[^1].
- [Corpora Collection Leipzig](https://wortschatz.uni-leipzig.de/en/download/French), corpora for many languages and regions. i've download files (30K-1M sentences each) for following regions: Belgique, Cameroun, Congo, Côte d'Ivoire, France, Luxembourg, Madagascar, Nouvelle Calédonie, Polynésie Française, Suisse, Togo[^2].
- [Corpus Reporterre](https://www.ortolang.fr/market/corpora/corpus-reporterre), corpus made of articles published in the newspaper Reporterre[^3].
- [CFPR, Corpus Français Parlé de nos Régions](https://cfpr.huma-num.fr/), oral corpus (i only used transcriptions).[^1]: Lydia-Mai Ho-Dac, Veronika Laippala. _Le corpus WikiDisc : ressource pour la caractérisation des discussions en ligne_. Wigham, Ciara R.; Ledegen, Gudrun. Corpus de communication médiée par les réseaux : construction, structuration, analyse., l'Harmattan, pp.107-124, 2017, Humanités numériques, 978-2-343-11212-1.
[^2]: D. Goldhahn, T. Eckart & U. Quasthoff: _Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages_. In: Proceedings of the 8th International Language Resources and Evaluation (LREC'12), 2012.
[^3]: Laboratoire de Linguistique et Didactique des Langues Etrangères et Maternelles - EA 609 (LIDILEM) (2024). _Corpus Reporterre_ [Corpus]. ORTOLANG (Open Resources and TOols for LANGuage) - www.ortolang.fr, v1, https://hdl.handle.net/11403/corpus-reporterre/v1.
### the wittgenstein project
books from [the wittgenstein project](https://www.wittgensteinproject.org/w/index.php?title=Main_Page), dedicated to the philosopher ludwig wittgenstein.
- [_conférence sur l'éthique_](https://www.wittgensteinproject.org/w/index.php/Une_conf%C3%A9rence_sur_l%E2%80%99Ethique) (1929)
- [_le cahier bleu_](https://www.wittgensteinproject.org/w/index.php/Blue_Book) and [_le cahier brun_](https://wittgensteinproject.org/w/index.php/Brown_Book) (automatic translation using [deepl](https://www.deepl.com/en/translator))### les classiques des sciences sociales
books in public domain from [Les classiques des sciences sociales](http://classiques.uqac.ca/).
- marcel mauss, [_les techniques du corps_](https://archive.wikiwix.com/cache/index2.php?url=http%3A%2F%2Fclassiques.uqac.ca%2Fclassiques%2Fmauss_marcel%2Fsocio_et_anthropo%2F6_Techniques_corps%2FTechniques_corps.html#federation=archive.wikiwix.com&tab=url) (1934)
### wikisource
books from [wikisource](https://fr.wikisource.org/wiki/Wikisource:Accueil).
- marcel mauss
- _essais de sociologie_ (1971)
- _mélange d'histoire des religions_, avec henri hubert (1909)
- simone weil
- _la condition ouvrière_ (1951, written in 1934-1937)
- _sur la science_ (1966, written in 1929-1942)
- jack london
- _l'appel de la forêt_ (1903, transl. 1908)
- _lettre au juge samuel_ (1910)
- _construire un feu_ (1910, transl. 1924)
- _le cabaret de la dernière chance_ (1913, transl. 1926)
- _la peste écarlate_ (1915, transl. 1924)
- _le vagabond des étoiles_ (1915, transl. 1925)
- rachilde
- _refaire l'amour_ (1928)
- _monsieur vénus_ (1884)
- _la découverte de l'amérique_ (1919)
- léon tolstoi
- _qu'est-ce que l'art_ (1898, transl. 1918)
- _guerre et paix_ (1864-1869, transl. 1903-1904)
- _anna karénine_ (1873-1877, transl. 1906-1908)
- marcel proust
- _à la recherche du temps perdu_ (1913-1927, ed. 1946)
- émile durkheim
- _la division du travail social_ (1893)
- _sociologie et éducation_ (1922)
- michelle le normand
- _les couleurs du temps_ (1919)
- _autour de la maison_ (1916)
- _enthousiasme_ (1947)
- _la maison au phlox_ (1941)
- _la plus belle chose du monde_ (1937)
- _le nom dans le bronze_ (1933)
- _la montagne d'hiver_ (1961)### the anarchist library
anarchist books from [the anarchist library](https://theanarchistlibrary.org/special/index).
- peter gelderloos, [_l'anarchisme fonctionne_](https://fr.anarchistlibraries.net/library/peter-gelderloos-anarchie-fonctionne) (2010, translation by mikail marchand using deepl)
### framabook
books from [framabook](french ) editions, under free license CC compatibles (CC By, CC By-Sa or Art Libre).
- gee, [_sortilèges et syndicats_](https://archives.framabook.org/working-class-heroic-fantasy/index.html) (2018)
- pouhiou, [_smartarted_](https://archives.framabook.org/smartarded-le-cycle-des-noenautes-ii/index.html) (2012)
- stephane crozat, [_traces_](https://archives.framabook.org/traces/index.html) (2018)
- yann kervran, [_quit'a_](https://archives.framabook.org/qita_01/index.html), vol.1-4 (2020), and [_les enquetes d'ernaud_](https://archives.framabook.org/la-nef-des-loups/index.html), t.1-3 (2017)### wikipedia
[wikipedia](https://fr.wikipedia.org/) articles.
- article [_donnée_](https://fr.wikipedia.org/wiki/Donn%C3%A9e)
- article [_logiciel libre_](https://fr.wikipedia.org/wiki/Logiciel_libre)
- article [_copier-coller_](https://fr.wikipedia.org/wiki/Copier-coller)
- article [_commun_](https://fr.wikipedia.org/wiki/Communs)
- extracts from the article [_football_](https://fr.wikipedia.org/wiki/Football) et [_lois du jeu_](https://fr.wikipedia.org/wiki/Lois_du_jeu)lemmatization
-------------the lemmatization has been done using [spacy](https://spacy.io/) library.
the [pipeline](https://spacy.io/usage/processing-pipelines) processed as follow (i wrote all components):- tokenization with [quelquhui](https://github.com/thjbdvlt/quelquhui);
- normalization with [presque](https://github.com/thjbdvlt/presque);
- morphologization with [turlututu](https://github.com/thjbdvlt/turlututu);
- lemmatization with [viceverser](https://github.com/thjbdvlt/viceverser).use with spacy
--------------to use the vectors with [spacy](https://spacy.io/), one need to convert the vectiors to text format.
```python
from gensim.models import KeyedVectorswv = KeyedVectors.load_word2vec_format('model.bin', binary=True)
wv.save_word2vec_format('model.word2vec', binary=False)
```create the vectors for a pipeline from file:
```bash
spacy init vectors fr model.word2vec vectors
```