Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sircamp/spark-pspectrum
P-spectrum embedding and sequence relaxation for NLP in Spark
https://github.com/sircamp/spark-pspectrum
big-data machine-learning nlp nlp-machine-learning sequence-relaxation spark spark-ml spectrum
Last synced: 20 days ago
JSON representation
P-spectrum embedding and sequence relaxation for NLP in Spark
- Host: GitHub
- URL: https://github.com/sircamp/spark-pspectrum
- Owner: sirCamp
- License: apache-2.0
- Created: 2019-08-06T13:09:36.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2019-11-18T08:07:12.000Z (about 5 years ago)
- Last Synced: 2024-11-19T13:46:05.680Z (3 months ago)
- Topics: big-data, machine-learning, nlp, nlp-machine-learning, sequence-relaxation, spark, spark-ml, spectrum
- Language: Scala
- Homepage:
- Size: 87.9 KB
- Stars: 3
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Pspectrum and Sequence Relaxation for Apache Spark
[![Build Status](https://travis-ci.com/sirCamp/spark-pspectrum.svg?branch=master)](https://travis-ci.com/sirCamp/spark-pspectrum)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Scala](https://img.shields.io/badge/scala-v2.11.12-blue)](https://img.shields.io/badge/scala-v2.11.12-blue)This repository represents a package that contains the PSpectrum computation for character embedding implemented for Apache Spark.
Furthermore, the repository contains an implementation of a String relaxation for Apache Spark.
In other words, this is an useful big data implementation of PSpectrum kernel encoding and String relaxation.# Usage
## PSpectrum
```scala
var pspectrumEstimator = new PSpectrum()
pspectrumEstimator.setP(3) //setting the degree of the Spectrum
pspectrumEstimator.setInputCol(inputCol) //set input column that must be a string column
pspectrumEstimator.setOutputCol(outputCol) //set output column, it return a VectorUTD encoded as SparseVector of Doublevar pspectrumModel = pspectrumEstimator.fit(dataset)
var transformedDataset = pspectrumModel.transform(dataset)
```
## SequenceRelaxation
```scalaval characterEncoding = new mutable.HashMap[String, String]()
val upper_consonant_chars = Array("B", "C", "D", "F", "G", "H", "J", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "X", "Y", "Z")
val lower_consonant_chars = Array("b", "c", "d", "f", "g", "h", "j", "k", "l", "m", "n", "p", "q", "r", "s", "t", "v", "w", "x", "y", "z")
val upper_vowel_chars = Array("A", "E", "I", "O", "U")
val lower_vowel_chars = Array("a", "e", "i", "o", "u")
val number_chars = Array("1", "2", "3", "4", "5", "6", "7", "8", "9", "0")
val symbol_chars = Array("!", "\"", "£", "$", "%", "&", "/", "(", ")", "=", "?", "^", "§", "<", ">", ".", ":", ",", ";", "-", "_", "+", "*", "[", "]", "#")upper_consonant_chars.foreach(char => {
characterEncoding.put(char, "C")
})
lower_consonant_chars.foreach(char => {
characterEncoding.put(char, "c")
})
upper_vowel_chars.foreach(char => {
characterEncoding.put(char, "V")
})
lower_vowel_chars.foreach(char => {
characterEncoding.put(char, "v")
})
number_chars.foreach(char => {
characterEncoding.put(char, "0")
})
symbol_chars.foreach(char => {
characterEncoding.put(char, "+")
})var sequenceRelaxation = new SequenceRelaxation()
sequenceRelaxation.setCharacterEncoding(characterEncoding) //setting the degree of the Spectrum
sequenceRelaxation.setInputCol(inputCol) //set input column that must be a string column
sequenceRelaxation.setOutputCol(outputCol) //set output column, it return a VectorUTD encoded as SparseVector of Doublevar transformedDataset = sequenceRelaxation.transform(dataset)
```