https://github.com/sircamp/spark-pspectrum

P-spectrum embedding and sequence relaxation for NLP in Spark
https://github.com/sircamp/spark-pspectrum

big-data machine-learning nlp nlp-machine-learning sequence-relaxation spark spark-ml spectrum

Last synced: 4 months ago
JSON representation

P-spectrum embedding and sequence relaxation for NLP in Spark

Host: GitHub
URL: https://github.com/sircamp/spark-pspectrum
Owner: sirCamp
License: apache-2.0
Created: 2019-08-06T13:09:36.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2019-11-18T08:07:12.000Z (over 5 years ago)
Last Synced: 2025-01-20T16:32:22.527Z (6 months ago)
Topics: big-data, machine-learning, nlp, nlp-machine-learning, sequence-relaxation, spark, spark-ml, spectrum
Language: Scala
Homepage:
Size: 87.9 KB
Stars: 3
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Pspectrum and Sequence Relaxation for Apache Spark

[![Build Status](https://travis-ci.com/sirCamp/spark-pspectrum.svg?branch=master)](https://travis-ci.com/sirCamp/spark-pspectrum)

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

[![Scala](https://img.shields.io/badge/scala-v2.11.12-blue)](https://img.shields.io/badge/scala-v2.11.12-blue)

This repository represents a package that contains the PSpectrum computation for character embedding implemented for Apache Spark. 

Furthermore, the repository contains an implementation of a String relaxation for Apache Spark. 

In other words, this is an useful big data implementation of PSpectrum kernel encoding and String relaxation.

# Usage

## PSpectrum

```scala

var pspectrumEstimator = new PSpectrum()

pspectrumEstimator.setP(3) //setting the degree of the Spectrum

pspectrumEstimator.setInputCol(inputCol) //set input column that must be a string column

pspectrumEstimator.setOutputCol(outputCol) //set output column, it return a VectorUTD encoded as SparseVector of Double

var pspectrumModel = pspectrumEstimator.fit(dataset)

var transformedDataset = pspectrumModel.transform(dataset)

```

## SequenceRelaxation

```scala

val characterEncoding = new mutable.HashMap[String, String]()

val upper_consonant_chars = Array("B", "C", "D", "F", "G", "H", "J", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "X", "Y", "Z")

val lower_consonant_chars = Array("b", "c", "d", "f", "g", "h", "j", "k", "l", "m", "n", "p", "q", "r", "s", "t", "v", "w", "x", "y", "z")

val upper_vowel_chars = Array("A", "E", "I", "O", "U")

val lower_vowel_chars = Array("a", "e", "i", "o", "u")

val number_chars = Array("1", "2", "3", "4", "5", "6", "7", "8", "9", "0")

val symbol_chars = Array("!", "\"", "£", "$", "%", "&", "/", "(", ")", "=", "?", "^", "§", "<", ">", ".", ":", ",", ";", "-", "_", "+", "*", "[", "]", "#")

upper_consonant_chars.foreach(char => {

  characterEncoding.put(char, "C")

})

lower_consonant_chars.foreach(char => {

  characterEncoding.put(char, "c")

})

upper_vowel_chars.foreach(char => {

  characterEncoding.put(char, "V")

})

lower_vowel_chars.foreach(char => {

  characterEncoding.put(char, "v")

})

number_chars.foreach(char => {

  characterEncoding.put(char, "0")

})

symbol_chars.foreach(char => {

  characterEncoding.put(char, "+")

})

var sequenceRelaxation = new SequenceRelaxation()

sequenceRelaxation.setCharacterEncoding(characterEncoding) //setting the degree of the Spectrum

sequenceRelaxation.setInputCol(inputCol) //set input column that must be a string column

sequenceRelaxation.setOutputCol(outputCol) //set output column, it return a VectorUTD encoded as SparseVector of Double

var transformedDataset = sequenceRelaxation.transform(dataset)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sircamp/spark-pspectrum

Awesome Lists containing this project

README