https://github.com/nkottary/corpustools

Julia library for corpus linguistics
https://github.com/nkottary/corpustools

Last synced: 29 days ago
JSON representation

Julia library for corpus linguistics

Host: GitHub
URL: https://github.com/nkottary/corpustools
Owner: nkottary
License: other
Created: 2016-01-24T07:28:20.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2015-06-04T21:50:44.000Z (almost 10 years ago)
Last Synced: 2025-02-02T15:30:05.457Z (3 months ago)
Language: Julia
Size: 3.39 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README
- License: LICENSE.md

Awesome Lists containing this project

README

        
Library for corpus linguistics with Julia.

-----------------------------------------

Some functions:

THIS IS WORK IN PROGRESS. IT IS FAR FROM COMPLETE.

The type Files(path, filter=r"") creates a "file" type that will include either a single file if "path" referes to a single file, or a list of files if "path" is a directory. It will also look recursively. If specified, "filter" will filter all results to those containin the string.

The two main functions for exploring the corpora are grep_count(patter, texts, overlap=false) and grep_context(pattern, texts, context=5, sep="\t"), both functions can take regular expressions and strings as the patters and they can take single strings, arrays of strings, or File types as the texts to look in. In general, giving a string as the pattern will have better performance than a regular expression would. grep_count() returns the number of occurences of the particular pattern in the text(s), while grep_context() returns the contexts on which the patter occurs in the text(s).

By default, grep_context() and grep_count() if used with File type will read the files one by one. This can be slightly less efficient than having the files already loaded in memory, so you can instead use load_files(files::Files) to load them to a variable and have better performance. This is not a good idea if the corpus is too big though.

The function collocations(word::String, text; test = "deltap", span=1, lower=true) finds all the collocations of the node "word" in "text" (String or Array{String}, Files is not implemented yet).  "lower" converts to lower case all words in "text" by default. "span" specifies how many words away from the node it should look for collocations (the sapn). The function returns a matrix with the collocations, a Chi Squared value, direction of attraction and odds for each collocate of the word.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nkottary/corpustools

Awesome Lists containing this project

README