Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/nkottary/corpustools

Julia library for corpus linguistics
https://github.com/nkottary/corpustools

Last synced: 27 days ago
JSON representation

Julia library for corpus linguistics

Awesome Lists containing this project

README

        

Library for corpus linguistics with Julia.

-----------------------------------------

Some functions:

THIS IS WORK IN PROGRESS. IT IS FAR FROM COMPLETE.

The type Files(path, filter=r"") creates a "file" type that will include either a single file if "path" referes to a single file, or a list of files if "path" is a directory. It will also look recursively. If specified, "filter" will filter all results to those containin the string.

The two main functions for exploring the corpora are grep_count(patter, texts, overlap=false) and grep_context(pattern, texts, context=5, sep="\t"), both functions can take regular expressions and strings as the patters and they can take single strings, arrays of strings, or File types as the texts to look in. In general, giving a string as the pattern will have better performance than a regular expression would. grep_count() returns the number of occurences of the particular pattern in the text(s), while grep_context() returns the contexts on which the patter occurs in the text(s).

By default, grep_context() and grep_count() if used with File type will read the files one by one. This can be slightly less efficient than having the files already loaded in memory, so you can instead use load_files(files::Files) to load them to a variable and have better performance. This is not a good idea if the corpus is too big though.

The function collocations(word::String, text; test = "deltap", span=1, lower=true) finds all the collocations of the node "word" in "text" (String or Array{String}, Files is not implemented yet). "lower" converts to lower case all words in "text" by default. "span" specifies how many words away from the node it should look for collocations (the sapn). The function returns a matrix with the collocations, a Chi Squared value, direction of attraction and odds for each collocate of the word.