https://github.com/octachron/locc

Linguistic Ocaml Comment Classifier
https://github.com/octachron/locc

Last synced: 10 months ago
JSON representation

Linguistic Ocaml Comment Classifier

Host: GitHub
URL: https://github.com/octachron/locc
Owner: Octachron
Created: 2015-09-23T19:44:05.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2015-09-23T19:46:06.000Z (over 10 years ago)
Last Synced: 2025-02-28T20:47:11.447Z (about 1 year ago)
Language: OCaml
Size: 121 KB
Stars: 0
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: Readme.md

Awesome Lists containing this project

README

Locc is a helper tool to extract comments from ocaml source code and classify
them in function of their primary language. The classification is done using
aspell dictionaries coupled to a max-likehood estimator based on utterly
simplistic statistic model.

##Model

The statistical model used assume that

* There is no correlation between words, i.e. sentences are i.i.d sequences
of words

* For a given primary language, every word existing in a language has the same probability

For instance, the default model is

```
\ french english unknown secondary

french 0.8 0.1 0.1

english 0.1 0.8 0.1

primary

````

In this model, we assume that within a text primary in french, there is a
10% probability that an English word or a word of unknown origin appears.
And reciprocally for a text primary in English, the model considers that
there is a 10% probability that a french or unknown word appears.

##Usage

```sh
locc -m model -o logs target
```
With this invocation, locc will analyze all the Ocaml source files
(i.e ".ml{,i,y,l}") presents in `target`. If target is a directory, all the
files and sub-directories contained in `target` will be analyzed.

Locc will then output on std a report listing the number of comments detected
under each subclasses of the `model`. The detailed log of the analysis will be
written in the `logs` directory.

If the option `model` is not provided, the default model is

```
fr 0.8 0.1 0.1
%en 0.1 0.8 0.1

```
The models themselve are a '%'-separated list of "primary language name" +
list of language probabilities within a text of primary language. Note that
the primary language name must be an aspell dictionary name.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/octachron/locc

Awesome Lists containing this project

README