https://github.com/octachron/locc
Linguistic Ocaml Comment Classifier
https://github.com/octachron/locc
Last synced: 3 months ago
JSON representation
Linguistic Ocaml Comment Classifier
- Host: GitHub
- URL: https://github.com/octachron/locc
- Owner: Octachron
- Created: 2015-09-23T19:44:05.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2015-09-23T19:46:06.000Z (over 9 years ago)
- Last Synced: 2025-01-11T09:16:13.978Z (4 months ago)
- Language: OCaml
- Size: 121 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
Awesome Lists containing this project
README
Locc is a helper tool to extract comments from ocaml source code and classify
them in function of their primary language. The classification is done using
aspell dictionaries coupled to a max-likehood estimator based on utterly
simplistic statistic model.##Model
The statistical model used assume that
* There is no correlation between words, i.e. sentences are i.i.d sequences
of words* For a given primary language, every word existing in a language has the same probability
For instance, the default model is
```
\ french english unknown secondaryfrench 0.8 0.1 0.1
english 0.1 0.8 0.1
primary
````
In this model, we assume that within a text primary in french, there is a
10% probability that an English word or a word of unknown origin appears.
And reciprocally for a text primary in English, the model considers that
there is a 10% probability that a french or unknown word appears.##Usage
```sh
locc -m model -o logs target
```
With this invocation, locc will analyze all the Ocaml source files
(i.e ".ml{,i,y,l}") presents in `target`. If target is a directory, all the
files and sub-directories contained in `target` will be analyzed.Locc will then output on std a report listing the number of comments detected
under each subclasses of the `model`. The detailed log of the analysis will be
written in the `logs` directory.If the option `model` is not provided, the default model is
```
fr 0.8 0.1 0.1
%en 0.1 0.8 0.1```
The models themselve are a '%'-separated list of "primary language name" +
list of language probabilities within a text of primary language. Note that
the primary language name must be an aspell dictionary name.