https://github.com/dnouri/beistrich
Predict where to put commas in sentences.
https://github.com/dnouri/beistrich
Last synced: 10 months ago
JSON representation
Predict where to put commas in sentences.
- Host: GitHub
- URL: https://github.com/dnouri/beistrich
- Owner: dnouri
- Created: 2012-11-24T18:30:47.000Z (over 13 years ago)
- Default Branch: master
- Last Pushed: 2013-01-24T18:23:55.000Z (over 13 years ago)
- Last Synced: 2024-10-28T13:26:51.828Z (over 1 year ago)
- Language: Python
- Size: 137 KB
- Stars: 4
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.rst
Awesome Lists containing this project
README
Abstract
========
beistrich tries to predict where to put commas in sentences. I
personally make a lot of errors when putting commas in German
sentences. So the idea was born to try and create a machine learning
model that can tell me where to put commas.
The best results with the current model, with a training set of 225000
cases, that has twice as many cases without a comma as with a comma,
the ``f1-score`` is **0.89**.
::
precision recall f1-score support
training set 0.93 0.93 0.93 225000
0 0.91 0.93 0.92 50000
1 0.86 0.82 0.84 25000
avg / total 0.89 0.89 0.89 75000
Confusion matrix:
[[46657 3343]
[ 4545 20455]]
Installation
============
Install from source with `pip `_:
.. code-block:: bash
$ pip install .
Install the latest released version from PyPI:
.. code-block:: bash
$ pip install beistrich
beistrich does not declare ``numpy`` or ``scipy`` as dependencies. So
you may have to install these separately *before* installing beistrich:
.. code-block:: bash
$ pip install numpy
$ pip install scipy
beistrich also expects you to have the Stanford Tagger installed.
After installation, you'll have to adjust the ``claspath`` and
``stanford_models`` environment variables in ``beistrich.ini`` to
point to the location of ``stanford-postagger.jar`` and the
``models/`` directory in your Stanford Tagger installation.
Usage
=====
create
------
The first step is to download and create a dataset from Gutenberg
books online. To do this, run:
.. code-block:: bash
$ beistrich-dataset create beistrich.ini
This will download books, process them, and create files
``data/X.npy`` and ``data/y.npy``.
stratify
--------
The dataset created through ``create`` has many more cases *with* a
comma than without a comma. The first number in the ``bincount`` here
is the number of training cases without a comma:
.. code-block:: bash
$ beistrich-dataset introspect beistrich.ini
data/y.npy : 1478815 (bincount: [1363410, 115405])
Let's stratify the dataset, so we'll get better results when doing
training later:
.. code-block:: bash
$ beistrich-dataset stratify beistrich.ini
``introspect`` will now show us the stratified ``y`` matrix, which has
twice as many training cases with comma:
.. code-block:: bash
$ beistrich-dataset introspect beistrich.ini
data/y-strat-large.npy : 300000 (bincount: [200000, 100000])
data/y.npy : 1478815 (bincount: [1363410, 115405])
report
------
We're now ready to actually train a model. ``report`` will give us a
report on the result of our training:
.. code-block:: bash
$ beistrich-learn report lr beistrich.ini
search, curve and analyze
-------------------------
The ``search`` command allows you to run a grid search to find the
best hyperparameters for the model.
The ``curve`` command will plot a learning curve, and thus help you
find out if the model is suffering from high bias or high variance.
The ``analyze`` command displays a list of test cases for which the
model made the best predictions (i.e. those cases where the estimated
probability was closest to the actual class), and the worst
predictions (where predictions were off).
You can call these commands just like you call ``report``:
.. code-block:: bash
$ beistrich-learn search lr beistrich.ini
$ beistrich-learn curve lr beistrich.ini
$ beistrich-learn analyze lr beistrich.ini
If you wanna tune the models, take a look at the models and their
parameters (specifically ``default_params`` and
``grid_search_params``) in ``beistrich/model.py``.
train and correct
-----------------
Once you're happy with your model it's time to save it:
.. code-block:: bash
$ bin/beistrich-learn train lr beistrich.ini
Saved file to data/model.pickle
And finally, you can use it to correct sentences:
.. code-block:: bash
$ bin/beistrich-learn correct beistrich.ini
The text to correct lives in the ``beistrich.ini`` configuration file.