{"id":15908895,"url":"https://github.com/dnouri/beistrich","last_synced_at":"2025-08-14T21:32:02.282Z","repository":{"id":5634640,"uuid":"6842890","full_name":"dnouri/beistrich","owner":"dnouri","description":"Predict where to put commas in sentences.","archived":false,"fork":false,"pushed_at":"2013-01-24T18:23:55.000Z","size":140,"stargazers_count":4,"open_issues_count":0,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-10-28T13:26:51.828Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dnouri.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-11-24T18:30:47.000Z","updated_at":"2019-03-30T18:10:02.000Z","dependencies_parsed_at":"2022-08-24T20:50:58.201Z","dependency_job_id":null,"html_url":"https://github.com/dnouri/beistrich","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnouri%2Fbeistrich","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnouri%2Fbeistrich/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnouri%2Fbeistrich/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dnouri%2Fbeistrich/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dnouri","download_url":"https://codeload.github.com/dnouri/beistrich/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":229869561,"owners_count":18136928,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-06T14:40:57.040Z","updated_at":"2024-12-15T19:43:44.373Z","avatar_url":"https://github.com/dnouri.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Abstract\n========\n\nbeistrich tries to predict where to put commas in sentences.  I\npersonally make a lot of errors when putting commas in German\nsentences.  So the idea was born to try and create a machine learning\nmodel that can tell me where to put commas.\n\nThe best results with the current model, with a training set of 225000\ncases, that has twice as many cases without a comma as with a comma,\nthe ``f1-score`` is **0.89**.\n\n::\n\n               precision    recall  f1-score   support\n\n training set       0.93      0.93      0.93    225000\n\n            0       0.91      0.93      0.92     50000\n            1       0.86      0.82      0.84     25000\n\n  avg / total       0.89      0.89      0.89     75000\n\n  Confusion matrix:\n  [[46657  3343]\n   [ 4545 20455]]\n\n\nInstallation\n============\n\nInstall from source with `pip \u003chttp://www.pip-installer.org\u003e`_:\n\n.. code-block:: bash\n\n  $ pip install .\n\nInstall the latest released version from PyPI:\n\n.. code-block:: bash\n\n  $ pip install beistrich\n\nbeistrich does not declare ``numpy`` or ``scipy`` as dependencies.  So\nyou may have to install these separately *before* installing beistrich:\n\n.. code-block:: bash\n\n  $ pip install numpy\n  $ pip install scipy\n\nbeistrich also expects you to have the Stanford Tagger installed.\nAfter installation, you'll have to adjust the ``claspath`` and\n``stanford_models`` environment variables in ``beistrich.ini`` to\npoint to the location of ``stanford-postagger.jar`` and the\n``models/`` directory in your Stanford Tagger installation.\n\n\nUsage\n=====\n\ncreate\n------\n\nThe first step is to download and create a dataset from Gutenberg\nbooks online.  To do this, run:\n\n.. code-block:: bash\n\n  $ beistrich-dataset create beistrich.ini\n\nThis will download books, process them, and create files\n``data/X.npy`` and ``data/y.npy``.\n\n\nstratify\n--------\n\nThe dataset created through ``create`` has many more cases *with* a\ncomma than without a comma.  The first number in the ``bincount`` here\nis the number of training cases without a comma:\n\n.. code-block:: bash\n\n  $ beistrich-dataset introspect beistrich.ini \n  data/y.npy                    :  1478815  (bincount: [1363410, 115405])\n\nLet's stratify the dataset, so we'll get better results when doing\ntraining later:\n\n.. code-block:: bash\n\n  $ beistrich-dataset stratify beistrich.ini \n\n``introspect`` will now show us the stratified ``y`` matrix, which has\ntwice as many training cases with comma:\n\n.. code-block:: bash\n\n  $ beistrich-dataset introspect beistrich.ini \n  data/y-strat-large.npy        :   300000  (bincount: [200000, 100000])\n  data/y.npy                    :  1478815  (bincount: [1363410, 115405])\n\n\nreport\n------\n\nWe're now ready to actually train a model.  ``report`` will give us a\nreport on the result of our training:\n\n.. code-block:: bash\n\n  $ beistrich-learn report lr beistrich.ini\n\n\nsearch, curve and analyze\n-------------------------\n\nThe ``search`` command allows you to run a grid search to find the\nbest hyperparameters for the model.\n\nThe ``curve`` command will plot a learning curve, and thus help you\nfind out if the model is suffering from high bias or high variance.\n\nThe ``analyze`` command displays a list of test cases for which the\nmodel made the best predictions (i.e. those cases where the estimated\nprobability was closest to the actual class), and the worst\npredictions (where predictions were off).\n\nYou can call these commands just like you call ``report``:\n\n.. code-block:: bash\n\n  $ beistrich-learn search lr beistrich.ini\n  $ beistrich-learn curve lr beistrich.ini\n  $ beistrich-learn analyze lr beistrich.ini\n\nIf you wanna tune the models, take a look at the models and their\nparameters (specifically ``default_params`` and\n``grid_search_params``) in ``beistrich/model.py``.\n\n\ntrain and correct\n-----------------\n\nOnce you're happy with your model it's time to save it:\n\n.. code-block:: bash\n\n  $ bin/beistrich-learn train lr beistrich.ini\n  Saved file to data/model.pickle\n\nAnd finally, you can use it to correct sentences:\n\n.. code-block:: bash\n\n  $ bin/beistrich-learn correct beistrich.ini \n\nThe text to correct lives in the ``beistrich.ini`` configuration file.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdnouri%2Fbeistrich","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdnouri%2Fbeistrich","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdnouri%2Fbeistrich/lists"}