{"id":13452366,"url":"https://github.com/mideind/GreynirServer","last_synced_at":"2025-03-23T19:34:13.690Z","repository":{"id":31622753,"uuid":"35187829","full_name":"mideind/GreynirServer","owner":"mideind","description":"The greynir.is Icelandic natural language processing API and website.","archived":false,"fork":false,"pushed_at":"2024-07-18T14:58:38.000Z","size":42011,"stargazers_count":65,"open_issues_count":16,"forks_count":17,"subscribers_count":13,"default_branch":"master","last_synced_at":"2024-07-31T07:18:46.593Z","etag":null,"topics":["earley","grammar","icelandic","icelandic-language","icelandic-news-sites","information-extraction","natural-language-processing","natural-language-queries","nlp","parse-forests","parse-trees","parser","python","tf-idf","tokenizer"],"latest_commit_sha":null,"homepage":"https://greynir.is","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mideind.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-05-06T23:19:12.000Z","updated_at":"2024-07-31T07:18:51.155Z","dependencies_parsed_at":"2024-07-31T07:28:53.888Z","dependency_job_id":null,"html_url":"https://github.com/mideind/GreynirServer","commit_stats":null,"previous_names":["mideind/greynir"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mideind%2FGreynirServer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mideind%2FGreynirServer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mideind%2FGreynirServer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mideind%2FGreynirServer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mideind","download_url":"https://codeload.github.com/mideind/GreynirServer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":221900895,"owners_count":16898991,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["earley","grammar","icelandic","icelandic-language","icelandic-news-sites","information-extraction","natural-language-processing","natural-language-queries","nlp","parse-forests","parse-trees","parser","python","tf-idf","tokenizer"],"created_at":"2024-07-31T07:01:22.015Z","updated_at":"2024-10-28T18:31:11.774Z","avatar_url":"https://github.com/mideind.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)\n[![Python 3.9](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-390/)\n[![Build](https://github.com/mideind/Greynir/actions/workflows/python-package.yml/badge.svg)]()\n\n\u003cimg src=\"static/img/greynir-logo-large.png\" alt=\"Greynir\" width=\"200\" height=\"200\" align=\"right\" style=\"margin-left:20px; margin-bottom: 20px;\"\u003e\n\n# GreynirServer\n\n### Natural Language Processing for Icelandic\n\n*GreynirServer* is the web API and frontend for\n[GreynirEngine](https://github.com/mideind/GreynirEngine), a natural language processing\nengine that **extracts processable information from Icelandic text**, allows\n**natural language querying** of that information and facilitates\n**natural language understanding**. GreynirServer is also the backend of\n[*Embla*](https://embla.is), an Icelandic voice assistant app for\nsmartphones, tablets, and other devices.\n\nTry Greynir (in Icelandic) at [https://greynir.is](https://greynir.is)\n\nGreynir periodically scrapes chunks of text from Icelandic news sites on the web.\nIt employs the [Tokenizer](https://github.com/mideind/Tokenizer) and\n[GreynirEngine](https://github.com/mideind/GreynirEngine) modules (by the same authors)\nto tokenize the text and parse the token streams according to a\n**hand-written context-free grammar** for the Icelandic language.\nThe resulting parse forests are disambiguated using\nscoring heuristics to find the best parse trees. The trees are then stored in a\ndatabase and processed by grammatical pattern matching modules to obtain statements\nof fact and relations between stated facts.\n\nAn overview of the technology behind Greynir can be found in the paper\n[A Wide-Coverage Context-Free Grammar for Icelandic\nand an Accompanying Parsing System](https://acl-bg.org/proceedings/2019/RANLP%202019/pdf/RANLP160.pdf)\nby Vilhjálmur Þorsteinsson, Hulda Óladóttir and Hrafn Loftsson *(Proceedings of Recent Advances in Natural Language Processing, pages 1397–1404, Varna, Bulgaria, Sep 2–4, 2019).*\n\n\u003ca href=\"https://raw.githubusercontent.com/mideind/Greynir/master/static/img/tree-example.png\" title=\"Greynir parse tree\"\u003e\n\u003cimg src=\"static/img/tree-example-small.png\" width=\"400\" height=\"450\" alt=\"Greynir parse tree\"\u003e\n\u003c/a\u003e\n\n*A parse tree as displayed by Greynir. Nouns and noun phrases are blue, verbs and verb phrases are red,\nadjectives are green, prepositional and adverbial phrases are grey, etc.*\n\nGreynir is most effective for text that is objective and factual, i.e. has a relatively high\nratio of concrete concepts such as numbers, amounts, dates, person and entity names,\netc.\n\nGreynir is innovative in its ability to parse and disambiguate text written in a\n**morphologically rich language**, i.e. Icelandic, which does not lend itself easily\nto statistical parsing methods. Greynir uses grammatical feature agreement (cases, genders,\npersons, number (singular/plural), verb tenses, modes, etc.) to guide and disambiguate\nparses. Its highly optimized Earley-based parser, implemented in C++, is fast and compact\nenough to make real-time while-you-wait analysis of web pages, as well as bulk\nprocessing, feasible.\n\nGreynir's goal is to \"understand\" text to a usable extent by parsing it into\nstructured, recursive trees that directly correspond to the original grammar.\nThese trees can then be further processed and acted upon by sets of Python\nfunctions that are linked to grammar nonterminals.\n\n**Greynir is currently able to parse about *90%* of sentences** in a typical news article from the web,\nand many well-written articles can be parsed completely. It presently has well over a million parsed articles\nin its database, containing more than 16 million parsed sentences. A relatively recent (2021) version of this\ndatabase is available via the [GreynirCorpus](https://github.com/mideind/GreynirCorpus) project.\n\nGreynir supports natural language querying of its databases. Users can ask about person names, titles and\nentity definitions and get appropriate replies. The HTML5 Web Speech API is supported to allow\nqueries to be **recognized from speech** in enabled browsers, such as recent versions of Chrome.\nSimilarity queries are also supported, yielding articles that are similar in content to a given\nsearch phrase or sentence.\n\nGreynir may in due course be expanded, for instance:\n\n* to make logical inferences from statements in its database;\n* to find statements supporting or refuting a thesis; and/or\n* to discover contradictions between statements.\n\n## Implementation\n\nGreynir is written in [Python 3](https://www.python.org/) except for its core\nEarley-based parser module which is written in C++ and called\nvia [CFFI](https://cffi.readthedocs.org/en/latest/index.html).\nGreynir requires Python 3.9 or later, and runs on CPython and\n[PyPy](http://pypy.org/), with the latter being recommended for performance reasons.\n\nGreynir works in stages, roughly as follows:\n\n1. **Web scraper**, built on [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/)\n  and [SQLAlchemy](http://www.sqlalchemy.org/) storing data\n  in [PostgreSQL](http://www.postgresql.org/).\n2. **Tokenizer** ([this one](https://github.com/mideind/Tokenizer)),\n  extended to use the [BÍN](http://bin.arnastofnun.is/DMII/) database of Icelandic word forms for lemmatization and\n  initial part-of-speech tagging.\n3. **Parser** (from [this module](https://github.com/mideind/GreynirEngine)),\n  using an improved version of the [Earley algorithm](http://en.wikipedia.org/wiki/Earley_parser)\n  to parse text according to an unconstrained hand-written context-free grammar for Icelandic\n  that may yield multiple parse trees (a parse forest) in case of ambiguity.\n4. **Parse forest reducer** with heuristics to find the best parse tree.\n5. **Information extractor** that maps a parse tree via its grammar constituents to plug-in\n  Python functions.\n6. **Article indexer** that transforms articles from bags-of-words to fixed-dimensional\n  topic vectors using [Tf-Idf](http://www.tfidf.com/) and\n  [Latent Semantic Analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis).\n7. **Query processor** that supports a range of natural language queries\n  (including queries about entities in Greynir's database).\n\nGreynir has an embedded web server that displays news articles recently scraped into its\ndatabase, as well as names of people extracted from those articles along with their titles.\nThe web interface enables the user to type in any URL and have Greynir scrape it, tokenize it and\ndisplay the result as a web page. Queries can also be entered via the keyboard or using voice\ninput. The server runs on the [Flask](http://flask.pocoo.org/) framework, implements WSGI and\ncan for instance be plugged into [Gunicorn](http://gunicorn.org/) and\n[nginx](https://www.nginx.com/).\n\nThe [tokenizer](https://github.com/mideind/Tokenizer) divides text chunks into\nsentences and recognizes entities such as dates, numbers,\namounts and person names, as well as common abbreviations and punctuation.\n\nGrammar rules are laid out in a separate text file,\n[`Greynir.grammar`](https://github.com/mideind/GreynirEngine/blob/master/src/reynir/Greynir.grammar),\nwhich is a part of [GreynirEngine](https://github.com/mideind/GreynirEngine). The standard\n[Backus-Naur form](http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form) has been\naugmented with repeat specifiers for right-hand-side tokens (`*` for 0..n instances,\n`+` for 1..n instances, or `?` for 0..1 instances). Also, the grammar allows for\ncompact specification of rules with variants, for instance due to cases, numbers and genders.\nThus, a single rule (e.g. `NounPhrase/case/gender → Adjective/case noun/case/gender`)\nis automatically expanded into multiple rules (12 in this case, 4 cases x 3 genders) with\nappropriate substitutions for right-hand-side tokens depending on their local variants.\n\nThe parser is an optimized C++ implementation of an Earley parser as enhanced by\n[Scott and Johnstone](http://www.sciencedirect.com/science/article/pii/S0167642309000951),\nreferencing Tomita. It parses ambiguous grammars without restriction and\nreturns a compact Shared Packed Parse Forest (SPPF) of parse trees. If a parse\nfails, it identifies the token at which no parse was available.\n\nThe Greynir scraper is typically run in a `cron` job every 30 minutes to extract\narticles automatically from the web, parse them and store the resulting trees\nin a PostgreSQL database for further processing.\n\nScraper modules for new websites are plugged in by adding Python code to the\n[`scrapers/`](scrapers/) directory. Currently, the [`scrapers/default.py`](scrapers/default.py)\nmodule supports a wide range of popular Icelandic news sites.\n\nProcessor modules can be plugged into Greynir by adding Python code to the\n[`processors/`](processors/) directory. The module [`processors/persons.py`](processors/persons.py),\nfor example, extracts person names and titles from parse trees for storage in a database table.\n\nQuery (question answering) modules can be plugged into Greynir by adding Python code to\nthe [`queries/`](queries/) directory. Reference implementations for several query types\ncan be found in that directory, for instance [`queries/builtin.py`](queries/builtin.py)\nwhich supports questions about persons and titles. Query module examples can be viewed\nin [`queries/examples`](queries/examples).\n\n## File details\n\n* [`article.py`](article.py): Representation of an article through its life cycle\n* [`config/Greynir.conf`](config/Greynir.conf): Editable configuration file\n* [`db/*.py`](db/): Database models, queries and functions via SQLAlchemy\n* [`fetcher.py`](fetcher.py): Utility classes for fetching articles given their URLs\n* [`geo.py`](geo.py): Geography and location-related utility functions\n* [`main.py`](main.py): WSGI web server application and main module for command-line invocation\n* [`nertokenizer.py`](nertokenizer.py): A layer on top of the tokenizer for named entity recognition\n* [`postagger.py`](postagger.py): Part-of-speech tagging\n* [`processor.py`](processor.py): Information extraction from parse trees and token streams\n* [`queries/*.py`](queries/): Natural language query processor and query-answering modules\n* [`routes/*.py`](routes/): Routes for the web application\n* [`scraper.py`](scraper.py): Web scraper, collecting articles from a set of pre-selected websites\n* [`scrapers/*.py`](scrapers): Scraper code for various websites\n* [`settings.py`](settings.py): Management of global settings and configuration data\n* [`speak.py`](speak.py): Command line interface for speech synthesis\n* [`speech/*.py`](speech/): Speech synthesizer modules\n* [`tnttagger.py`](tnttagger.py): Statistical Part-of-speech tagging\n* [`tools/*.py`](tools/): Various command line utility tools\n* [`tree/*.py`](tree/): Representation of parse trees for processing and related utility functions\n* [`utility.py`](utility.py): Assorted utility functions used throughout the codebase\n* [`vectors/builder.py`](vectors/builder.py): Article indexer and LSA topic vector builder\n\n## Installation and setup\n\n* [Instructions for Ubuntu/Debian GNU/Linux](docs/setup_linux.md)\n* [Instructions for macOS](docs/setup_macos.md)\n* [Docker container](https://github.com/vthorsteinsson/greynir-docker)\n\n## Running Greynir\n\nOnce you have followed the installation and setup instructions above, change\nto the Greynir repository and activate the virtual environment:\n\n```bash\ncd Greynir\nsource venv/bin/activate\n```\n\nYou should now be able to run Greynir.\n\n### Web application\n\n```bash\npython main.py\n```\n\nDefaults to running on [`localhost:5000`](http://localhost:5000) but this can be\nchanged in [`config/Greynir.conf`](config/Greynir.conf).\n\n### Web scrapers\n\n```bash\npython scraper.py\n```\n\nIf you are running the scraper on macOS, you may run into problems with Python's `fork()`.\nThis can be fixed by setting the following environment variable in your shell:\n\n```bash\nexport OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES\n```\n\n### Processors\n\n```bash\npython processor.py\n```\n\nThis will run all processors in the `processors` directory on any unprocessed articles\nin the database.\n\n### Interactive shell\n\nYou can launch an [IPython](https://ipython.org) REPL shell with a database session (`s`), the Greynir\nparser (`r`) and all SQLAlchemy database models preloaded. See [Using the Greynir Shell](docs/shell.md)\nfor instructions.\n\n## Contributing\n\nSee [Contributing to Greynir](CONTRIBUTING.md).\n\n## License\n\nGreynir is Copyright \u0026copy; 2023 [Miðeind ehf.](https://mideind.is)  \nThe original author of this software is *Vilhjálmur Þorsteinsson*.\n\n\u003ca href=\"https://mideind.is\"\u003e\u003cimg src=\"static/img/mideind-horizontal-small.png\" alt=\"Miðeind ehf.\"\n    width=\"214\" height=\"66\" align=\"right\" style=\"margin-left:20px; margin-bottom: 20px;\"\u003e\u003c/a\u003e\n\nThis set of programs is free software: you can redistribute it and/or modify it\nunder the terms of the GNU General Public License as published by the Free\nSoftware Foundation, either version 3 of the License, or (at your option) any later\nversion.\n\nThis set of programs is distributed in the hope that it will be useful, but WITHOUT\nANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR\nA PARTICULAR PURPOSE. See the GNU General Public License for more details.\n\n\u003ca href=\"https://www.gnu.org/licenses/gpl-3.0.html\"\u003e\u003cimg src=\"static/img/GPLv3.png\"\nalign=\"right\" style=\"margin-left:15px;\" width=\"180\" height=\"60\"\u003e\u003c/a\u003e\n\nThe full text of the GNU General Public License v3 is\n[included here](https://github.com/mideind/Greynir/blob/master/LICENSE.txt)\nand also available here: [https://www.gnu.org/licenses/gpl-3.0.html](https://www.gnu.org/licenses/gpl-3.0.html).\n\nIf you wish to use this set of programs in ways that are not covered under the\nGNU GPLv3 license, please contact us at [mideind@mideind.is](mailto:mideind@mideind.is)\nto negotiate a custom license. This applies for instance if you want to include or use\nthis software, in part or in full, in other software that is not licensed under\nGNU GPLv3 or other compatible licenses.\n\n## Acknowledgements\n\nGreynir uses the BÍN ([Beygingarlýsing íslensks nútímamáls](https://bin.arnastofnun.is))\nlexicon and database of Icelandic word forms to identify words and find their\npotential meanings and lemmas. The database is included in\n[BinPackage](https://github.com/mideind/BinPackage) in compressed form.\nBÍN is licensed under CC-BY-4.0, and credit is hereby given as follows:\n\n*Beygingarlýsing íslensks nútímamáls. Stofnun Árna Magnússonar í íslenskum fræðum. Höfundur og ritstjóri Kristín Bjarnadóttir.*\n\nThe Greynir web interface uses map data from [OpenStreetMap](https://www.openstreetmap.org),\n[Google](https://maps.google.com) and [Wikipedia](https://is.wikipedia.org).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmideind%2FGreynirServer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmideind%2FGreynirServer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmideind%2FGreynirServer/lists"}