{"id":13550449,"url":"https://github.com/opensemanticsearch/open-semantic-search","last_synced_at":"2025-05-16T14:06:41.117Z","repository":{"id":39649559,"uuid":"55076350","full_name":"opensemanticsearch/open-semantic-search","owner":"opensemanticsearch","description":"Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining \u0026 Text Analytics platform (Integrates ETL for document processing, OCR for images \u0026 PDF, named entity recognition for persons, organizations \u0026 locations, metadata management by thesaurus \u0026 ontologies, search user interface \u0026 search apps for fulltext search, faceted search \u0026 knowledge graph)","archived":false,"fork":false,"pushed_at":"2025-04-19T14:57:11.000Z","size":9347,"stargazers_count":1019,"open_issues_count":202,"forks_count":180,"subscribers_count":56,"default_branch":"master","last_synced_at":"2025-04-19T15:58:20.747Z","etag":null,"topics":["annotation","faceted-search","fulltext-search","investigative-journalism","journalism","named-entity-recognition","ocr","ontologies","osint","python","research-tool","search","search-engine","search-interface","semantic","skos","text-analysis","text-mining","thesaurus","ui"],"latest_commit_sha":null,"homepage":"https://opensemanticsearch.org","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/opensemanticsearch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null},"funding":{"custom":["https://www.paypal.me/MMandalka"]}},"created_at":"2016-03-30T15:51:03.000Z","updated_at":"2025-04-19T14:57:15.000Z","dependencies_parsed_at":"2022-08-28T18:31:04.673Z","dependency_job_id":"ee1fa0bc-11ce-4c5a-a997-a599f08016a1","html_url":"https://github.com/opensemanticsearch/open-semantic-search","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opensemanticsearch%2Fopen-semantic-search","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opensemanticsearch%2Fopen-semantic-search/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opensemanticsearch%2Fopen-semantic-search/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/opensemanticsearch%2Fopen-semantic-search/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/opensemanticsearch","download_url":"https://codeload.github.com/opensemanticsearch/open-semantic-search/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254544146,"owners_count":22088807,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["annotation","faceted-search","fulltext-search","investigative-journalism","journalism","named-entity-recognition","ocr","ontologies","osint","python","research-tool","search","search-engine","search-interface","semantic","skos","text-analysis","text-mining","thesaurus","ui"],"created_at":"2024-08-01T12:01:33.202Z","updated_at":"2025-05-16T14:06:41.068Z","avatar_url":"https://github.com/opensemanticsearch.png","language":"Shell","funding_links":["https://www.paypal.me/MMandalka"],"categories":["Dockerfile","Shell","A01_文本生成_文本对话","ui","osint"],"sub_categories":["大语言对话模型及数据"],"readme":"# Open Semantic Search\nhttps://opensemanticsearch.org\n\nOpen Semantic Search is:\n- an integrated search server,\n- ETL framework for document processing (crawling, text extraction, text analysis, named entity recognition and OCR for images and embedded images in PDF),\n- search user interfaces, text mining, text analytics and search apps for fulltext search, faceted search, exploratory search and knowledge graph search\n\n# Documentation\n\nThis README.md is documentation for software developers.\n\n## Documentation for users and admins\n\nThe [documentation for users and admins](docs/doc/README.md) is included in the software packages/images and linked in the search user interface (Menu \"Help\").\n\n## Software architecture\n\nYou can find the [documentation of the search engine architecture in `docs/doc/modules/README.md`](docs/doc/modules/README.md).\n\n## Documentation format\n\nThis integrated HTML [documentation](https://opensemanticsearch.org/doc/search/) is generated by the static site generator [MkDocs](https://www.mkdocs.org/) with the config file [`mkdocs.yml`](mkdocs.yml).\n\nThe source of the documentation (Markdown format) and the charts ([mermaid](https://mermaid-js.github.io/mermaid/) format) is editable in the directory [`docs`](docs).\n\n# Build\n\nHow to build the deb package for installation on Debian or Ubuntu server or the docker images for running in Docker containers:\n\n## Clone git repositories\nClone the repository including the dependencies:\n\n```\ngit clone --recurse-submodules --remote-submodules https://github.com/opensemanticsearch/open-semantic-search.git\ncd open-semantic-search\n```\n\n## Build deb package\n\nTo build a `deb` package for *Debian GNU/Linux* or *Ubuntu Linux*, call the build script \u003ccode\u003e[build-deb](build-deb)\u003c/code\u003e as user root (change user by `su` or `sudo su`):\n\n```\n./build-deb\n```\n\n## Build Desktop Search VM\n\nHow to build an *Open Semantic Desktop Search* Appliance for *VirtualBox* is documented in\n[`src/open-semantic-desktop-search/README.md`](src/open-semantic-desktop-search/README.md).\n\n## Build docker images\n\nBuild the Docker images using the default docker-compose config \u003ccode\u003e[docker-compose.yml](docker-compose.yml)\u003c/code\u003e:\n\n```\ndocker-compose build\n```\n\n## Run docker containers\n\nAfter these builds all the Docker images/dependencies/services can be started together by docker-compose with the config file \u003ccode\u003e[docker-compose.yml](docker-compose.yml)\u003c/code\u003e.\n\nYou can start the whole environment by running:\n\n```\ndocker-compose up\n```\n\nwhich will expose the web user interface on port `8080`.\n\nYou can browse the Open Semantic Search user interface in your favourite browser by this URL: \n\n`http://localhost:8080/search/`\n\n\n# Automated tests\n\nFor CI/CD there are some different automated tests:\n\n\n## Integration tests\n\nSince the [submodule Open Semantic ETL](src/open-semantic-etl) uses and needs different powerful services like [Solr](src/solr.deb), [spaCy-services](src/spacy-services.deb) or [Tika-Server](src/tika-server.deb) by HTTP and REST-API, many automated tests run as integration tests within the docker-compose environment configured in \u003ccode\u003e[docker-compose.etl.test.yml](docker-compose.etl.test.yml)\u003c/code\u003e so these services are available while running the unittests and integration tests.\n\n```\ndocker-compose -f docker-compose.etl.test.yml build\ndocker-compose -f docker-compose.etl.test.yml up\n```\n\n\n## End to end tests\n\nSome automated integration tests and end-to-end (E2E) tests within a web browser controlled by the browser automation framework [Playwright](https://playwright.dev/) and the node.js / javascript based test framework [JEST](https://jestjs.io/).\n\nYou can extend the automated tests in \u003ccode\u003e[test/test.js](test/test.js)\u003c/code\u003e\n\nThey run by the docker image \u003ccode\u003e[Dockerfile-test](Dockerfile-test)\u003c/code\u003e and need the services of the docker-compose environment \u003ccode\u003e[docker-compose.test.yml](docker-compose.test.yml)\u003c/code\u003e:\n\n```\ndocker-compose -f docker-compose.test.yml build\ndocker-compose -f docker-compose.test.yml up\n```\n\n\n# Dependencies\n\nDependencies are resolved automatically by building or by installation of the Debian or Ubuntu packages or by building the Docker images.\n\nDocumentation on this dependencies which may help debugging dependency hell issues or installations in other environments:\n\n\n## Build dependencies on Source code (GIT)\n\nDependencies on other Git repositories / submodules of components like Open Semantic ETL are defined in the Git config file \u003ccode\u003e[.gitmodules](.gitmodules)\u003c/code\u003e\n\nThe submodules will be checked out automatically to the subdirectory \u003ccode\u003e[src](src)\u003c/code\u003e, if you check out this repository by git in recursive mode.\n\n\n## Packaging dependencies of Java archives (JAR)\n\nThe submodules \u003ccode\u003e[src/tika-server.deb](src/tika-server.deb)\u003c/code\u003e and \u003ccode\u003e[src/solr.deb](src/solr.deb)\u003c/code\u003e need the JAR of [Apache Tika-Server](https://tika.apache.org/) and [Apache Solr](https://solr.apache.org/).\n\nIf not there, they will be downloaded from Apache Software Foundation by wget in the \u003ccode\u003e[build-deb](build-deb)\u003c/code\u003e script or the submodules \u003ccode\u003eDockerfile\u003c/code\u003e.\n\n\n## Installation dependencies on Debian/Ubuntu packages (DEB)\n\nDependencies of tools and libraries, which are available in the Debian or Ubuntu package repositories, are defined in the section \u003ccode\u003eDepends\u003c/code\u003e of the deb package config file \u003ccode\u003e[DEBIAN/control](DEBIAN/control)\u003c/code\u003e\n\n\n## Installation dependencies on Python packages (PIP)\n\nDependencies of Python libraries which are not available as packages of the Linux distribution but in Python Package Index (PyPI), are defined in\n\n\u003ccode\u003e[src/open-semantic-etl/src/opensemanticetl/requirements.txt](src/open-semantic-etl/src/opensemanticetl/requirements.txt)\u003c/code\u003e\n\nThis dependencies will be installed automatically on installation of the Debian/Ubuntu packages by the \u003ccode\u003eDEBIAN/postinst\u003c/code\u003e script of the Debian/Ubuntu packages or by docker build configured by \u003ccode\u003eDockerfile\u003c/code\u003e by\n\n```\npip3 install -r /usr/lib/python3/dist-packages/opensemanticetl/requirements.txt\n```\n\n# Contributors\n\nMost contributors are not shown by the Github user interface as \"*Contributors*\" of this repository,\nsince this main repository is structured by [Git submodules](.gitmodules) like [*Open Semantic ETL*](https://github.com/opensemanticsearch/open-semantic-etl)\nand other modules, which are managed in separated Git(hub) repositories.\n\nSo thanks to all (current and former) contributors:\n\n- Markus Mandalka (@mandalka)\n- @g-braeunlich\n- @maehr\n- @sdinten\n- @wsldankers\n- @rivimey\n- @rbussche\n- @mosea3\n- @bhelou\n- @hpiedcoq\n- @andreclinio\n- @agharbeia\n- @ciyer\n- @davidshq\n...\n\nFeel free to extend if you contributed/supported/sponsored in different forms.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopensemanticsearch%2Fopen-semantic-search","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopensemanticsearch%2Fopen-semantic-search","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopensemanticsearch%2Fopen-semantic-search/lists"}