Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/nexus-stc/stc

Distributed free search engine and AI tools that grant access to knowledge
https://github.com/nexus-stc/stc

books database ipfs knowledge scholarly-articles summa

Last synced: 1 day ago
JSON representation

Distributed free search engine and AI tools that grant access to knowledge

Awesome Lists containing this project

README

        

# Standard Template Construct

Welcome, developer!
You've arrived at the repository for [STC](https://libstc.cc), the library, search engine and AI tooling offering free access to academic knowledge and works of fictional literature.

![](/web/public/favicon.svg)

[STC](https://libstc.cc) | [Help Center](https://libstc.cc/#/help)

## Getting Started

- Explore our search features at [Web STC](https://libstc.cc), or through one of the Telegram bots listed in the bio of our [channel](https://t.me/nexus_search) (not an ad, just a safety)
- [Discover](https://libstc.cc/#/help/replicate) how to set up your own STC instance, enabling you to enjoy the same search capabilities in your local environment
- Learn about [how to access large corpus](/geck) of high-quality scholarly texts using Python and [use them in AI apps](/cybrex)

## Details

In essence, STC is a search engine [Summa](https://github.com/izihawa/summa) coupled with databanks.
These databanks reside on [IPFS](https://ipfs.tech/) in a format that allows for searching without necessitating the download of the entire dataset.
The search engine library can function as a standalone server, an embeddable Python library (requiring no additional software!), and a WASM-compiled module that can be used in a browser.
Last way allows to embed search engine in a static site that further can be deployed over IPFS too. This is how [Web STC](https://libstc.cc) is live.

Putting everything to IPFS allows you to open STC in your browser or on your server and avoid the use of centralized servers that may lose or censor data.

## Components

- [Web STC](/web) is a browser-based interface with embedded search engine that can be entirely deployed on IPFS and used in browsers
- [GECK](/geck) is a Python library and Bash tool for setting up and interacting with STC programmatically
- [Cybrex AI](/cybrex) library pairs STC with AI tools such as OpenAI or free LLM for processing stored data
- [STC Hub API](https://libstc.cc/#/help/stc-hub-api) is plain API for accessing scholarly publications by their DOIs through `kubo` command line tools or even through HTTP.
- [Telegram Nexus Bot](/tgbot) allows users to access STC via Telegram, one of the most popular messaging platforms.

## Roadmap

| Part | Task | Description |
|---------------------|----------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Library Stewardship | | |
| | ✅ Assimilation of LibGen corpus | Transition of all items to `nexus_science` |
| | 🚧 Assimilation of SciMag corpus | Significant task of transferring scimag corpus to IPFS |
| | ✅ Structured content | Enhance GROBID extraction (headers + content) and store content in structured_content JSON column. Extract entities for cross-linking in Web STC |
| | 🚧 Implementing classification ([articles](https://github.com/nexus-stc/stc/issues/12), [books](https://github.com/nexus-stc/stc/issues/13)) | |
| Web STC | | |
| | UX improvement | STC often requires loading of large data chunks, currently reflected only by a spinner. The UX needs improvement. Following structured content implementation, we can highlight headers and generate cross-links in abstracts/content |
| | Enhancing availability | Further testing needed on diverse devices and networks |
| | Bookshelf | STC has all tools for generating bookshelves that may offer users high-quality suggestions on read. |
| Cybrex AI | | |
| | First-class support of local LLM | Extensive testing of prompts with documents is required to identify the smallest model capable of efficiently executing QA and summarization tasks. Most 13-15B models are currently failing (quantized, on CPU) |
| | Building an embeddings dataset | The goal is to build a comprehensive dataset with DOIs and document embeddings. Currently, the Instructor XL model appears most promising, but further testing is necessary |
| | Refining and fixing metadata ([cleaning `content`](https://github.com/nexus-stc/stc/issues/14)) | Areas for improvement include: detected language, tags, keywords, automated abstracts, Dewey classification |
| | Build QA on local LLM | Such a system should be independently operable and also accessible via Telegram. |
| | Fine-tuning LLMs on STC | |
| Distribution | | |
| | Building STC Box | Develop and maintain a definitive guide and scripts for replicating and launching STC on compact devices like PI computers or TV Boxes |
| | Global replication | The goal is to replicate STC (including the search database and papers) a minimum of 100 times across at least 30 countries |
| | Establishing Frontier Outposts | Investigate strategies to replicate STC on an orbiting satellite or another planet in the solar system (Mars or Europa preferred) |
| Communities | | |
| | ✅ [Forming Science Communities on Telegram](https://t.me/+CVQ4OIRoU85hZDc8) | Initiate the first version of Telegram-based forums focusing on specific scientific topics |
| | Addressing Copyright Issues | Organize more activities aimed at challenging the copyright laws for scholarly and educational writings |