Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/maisikoleni/javadoc-search

Fast Server-side Search Engine for Javadoc API
https://github.com/maisikoleni/javadoc-search

api java javadoc maven quarkus qute regular-expression search search-algorithm search-engine trie

Last synced: about 1 month ago
JSON representation

Fast Server-side Search Engine for Javadoc API

Awesome Lists containing this project

README

        

:encoding: utf-8
:lang: en
:title: Javadoc Search
:description: Fast Server-side Search Engine for Javadoc API
:keywords: search, java, api, search-engine, maven, regular-expression, trie, javadoc, search-algorithm, qute, quarkus
:author: Christian Femers
:showtitle:
:icons: font

= Javadoc Search

A fast server-side search engine for the Javadoc API
of selected JDK releases and other libraries.

*Available at https://javadoc-search.maisikoleni.net*

Can also be added to a bowser as custom search engine:

- Search URL Template: +
`https://javadoc-search.maisikoleni.net/api/v2/libraries/jdk-latest/search/redirect?query=%s`
- Search Suggestions URL Template: +
`https://javadoc-search.maisikoleni.net/api/v2/libraries/jdk-latest/search/suggestions?query=%s`

[WARNING]
====
The above is a fairly weak server and not suitable for benchmarking.
====

You can also replace `jdk-latest` which always points to the
latest full release with one of the following:

- `jdk-11` for the JDK 11 API
- `jdk-17` for the JDK 17 API
- `junit5-latest` for the current JUnit 5 API

The same works for the website in the browser as well.

== Details

Realized as https://quarkus.io[Quarkus] web application mainly with https://quarkus.io/guides/qute[Qute],
https://quarkus.io/guides/resteasy[RESTEasy] and https://microstream.one/platforms/microstream-for-java/[Microstream].
The client-side uses https://getbootstrap.com[Bootstrap] and https://htmx.org[htmx].

The search engine is implemented as a https://en.wikipedia.org/wiki/Radix_tree[sparse radix trie]
generated from the `xxx-search-index.js` files of a Javadoc site.
It is also mixed with a https://en.wikipedia.org/wiki/Suffix_tree[suffix tree]
for name segments of the Javadoc elements.

The following example shows the trie generated for `java.lang.Long`
and `java.util.OptionalLong`, both in the `java.base` module.
The trie will be compressed such that the equally colored nodes point to the same object.

.Example trie
image::docs/example-trie.svg[]

To search in that trie like one can in the official Javadoc implementation,
the nodes in the trie are being traversed recursively
based on a custom regular expression matcher implementation.

For example, `*j.u.OpL*` gets transformed into a regular expression
equivalent to
[source,regexp]
----
j\p{javaLowerCase}*+\s*+•*+\.\s*+u\p{javaLowerCase}*+\s*+•*+\.\s*+Op\p{javaLowerCase}*+L[^•]*+
----
and gets compiled into:
[source,java]
----
CompiledRegex [
0x0000006a, // matches 'j'
0xc0000000, // star referencing char class #0 -> "\p{javaLowerCase}"
0xc0000001, // star referencing char class #1 -> "\s"
0x80000378, // star referencing literal '•'
0x0000002e, // matches '.'
0xc0000001, // star referencing char class #1 -> "\s"
0x00000075, // matches 'u'
0xc0000000, // star referencing char class #0 -> "\p{javaLowerCase}"
0xc0000001, // star referencing char class #1 -> "\s"
0x80000378, // star referencing literal '•'
0x0000002e, // matches '.'
0xc0000001, // star referencing char class #1 -> "\s"
0x0000004f, // matches 'O'
0x00000070, // matches 'p'
0xc0000000, // star referencing char class #0 -> "\p{javaLowerCase}"
0x0000004c, // matches 'L'
0xc0000002 // star referencing char class #2 -> "[^•]"
]
----
These instructions are an array of `int` and
they are accompanied by an array of CharPredicates,
which are functions of type `char -> boolean`
and represent the character classes in the custom regex implementation.

The instructions are encoded as follows:
[options="header"]
|===
| Bit 31 | Bit 30 | Bits 29 ... 0
//-------------------------------------------------------------------------------------
| Wrapped by a star? | Represents a char class? | `char` value or `CharPredicate` index
|===

This compiled custom regex implementation has the
disadvantage that it does not support nested groups or
repetitions of more than a single character.
This is however not an issue for our purpose,
as we only need to star single characters.

The advantage, however, is
that all its state can be stored in a single `long`,
which can easily be passed around methods
and avoids the generation of objects.

This `long` value is composed as follows:
[options="header"]
|===
| Bit 63 | Bits 62 ... 32 | Bit 31 | Bits 30 ... 0
//-------------------------------
| Did the match fail?
| Number of stars that matched at least one character
| Did the current star already match one character?
| Next index in the instruction array
|===

As the state is only a `long` that is stored by the user
and this possessive matcher implementation can advance the state
step by step consuming a single character,
collecting all matches by traversing the trie
and branching at the transitions is fairly simple.

All of that allows for the following search syntax:

- Query `*ja.l.o*` will find:
* `java.base/java.lang.Object`
* `java.base/java.lang.Override`
* `java.base/java.lang.OutOfMemoryError`
- Query `*LDT*` will find:
* `java.base/java.time.LocalDateTime`
* `java.base/java.time.chrono.ChronoLocalDateTime`
* ...
- Query `*math max*` will find:
* `java.base/java.lang.Math.max(double, double)`
* `java.base/java.lang.Math.max(long, long)`
* ...
- Query `*j.u.s.Col.jo*` will find:
* `java.base/java.util.stream.Collectors.joining()`
* `java.base/java.util.stream.Collectors.joining(CharSequence, CharSequence, CharSequence)`
* `java.base/java.util.stream.Collectors.joining(CharSequence)`
- Query `*.C~rs*` will find:
* `java.base/java.util.stream.Collectors`
* ...

The search engine implementation is _mostly_ compliant with the
https://docs.oracle.com/en/java/javase/20/docs/specs/javadoc/javadoc-search-spec.html[Javadoc Search Specification]
but since major changes got introduced with JDK 19
(search terms can be separated by spaces),
I felt free to include custom behavior.

[NOTE]
====
One example for that is `~`, which allows skipping lower case characters
until the following part of the search term matches.
Like the custom regular expression implementation, `~` is possessive
and the engine will not backtrack to find a match.
====

As a result of the trie and simplified regex matching combination,
the search is very fast and rarely exceeds 20 ms.
Most of the time, search completes in about 1 ms.
The longer the search term, the faster the results are available.

The matching results are currently ranked by:

- Where the match of the query starts. +
A match starting at the beginning is the best case, a match starting
at or after a separator comes next and matches that start
within identifiers (upper-case letters, underscore) have the lowest rank.
`java.desktop/javax.print.attribute.SetOfIntegerSyntax`
is therefore a better match for `*Set*`
than `java.base/java.util.AbstractSet`.
- How good they match the query. +
`java.base/java.io.File.isHidden()` is a better match for `*File.isH*`
than `java.base/java.nio.file.Files.isHidden(Path)` because
it does not require further lower case characters after `File`.
- Natural order of the entries. +
Results with smaller char values at the same position come first.
The implementation is very much like `String.compareTo`
on the last segment of the qualified name.

Case insensitive matching is only performed
if case-sensitive did not yield _any_ results.

== Building

This project requires at least JDK 17.

Building should be straightforward since this is a regular Maven project.

- `mvn clean` to clean up generated artifacts (does not remove the `database` folder)
- `mvn test` runs all tests
- `mvn quarkus:dev` starts the
https://quarkus.io/guides/dev-mode-differences[Quarkus dev mode] (live reload)
- `mvn package` bundles the web application using the `prod` profile
into `target/quarkus-app`

It might be helpful to add the missing search-index files
in `src/main/resources/net/maisikoleni/javadoc/service/jdk-latest`
(see `jdk-index-files-go-here.txt` there for details), but this is not required
as they are then fetched from the official site and cached in `database/javadoc-indexes`.

To run outside Maven, `--add-exports java.base/jdk.internal.misc=ALL-UNNAMED` is required
as MicroStream requires Unsafe at the moment.

== License

Javadoc Search was created by Christian Femers in 2022
and the program and the accompanying materials are made available
under the terms of the
https://joinup.ec.europa.eu/collection/eupl/eupl-text-eupl-12[European Union Public License 1.2]
(see the https://github.com/MaisiKoleni/javadoc-search/blob/main/LICENSE["LICENSE" file],
`SPDX-License-Identifier: EUPL-1.2-or-later`)
with the following exceptions:

- `src/test/java/net/maisikoleni/javadoc/parser/JdkUrlGeneration.java` (`GPL-2.0-with-classpath-exception`)