An open API service indexing awesome lists of open source software.

https://github.com/chris-santiago/bookmarks-topics

Using unsupervised learning and language modeling to cluster and reorganize web bookmarks.
https://github.com/chris-santiago/bookmarks-topics

bert-embeddings bertopic bookmarks clustering generative-modeling hdbscan hydra llm openai taskfile umap unsupervised-learning

Last synced: about 2 months ago
JSON representation

Using unsupervised learning and language modeling to cluster and reorganize web bookmarks.

Awesome Lists containing this project

README

        

# bookmarks-topics

This project is a continuation of the stale [bookmarks_clustering](https://github.com/chris-santiago/bookmarks_clustering) project. It's updated to use newer embedding and generative models, mostly via [BERTopic](https://maartengr.github.io/BERTopic/index.html) library.

# Usage

## Prerequisites

1. This project uses [Task](https://taskfile.dev/) to run and manage tasks, so you'll need to first install that on your machine.
2. This project uses OpenAI's API. You'll need an API key from OpenAI; place it in a `.env` file within this project's root directory. The key should be `OPENAI_KEY` and the value is your API key. For example:

```toml
OPENAI_KEY=sk-proj-_mySuperSecretOpenAIkey
```

3. Export your bookmarks to an HTML file. *Note: this project used Google Chrome bookmarks.*

## Setup

Clone this repo and install the project and dependencies:

```bash
git clone https://github.com/chris-santiago/bookmarks-topics.git
cd bookmarks-topics
conda env create -f environment.yaml
pip install .
```

## Quick Start

Once you've completed the prerequisites and setup the project environment, you can run the entire pipeline using the command:

```bash
task cluster-bookmarks -- "bookmarks.input_path=your/path/to/bookmarks.html"
```

This will parse your bookmarks file and fetch content from all the bookmarked URLs, before running the clustering algorithm. **You may not want to organize ALL of your bookmarks, but rather a subset.** In this case, you can pass a comma-separated list of specific folders:

```bash
task cluster-bookmarks -- "bookmarks.input_path=your/path/to/bookmarks.html" "bookmarks.folders=[My first folder,My second folder]"
```

Once complete, your re-organized bookmarks are placed within a newly-created `ouputs/topics/` directory, within this project's root directory. That directory is organized by date and time; find the folder that corresponds with your most recent run and import the `new_bookmarks.html` file back into your browser. You can also view a breakdown of bookmarks and topics in the `bookmarks_topics.json` file, within that same directory.

**Note**: If you haven't added `task` to your PATH then you can replace that command with `./bin/task`

### Example Output

#### HTML

```html

Bookmarks

Bookmarks






JavaScript D3.js







javascript - Moving the axes in d3.js - Stack Overflow

javascript - D3.csv accessor function for loop - Stack Overflow

Javascript: take every nth Element of Array - Stack Overflow

How to add in zero values into a time series in d3.js / JavaScript - Stack Overflow

How to get the difference between two arrays in JavaScript? - Stack Overflow

javascript - d3.js Specify text for x-axis - Stack Overflow

javascript - D3 get attributes from element - Stack Overflow

javascript - How to select unique values in d3.js from data - Stack Overflow

html - Targeting nested elements with CSS - Stack Overflow

MathJax basic tutorial and quick reference - Mathematics Meta Stack Exchange

How to Debug JavaScript in Visual Studio Code with live-server Running - Stack Overflow

jetbrains ide - IntelliJ: Error: java: release version 10 not supported - Stack Overflow

javascript - reversed Y-axis D3 - Stack Overflow

d3.js - Plot multiple lines in a for loop in d3 - Stack Overflow
















```

#### JSON

```json
[
{
"url": "https://appliedcausalinference.github.io/aci_book",
"title": "Applied Causal Inference",
"topic": "Bayesian Causal Inference"
},
{
"url": "https://astral.sh/blog/u",
"title": "uv: Python packaging in Rust",
"topic": "Python Development Tools"
},
{
"url": "https://bayesiancomputationbook.com/markdown/chp_01.htm",
"title": "1. Bayesian Inference \u2014 Bayesian Modeling and Computation in Python",
"topic": "Bayesian Causal Inference"
}
]
```

# Tinkering

This project is configured using [Hydra](https://hydra.cc/docs/intro/), and current configs are found in the `conf` directory. You can modify behavior by changing these configs, directly, or by overriding on the command line.

| Config | Use | Path |
|--------|-----------------------------------------------------------|---------------------------|
| Main | Main configuration file. Use this to tune the topic model | `conf/config.yaml` |
| Prompt | Configure LLM prompts. | `conf/prompt/*` |
| Paths | Configure your local paths. | `conf/paths/default.yaml` |
| Hydra | Configure hydra. | `conf/hydra/default.yaml` |

## CLI Override

You can [override](https://hydra.cc/docs/advanced/override_grammar/basic/) much of the configuration directly from the command line by passing Hydra overrides after `--` in the command. For example:

```bash
task cluster-bookmarks -- "topics.topic_model.top_n_words=5"
```

## Tasks

You can, of course, also run individual tasks that will execute corresponding Python modules. This is useful when tuning the topic model (`task: topics`) and want to avoid fetching and parsing HTML from your bookmarked URLs.

```bash
task: Available tasks for this project:
* bookmarks: Read bookmarks file
* check-config: Check Hydra configuration
* cluster-bookmarks: Run entire bookmarks clustering pipeline.
* fetch-html: Get bookmarks raw html
* lint: Check source code for errors (will run before tasks)
* parse-html: Parse bookmarks raw html
* topics: Get topics
```