Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/colindean/homebrew-size-analysis

Analyzing the size of Homebrew formulae bottles
https://github.com/colindean/homebrew-size-analysis

data-science hacktoberfest homebrew

Last synced: 10 days ago
JSON representation

Analyzing the size of Homebrew formulae bottles

Host: GitHub
URL: https://github.com/colindean/homebrew-size-analysis
Owner: colindean
License: unlicense
Created: 2023-04-13T13:49:51.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-11-18T16:26:43.000Z (2 months ago)
Last Synced: 2024-12-16T22:45:16.374Z (about 1 month ago)
Topics: data-science, hacktoberfest, homebrew
Language: Makefile
Homepage:
Size: 24.4 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 7
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Homebrew Bottle Size Analysis

Analyzing the size of Homebrew formulae bottles

## Intent

The [Homebrew formula JSON API](https://formulae.brew.sh/docs/api) does not provide package size information for bottles[^def_bottle].

I aim to retrieve package sizes regularly in order to build a database of `(package, version, bottle_arch) -> size` pairs for future analysis.

This analysis could capture:

- Package growth over time

- Total estimated archive size

- Spikes in package size indicating significant changes warranting further inspection

- Expired or broken package URL for rarely-updated formulae with rarely-downloaded bottles

- Packages to target for size optimization, from individual relief to [humanity-scale savings](https://daniel.haxx.se/blog/2022/12/06/faster-base64-in-curl/).

This is **currently** mostly an experiment in using simple CLI tools like Make and curl

to do some data engineering and science that has the above useful implications.

### Current principles

- KISS, to the level of probably dumb.

- Use CLI tools as much as possible; keep code to a minimum.

- [Anything that can be installed via Homebrew](https://formulae.brew.sh) is fair to use.

- Prioritize concurrency using simple tools such as Make `-j`, xargs, parallel, fd, ripgrep, etc.

- Rebuilding the database from scratch means losing data, so avoid that.

## Usage

```sh

make formula.json  # get the data file

make urls          # split it out

make sizes         # get the sizes

```

## Architecture

```mermaid

stateDiagram-v2

   Formulajson : Homebrew API \n formula.json

   Urls : One URL file per URL

   Database : Database (Unspecified Format)

    [*] --> Formulajson : retrieve latest database

    Formulajson --> Urls : extract triplets, write URL files

    state fork_state <>

    state join_state <>

    Urls --> fork_state : list URL files

    fork_state --> HTTPRequest1 : retrieve package size

    fork_state --> HTTPRequest2 : through HEAD requests

    fork_state --> HTTPRequestN : to all URLs in files

    HTTPRequest1 --> Sizes1 : write size file

    HTTPRequest2 --> Sizes2 : write size file

    HTTPRequestN --> SizesN : write size file

    Sizes1 --> join_state

    Sizes2 --> join_state

    SizesN --> join_state

    join_state --> Database

    Database --> [*]

    note left of fork_state

      One size file per retrieved URL

    end note

```

## Performance notes

It takes around 80 minutes to run for me two requests at a time in order not to

trigger some kind of speed limit at my ISP level [^not_ghcr].

You can check the counts of urls and size files by running something like this:

```sh

fd .url data | wc -l

fd .size data | wc -l

```

If the numbers are the same, you've got the data for the current `formula.json`.

[^def_bottle]: A _bottle_ is a pre-packaged archive of a formula available in Homebrew.

    See  for more information.

[^not_ghcr]: It's not ghcr.io rate-limiting me.

    My gateway is working fine but my ISP drops the upstream connection.

    It's probably some kind of DDOS protection at the DNS level.

    See notes.txt for ways I might get around this since curl does

    a DNS lookup every time it launches.