https://github.com/felsenhower/top500-dataloader

TOP500 scraper / downloader / dataloader
https://github.com/felsenhower/top500-dataloader
excel package polars python scraper top500 xml
Last synced: 2 months ago
JSON representation
TOP500 scraper / downloader / dataloader
Host: GitHub
URL: https://github.com/felsenhower/top500-dataloader
Owner: felsenhower
License: mit
Created: 2025-10-08T11:22:38.000Z (10 months ago)
Default Branch: main
Last Pushed: 2025-10-15T15:28:33.000Z (10 months ago)
Last Synced: 2025-10-16T12:17:56.542Z (9 months ago)
Topics: excel, package, polars, python, scraper, top500, xml
Language: Python
Homepage:
Size: 121 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # top500-dataloader

This repository contains a scraper / downloader / dataloader for the [TOP500](https://top500.org/) website.

## Usage

⚠️ Since the module isn't on PyPI, I will use `uv` and `uvx` in all examples, since it works quite well with packages from git.

### As an Executable

When installed, you can invoke the CLI via

```shell

$ python -m top500 --help

usage: top500 [-h] [-d dir] {list-online,list-local,download,download-all,display} ...

Download or view TOP500 lists.

positional arguments:

  {list-online,list-local,download,download-all,display}

    list-online         List TOP500 list issues that are available online.

    list-local          List TOP500 list issues that are available locally.

    download            Download a TOP500 list issue (see "download --help" for more info).

    download-all        Download all TOP500 list issues that are available online.

    display             Display a TOP500 list on the console (see "display --help" for more info).

options:

  -h, --help            show this help message and exit

  -d, --download-dir dir

                        Set the download dir. Defaults to "/home/ruben/.local/share/top500".

```

You can also do the same thing like this with `uvx`:

```shell

$ uvx git+https://github.com/felsenhower/top500-dataloader.git --help

```

If you simply want to download all TOP500 issues to `~/.local/share/top500`, you can do this:

```shell

$ uvx git+https://github.com/felsenhower/top500-dataloader.git download-all

```

To get a nice tabular view of the available lists online:

```shell

$ uvx git+https://github.com/felsenhower/top500-dataloader.git list-online

Fetching https://top500.org/lists/top500/...

shape: (65, 6)

┌─────────┬───────────────┬────────┬──────────────┬───────────────────────────┬─────────────────────────────────────────┐

│ key     ┆ title         ┆ number ┆ published_on ┆ published_at              ┆ url                                     │

│ ---     ┆ ---           ┆ ---    ┆ ---          ┆ ---                       ┆ ---                                     │

│ str     ┆ str           ┆ i64    ┆ date         ┆ str                       ┆ object                                  │

╞═════════╪═══════════════╪════════╪══════════════╪═══════════════════════════╪═════════════════════════════════════════╡

│ 2025-06 ┆ June 2025     ┆ 65     ┆ 2025-06-14   ┆ Hamburg, Germany          ┆ https://top500.org/lists/top500/2025/06 │

│ 2024-11 ┆ November 2024 ┆ 64     ┆ 2024-11-19   ┆ Atlanta, GA, USA          ┆ https://top500.org/lists/top500/2024/11 │

│ 2024-06 ┆ June 2024     ┆ 63     ┆ 2024-06-01   ┆ Hamburg, Germany          ┆ https://top500.org/lists/top500/2024/06 │

│ 2023-11 ┆ November 2023 ┆ 62     ┆ 2023-11-14   ┆ Denver, CO, USA           ┆ https://top500.org/lists/top500/2023/11 │

│ 2023-06 ┆ June 2023     ┆ 61     ┆ 2023-06-01   ┆ Hamburg, Germany          ┆ https://top500.org/lists/top500/2023/06 │

│ 2022-11 ┆ November 2022 ┆ 60     ┆ 2022-11-15   ┆ Dallas, TX, USA           ┆ https://top500.org/lists/top500/2022/11 │

[...]

```

The `key` may be used to get a glimpse of a list like this:

```

$ uvx git+https://github.com/felsenhower/top500-dataloader.git display 2025-06

shape: (500, 7)

┌──────┬─────────────────────────────────┬──────────────────────┬─────────────────────────────────┬────────────────┬─────────────────┬────────────┐

│ Rank ┆ System Name                     ┆ Country              ┆ Manufacturer                    ┆ Rmax [GFlop/s] ┆ Rpeak [GFlop/s] ┆ Power [kW] │

│ ---  ┆ ---                             ┆ ---                  ┆ ---                             ┆ ---            ┆ ---             ┆ ---        │

│ i64  ┆ str                             ┆ str                  ┆ str                             ┆ f64            ┆ f64             ┆ f64        │

╞══════╪═════════════════════════════════╪══════════════════════╪═════════════════════════════════╪════════════════╪═════════════════╪════════════╡

│ 1    ┆ El Capitan                      ┆ United States        ┆ HPE                             ┆ 1.7420e9       ┆ 2.7464e9        ┆ 29581.0    │

│ 2    ┆ Frontier                        ┆ United States        ┆ HPE                             ┆ 1.3530e9       ┆ 2.0557e9        ┆ 24607.0    │

│ 3    ┆ Aurora                          ┆ United States        ┆ Intel                           ┆ 1.0120e9       ┆ 1.9800e9        ┆ 38698.4    │

│ 4    ┆ JUPITER Booster                 ┆ Germany              ┆ EVIDEN                          ┆ 7.934e8        ┆ 9.3e8           ┆ 13088.2    │

│ 5    ┆ Eagle                           ┆ United States        ┆ Microsoft Azure                 ┆ 5.612e8        ┆ 8.468352e8      ┆ null       │

│ 6    ┆ HPC6                            ┆ Italy                ┆ HPE                             ┆ 4.779e8        ┆ 6.0696576e8     ┆ 8460.9     │

│ 7    ┆ Supercomputer Fugaku            ┆ Japan                ┆ Fujitsu                         ┆ 4.4201e8       ┆ 5.37212e8       ┆ 29899.2    │

│ 8    ┆ Alps                            ┆ Switzerland          ┆ HPE                             ┆ 4.349e8        ┆ 5.7484128e8     ┆ 7124.0     │

│ 9    ┆ LUMI                            ┆ Finland              ┆ HPE                             ┆ 3.797e8        ┆ 5.3150515e8     ┆ 7106.8     │

[...]

```

### As a Python Module

```python

import top500

for list_info in top500.iter_lists_online():

    df = top500.read_list(list_info)

    fastest_computer = df["name"][0]

    if fastest_computer is None:

        continue

    print(f"In {list_info.title}, the fastest computer was {fastest_computer}.")

```

The module exports these functions (see [`__init__.py`](https://github.com/felsenhower/top500-dataloader/blob/main/src/top500/__init__.py) for their docstrings):

```python

def set_download_dir(download_dir: str | os.PathLike) -> None:

def get_download_dir() -> Path:

def iter_lists_online(newest_first: bool = True) -> Iterator[Top500ListInfo]:

def iter_lists_local(newest_first: bool = True) -> Iterator[Top500ListInfo]:

def download_list(list_info_or_key: str | Top500ListInfo) -> None:

def download_all_lists() -> None:

def read_list(list_info_or_key: str | Top500ListInfo, allow_download: bool = True, source: str = "normalized") -> pl.DataFrame:

```

Some Python examples are located in the [examples](examples) directory.

The `read_list` function returns a `polars.DataFrame` for the TOP500 list you request.

You can use either the key as a `str` or a `Top500ListInfo` object (but in the first case, the TOP500 overview page may be visited).

If a list is not downloaded yet, it can be automatically downloaded, unless `allow_download` is set to `False`.

The `source` argument can be `excel`, `xml`, `normalized` or `normalized-pretty`.

- `excel` will give you the data like in the Excel file (the columns are not stable).

- `xml` will give you the data like in the XML file (the columns are not stable).

- `normalized` will give you a merge of `excel` and `xml` with stable and sane columns.

- `normalized-pretty` is like `normalized`, but with prettier column names (similar to `excel`).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/felsenhower/top500-dataloader

Awesome Lists containing this project

README