https://github.com/bugout-dev/mirror

Software project analysis
https://github.com/bugout-dev/mirror

Last synced: 3 months ago
JSON representation

Software project analysis

Host: GitHub
URL: https://github.com/bugout-dev/mirror
Owner: bugout-dev
License: apache-2.0
Created: 2020-02-02T00:30:18.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2021-05-20T16:10:47.000Z (about 4 years ago)
Last Synced: 2024-11-05T18:46:58.957Z (8 months ago)
Language: Python
Size: 335 KB
Stars: 21
Watchers: 7
Forks: 8
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# mirror - Tools for software project analysis

## Setup
- Prepare python environment and install package
- For development please use `pip install -r requirements.dev.txt`
- Copy `sample.env` to `dev.env`, fill it with required variables and source it
```bash
export GITHUB_TOKEN=""
export LANGUAGES_DIR=""
export MIRROR_CRAWL_INTERVAL_SECONDS=1
export MIRROR_CRAWL_MIN_RATE_LIMIT=500 (for search better set as 5)
export MIRROR_CRAWL_BATCH_SIZE=""
export MIRROR_CRAWL_DIR=""
export MIRROR_LANGUAGES_FILE=""
export SNIPPETS_DIR=""
```

- To avoid block from GitHub prepare Rate Limit watcher
```bash
watch -d -n 5 'curl https://api.github.com/rate_limit -s -H "Authorization: Bearer $GITHUB_TOKEN" "Accept: application/vnd.github.v3+json"'
```

### Module commands

```
python -m mirror.cli --help

clone Clone repos from search api to output dir.
commits Read repos json file and upload all commits for that...
crawl Processes arguments as parsed from the command line
and...

generate_snippets Create snippets dataset from cloned repos
nextid Prints ID of most recent repository crawled and
written...

sample Writes repositories sampled from a crawl directory to...
search Crawl via search api.
validate Prints ID of most recent repository crawled and
written...
```

### Extract all repos metadata

Run the `crawl` command to extract all repositories metadata and save in a `.json` file.

```bash
python -m mirror.cli crawl \
--crawldir $MIRROR_CRAWL_DIR \
--interval $MIRROR_CRAWL_INTERVAL_SECONDS \
--min-rate-limit $MIRROR_CRAWL_MIN_RATE_LIMIT \
--batch-size $MIRROR_CRAWL_BATCH_SIZE
```

### Extract repos metadata via search api

Say you need to extract only a small pool of repositories for analysis then you can set more precise criteria that you need via `search` command.

```bash
python -m mirror.cli search --crawldir "$MIRROR_CRAWL_DIR/search" -L "python" -s ">500" -l 5
```

### Clone repos to local machine for analysis

The `clone` command uses the standard `git clone` to extract search results of repositories and clones to local machine.

Clone from search
```bash
python -m mirror.cli clone -d $LANGUAGES_DIR -r "$MIRROR_CRAWL_DIR/search"
```

Clone from crawl
```bash
python -m mirror.cli clone -d $LANGUAGES_DIR -r "$MIRROR_CRAWL_DIR"
```

Structure of `$LANGUAGES_DIR` directory:

```
> $LANGUAGES_DIR
> language 1
> repo 1
> repo 2
...
> language 2
> repo 1
> repo 2
...
...
```

Also, there is possibility to upload popular repositories with python code. See example in [ex_clone.py](https://github.com/bugout-dev/mirror/examples/ex_clone.py)

### Create commits from repo search

Command `commits` extract all commits from repository and save `.json` files with commits for each repository.

```bash
python -m mirror.cli commits -d "$MIRROR_CRAWL_DIR\commits" -l 5 -r "$MIRROR_CRAWL_DIR/search"
```

### Convert json data to csv for analysis

It creates `.csv` file with flat json structure.

```bash
python -m mirror.github.utils --json-files-folder "$MIRROR_CRAWL_DIR" --output-csv "$MIRROR_CRAWL_DIR/output.csv" --command commits
```

### Generate snippets dataset from downloaded repo
```bash
python -m mirror.github.generate_snippets -r "$OUTPUT_DIR" -f "examples/languages.json" -L "$LANGUAGES_DIR"

```

### Workflow of generate snippet dataset from prepered file with languages and they extentions

1) Create search result
```bash
python -m mirror.cli search -d "$MIRROR_CRAWL_DIR/search" -f $MIRROR_LANGUAGES_FILE -s ">500" -l 5
```

2) Clone repos from search result it's take time and maybe good idea not add stdout from **git clone** to terminal.
```bash
python -m mirror.cli clone -d $LANGUAGES_DIR -r "$MIRROR_CRAWL_DIR/search"
```

3) Generate snippets
```bash
python -m mirror.cli generate_snippets -d $SNIPPETS_DIR -r $LANGUAGES_DIR
```

It return sqlite db with snippets and they metadata.

For use accross **allrepos** result **clone** and **commits** have option argument
```bash
--start-id --end-id
```
parameters must be set togrther. That id add for ability processing part of repo from allrepos result.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bugout-dev/mirror

Awesome Lists containing this project

README