https://github.com/bugout-dev/mirror
Software project analysis
https://github.com/bugout-dev/mirror
Last synced: 3 months ago
JSON representation
Software project analysis
- Host: GitHub
- URL: https://github.com/bugout-dev/mirror
- Owner: bugout-dev
- License: apache-2.0
- Created: 2020-02-02T00:30:18.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2021-05-20T16:10:47.000Z (almost 5 years ago)
- Last Synced: 2024-11-05T18:46:58.957Z (over 1 year ago)
- Language: Python
- Size: 335 KB
- Stars: 21
- Watchers: 7
- Forks: 8
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# mirror - Tools for software project analysis
## Setup
- Prepare python environment and install package
- For development please use `pip install -r requirements.dev.txt`
- Copy `sample.env` to `dev.env`, fill it with required variables and source it
```bash
export GITHUB_TOKEN=""
export LANGUAGES_DIR=""
export MIRROR_CRAWL_INTERVAL_SECONDS=1
export MIRROR_CRAWL_MIN_RATE_LIMIT=500 (for search better set as 5)
export MIRROR_CRAWL_BATCH_SIZE=""
export MIRROR_CRAWL_DIR=""
export MIRROR_LANGUAGES_FILE=""
export SNIPPETS_DIR=""
```
- To avoid block from GitHub prepare Rate Limit watcher
```bash
watch -d -n 5 'curl https://api.github.com/rate_limit -s -H "Authorization: Bearer $GITHUB_TOKEN" "Accept: application/vnd.github.v3+json"'
```
### Module commands
```
python -m mirror.cli --help
clone Clone repos from search api to output dir.
commits Read repos json file and upload all commits for that...
crawl Processes arguments as parsed from the command line
and...
generate_snippets Create snippets dataset from cloned repos
nextid Prints ID of most recent repository crawled and
written...
sample Writes repositories sampled from a crawl directory to...
search Crawl via search api.
validate Prints ID of most recent repository crawled and
written...
```
### Extract all repos metadata
Run the `crawl` command to extract all repositories metadata and save in a `.json` file.
```bash
python -m mirror.cli crawl \
--crawldir $MIRROR_CRAWL_DIR \
--interval $MIRROR_CRAWL_INTERVAL_SECONDS \
--min-rate-limit $MIRROR_CRAWL_MIN_RATE_LIMIT \
--batch-size $MIRROR_CRAWL_BATCH_SIZE
```
### Extract repos metadata via search api
Say you need to extract only a small pool of repositories for analysis then you can set more precise criteria that you need via `search` command.
```bash
python -m mirror.cli search --crawldir "$MIRROR_CRAWL_DIR/search" -L "python" -s ">500" -l 5
```
### Clone repos to local machine for analysis
The `clone` command uses the standard `git clone` to extract search results of repositories and clones to local machine.
Clone from search
```bash
python -m mirror.cli clone -d $LANGUAGES_DIR -r "$MIRROR_CRAWL_DIR/search"
```
Clone from crawl
```bash
python -m mirror.cli clone -d $LANGUAGES_DIR -r "$MIRROR_CRAWL_DIR"
```
Structure of `$LANGUAGES_DIR` directory:
```
> $LANGUAGES_DIR
> language 1
> repo 1
> repo 2
...
> language 2
> repo 1
> repo 2
...
...
```
Also, there is possibility to upload popular repositories with python code. See example in [ex_clone.py](https://github.com/bugout-dev/mirror/examples/ex_clone.py)
### Create commits from repo search
Command `commits` extract all commits from repository and save `.json` files with commits for each repository.
```bash
python -m mirror.cli commits -d "$MIRROR_CRAWL_DIR\commits" -l 5 -r "$MIRROR_CRAWL_DIR/search"
```
### Convert json data to csv for analysis
It creates `.csv` file with flat json structure.
```bash
python -m mirror.github.utils --json-files-folder "$MIRROR_CRAWL_DIR" --output-csv "$MIRROR_CRAWL_DIR/output.csv" --command commits
```
### Generate snippets dataset from downloaded repo
```bash
python -m mirror.github.generate_snippets -r "$OUTPUT_DIR" -f "examples/languages.json" -L "$LANGUAGES_DIR"
```
### Workflow of generate snippet dataset from prepered file with languages and they extentions
1) Create search result
```bash
python -m mirror.cli search -d "$MIRROR_CRAWL_DIR/search" -f $MIRROR_LANGUAGES_FILE -s ">500" -l 5
```
2) Clone repos from search result it's take time and maybe good idea not add stdout from **git clone** to terminal.
```bash
python -m mirror.cli clone -d $LANGUAGES_DIR -r "$MIRROR_CRAWL_DIR/search"
```
3) Generate snippets
```bash
python -m mirror.cli generate_snippets -d $SNIPPETS_DIR -r $LANGUAGES_DIR
```
It return sqlite db with snippets and they metadata.
For use accross **allrepos** result **clone** and **commits** have option argument
```bash
--start-id --end-id
```
parameters must be set togrther. That id add for ability processing part of repo from allrepos result.