https://github.com/r-erd/github_scan

customizable tool that collects github repos and scans their code
https://github.com/r-erd/github_scan

analysis collector github scanner

Last synced: 22 days ago
JSON representation

customizable tool that collects github repos and scans their code

Host: GitHub
URL: https://github.com/r-erd/github_scan
Owner: r-erd
License: gpl-3.0
Created: 2024-03-23T17:38:15.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-03-23T17:39:02.000Z (over 2 years ago)
Last Synced: 2026-03-22T18:22:30.009Z (4 months ago)
Topics: analysis, collector, github, scanner
Language: Python
Homepage:
Size: 21.5 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# github_scan

This is a tool designed to collect, download and scan the code of GitHub repositories for specific criteria. Originally developed for a personal university project, this tool provides a convenient way to collect and analyze repositories based on custom requirements.

## Features

Within 48 hours, this tool can easily collect over 10,000 repository URLs, allowing you to efficiently analyze them based on your custom criteria.

- **Search Filtering**: The search supports filtering based on language, min stars, max stars, and keywords
- **Extensible Scanner**: The scanner can be extended easily, grep-like functionality is already implemented
- **Blacklist**: The blacklist feature prevents the processing of already cloned and scanned repositories, ensuring efficiency
- **Hits File**: The hits file contains the names of files/repositories that meet all specified conditions
- **Efficient Processing**: The blacklist contains a list of URLs that have already been processed, allowing for time and computational efficiency
- **GitHub Access Token**: A GitHub access token is only required for utilizing the search API in the repo_collector
- **Independent Usage**: The `repo_collector` and `repo_scanner` can be used independently or asynchronously, providing flexibility
- **Rate Limit Compliance**: The tool respects rate limits to ensure compliance with GitHub's usage policies

## Getting Started

To use this tool, you need to run two scripts either simultaneously or sequentially. Here's how to get started:

1. Run the `repo_collector` script to collect URLs of repositories and save them into CSV files in a designated directory.
2. Run the `repo_scanner` script, which watches the directory and processes the CSV files and their corresponding URLs.
3. It is recommended to redirect the output of the scripts to a file for convenient logging.

> Note: output dir and input dir of repo_collector and repo_scanner have to be the same.

> Note: this version of repo_scanner looks for flask-applications and occurences of the render_string_template function.

> Note: tested with Python 3.12.1

## Usage

For just using it one-time the information in the Getting Started section is sufficient.
The paragraph below is only useful if you want to continue a search/scan after it stopped/crashed.

#### repo_collector
- take note of the number of the last repo_csv generated by the repo_collector, pass that as `file_batch_index`
- take note of the last keyword that was processed by the repo_collector, pass that as `starting_point`
- with these two additional parameters, run `repo_collector` as before

It should now continue with the search it was stopped at, and create a new csv. This one search can overlap, but only for that keyword, depending on where it was aborted.

#### repo_scanner
- take note of the number of the last repo_csv that was processed by it, pass that as `file_batch_index`
- with this additional parameter, run `repo_scanner` as before

It should now continue with the cloning & grepping at the correct csv file. Within that file, some URLs might already have been processed, but that is handled by the blacklist.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/r-erd/github_scan

Awesome Lists containing this project

README