https://github.com/src-d/gitcollector
https://github.com/src-d/gitcollector
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/src-d/gitcollector
- Owner: src-d
- License: gpl-3.0
- Created: 2019-05-30T08:49:51.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2019-10-31T12:03:07.000Z (over 5 years ago)
- Last Synced: 2025-05-05T05:05:41.100Z (about 2 months ago)
- Language: Go
- Size: 204 KB
- Stars: 18
- Watchers: 8
- Forks: 14
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# gitcollector [](https://github.com/src-d/gitcollector/releases) [](https://travis-ci.com/src-d/gitcollector) [](https://codecov.io/gh/src-d/gitcollector) [](https://godoc.org/gopkg.in/src-d/gitcollector.v0) [](https://goreportcard.com/report/github.com/src-d/gitcollector)
**gitcollector** collects and stores git repositories.
gitcollector is the source{d} tool to download and update git repositories at
large scale. To that end, it uses a custom repository storage
[file format](https://blog.sourced.tech/post/siva/) called [siva](https://github.com/src-d/go-siva) optimized for saving
storage space and keeping repositories up-to-date.## Status
The project is in a preliminary stable stage and under active development.
## Storing repositories using rooted repositories
A rooted repository is a [bare Git repository](http://www.saintsjd.com/2011/01/what-is-a-bare-git-repository/) that stores all objects from all repositories that share a common history, that is, they have the same initial commit. It is stored using the [Siva](https://github.com/src-d/go-siva) file format.

Rooted repositories have a few particularities that you should know to work with them effectively:
- They have no `HEAD` reference.
- All references are of the following form: `{REFERENCE_NAME}/{REMOTE_NAME}`. For example, the reference `refs/heads/master` of the remote `foo` would be `/refs/heads/master/foo`.
- Each remote represents a repository that shares the common history of the rooted repository. A remote can have multiple endpoints.
- A rooted repository is simply a repository with all the objects from all the repositories which share the same root commit.
- The root commit for a repository is obtained following the first parent of each commit from HEAD.## Getting started
### Plain command
gitcollector entry point usage is done through the subcommand `download` (at this time is the only subcommand):
```txt
Usage:
gitcollector [OPTIONS] download [download-OPTIONS]Help Options:
-h, --help Show this help message[download command options]
--library= path where download to [$GITCOLLECTOR_LIBRARY]
--bucket= library bucketization level (default: 2) [$GITCOLLECTOR_LIBRARY_BUCKET]
--tmp= directory to place generated temporal files (default: /tmp) [$GITCOLLECTOR_TMP]
--workers= number of workers, default to GOMAXPROCS [$GITCOLLECTOR_WORKERS]
--half-cpu set the number of workers to half of the set workers [$GITCOLLECTOR_HALF_CPU]
--no-updates don't allow updates on already downloaded repositories [$GITCOLLECTOR_NO_UPDATES]
--no-forks github forked repositories will not be downloaded [$GITCOLLECTOR_NO_FORKS]
--orgs= list of github organization names separated by comma [$GITHUB_ORGANIZATIONS]
--excluded-repos= list of repos to exclude separated by comma [$GITCOLLECTOR_EXCLUDED_REPOS]
--token= github token [$GITHUB_TOKEN]
--metrics-db= uri to a database where metrics will be sent [$GITCOLLECTOR_METRICS_DB_URI]
--metrics-db-table= table name where the metrics will be added (default: gitcollector_metrics) [$GITCOLLECTOR_METRICS_DB_TABLE]
--metrics-sync-timeout= timeout in seconds to send metrics (default: 30) [$GITCOLLECTOR_METRICS_SYNC]Log Options:
--log-level=[info|debug|warning|error] Logging level (default: info) [$LOG_LEVEL]
--log-format=[text|json] log format, defaults to text on a terminal and json otherwise [$LOG_FORMAT]
--log-fields= default fields for the logger, specified in json [$LOG_FIELDS]
--log-force-format ignore if it is running on a terminal or not [$LOG_FORCE_FORMAT]
```Usage example, `--library` and `--orgs` are always required:
> gitcollector download --library=/path/to/repos/directoy --orgs=src-d
To collect repositories from several github organizations:
> gitcollector download --library=/path/to/repos/directoy --orgs=src-d,bblfsh
Note that all the download command options are also configurable with environment variables.
### Docker
gitcollector upload a new docker image to [docker hub](https://hub.docker.com/r/srcd/gitcollector/tags) on each new release. To use it:
``` sh
docker run --rm --name gitcollector_1 \
-e "GITHUB_ORGANIZATIONS=src-d,bblfsh" \
-e "GITHUB_TOKEN=foo" \
-v /path/to/repos/directory:/library \
srcd/gitcollector:latest
```Note that you must mount a local directory into the specific container path shown in `-v /path/to/repos/directory:/library`. This directory is where the repositories will be downloaded into rooted repositories in siva files format.
## License
GPL v3.0, see [LICENSE](LICENSE)