{"id":22561582,"url":"https://github.com/seart-group/ghs","last_synced_at":"2025-04-04T19:07:21.305Z","repository":{"id":40406519,"uuid":"333974111","full_name":"seart-group/ghs","owner":"seart-group","description":"GitHub Search: Platform used to crawl, store and present projects from GitHub, as well as any statistics related to them","archived":false,"fork":false,"pushed_at":"2025-04-03T00:44:00.000Z","size":43169,"stargazers_count":152,"open_issues_count":6,"forks_count":19,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-03T01:27:56.699Z","etag":null,"topics":["bootstrap","crawler","csv-export","dataset-generation","docker-compose","git","github","java-17","json-export","mining-software-repositories","msr","mysql","platform","repository","search-engine","spring-boot","spring-boot-application","spring-boot-server","sql-dump","xml-export"],"latest_commit_sha":null,"homepage":"https://seart-ghs.si.usi.ch","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/seart-group.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.bib","codeowners":"CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-01-28T22:34:55.000Z","updated_at":"2025-04-03T00:43:58.000Z","dependencies_parsed_at":"2023-10-26T12:35:57.806Z","dependency_job_id":"666ff3ef-9cbf-412e-88f2-e47283fc9c17","html_url":"https://github.com/seart-group/ghs","commit_stats":null,"previous_names":[],"tags_count":45,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seart-group%2Fghs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seart-group%2Fghs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seart-group%2Fghs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seart-group%2Fghs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/seart-group","download_url":"https://codeload.github.com/seart-group/ghs/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247234921,"owners_count":20905854,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bootstrap","crawler","csv-export","dataset-generation","docker-compose","git","github","java-17","json-export","mining-software-repositories","msr","mysql","platform","repository","search-engine","spring-boot","spring-boot-application","spring-boot-server","sql-dump","xml-export"],"created_at":"2024-12-07T22:08:21.405Z","updated_at":"2025-04-04T19:07:21.261Z","avatar_url":"https://github.com/seart-group.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GitHub Search \u0026middot; [![Status](https://badgen.net/https/dabico.npkn.net/ghs-status)](http://seart-ghs.si.usi.ch) [![MIT license](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/seart-group/ghs/blob/master/LICENSE) [![Latest Dump](https://img.shields.io/badge/Latest_Dump-01.08.24-blue)](https://www.dropbox.com/scl/fi/yqgnrtfdasq518wr4tfpl/gse.sql.gz?rlkey=6u1gke9zwjdk26040fslg88vy\u0026st=zm71s900\u0026dl=1) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4588464.svg)](https://doi.org/10.5281/zenodo.4588464) \u003c!-- markdownlint-disable-line --\u003e\n\nThis project is made of two components:\n\n1. A Spring Boot powered back-end, responsible for:\n    1. Continuously crawling GitHub API endpoints for repository information, and storing it in a central database;\n    2. Acting as an API for providing access to the stored data.\n2. A Bootstrap-styled and jQuery-powered [web user interface](http://seart-ghs.si.usi.ch), serving as an accessible\n    front for the API.\n\n## Running Locally\n\n### Prerequisites\n\n| Dependency                                   | Version Requirement |\n|----------------------------------------------|--------------------:|\n| Java                                         |                  17 |\n| Maven                                        |                 3.9 |\n| MySQL                                        |                 8.3 |\n| Flyway                                       |               10.13 |\n| [cloc](https://github.com/AlDanial/cloc)[^1] |                2.00 |\n| Git[^1]                                      |                2.43 |\n\n[^1]: Only required in versions prior to 1.7.0\n\n### Database\n\nBefore choosing whether to start with a clean slate or pre-populated database,\nmake sure the following requirements are met:\n\n1. The database timezone is set to `+00:00`. You can verify this via:\n\n    ```sql\n    SELECT @@global.time_zone, @@session.time_zone;\n    ```\n\n2. The event scheduler is turned `ON`. You can verify this via:\n\n    ```sql\n    SELECT @@global.event_scheduler;\n    ```\n\n3. The binary logging during the creation of stored functions is set to `1`. You can verify this via:\n\n    ```sql\n    SELECT @@global.log_bin_trust_function_creators;\n    ```\n\n4. The `gse` database exists. To create it:\n\n    ```sql\n    CREATE DATABASE gse CHARACTER SET utf8 COLLATE utf8_bin;\n    ```\n\n5. The `gseadmin` user exists. To create one, run:\n\n    ```sql\n    CREATE USER IF NOT EXISTS 'gseadmin'@'%' IDENTIFIED BY 'Lugano2020';\n    GRANT ALL ON gse.* TO 'gseadmin'@'%';\n    ```\n\nIf you prefer to begin with an empty database, there is nothing more for you to do.\nThe required tables will be generated through Flyway migrations during the initial startup of the server.\nHowever, if you would like your local database to be pre-populated with the data we've collected,\nyou can use the compressed SQL dump we offer. We host this dump, along with the four previous iterations,\non [Dropbox](https://www.dropbox.com/scl/fo/lqvp1mhsg0ezp2sgs0xdk/h?rlkey=j9joij3iqpy1zl5h061vdnlj6).\nAfter choosing and downloading a database dump, you can import the data by executing:\n\n```shell\ngzcat \u003c gse.sql.gz | mysql -u gseadmin -pLugano2020 gse\n```\n\n### Server\n\nBefore attempting to run the server, you should generate your own GitHub personal access token (PAT).\nThe crawler relies on the GraphQL API, which is inaccessible without authentication.\nTo access the information provided by the GitHub API, the token must include the `repo` scope.\n\nOnce that is done, you can run the server locally using Maven:\n\n```shell\nmvn spring-boot:run\n```\n\nIf you want to make use of the token when crawling, specify it in the run arguments:\n\n```shell\nmvn spring-boot:run -Dspring-boot.run.arguments=--ghs.github.tokens=\u003cyour_access_token\u003e\n```\n\nAlternatively, you can compile and run the JAR directly:\n\n```shell\nmvn clean package\nln target/ghs-application-*.jar target/ghs-application.jar\njava -Dghs.github.tokens=\u003cyour_access_token\u003e -jar target/ghs-application.jar\n```\n\nHere is a list of project-specific arguments supported by the application that you can find in the `application.properties`:\n\n| Variable Name                        | Type                     | Default Value                                                           | Description                                                                                                                                                                                                                                                        |\n|--------------------------------------|--------------------------|-------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `ghs.github.tokens`                  | List\u0026lt;String\u0026gt;       |                                                                         | List of [GitHub personal access tokens (PATs)](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens) that will be used for mining the GitHub API. Must not contain blank strings.                   |\n| `ghs.github.api-version`             | String                   | 2022-11-28                                                              | [GitHub API version](https://docs.github.com/en/rest/overview/api-versions) used across various operations.                                                                                                                                                        |\n| `ghs.git.username`                   | String                   |                                                                         | Git account login used to interact with the version control system.                                                                                                                                                                                                |\n| `ghs.git.password`                   | String                   |                                                                         | Password used to authenticate the specified Git account.                                                                                                                                                                                                           |\n| `ghs.git.config`                     | Map\u0026lt;String,String\u0026gt; | See [application.properties](src/main/resources/application.properties) | Git configurations specific to the application[^2].                                                                                                                                                                                                                |\n| `ghs.git.folder-prefix`              | String                   | ghs-clone-                                                              | Prefix used for the temporary directories into which analyzed repositories are cloned. Must not be blank.                                                                                                                                                          |\n| `ghs.git.ls-remote-timeout-duration` | Duration                 | 1m                                                                      | Maximum time allowed for listing remotes of Git repositories.                                                                                                                                                                                                      |\n| `ghs.git.clone-timeout-duration`     | Duration                 | 5m                                                                      | Maximum time allowed for cloning Git repositories.                                                                                                                                                                                                                 |\n| `ghs.cloc.max-file-size`             | DataSize                 | 25MB                                                                    | Maximum file size threshold for analysis with `cloc`.                                                                                                                                                                                                              |\n| `ghs.cloc.timeout-duration`          | Duration                 | 5m                                                                      | Maximum time allowed for a `cloc` command to execute.                                                                                                                                                                                                              |\n| `ghs.crawler.enabled`                | Boolean                  | true                                                                    | Specifies if the repository crawling job is enabled.                                                                                                                                                                                                               |\n| `ghs.crawler.minimum-stars`          | int                      | 10                                                                      | Inclusive lower bound for the number of stars a project needs to have in order to be picked up by the crawler. Must not be negative.                                                                                                                               |\n| `ghs.crawler.languages`              | List\u0026lt;String\u0026gt;       | See [application.properties](src/main/resources/application.properties) | List of language names that will be targeted during crawling. Must not contain blank strings. To ensure proper operations, the names must match those specified in [linguist](https://github.com/github-linguist/linguist/blob/master/lib/linguist/languages.yml). |\n| `ghs.crawler.start-date`             | Date                     | 2008-01-01T00:00:00Z                                                    | Default crawler start date: the earliest date for repository crawling in the absence of prior crawl jobs. Value format: `yyyy-MM-ddTHH:MM:SSZ`.                                                                                                                    |\n| `ghs.crawler.delay-between-runs`     | Duration                 | PT6H                                                                    | Delay between successive crawler runs, expressed as a duration string.                                                                                                                                                                                             |\n| `ghs.analysis.enabled`               | Boolean                  | true                                                                    | Specifies if the analysis job is enabled.                                                                                                                                                                                                                          |\n| `ghs.analysis.delay-between-runs`    | Duration                 | PT6H                                                                    | Delay between successive analysis runs, expressed as a duration string.                                                                                                                                                                                            |\n| `ghs.analysis.max-pool-threads`      | int                      | 3                                                                       | Maximum amount of live threads dedicated to concurrently analyzing repositories. Must be positive.                                                                                                                                                                 |\n| `ghs.clean-up.enabled`               | Boolean                  | true                                                                    | Specifies if the job responsible for removing unavailable repositories (clean-up) is enabled.                                                                                                                                                                      |\n| `ghs.clean-up.cron`                  | CronTrigger              | 0 0 0 \\* \\* 1                                                           | Delay between successive repository clean-up runs, expressed as a [Spring CRON expression](https://spring.io/blog/2020/11/10/new-in-spring-5-3-improved-cron-expressions).                                                                                         |\n\n[^2]:\n    We separate the application-level Git configurations from the ones used by the user to avoid any potential\n    conflicts or confusion. As such, an application-specific configuration file is created in the temporary directory on\n    startup. Settings added to the file depend on the `ghs.git.config` entries in the `application.properties`.\n    Note that configuration subsections are currently not supported.\n\n### Web UI\n\nThe easiest way to launch the front-end is through the provided NPM script:\n\n```shell\nnpm run dev\n```\n\nYou can also use the built-in web server of your IDE, or any other web server of your choice.\nRegardless of which method you choose for hosting, the back-end CORS restricts you to using ports `3030` and `7030`.\n\n## Dockerisation :whale:\n\nThe deployment stack consists of the following containers:\n\n| Service/Container name |                                  Image                                  | Description                              |      Enabled by Default       |\n|------------------------|:-----------------------------------------------------------------------:|------------------------------------------|:-----------------------------:|\n| `gse-database`         |            [mysql](https://registry.hub.docker.com/_/mysql)             | Platform database                        |      :white_check_mark:       |\n| `gse-migration`        |        [flyway](https://registry.hub.docker.com/r/flyway/flyway)        | Database schema migration executions     |      :white_check_mark:       |\n| `gse-backup`           |  [tiredofit/db-backup](https://hub.docker.com/r/tiredofit/db-backup/)   | Automated database backups               | :negative_squared_cross_mark: |\n| `gse-server`           |              [seart/ghs-server](docker/server/Dockerfile)               | Spring Boot server application           |      :white_check_mark:       |\n| `gse-website`          |             [seart/ghs-website](docker/website/Dockerfile)              | NGINX web server acting as HTML supplier |      :white_check_mark:       |\n| `gse-watchtower`       | [containrrr/watchtower](https://hub.docker.com/r/containrrr/watchtower) | Automatic Docker image updates           | :negative_squared_cross_mark: |\n\nThe service dependency chain can be represented as follows:\n\n```mermaid\ngraph RL\n    gse-migration --\u003e |service_healthy| gse-database\n    gse-backup --\u003e |service_completed_successfully| gse-migration\n    gse-server --\u003e |service_completed_successfully| gse-migration\n    gse-website --\u003e |service_healthy| gse-server\n    gse-watchtower --\u003e |service_healthy| gse-website\n```\n\nDeploying is as simple as, in the [docker-compose](docker-compose) directory, run:\n\n```shell\ndocker-compose -f docker-compose.yml up -d\n```\n\nIt is important to note that the database setup steps explained in the preceding section aren't necessary when running\nwith Docker. This is because the environment properties passed to the service will automatically create the MySQL\nuser and database during the initial startup. However, this convenience doesn't extend to the database data, as the\ndefault deployment generates an empty database. If you wish to use existing data from the dumps, you will need to\noverride the `docker-compose` deployment to employ a custom database image that includes the dump. To achieve this,\ncreate your `docker-compose.override.yml` file with the following contents:\n\n```yaml\nversion: \"3.9\"\nname: \"gse\"\n\nservices:\n    gse-database:\n        image: seart/ghs-database:latest\n```\n\nThe above image will include the freshest database dump, at most 15 days behind the actual platform data.\nFor a more specific database version, refer to the [Docker Hub page](https://hub.docker.com/r/seart/ghs-database/tags).\nRemember to specify the override file during deployment:\n\n```shell\ndocker-compose -f docker-compose.yml -f docker-compose.override.yml up -d\n```\n\nThe database data itself is kept in the `gse-data` volume, while detailed back-end logs are kept in a local mount called [logs](docker-compose/logs).\nYou can also use this override file to change the configurations of other services.\nFor example, specifying your own PAT for the crawler:\n\n```yaml\nversion: \"3.9\"\nname: \"gse\"\n\nservices:\n    # other services omitted...\n\n    gse-server:\n        environment:\n            GHS_GITHUB_TOKENS: \"A single or comma-separated list of token(s)\"\n            GHS_CRAWLER_ENABLED: \"true\"\n```\n\nAny of the Spring Boot properties or aforementioned application-specific properties can be overridden.\nKeep in mind that a property such as `ghs.x.y` corresponds to the `GHS_X_Y` service environment setting.\n\nAnother example is the automated database backup service, which is disabled by default.\nIf you would like to re-enable it, you would have to add the following to the override file:\n\n```yaml\nversion: \"3.9\"\nname: \"gse\"\n\nservices:\n    # other services omitted...\n\n    gse-backup:\n        restart: always\n        entrypoint: \"/init\"\n```\n\n## FAQ\n\n### How can I request a feature or ask a question?\n\nIf you have ideas for a feature you would like to see implemented or if you have any questions, we encourage you to\ncreate a new [discussion](https://github.com/seart-group/ghs/discussions/). By initiating a discussion, you can engage\nwith the community and our team, and we will respond promptly to address your queries or consider your feature requests.\n\n### How can I report a bug?\n\nTo report any issues or bugs you encounter, create a [new issue](https://github.com/seart-group/ghs/issues/).\nProviding detailed information about the problem you're facing will help us understand and address it more effectively.\nRest assured, we're committed to promptly reviewing and responding to the issues you raise, working collaboratively\nto resolve any bugs and improve the overall user experience.\n\n### How do I contribute to the project?\n\nRefer to [CONTRIBUTING.md](CONTRIBUTING.md) for more information.\n\n### How do I extend/modify the existing database schema?\n\nTo do that, you should be familiar with database migration tools and practices.\nThis project uses [Flyway](https://flywaydb.org/) by Redgate.\nThe general rule for schema manipulation is: create new migrations, and _refrain from editing existing ones_.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseart-group%2Fghs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fseart-group%2Fghs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseart-group%2Fghs/lists"}