{"id":23073548,"url":"https://github.com/src-d/jgit-spark-connector","last_synced_at":"2025-08-15T16:31:11.419Z","repository":{"id":57469446,"uuid":"99223053","full_name":"src-d/jgit-spark-connector","owner":"src-d","description":"jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.","archived":false,"fork":false,"pushed_at":"2019-02-13T13:43:52.000Z","size":8966,"stargazers_count":71,"open_issues_count":11,"forks_count":32,"subscribers_count":11,"default_branch":"master","last_synced_at":"2024-11-13T14:55:29.694Z","etag":null,"topics":["datasource","git","pyspark","python","scala","spark"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/src-d.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-08-03T10:55:44.000Z","updated_at":"2022-06-02T05:17:29.000Z","dependencies_parsed_at":"2022-09-19T09:10:11.769Z","dependency_job_id":null,"html_url":"https://github.com/src-d/jgit-spark-connector","commit_stats":null,"previous_names":["src-d/spark-api"],"tags_count":53,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fjgit-spark-connector","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fjgit-spark-connector/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fjgit-spark-connector/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/src-d%2Fjgit-spark-connector/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/src-d","download_url":"https://codeload.github.com/src-d/jgit-spark-connector/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":229880682,"owners_count":18138638,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["datasource","git","pyspark","python","scala","spark"],"created_at":"2024-12-16T08:18:14.608Z","updated_at":"2024-12-16T08:18:15.273Z","avatar_url":"https://github.com/src-d.png","language":"Scala","readme":"# jgit-spark-connector [![Build Status](https://travis-ci.org/src-d/jgit-spark-connector.svg?branch=master)](https://travis-ci.org/src-d/jgit-spark-connector) [![codecov](https://codecov.io/gh/src-d/jgit-spark-connector/branch/master/graph/badge.svg)](https://codecov.io/gh/src-d/jgit-spark-connector) [![Maven Central](https://maven-badges.herokuapp.com/maven-central/tech.sourced/jgit-spark-connector/badge.svg)](https://maven-badges.herokuapp.com/maven-central/tech.sourced/jgit-spark-connector)\n\n**jgit-spark-connector** is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.\n\nIt is written in Scala and built on top of Apache Spark to enable rapid construction of custom analysis pipelines and processing large number of Git repositories stored in HDFS in [Siva file format](https://github.com/src-d/go-siva). It is accessible both via Scala and Python Spark APIs, and capable of running on large-scale distributed clusters.\n\nCurrent implementation combines:\n - [src-d/enry](https://github.com/src-d/enry) to detect programming language of every file\n - [bblfsh/client-scala](https://github.com/bblfsh/client-scala) to parse every file to UAST\n - [src-d/siva-java](https://github.com/src-d/siva-java) for reading Siva files in JVM\n - [apache/spark](https://github.com/apache/spark) to extend DataFrame API\n - [eclipse/jgit](https://github.com/eclipse/jgit) for working with Git .pack files\n\n # Deprecated\n\n jgit-spark-connector has been deprecated in favor of gitbase-spark-connector and there will be no further development of this tool.\n\n\n# Quick-start\n\nFirst, you need to download [Apache Spark](https://spark.apache.org/) somewhere on your machine:\n\n```bash\n$ cd /tmp \u0026\u0026 wget \"https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download\u0026filename=spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz\" -O spark-2.2.1-bin-hadoop2.7.tgz\n```\nThe Apache Software Foundation suggests you the better mirror where you can download `Spark` from. If you wish to take a look and find the best option in your case, you can [do it here](https://www.apache.org/dyn/closer.lua/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz).\n\nThen you must extract `Spark` from the downloaded tar file:\n\n```bash\n$ tar -C ~/ -xvzf spark-2.2.1-bin-hadoop2.7.tgz\n```\n\nBinaries and scripts to run `Spark` are located in spark-2.2.1-bin-hadoop2.7/bin, so should set `PATH` and `SPARK_HOME` to point to this directory. It's advised to add this to your shell profile:\n\n```bash\n$ export SPARK_HOME=$HOME/spark-2.2.1-bin-hadoop2.7\n$ export PATH=$PATH:$SPARK_HOME/bin\n```\n\nLook for the latest [**jgit-spark-connector** version](http://search.maven.org/#search%7Cga%7C1%7Ctech.sourced), and then replace in the command where `[version]` is showed:\n\n```bash\n$ spark-shell --packages \"tech.sourced:jgit-spark-connector:[version]\"\n\n# or\n\n$ pyspark --packages \"tech.sourced:jgit-spark-connector:[version]\"\n```\n\nRun [bblfsh daemon](https://github.com/bblfsh/bblfshd). You can start it easily in a container following its [quick start guide](https://github.com/bblfsh/bblfshd#quick-start).\n\nIf you run **jgit-spark-connector** in an UNIX like environment, you should set the `LANG` variable properly:\n\n    export LANG=\"en_US.UTF-8\"\n\nThe rationale behind this is that UNIX file systems don't keep the encoding for each file name, they are just plain bytes,\nso the `Java API for FS` looks for the `LANG` environment variable to apply certain encoding.\n\nEither in case the `LANG` variable wouldn't be set to a UTF-8 encoding or it wouldn't be set at all (which results in handle encoding in C locale) you could get an exception during the ***jgit-spark-connector*** execution similar to `java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters`.\n\n# Pre-requisites\n\n* Scala 2.11.x\n* [Apache Spark Installation](http://spark.apache.org/docs/2.2.1/) 2.2.x or 2.3.x\n* [bblfsh](https://github.com/bblfsh/bblfshd) \u003e= 2.5.0: Used for UAST extraction\n\n## Python pre-requisites:\n\n* Python \u003e= 3.4.x (jgit-spark-connector is tested with Python 3.4, 3.5 and 3.6 and these are the supported versions, even if it might still work with previous ones)\n* `libxml2-dev` installed\n* `python3-dev` installed\n* `g++` installed\n\n# Examples of jgit-spark-connector usage\n\n**jgit-spark-connector** is available on [maven central](https://search.maven.org/#search%7Cga%7C1%7Ctech.sourced.jgit-spark-connector). To add it to your project as a dependency,\n\nFor projects managed by [maven](https://maven.apache.org/) add the following to your `pom.xml`:\n\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003etech.sourced\u003c/groupId\u003e\n    \u003cartifactId\u003ejgit-spark-connector\u003c/artifactId\u003e\n    \u003cversion\u003e[version]\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\nFor [sbt](http://www.scala-sbt.org/) managed projects add the dependency:\n\n    libraryDependencies += \"tech.sourced\" % \"jgit-spark-connector\" % \"[version]\"\n\nIn both cases, replace `[version]` with the [latest jgit-spark-connector version](http://search.maven.org/#search%7Cga%7C1%7Ctech.sourced)\n\n### Usage in applications as a dependency\n\nThe default jar published is a fatjar containing all the dependencies required by the jgit-spark-connector. It's meant to be used directly as a jar or through `--packages` for Spark usage.\n\nIf you want to use it in an application and built a fatjar with that you need to follow these steps to use what we call the \"slim\" jar:\n\nWith maven:\n\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003etech.sourced\u003c/groupId\u003e\n    \u003cartifactId\u003ejgit-spark-connector\u003c/artifactId\u003e\n    \u003cversion\u003e[version]\u003c/version\u003e\n    \u003cclassifier\u003eslim\u003c/classifier\u003e\n\u003c/dependency\u003e\n```\n\nOr (for sbt):\n\n```scala\nlibraryDependencies += \"tech.sourced\" % \"jgit-spark-connector\" % \"[version]\" % Compile classifier \"slim\"\n```\n\nIf you run into problems with `io.netty.versions.properties` on sbt, you can add the following snippet to solve it:\n\nIn sbt:\n\n```scala\nassemblyMergeStrategy in assembly := {\n  case \"META-INF/io.netty.versions.properties\" =\u003e MergeStrategy.last\n  case x =\u003e\n    val oldStrategy = (assemblyMergeStrategy in assembly).value\n    oldStrategy(x)\n}\n```\n\n## pyspark\n\n### Local mode\n\nInstall python-wrappers is necessary to use **jgit-spark-connector** from pyspark:\n\n``` bash\n$ pip install sourced-jgit-spark-connector\n```\n\nThen you should provide the **jgit-spark-connector's** maven coordinates to the pyspark's shell:\n```bash\n$ $SPARK_HOME/bin/pyspark --packages \"tech.sourced:jgit-spark-connector:[version]\"\n```\nReplace `[version]` with the [latest jgit-spark-connector version](http://search.maven.org/#search%7Cga%7C1%7Ctech.sourced)\n\n### Cluster mode\n\nInstall **jgit-spark-connector** wrappers as in local mode:\n```bash\n$ pip install -e sourced-jgit-spark-connector\n```\n\nThen you should package and compress with `zip`  the python wrappers to provide pyspark with it. It's required to distribute the code among the nodes of the cluster.\n\n```bash\n$ zip \u003cpath-to-installed-package\u003e ./sourced-jgit-spark-connector.zip\n$ $SPARK_HOME/bin/pyspark \u003csame-args-as-local-plus\u003e --py-files ./sourced-jgit-spark-connector.zip\n```\n\n### pyspark API usage\n\nRun pyspark as explained before to start using the jgit-spark-connector, replacing `[version]` with the [latest jgit-spark-connector version](http://search.maven.org/#search%7Cga%7C1%7Ctech.sourced):\n\n```bash\n$ $SPARK_HOME/bin/pyspark --packages \"tech.sourced:jgit-spark-connector:[version]\"\nWelcome to\n\n   spark version 2.2.1\n\nUsing Python version 3.6.2 (default, Jul 20 2017 03:52:27)\nSparkSession available as 'spark'.\n\u003e\u003e\u003e from sourced.engine import Engine\n\u003e\u003e\u003e engine = Engine(spark, '/path/to/siva/files', 'siva')\n\u003e\u003e\u003e engine.repositories.filter('id = \"github.com/mingrammer/funmath.git\"').references.filter(\"name = 'refs/heads/HEAD'\").show()\n+--------------------+---------------+--------------------+\n|       repository_id|           name|                hash|\n+--------------------+---------------+--------------------+\n|github.com/mingra...|refs/heads/HEAD|290440b64a73f5c7e...|\n+--------------------+---------------+--------------------+\n\n```\n\n## Scala API usage\n\nYou must provide **jgit-spark-connector** as a dependency in the following way, replacing `[version]` with the [latest jgit-spark-connector version](http://search.maven.org/#search%7Cga%7C1%7Ctech.sourced):\n\n```bash\n$ spark-shell --packages \"tech.sourced:jgit-spark-connector:[version]\"\n```\n\nTo start using **jgit-spark-connector** from the shell you must import everything inside the `tech.sourced.engine` package (or, if you prefer, just import `Engine` and `EngineDataFrame` classes):\n\n```bash\nscala\u003e import tech.sourced.engine._\nimport tech.sourced.engine._\n```\n\nNow, you need to create an instance of `Engine` and give it the spark session and the path of the directory containing the siva files:\n\n```bash\nscala\u003e val engine = Engine(spark, \"/path/to/siva-files\", \"siva\")\n```\n\nThen, you will be able to perform queries over the repositories:\n\n```bash\nscala\u003e engine.getRepositories.filter('id === \"github.com/mawag/faq-xiyoulinux\").\n     | getReferences.filter('name === \"refs/heads/HEAD\").\n     | getAllReferenceCommits.filter('message.contains(\"Initial\")).\n     | select('repository_id, 'hash, 'message).\n     | show\n\n     +--------------------------------+-------------------------------+--------------------+\n     |                 repository_id|                                hash|          message|\n     +--------------------------------+-------------------------------+--------------------+\n     |github.com/mawag/...|fff7062de8474d10a...|Initial commit|\n     +--------------------------------+-------------------------------+--------------------+\n\n```\n\n## Supported repository formats\n\nAs you might have seen, you need to provide the repository format you will be reading when you create the `Engine` instance. Although the documentation always uses the `siva` format, there are more repository formats available.\n\nThese are all the supported formats at the moment:\n\n- `siva`: rooted repositories packed in a single `.siva` file.\n- `standard`: regular git repositories with a `.git` folder. Each in a folder of their own under the given repository path.\n- `bare`: git bare repositories. Each in a folder of their own under the given repository path.\n\n### Processing local repositories with the jgit-spark-connector\n\nThere are some design decisions that may surprise the user when processing local repositories, instead of siva files. This is the list of things you should take into account when doing so:\n\n- All local branches will belong to a repository whose id is `file://$REPOSITORY_PATH`. So, if you clone `https://github.com/foo/bar.git` at `/home/foo/bar`, you will see two repositories `file:///home/foo/bar` and `github.com/foo/bar`, even if you only have one.\n- Remote branches are transformed from `refs/remote/$REMOTE_NAME/$BRANCH_NAME` to `refs/heads/$BRANCH_NAME` as they will only belong to the repository id of their corresponding remote. So `refs/remote/origin/HEAD` becomes `refs/heads/HEAD`.\n\n# Playing around with **jgit-spark-connector** on Jupyter\n\nYou can launch our docker container which contains some Notebooks examples just running:\n\n    docker run --name jgit-spark-connector-jupyter --rm -it -p 8080:8080 -v $(pwd)/path/to/siva-files:/repositories --link bblfshd:bblfshd srcd/jgit-spark-connector-jupyter\n\nYou must have some siva files in local to mount them on the container replacing the path `$(pwd)/path/to/siva-files`. You can get some siva-files from the project [here](https://github.com/src-d/jgit-spark-connector/tree/master/_examples/siva-files).\n\nYou should have a [bblfsh daemon](https://github.com/bblfsh/bblfshd) container running to link the jupyter container (see Pre-requisites).\n\nWhen the `jgit-spark-connector-jupyter` container starts it will show you an URL that you can open in your browser.\n\n# Using jgit-spark-connector directly from Python\n\nIf you are using the jgit-spark-connector directly from Python and are unable to modify the `PYTHON_SUBMIT_ARGS` you can copy the jgit-spark-connector jar to the pyspark jars to make it available there.\n\n```\ncp jgit-spark-connector.jar \"$(python -c 'import pyspark; print(pyspark.__path__[0])')/jars\"\n```\n\nThis way, you can use it in the following way:\n\n```python\nimport sys\n\npyspark_path = \"/path/to/pyspark/python\"\nsys.path.append(pyspark_path)\n\nfrom pyspark.sql import SparkSession\nfrom sourced.engine import Engine\n\nsiva_folder = \"/path/to/siva-files\"\nspark = SparkSession.builder.appName(\"test\").master(\"local[*]\").getOrCreate()\nengine = Engine(spark, siva_folder, 'siva')\n```\n\n# Development\n\n## Build fatjar\n\nBuild the fatjar is needed to build the docker image that contains the jupyter server,  or test changes in spark-shell just passing the jar with `--jars` flag:\n\n```bash\n$ make build\n```\n\nIt leaves the fatjar in `target/scala-2.11/jgit-spark-connector-uber.jar`\n\n## Build and run docker to get a Jupyter server\n\nTo build an image with the last built of the project:\n\n```bash\n$ make docker-build\n```\n\nNotebooks under examples folder will be included on the image.\n\nTo run a container with the Jupyter server:\n\n```bash\n$ make docker-run\n```\n\nBefore run the jupyter container you must run a [bblfsh daemon](https://github.com/bblfsh/bblfshd):\n\n```bash\n$ make docker-bblfsh\n```\n\nIf it's the first time you run the [bblfsh daemon](https://github.com/bblfsh/bblfshd), you must install the drivers:\n\n```bash\n$ make docker-bblfsh-install-drivers\n```\n\nTo see installed drivers:\n\n```bash\n$ make docker-bblfsh-list-drivers\n```\n\nTo remove the development jupyter image generated:\n\n```bash\n$ make docker-clean\n```\n\n## Run tests\n\n**jgit-spark-connector** uses [bblfsh](https://github.com/bblfsh) so you need an instance of a bblfsh server running:\n\n```bash\n$ make docker-bblfsh\n```\n\nTo run tests:\n```bash\n$ make test\n```\n\nTo run tests for python wrapper:\n\n```bash\n$ cd python\n$ make test\n```\n\n### Windows support\n\nThere is no windows support in enry-java or bblfsh's client-scala right now, so all the language detection and UAST features are not available for the windows platform.\n\n# Code of Conduct\n\nSee [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md)\n\n# License\n\nApache License Version 2.0, see [LICENSE](LICENSE)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrc-d%2Fjgit-spark-connector","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsrc-d%2Fjgit-spark-connector","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrc-d%2Fjgit-spark-connector/lists"}