{"id":13527742,"url":"https://github.com/henridf/apache-spark-node","last_synced_at":"2025-04-01T09:32:02.466Z","repository":{"id":57181210,"uuid":"45722869","full_name":"henridf/apache-spark-node","owner":"henridf","description":"Node.js bindings for Apache Spark DataFrame APIs","archived":false,"fork":false,"pushed_at":"2017-09-10T15:44:05.000Z","size":794,"stargazers_count":143,"open_issues_count":8,"forks_count":14,"subscribers_count":17,"default_branch":"master","last_synced_at":"2025-03-27T06:11:25.547Z","etag":null,"topics":["data-frame","node","spark"],"latest_commit_sha":null,"homepage":"https://henridf.github.io/apache-spark-node","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/henridf.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-11-07T04:44:31.000Z","updated_at":"2024-02-28T02:24:22.000Z","dependencies_parsed_at":"2022-09-19T17:45:14.726Z","dependency_job_id":null,"html_url":"https://github.com/henridf/apache-spark-node","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/henridf%2Fapache-spark-node","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/henridf%2Fapache-spark-node/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/henridf%2Fapache-spark-node/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/henridf%2Fapache-spark-node/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/henridf","download_url":"https://codeload.github.com/henridf/apache-spark-node/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246616395,"owners_count":20806129,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-frame","node","spark"],"created_at":"2024-08-01T06:01:59.299Z","updated_at":"2025-04-01T09:32:02.004Z","avatar_url":"https://github.com/henridf.png","language":"JavaScript","funding_links":[],"categories":["JavaScript"],"sub_categories":[],"readme":"[![Build Status](https://travis-ci.org/henridf/apache-spark-node.svg?branch=master)](https://travis-ci.org/henridf/apache-spark-node)\n\n\n__This repository is no longer under development. Everything here should continue to work (with appropriate Spark and node versions), but the repo will not be further developed or maintained. I will still try to review any PRs. If anyone has any interest in taking this over, please contact [@henridf](https://twitter.com/henridf).__\n\nApache Spark \u003c=\u003e Node.js\n========================\n\nNode.js bindings for Apache Spark DataFrame APIs.\n\n**Table of Contents**\n\n- [API Docs](#api-docs)\n- [Status](#status)\n- [Getting Started](#getting-started)\n- [Usage](#usage)\n- [Examples](#examples)\n- [Misc notes](#misc-notes)\n\nAPI Docs\n--------\nAPI documentation is [here](https://henridf.github.io/apache-spark-node).\n\n\nStatus\n------\n\nThis project is already usable in its present form, but it is still in early\nstages and under development. APIs may change.\n\nNotably not yet implemented are:\n\n- support for user-defined functions\n- jvm-side helpers to for functions/methods which cannot currently be called\n  from node (for example because they take parameters types like `Seq`)\n\n\n\nGetting started\n---------------\n\n### Requirements\n\n- Linux or OS X (on Windows, there are currently problems building node add-ons)\n- Node.js, version 4+\n- Java 8\n- Spark \u003e= 1.5. You'll need the Spark Assembly jar, which\n  contains all of the Spark classes. If you don't have an existing\n  installation, the easiest is to get the binaries from the\n  [Spark downloads page](http://spark.apache.org/downloads.html) (choose\n  \"pre-built for Hadoop 2.6 and later\"). Or you can download the Spark sources\n  and build it yourself. More information\n  [here](http://spark.apache.org/docs/latest/building-spark.html).\n\n\n### Installing\n\n#### From NPM\n\n    $ npm install apache-spark-node\n\n#### From source\n\nClone git repo, then:\n\n    $ npm install\n    $ npm run compile\n\n### Running\n\nSet ASSEMBLY_JAR to the location of your assembly JAR and run `spark-node` from the directory where you issued `npm install apache-spark-node`:\n```shell\nASSEMBLY_JAR=/path/to/spark-assembly-1.6.0-SNAPSHOT-hadoop2.2.0.jar node_modules/apache-spark-node/bin/spark-node\n```\n\n### Docker\n\nIf you want to play with spark-node but don't want to download the\ndependencies or build, you can run it in docker.\n\n    $ docker run -it henridf/spark-node\n\nThis will take you to the normal `spark-node` shell. Optionally, you can map\nhost volumes to use files on your host system with `spark-node`. For example\n\n    $ docker run -v /var/data:/data -it henridf/spark-node\n\nwill map the host's `/var/data` directory to `/data` within the Docker image. This means that you can use\n\n    $ var df = sqlContext.read().jsonSync(\"/data/people.json\")\n\nto load a file at `/var/data/people.json` on the host system.\n\n\n\nUsage\n-----\n\n(Note: This section is a quick overview of the available APIs in spark-node;\nit is not general introduction to Spark or to DataFrames.)\n\nStart the spark-node shell (this assumes you've that you've `ASSEMBLY_JAR` as\nan environment variable):\n\n    $ ./bin/spark-node\n\nA `sqlContext` global object is available in the shell. Its functions are used to create\nDataFrames, register DataFrames as tables, execute SQL over tables, cache\ntables, and read parquet files.\n\nTo see available command-line options, do `./bin/spark-node --help`.\n\n\n### Creating a DataFrame\n\nLoad a dataframe from a json file:\n\n    $ var df = sqlContext.read().jsonSync(\"./data/people.json\")\n\nLoad a dataframe from a list of javascript objects:\n\n    $ var df = sqlContext.createDataFrame([{\"name\":\"Michael\"}, {\"name\":\"Andy\", \"age\":30}, {\"name\":\"Justin\", \"age\": 19}])\n\nPretty-print dataframe contents to stdout:\n\n    $ df.show()\n\n```\n+----+-------+\n| age|   name|\n+----+-------+\n|null|Michael|\n|  30|   Andy|\n|  19| Justin|\n+----+-------+\n```\n\n### DataFrame Operations\n\nPrint the dataframe's schema in a tree format:\n\n    $ df.printSchema()\n\nSelect only the \"name\" column:\n\n    $ df.select(df.col(\"name\")).show()\n\nor the shorter (equivalent) version:\n\n    $ df.select(\"name\").show()\n\ncollect the result (as an array of rows) and assign it to a javascript\nvariable:\n\n    $ var res = df.select(\"name\").collectSync()\n\nSelect everybody and increment age by 1:\n\n    $ df.select(df.col(\"name\"), df.col(\"age\").plus(1)).show()\n\nSelect people older than 21:\n\n    $ df.filter(df.col(\"age\").gt(21)).show()\n\nCount people by age:\n\n    $ df.groupBy(\"age\").count().show()\n\n\n### Dataframe functions\n\nA `sqlFunctions` global object is available in the shell. It contains a\nvariety of built-in functions for operating on dataframes.\n\nFor example, to find the minimum and average of \"age\" across all rows:\n\n    $ var F = sqlFunctions;\n\n    $ df.agg(F.min(df.col(\"age\")), F.avg(df.col(\"age\"))).show()\n\n\n### Running SQL Queries Programmatically\n\nRegister `df` as a table named `people`:\n\n    $ df.registerTempTable(\"people\")\n\nRun a SQL query:\n\n    $ var teens = sqlContext.sql(\"SELECT name FROM people WHERE age \u003e= 13 AND age \u003c= 19\")\n    $ teens.show()\n\n\nExamples\n--------\n\n#### Word count (aka 'big data hello world')\n\nCreate dataframe from text file:\n\n    $ var lines = sqlContext.read().textSync(\"data/words.txt\");\n\n(_Note: support for the \"text\" format was added in Spark 1.6_).\n\n\nSplit strings into arrays:\n\n    $ var F = sqlFunctions;\n    $ var splits = lines.select(F.split(lines.col(\"value\"), \" \").as(\"words\"));\n\nExplode the arrays into individual rows:\n\n    $ var occurrences = splits.select(F.explode(splits.col(\"words\")).as(\"word\"));\n\nWe now have a dataframe with one row per word occurrence. So we group\nand count occurrences of the same word and we're done:\n\n    $ var counts = occurrences.groupBy(\"word\").count()\n\n    $ counts.where(\"count\u003e10\").sort(counts.col(\"count\")).show()\n\n\n\n#### Running `spark-node` against a standalone cluster\n\nWhen you run `bin/spark-node` without passing a `--master` argument, the\nspark-node process runs a spark worker in the same process. To run the\nspark-node shell against a cluser, use the `--master` argument. Here's an\nexample.\n\nOn a worker node, do the following:\n\n    $ cd path/to/spark/distribution\n    $ ./sbin/start-master.sh\n\nNavigate to `http://hostname:8080` and get the Spark URL (top line), which\nwill be something like `spark://worker_hostname:7077`. Then start any number of\nslaves on your cluster hosts by running `./sbin/start-slave.sh --master \u003cspark_url\u003e`.\n\n\nThen on your client machine:\n\n    $ cd path/to/apache-spark-node\n    $ ./bin/spark-node --master \u003cspark_url\u003e\n\n\nIf you return to the master Web UI (`http://hostname:8080`), you should now\nsee an application with name \"spark-node shell\" under \"Running\napplications\". Following that link gets you to the Web UI of the node-spark\nshell itself.\n\n\n\nMisc notes\n----------\n\nThis was done under the self-imposed constraint of not making modifications to\nthe spark sources. This results in hacks like the `NodeSparkSubmit` Scala\nclass, which are workaround for the fact that we can't add explicit awareness\nof this shell to SparkSubmit.\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhenridf%2Fapache-spark-node","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhenridf%2Fapache-spark-node","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhenridf%2Fapache-spark-node/lists"}