Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/joeyism/commonly-used-pyspark-commands
A list of commonly used pyspark commands
https://github.com/joeyism/commonly-used-pyspark-commands
common frequent pyspark python spark
Last synced: about 1 month ago
JSON representation
A list of commonly used pyspark commands
- Host: GitHub
- URL: https://github.com/joeyism/commonly-used-pyspark-commands
- Owner: joeyism
- Created: 2017-12-12T17:44:26.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2020-04-15T03:19:01.000Z (almost 5 years ago)
- Last Synced: 2024-11-07T17:38:08.065Z (3 months ago)
- Topics: common, frequent, pyspark, python, spark
- Homepage:
- Size: 12.7 KB
- Stars: 3
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Commonly Used Pyspark Commands
The following is a list of commonly used [Pyspark](http://spark.apache.org/docs/latest/api/python/index.html) commands that I have found to be useful.
DISCLAIMER: These are not the only ways to use these commands. There are obviously many other ways. These are just ways that I use often and have found to be useful.
## Table of Content
* [`Viewing`](#viewing)
* [`Read and Write`](#read-and-write)
* [`Select`](#select)
* [`Apply Anonymous Function`](#apply-anonymous-function)
* [`Zip`](#zip)
* [`Count Distinct`](#count-distinct)
* [`Joins`](#joins)
* [`Groups and Aggregate`](#groups-and-aggregate)
* [`Working with datetime`](#working-with-datetime)
* [`Viewing Data`](#viewing-data)### Viewing
#### Viewing first few of DataFrame
```python
df.head()
```#### Viewing first 10 of DataFrame
```python
df.head(10)
```#### Viewing total number of rows of DataFrame
```python
df.count()
```### Read and Write
#### Read CSV to DataFrame
```python
df = sqlContext.read.load("data/file.csv",
format="com.databricks.spark.csv",
header="true", inferSchema="true",
delimiter=',')
```
#### Manually add columns to DataFrame
```python
df = df.toDF(*["col1", "col2"])
```#### Write DataFrame to 1 CSV
```python
df.toPandas().to_csv("df.csv", index=False)
# If it says pandas not found, just pip install pandas
```#### Write to csv but in multiple files in a folder
```python
df.write.csv("df_folder")## or if you want to specify number of files
no_of_files = 256 # setting to 1 might throw error due to memory problems
df.coalesce(no_of_files).write.format("com.databricks.spark.csv").save("df_folder")
```### Reading from JSON
#### Reading from JSON into a DataFrame
```python
df = spark.read.json('json_file.json')
```
#### Reading many JSON files into a DataFrame
```python
df = spark.read.json('json_folder/*.json')
```### Select
#### Selecting one columns from dataframe
```python
df_new = df.select("col1")
```#### Selecting multiple columns from dataframe
```python
df_new = df.select("col1", "col2")
df_new = df.select(["col1", "col2"])```
### Apply Anonymous Function
#### Creating an anonymous function, and using that to transform column of data
```python
from pyspark.sql.functions import udf
df = df.where("new_col", udf(lambda x: x + 2)("col1"))## More explicit example
from pyspark.sql.functions import udf
add_two_udf = udf(lambda x: x+2)
df = df.where("new_col", add_two_udf("col1")) # then df will have what it originally had, with new_col```
#### Creating an anonymous function and making it return an integer
```python
from pyspark.sql.types import IntegerType
df = df.where("new_col", udf(lambda x: x + 2, IntegerType())("col1"))
df = df.where("new_col", udf(lambda x: x + 2)("col1").cast(IntegerType()))```
#### Have a global variable, and reference that in anonymous function
```python
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udfcity_to_num = {"Toronto": 1, "Montreal": 2, "Vancouver": 3}
global_city_to_num = sc.broadcast(city_to_num)
city_to_num_udf = udf(lambda x: global_city_to_num.value[x], IntegerType())
df = df.where("city_id", city_to_num_udf("city_name"))```
### Zip
#### Get unique values from a column, and zip it with a unique number so it returns a dict
```python
# So that the result is {"Toronto": 1, "Montreal": 2, "Vancouver": 3, ...}
city_to_num = dict(df.select("city_name").rdd.distinct().map(lambda x: x[0]).zipWithIndex().collect())```
### Count Distinct
#### Number of distinct/unique values from a column
```python#CPU times: user 0 ns, sys: 4 ms, total: 4 ms
#Wall time: 426 ms
df.select("col1").distinct().count()# CPU times: user 16 ms, sys: 0 ns, total: 16 ms
# Wall time: 200 ms
df.select("col1").rdd.distinct().count()```
### Joins
#### Joining two dataframes on different ids
```python
df_new = df1.join(df2, df1["some_id"] == df2["other_id"], "inner")
```#### Joining two dataframes on the same id, and you don't want that id to repeat
```python
df_new = df1.join(df2, "id", "inner")
```#### Join on multiple conditions
```python
df_new = df1.join(
df2,
(df2["col1"] == df1["col2"]) & (df2["col3"] == df1["col4"]),
"inner"
)
```### Groups and Aggregate
#### Group by an id, and sum up values based on the groupby value
```python
from pyspark.sql.functions import sumdf_new = df.groupBy("store_id").agg(sum("no_of_customers").alias("total_no_of_customers"))
```#### Group by multiple, and aggregate multiple
```python
from pyspark.sql.functions import sum, avgdf_new = df.groupBy([
"store_id",
"date"
]).agg(
sum("no_of_customers").alias("total_no_of_customers"),
avg("no_of_customers").alias("avg_no_of_customers")
)
```#### Group by with custom sum aggregate
```python
from pyspark.sql.functions import udf
from pyspark.sql.functions import sum, avgdf_new = df.groupBy("store_id").agg(
sum(udf(lambda t: t.weekday() == 4 or t.weekday() == 5, BooleanType())("visit_date").cast("int")).alias("no_of_visits_weekend")
)## A more explicit version
is_weekend = udf(lambda t: t.weekday() == 4 or t.weekday() == 5, BooleanType())
df_new = df.groupBy("store_id").agg(
sum(
is_weekend("visit_date").cast("int")
).alias("no_of_visits_weekend")
)
```### Working with datetime
#### Get day of the week
```python
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerTypeget_dotw = udf(lambda x: x.weekday(), IntegerType())
df = df.withColumn("dotw", get_dotw("date"))
```#### Get 1 if it is a weekend, 0 if it is not
```python
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerTypeget_weekend = udf(lambda x: 1 if x.weekday() == 4 or x.weekday() == 5 else 0, IntegerType())
df = df.withColumn("dotw", get_weekend("date"))
```#### Get day
```python
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerTypeget_day = udf(lambda x: x.day, IntegerType())
df = df.withColumn("dotw", get_day("date"))
```#### Get Month
```python
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerTypeget_month = udf(lambda x: x.month, IntegerType())
df = df.withColumn("dotw", get_month("date"))
```#### Convert String to Timestamp/datetime
If the string is of the form `yyyy-MM-dd`, this creates a new column that with the same data but in `timestamp` format. When you `.take` the value, it'll actually say it's `datetime.datetime` which is useful for manipulation```python
from pyspark.sql.functions import coldf = df.select("*", col("time_string").cast("timestamp").alias("time_datetime"))
```#### Datetime subtract constant datetime
The following example shows the number of days since new year of that year.```python
from datetime import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerTypedays_since_newyear_udf = udf(lambda x: (x - datetime.strptime(str(x.year) + "-01-01", "%Y-%m-%d")).days, IntegerType())
df = df.withColumn("days_since_newyear", days_since_newyear_udf("calendar_date"))
```#### Find time difference
```python
from pyspark.sql.functions import unix_timestampdf.withColumn("avg_timestamp_in_s", unix_timestamp("max_timestamp") - unix_timestamp("min_timestamp"))
```### Viewing Data
#### Taking a peek at 1 row of data in list form
I use this alot, similar to when I use `df.head()` in pandas```python
# CPU times: user 28 ms, sys: 4 ms, total: 32 ms
# Wall time: 4.18 s
df.take(1)# CPU times: user 8 ms, sys: 28 ms, total: 36 ms
# Wall time: 4.19 s
df.head(1) # defaults to 1 if no specified
```#### Looking at the entire set in list form
**WARNING**: This takes a while if your dataset is large. I don't usually use this```python
df.collect()
```### Machine Learning
#### Transforming for Training
I'm just using GBTRegressor as a example
```python
# Transforming
from pyspark.ml.linalg import DenseVectordf_train = df.select("output", "input1", "input2", ...)
df_train = df_train.rdd.map(lambda x: (x[0], DenseVector(x[1:])))
df_train = spark.createDataFrame(df_train, ["label", "features"])# Training
from pyspark.ml.regression import GBTRegressor
gbt = GBTRegressor(maxIter=10)
gbt.fit(df_train)
```