https://github.com/xiaodaigh/data-wrangling-puzzles
https://github.com/xiaodaigh/data-wrangling-puzzles
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/xiaodaigh/data-wrangling-puzzles
- Owner: xiaodaigh
- Created: 2020-08-05T00:30:15.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2020-09-23T00:29:34.000Z (over 4 years ago)
- Last Synced: 2025-01-21T10:08:26.575Z (4 months ago)
- Size: 29.3 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# data-wrangling-puzzles
## Puzzle 1 - Messed up column names need distribution
[Orig link](https://discourse.julialang.org/t/how-would-i-remove-a-column-from-a-dataframe-by-distributing-its-values-among-existing-columns/44265).
From
```julia
df = DataFrame(:person=>["bob","phil","nick"],:london=>[1,1,0],:spain=>[1,0,0],Symbol("london,spain")=>[1,1,1])
```
to```julia
df = DataFrame(:person=>["bob","phil","nick"],:london=>[2,2,1],:spain=>[2,1,1])
```### Julia solution
```julia
function row_spread(row)
dict = Dict{String, Int}()
for (colname, val) in zip(keys(row), values(row))
if colname == :person
continue
end
for country in split(string(colname), ",")
dict[country] = get(dict, country, 0) + val
end
end
new_row = hcat(DataFrame(person = row.person), DataFrame(dict))
new_row
endnew_df = reduce(vcat, row_spread(row) for row in eachrow(df))
```## Puzzle 2 - Pivot a dataframe to wide format with values in multiple columns
See https://discourse.julialang.org/t/pivot-a-dataframe-to-wide-format-with-values-in-multiple-columns/45916
```
wide = DataFrame(x = 1:12,
a = 2:13,
b = 3:14,
val1 = randn(12),
val2 = randn(12),
cname = repeat(["c", "d"], inner =6)
)12×6 DataFrame
│ Row │ x │ a │ b │ val1 │ val2 │ cname │
│ │ Int64 │ Int64 │ Int64 │ Float64 │ Float64 │ String │
├─────┼───────┼───────┼───────┼───────────┼───────────┼────────┤
│ 1 │ 1 │ 2 │ 3 │ 1.51014 │ -1.18548 │ c │
│ 2 │ 2 │ 3 │ 4 │ 0.0845411 │ -0.370083 │ c │
│ 3 │ 3 │ 4 │ 5 │ 0.826283 │ -1.00423 │ c │
│ 4 │ 4 │ 5 │ 6 │ -0.53175 │ -1.16659 │ c │
│ 5 │ 5 │ 6 │ 7 │ -1.77975 │ 0.336333 │ c │
│ 6 │ 6 │ 7 │ 8 │ 0.632577 │ 0.236621 │ c │
│ 7 │ 7 │ 8 │ 9 │ -0.681532 │ 1.14869 │ d │
│ 8 │ 8 │ 9 │ 10 │ -0.775619 │ 0.393475 │ d │
│ 9 │ 9 │ 10 │ 11 │ -0.533034 │ 0.059624 │ d │
│ 10 │ 10 │ 11 │ 12 │ 0.496152 │ -1.23507 │ d │
│ 11 │ 11 │ 12 │ 13 │ 0.834099 │ 2.12115 │ d │
│ 12 │ 12 │ 13 │ 14 │ 0.532357 │ -0.369267 │ d │
```I am trying to mimic the pivot_wider function in R:
`wide %>% pivot_wider(names_from = cname, values_from = c(val1,val2))`
=== === === ========== ========== ========== ==========
x a b val1_c val1_d val2_c val2_d
=== === === ========== ========== ========== ==========
1 2 3 1.0174232 NA -0.6611959 NA
2 3 4 0.6590795 NA -2.0954505 NA
3 4 5 1.2939581 NA 1.6350356 NA
4 5 6 -1.9395356 NA 0.7813238 NA
5 6 7 0.3558087 NA 0.9789414 NA
6 7 8 0.9859100 NA -0.9803336 NA
7 8 9 NA 0.4949224 NA -0.0659333
8 9 10 NA 0.5024755 NA -0.2317832
9 10 11 NA 1.6926897 NA -0.3840687
10 11 12 NA -0.4324705 NA -0.0901276
11 12 13 NA -0.6415260 NA 0.0014151
12 13 14 NA 1.2406868 NA -2.1959740
=== === === ========== ========== ========== ==========
```## Puzzle 3
Keep only certain rows based on data in group
I need to select groups of observations from a large dataframe (about 2.9 mio rows) using a number of conditions which apply to different observations in each group (so I cannot select on individual rows only).
Using a small dataframe as a starting point, I wrote an algorithm which applies the conditions and generates the result dataframe with my desired groups within a loop. See mwe below.
If I apply this loop to the large dataframe, performance becomes a (serious) problem.
I haven’t used the split/apply/combine approach before so I am learning about it right now. I worked through the documentation but I haven’t been able to write code for my problem yet.
For example, I am struggling to understand how to select different rows of groupeddataframes. I figured out how to get the age of the status1 row select(combine(first, gdf), :status => :obs1_status) but not for the status2 row.
Any hints/guidance on how to implement selection conditions on groupeddataframes using combine/select/transform commands? (I.e. generate df_result in a faster way?)
Thanks a lot!
```
using DataFrames# generate sample dataframe
df = DataFrame(id = [1,1,1,2,2,3,4], age = [53,52,17,31,29,22,71], status = [1,2,3,1,2,1,1])# initialize result dataframe
df_result = copy(df[1:2,:]; copycols=true);for k = 1:maximum(df.id)
```Sample Data
```julia
df = DataFrame(id = [1,1,1,2,2,3,4], age = [53,52,17,31,29,22,71], status = [1,2,3,1,2,1,1])
``````
7×3 DataFrame
│ Row │ id │ age │ status │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼────────┤
│ 1 │ 1 │ 53 │ 1 │
│ 2 │ 1 │ 52 │ 2 │
│ 3 │ 1 │ 17 │ 3 │
│ 4 │ 2 │ 31 │ 1 │
│ 5 │ 2 │ 29 │ 2 │
│ 6 │ 3 │ 22 │ 1 │
│ 7 │ 4 │ 71 │ 1 │
```Julia Solutions
```julia
using Pipe, PairAsPipe, DataFramesMeta
df_result = @pipe df |>
groupby(_, :id) |>
combine(_,
@pap(status1and2 = sum(in(1:2), :status)),
@pap(not_wokring_age = sum(:status .== 1 .& (:age .< 25 .| :age .> 61)))
) |>
@where(_, :status1and2 .== 2, :not_wokring_age .== 0) |>
@select(_, :id) |>
innerjoin(_, df; on = :id)
```## Pivot wider two columns
```
julia> df = DataFrame(t = [:a, :b, :c, :a, :b, :c], x = 1:6, y = 11:16)
6×3 DataFrame
│ Row │ t │ x │ y │
│ │ Symbol │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1 │ a │ 1 │ 11 │
│ 2 │ b │ 2 │ 12 │
│ 3 │ c │ 3 │ 13 │
│ 4 │ a │ 4 │ 14 │
│ 5 │ b │ 5 │ 15 │
│ 6 │ c │ 6 │ 16 │
so that it becomes
│ Row │ x_a │ x_b │ x_c │ y_a │ y_b │ y_c │
│ │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┼───────┼───────┤
│ 1 │ 1 │ 2 │ 3 │ 11 │ 12 │ 13 │
│ 2 │ 4 │ 5 │ 6 │ 14 │ 15 │ 16 │
```