https://github.com/pdwaggoner/python-to-tidy-R

A Running List of Key Python Operations Translated to Tidy R
https://github.com/pdwaggoner/python-to-tidy-R
Last synced: 7 months ago
JSON representation
A Running List of Key Python Operations Translated to Tidy R
Host: GitHub
URL: https://github.com/pdwaggoner/python-to-tidy-R
Owner: pdwaggoner
Created: 2023-09-11T15:53:03.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-05-08T15:17:26.000Z (over 1 year ago)
Last Synced: 2024-10-15T09:33:48.577Z (about 1 year ago)
Size: 154 KB
Stars: 9
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # From Python to Tidy R (and Back)

**A Running List of Key Python Operations Translated to (Mostly) Tidy R**

![Visitors](https://api.visitorbadge.io/api/visitors?path=https%3A%2F%2Fgithub.com%2Fpdwaggoner%2Fpython-to-tidy-R&label=Visitors&countColor=%ba68c8&style=plastic)

Frequently I am writing code in Python and R. And my team relies heavily on the [Tidyverse](https://www.tidyverse.org/) syntax. So, I am often translating key Python operations (pandas, matplotlib, etc.) to tidy R (dplyr, ggplot2, etc.). In an effort to ease that translation, and also to crowdsource a running directory of these translations, I created this repo. 

This is just a start. **Please feel free to share and also directly contribute or revise via pulls or issues**.

*Note:* I recommend using the native pipe operator (`|>`) when constructing piped operations in practice, instead of the `magrittr` pipe (`%>%`). However, I used the latter in this repo because the `|` in the native R pipe threw off formatting of the markdown tables. 

## Table of Contents

- [Key tasks](#Key-tasks)

- [Joining Data](#Joining-Data)

- [Iteration](#Iteration)

- [Iteration Over Lists](#Iteration-Over-Lists)

- [String Operations](#String-Operations)

- [Modeling and Machine Learning](#Modeling-and-Machine-Learning)

- [Network Modeling and Dynamics](#Network-Modeling-and-Dynamics)

- [Parallel Computing](https://github.com/pdwaggoner/python-to-tidy-R/blob/main/Parallel%20Computing.md)

----

## Key tasks

| Task / Operation         | Python (Pandas)                       | Tidyverse (dplyr, ggplot2)         |

|-------------------------|--------------------------------------|-----------------------------------|

| **Data Loading**        | `import pandas as pd`                | `library(readr)`                  |

|                         | `df = pd.read_csv('file.csv')`       | `data <- read_csv('file.csv')`    |

| **Select Columns**      | `df[['col1', 'col2']]`              | `data %>% select(col1, col2)`    |

| **Filter Rows**         | `df[df['col'] > 5]`                 | `data %>% filter(col > 5)`        |

| **Arrange Rows**        | `df.sort_values(by='col')`           | `data %>% arrange(col)`           |

| **Mutate (Add Columns)**| `df['new_col'] = df['col1'] + df['col2']` | `data %>% mutate(new_col = col1 + col2)` |

| **Group and Summarize** | `df.groupby('col').agg({'col2': 'mean'})` | `data %>% group_by(col) %>% summarize(mean_col2 = mean(col2))` |

| **Pivot/Wide to Long**  | `pd.melt(df, id_vars=['id'], var_name='variable', value_name='value')` | `data %>% gather(variable, value, -id)` |

| **Long to Wide/Pivot**  | `df.pivot(index='id', columns='variable', values='value')` | `data %>% spread(variable, value)` |

| **Data Visualization**  | Matplotlib, Seaborn, Plotly, etc.   | ggplot2                           |

|                         | `import matplotlib.pyplot as plt`   | `library(ggplot2)`                 |

|                         | `plt.scatter(df['x'], df['y'])`    | `ggplot(data, aes(x=x, y=y)) + geom_point()` |

| **Data Reshaping**      | `pd.concat([df1, df2], axis=0)`     | `bind_rows(df1, df2)`             |

|                         | `pd.concat([df1, df2], axis=1)`     | `bind_cols(df1, df2)`             |

| **String Manipulation** | `df['col'].str.replace('a', 'b')`   | `data %>% mutate(col = str_replace(col, 'a', 'b'))` |

| **Date and Time**      | `pd.to_datetime(df['date_col'])`    | `data %>% mutate(date_col = as.Date(date_col))` |

| **Missing Data Handling**| `df.dropna()`                        | `data %>% drop_na()`              |

| **Rename Columns**      | `df.rename(columns={'old_col': 'new_col'})` | `data %>% rename(new_col = old_col)` |

| **Summary Statistics**  | `df.describe()`                      | `data %>% summary()` or `data %>% glimpse()`              |

## Joining Data

This is the only table that includes SQL given that most of the R/`dplyr` operations were patterned and named after many SQL operations.

| Join Type       | SQL                                      | Python (Pandas)                         | R (dplyr)                              |

|-----------------|------------------------------------------|----------------------------------------|----------------------------------------|

| **Inner Join**  | `INNER JOIN`                             | `pd.merge(df1, df2, on='key')`         | `inner_join(df1, df2, by='key')`       |

| **Left Join**   | `LEFT JOIN`                              | `pd.merge(df1, df2, on='key', how='left')` | `left_join(df1, df2, by='key')`        |

| **Right Join**  | `RIGHT JOIN`                             | `pd.merge(df1, df2, on='key', how='right')` | `right_join(df1, df2, by='key')`       |

| **Full Outer Join** | `FULL OUTER JOIN`                      | `pd.merge(df1, df2, on='key', how='outer')` | `full_join(df1, df2, by='key')`         |

| **Cross Join**  | `CROSS JOIN`                             | `pd.merge(df1, df2, how='cross')`       | Not directly supported, but can be achieved with `full_join` and filtering |

| **Anti Join**   | Not directly supported                   | `pd.merge(df1, df2, on='key', how='left', indicator=True).query('_merge == "left_only"').drop('_merge', axis=1)` | Not directly supported, but can be achieved with `anti_join` function from dplyr or by using `filter()` and `!` condition |

| **Semi Join**   | Not directly supported                   | `pd.merge(df1, df2, on='key', how='inner', indicator=True).query('_merge == "both"').drop('_merge', axis=1)` | Not directly supported, but can be achieved with `semi_join` function from dplyr or by using `filter()` and `!` condition |

| **Self Join**   | `INNER JOIN` with the same table         | `pd.merge(df, df, on='key')`            | `inner_join(df, df, by='key')`          |

| **Multiple Key Join** | `INNER JOIN` with multiple keys     | `pd.merge(df1, df2, on=['key1', 'key2'])` | `inner_join(df1, df2, by=c('key1', 'key2'))` |

| **Join with Renamed Columns** | `INNER JOIN` with renamed columns | `pd.merge(df1.rename(columns={'col1': 'key'}), df2, on='key')` | `inner_join(rename(df1, key = col1), df2, by = 'key')` |

| **Join with Complex Condition** | `INNER JOIN` with complex conditions | `pd.merge(df1, df2, on='key', how='inner', left_on=(df1['col1'] > 10) & (df1['col2'] == df2['col3']))` | Not directly supported, but can be achieved with `filter()` and complex conditions |

| **Join with Different Key Names** | `INNER JOIN` with different key names | `pd.merge(df1, df2, left_on='key1', right_on='key2')` | `inner_join(df1, df2, by = c('key1' = 'key2'))` |

## Iteration

| Task / Operation            | Python (Pandas)                       | Tidyverse (dplyr and purrr)       |

|-----------------------------|--------------------------------------|-----------------------------------|

| **Iterate Over Rows**       | `for index, row in df.iterrows():`   | `data %>% rowwise() %>% mutate(new_col = your_function(col))` |

|                             | `    print(row['col1'], row['col2'])` |                                       |

| **Map Function to Column**  | `df['new_col'] = df['col'].apply(your_function)` | `data %>% mutate(new_col = map_dbl(col, your_function))` |

| **Apply Function to Column**| `df['new_col'] = your_function(df['col'])` | `data %>% mutate(new_col = your_function(col))` |

| **Group and Map**           | `for group, group_df in df.groupby('group_col'):` | `data %>% group_by(group_col) %>% nest(data = .) %>% mutate(new_col = map(data, your_function))` |

| **Map Over List Column**    | `df['new_col'] = df['list_col'].apply(lambda x: [your_function(i) for i in x])` | `data %>% mutate(new_col = map(list_col, ~map(your_function, .)))` |

| **Map with Anonymous Function** | - | `data %>% mutate(new_col = map_dbl(col, ~your_function(.)))` |

| **Map Multiple Columns**    | `df['new_col'] = df.apply(lambda row: your_function(row['col1'], row['col2']), axis=1)` | `data %>% mutate(new_col = pmap_dbl(list(col1, col2), ~your_function(...)))` |

## Iteration Over Lists

| Task / Operation                  | Python (Pandas)                          | Tidyverse (dplyr and purrr)               |

|-----------------------------------|-----------------------------------------|-------------------------------------------|

| **Map Function Across List Column**| `df['new_col'] = df['list_col'].apply(lambda x: [your_function(i) for i in x])` | `data %>% mutate(new_col = map(list_col, ~map(your_function, .)))` |

| **Nested Map in List Column**     | `df['new_col'] = df['list_col'].apply(lambda x: [your_function(i) for i in x])` | `data %>% mutate(new_col = map(list_col, ~map(your_function, .)))` |

| **Nested Map Across Columns**     | -                                       | `data %>% mutate(new_col = map2(list(col1, col2), ~map(your_function, .)))` |

| **Nested Map Within List Column** | -                                       | `data %>% mutate(new_col = map(list_col, ~map(your_function, .)))` |

| **Map Across Rows with Nested Map**| -                                     | `data %>% mutate(new_col = pmap(list(col1, col2), ~list(your_function(.x), your_function(.y))))` |

| **Nested Map Within Nested List**   | -                                       | `data %>% mutate(new_col = map(list(list_col), ~map(your_function, .)))` |

| **Nested Map Across List of Lists** | `df['new_col'] = df['list_col'].apply(lambda x: [list(map(your_function, i)) for i in x])` | `data %>% mutate(new_col = map2(list(list_col1, list_col2), ~map2(your_function1, your_function2, .x, .y)))` |

| **Nested Map Across Rows and Lists**| -                                     | `data %>% mutate(new_col = pmap(list(col1, col2, col3), ~list(your_function(.x), your_function(.y), your_function(.z))))` |

| **Map and Reduce Across List**      | `df['new_col'] = df['list_col'].apply(lambda x: reduce(your_function, x))` | `data %>% mutate(new_col = map(list_col, ~reduce(your_function, .)))` |

| **Map and Reduce Across Rows**      | `df['new_col'] = df.apply(lambda row: reduce(your_function, row[['col1', 'col2']]), axis=1)` | `data %>% mutate(new_col = pmap(list(col1, col2), ~reduce(your_function, .)))` |

## String Operations

| Task / Operation 
|----------------------- 
| **String Length** 
| **Concatenate Strings** 
| **Split Strings** 
| **Substring** 
| **Replace Substring** 
| **Uppercase / Lowercase** 
| 
| **Strip Whitespace** 
| **Check for Substring** 
| **Count Substring 
| **Find First 
| **Extract Substring with Regex** 
| **Remove Duplicates in Strings** 
| **Pad Strings** 
| **Truncate Strings** 
| **Title Case** 
| **Join List of Strings** 
| **Remove Punctuation** 
| **String Encoding/Decoding**

| Python (Pandas)                    | Tidyverse (dplyr and stringr)            | ---------|-----------------------------------|-----------------------------------------| | `df['col'].str.len()`             | `data %>% mutate(new_col = str_length(col))` | | `df['new_col'] = df['col1'] + df['col2']` | `data %>% mutate(new_col = str_c(col1, col2))` | | `df['col'].str.split(', ')`      | `data %>% mutate(new_col = str_split(col, ', '))` | | `df['col'].str.slice(0, 5)`      | `data %>% mutate(new_col = str_sub(col, 1, 5))` | | `df['col'].str.replace('old', 'new')` | `data %>% mutate(new_col = str_replace(col, 'old', 'new'))` | | `df['col'].str.upper()`           | `data %>% mutate(new_col = str_to_upper(col))` | | `df['col'].str.lower()`           | `data %>% mutate(new_col = str_to_lower(col))` | | `df['col'].str.strip()`           | `data %>% mutate(new_col = str_squish(col))` | | `df['col'].str.contains('pattern')` | `data %>% mutate(new_col = str_detect(col, 'pattern'))` | Occurrences** | `df['col'].str.count('pattern')`  | `data %>% mutate(new_col = str_count(col, 'pattern'))` | Occurrence of Substring**| `df['col'].str.find('pattern')`        | `data %>% mutate(new_col = str_locate(col, 'pattern')[, 1])` | | `df['col'].str.extract(r'(\d+)')`      | `data %>% mutate(new_col = str_extract(col, '(\\d+)'))` | | -                                      | `data %>% mutate(new_col = str_unique(col))` | | `df['col'].str.pad(width=10, side='right', fillchar='0')` | `data %>% mutate(new_col = str_pad(col, width = 10, side = 'right', pad = '0'))` | | `df['col'].str.slice(0, 10)`           | `data %>% mutate(new_col = str_sub(col, 1, 10))` | | -                                      | `data %>% mutate(new_col = str_to_title(col))` | | `'separator'.join(df['col'])`          | `data %>% mutate(new_col = str_flatten(col, collapse = 'separator'))` | | -                                      | `data %>% mutate(new_col = str_remove_all(col, '[[:punct:]]'))` | | -                                      | `data %>% mutate(new_col = str_encode(col, to = 'UTF-8'))` |

## Modeling and Machine Learning

| Task / Operation              | Python (scikit-learn)                   | R (various packages)                    |

|-------------------------------|----------------------------------------|----------------------------------------|

| **Data Preprocessing**        | `from sklearn.preprocessing import ...`  | `library(caret)`                       |

|                               | `from sklearn.pipeline import Pipeline` | `library(glmnet)`                      |

|                               | `preprocessor = ...`                  | `preprocess <- preProcess(data, ...)`   |

| **Feature Scaling**           | `StandardScaler()`                     | `preprocess$scaling`                    |

| **Feature Selection**         | `SelectKBest()`                        | `caret::createFolds()`                  |

| **Data Splitting**            | `train_test_split()`                   | `createDataPartition()`                 |

| **Model Initialization**      | `model = ...()`                        | `model <- ...()`                       |

| **Model Training**            | `model.fit(X_train, y_train)`          | `model <- train(y ~ ., data = data)`   |

| **Model Prediction**          | `y_pred = model.predict(X_test)`        | `y_pred <- predict(model, newdata)`    |

| **Model Evaluation**          | `accuracy_score(y_test, y_pred)`       | `confusionMatrix(y_pred, y_true)`      |

| **Hyperparameter Tuning**     | `GridSearchCV()`                       | `tuneGrid(...)`                        |

| **Cross-Validation**          | `cross_val_score()`                    | `trainControl(method = "cv")`           |

| **Model Pipelining**          | `pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])` | `model <- train(y ~ ., data = data, method = model, trControl = trainControl(method = "cv"))` |

| **Feature Engineering**         | `from sklearn.preprocessing import ...` | `library(caret)`                     |

|                                 | Custom feature transformers          | Custom feature transformers           |

| **Handling Missing Data**       | `SimpleImputer()`                     | `preprocess$impute`                   |

| **Encoding Categorical Data**   | `OneHotEncoder()`                     | `dummyVars()`                        |

| **Dimensionality Reduction**    | `PCA()`                               | `preprocess$reduce`                   |

| **Model Selection**             | `GridSearchCV()`                      | `caret::train()`                      |

| **Ensemble Learning**           | Various ensemble methods              | `caret::train()` with `method="stack"` |

| **Regularization**              | Lasso, Ridge, Elastic Net, etc.       | `glmnet()`                            |

| **Model Interpretability**      | SHAP, Lime, etc.                      | DALEX, iml, etc.                      |

| **Model Export/Serialization**   | `joblib` or `pickle`                  | `saveRDS` or other formats            |

| **Deploying Models**            | Web frameworks (e.g., Flask, Django)  | Web frameworks (e.g., Shiny, Plumber) |

| **Batch Scoring**               | Scripting or automation tools         | R batch processing                    |

| **Feature Scaling/Normalization**| `StandardScaler()`, `MinMaxScaler()`, etc. | `scale()`, `normalize()`, etc.       |

| **Feature Selection with L1 Regularization** | `SelectFromModel()`, `Lasso()`  | `glmnet()`, `cv.glmnet()`            |

| **Handling Imbalanced Data**    | `RandomUnderSampler()`, `SMOTE()`, etc. | `caret::train()` with `weights` or `sampling` |

| **Model Evaluation Metrics**    | `classification_report()`, `confusion_matrix()`, `mean_squared_error()`, etc. | `confusionMatrix()`, `postResample()`, `RMSE`, etc. |

| **Feature Importance**          | `.feature_importances_` (Random Forest, etc.) | `varImp()`, `vip()`, etc.         |

| **Model Persistence**           | `joblib`, `pickle`, `sklearn.externals` | `saveRDS`, `save()`, `serialize()`, etc. |

| **Time Series Forecasting**     | `Prophet`, `ARIMA`, `ExponentialSmoothing`, etc. | `forecast`, `prophet`, `auto.arima`, etc. |

| **Natural Language Processing (NLP)** | `nltk`, `spaCy`, `textblob`, etc. | `tm`, `quanteda`, `udpipe`, `tm.plugin.webmining`, etc. |

| **Deep Learning**               | `Keras`, `TensorFlow`, `PyTorch`, etc. | `keras`, `tensorflow`, `torch`, `mxnet`, etc. |

| **Model Interpretation**        | `SHAP`, `LIME`, `ELI5`, etc.         | `DALEX`, `iml`, `iBreakDown`, `lime`, etc. |

| **Model Deployment in Production** | Containers, cloud platforms (e.g., Docker, Kubernetes, AWS SageMaker) | Containers, Shiny, Plumber, APIs, cloud platforms |

## Network Modeling and Dynamics

| Task / Operation                | Python (NetworkX)                    | R (various packages)                    |

|---------------------------------|--------------------------------------|----------------------------------------|

| **Network Creation**            | `G = nx.Graph()`, `G.add_node()`, `G.add_edge()` | `igraph::graph()`, `add_vertices()`, `add_edges()` |

| **Node and Edge Attributes**    | `G.nodes[node]['attribute'] = value`, `G.edges[edge]['attribute'] = value` | `V(graph)$attribute <- value`, `E(graph)$attribute <- value` |

| **Network Visualization**       | `nx.draw(G)`, `matplotlib` for customization | `plot(graph)`, `igraph`, `ggplot2`, `visNetwork`, etc. |

| **Network Measures**            | `nx.degree_centrality(G)`, `nx.betweenness_centrality(G)`, `nx.clustering(G)`, etc. | `degree()`, `betweenness()`, `transitivity()`, etc. |

| **Community Detection**         | `community.detect()` (e.g., Louvain, Girvan-Newman) | `cluster_walktrap()`, `cluster_fast_greedy()`, `cluster_leading_eigen()`, etc. |

| **Link Prediction**             | `link_prediction.method()` (e.g., Common Neighbors, Jaccard Coefficient) | `link_prediction.method()` (e.g., Adamic-Adar, Preferential Attachment) |

| **Network Filtering/Selection** | `G.subgraph(nodes)`                | `subgraph(graph, vertices)`            |

| **Network Embedding**           | `node2vec`, `GraphSAGE`, etc.        | `walktrap.community`, `fastgreedy.community`, etc. |

| **Network Simulation**          | `nx.erdos_renyi_graph()`, `nx.watts_strogatz_graph()`, etc. | `igraph::erdos.renyi.game()`, `igraph::watts.strogatz.game()`, etc. |

| **Network Analysis Pipelines**  | Custom pipelines using NetworkX, Pandas, and other libraries | Custom pipelines using igraph, dplyr, and other packages |

| **Dynamic Network Analysis**    | `dynetx` for dynamic networks       | `tsna` for temporal networks, `dyngraph` for dynamic graphs, etc. |

| **Geospatial Network Analysis** | `osmnx` for urban network analysis  | `stplanr` for transport planning, `spatnet` for spatial network analysis, etc. |

| **Network Modeling for Machine Learning** | Integration with scikit-learn, PyTorch, etc. | Integration with caret, glmnet, keras, etc. |

| **Community Visualization**      | Visualization of detected communities using network layouts | `igraph::plot.igraph()` with community coloring |

| **Path Analysis**               | Shortest paths, k-shortest paths, and all simple paths | `get.shortest.paths()`, `all.simple.paths()` |

| **Centrality Analysis**         | Closeness centrality, eigenvector centrality, Katz centrality, etc. | `closeness()`, `eigen_centrality()`, `katz_centrality()`, etc. |

| **Structural Role Analysis**    | Structural equivalence, equivalence-based roles | `structural_equivalence()`, `role_equiv()`, etc. |

| **Network Robustness Analysis**  | Network attack simulations, robustness metrics | `robustness()` function, `remove_vertices()`, etc. |

| **Temporal Network Analysis**   | Temporal networks, evolving networks | `dynnet` package for dynamic networks, temporal extensions of `igraph` functions |

| **Multiplex Network Analysis**  | Analyzing multiple layers of networks | `multiplex` package for multilayer networks, `mgm` package for multilayer graphical models |

| **Network Alignment**           | Aligning nodes in two or more networks | `netAlign` package for network alignment, `gmatch` package for graph matching |

| **Dynamic Community Detection**  | Detecting evolving communities over time | `dynCOMM` for dynamic community detection |

| **Network Generative Models**   | Generating networks from various models (e.g., ER, BA, etc.) | `igraph::sample_gnm()`, `igraph::sample_degseq()`, etc. |

| **Geospatial Network Analysis** | Geospatial network analysis and routing | `stplanr` for transport planning, `spatnet` for spatial network analysis, etc. |

| **Network Modeling for Machine Learning** | Integrating network data with machine learning libraries | Combining `igraph` or custom network features with caret, glmnet, keras, etc. |
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pdwaggoner/python-to-tidy-R

Awesome Lists containing this project

README