Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kdcro101/pandas-cheatsheet
https://github.com/kdcro101/pandas-cheatsheet
Last synced: 17 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/kdcro101/pandas-cheatsheet
- Owner: kdcro101
- License: apache-2.0
- Created: 2019-08-29T07:28:54.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2019-10-09T09:05:04.000Z (about 5 years ago)
- Last Synced: 2024-10-23T05:16:28.898Z (2 months ago)
- Size: 71.3 KB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Pandas cheatsheet
https://pandas.pydata.org/### Good plotting guide
https://kanoki.org/2019/09/16/dataframe-visualization-with-pandas-plot/http://queirozf.com/entries/pandas-dataframe-plot-examples-with-matplotlib-pyplot
https://www.dataquest.io/blog/making-538-plots/
### Generate Date list
```py
pd.date_range(DATE_START, periods=DATE_DELTA.days, freq="D").tolist()
```### Configure Plotting
```py
# increase font size
matplotlib.rcParams.update({'font.size': 22})
# make bounding box tight
plt.savefig('report.png', bbox_inches='tight')
# rotate axis
df_m.plot(x='split', y=["total", "POSITIVE", "NEGATIVE"], figsize=(32, 18), grid=True, kind="bar", rot=0)
```### groupBy
```py
# Within each team we are grouping based on "Position"
gkk = daf.groupby(['Team', 'Position'])
# Print the first value in each group
gkk.first()
```### Create DataFrame
```py
df = df.DataFrame([], columns=["name", "P_accuracy", "N_accuracy"])
```### Plot multiple series with ma
```py
import matplotlib.pyplot as pltdf = pd.read_csv(csv_filepath)
df.plot(x='epoch', y=["series1", "series2"], figsize=(32, 18), grid=True, color=["red","green"])
plt.savefig(os.path.join(self.ROOT, 'report.pdf'))
plt.savefig(os.path.join(self.ROOT, 'report.png'))
```
### Get columns in DataFrame
```py
list = df.columns
```
### Add column with Moving Average
```py
df["ma"] = df.rolling(window=10)['column'].mean()
# then plot it
df.plot(x="date",y="ma",figsize=(16,9),linewidth=0.5)
```### Modify limit when printing series
```py
# set new limit
pd.set_option('display.max_rows', 1_000_000)
# reset limit
pd.reset_option('display.max_rows')
```### Check value in index
```py
10090723 not in self.df.index.values
```### Iterate through DataFrame rows
```py
for index, row in df.iterrows():
print(index, row)
```### Histograms
```py
#string value
df.label.value_counts().to_frame().style.bar()
# numerical
df["column"].hist(figsize=(16,9),bins=50)```
### Ploting
```py
df.plot(x="date",y="Change")
```### Sorting the column
```py
# accending
df.sort_values(by=['Brand'], inplace=True)
# descending
df.sort_values(by=['Brand'], inplace=True, ascending=False)```
### Adding column
```py
# with default value
df["new_column"] = None
# with data
df['capital city'] = ['Rome','Madrid','Athens','Paris','Lisbon']```
### Rename column
```py
dfnew = df.rename(columns={'old_name': 'new_name'})
```
### Removing column
```py
# Delete using del
del df['column_to_remove']
# Delete using drop()
df = df.drop(['column_to_remove'], axis=1)
```### Updating row
```py
# Update column 'name' at index 0
df.loc[0,'name'] = 123
```### Querying DataFrame
```py
# get all where 'val' > 55.00
df.loc[df['val'] > 55.00]
# get nth where 'val' > 55.00 - returns row only not array of length=1
df.loc[df['val'] > 55.00].iloc[0]# get top 1 row - returns array of rows of length=1
df.loc[df['val'] > 55.00].head(1)
# get top n rows - returns array of rows
df.loc[df['val'] > 55.00].head(5)# comparison operators
df.loc[df['val'] > 55.00].head(5)
df.loc[df['val'] == 55.00].head(5)
df.loc[df['val'] != 55.00].head(5)
df.loc[df['val'].isin(['a','b','c'])]#check for None
df.loc[df['change2'].isnull()].head(1)
#check for NOT None
df.loc[~df['change2'].isnull()].head(1)# logical - combine
df.loc[(df['val'] >= 0) & (df['column_name'] <= 101)]```
### torchtext.data.Dataset from Pandas DataFrame
```pyTEXT = data.Field(tokenize='spacy')
# example with lambda across LABEL values
LABEL = data.LabelField(preprocessing=data.Pipeline(lambda x: pro_proc(x)))
FIELDS = {'title': ('text', TEXT), 'change_prc': ('label', LABEL)}def create_dataset(dataframe,text_column,label_column, TEXT_FIELD, LABEL_FIELD):
# idData pair training is useless during training, use None to specify its corresponding field
fields = [(text_column, TEXT_FIELD), (label_column, LABEL_FIELD)]
examples = []
for text, label in tqdm(zip(dataframe[text_column], dataframe[label_column])):
examples.append(data.Example.fromlist([text, label], fields))
return examples, fields# use as
train_examples, train_fields = create_dataset(dataFrame,"title","change_prc", TEXT, LABEL)train = data.Dataset(train_examples, FIELDS)
```