https://github.com/markphamm/pandas-tutorial
🐼 A practical pandas cheatsheet with examples. Includes grouped syntax by category (inspection, manipulation, aggregation, joins, datetime, nulls) and a sample DataFrame to test all operations — perfect for interviews, studying, or daily reference.
https://github.com/markphamm/pandas-tutorial
pandas
Last synced: 2 months ago
JSON representation
🐼 A practical pandas cheatsheet with examples. Includes grouped syntax by category (inspection, manipulation, aggregation, joins, datetime, nulls) and a sample DataFrame to test all operations — perfect for interviews, studying, or daily reference.
- Host: GitHub
- URL: https://github.com/markphamm/pandas-tutorial
- Owner: MarkPhamm
- License: mit
- Created: 2025-04-12T19:46:05.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-04-17T18:27:44.000Z (8 months ago)
- Last Synced: 2025-10-10T11:35:13.051Z (2 months ago)
- Topics: pandas
- Language: Jupyter Notebook
- Homepage:
- Size: 34.2 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Pandas Syntax Cheatsheet
This guide covers essential pandas syntax across data inspection, basic manipulation, aggregation, joins/unions, datetime operations, and null handling.
**Getting Started**
To get started with this project, follow these steps:
1. **Clone the repository**:
```bash
git clone https://github.com/MarkPhamm/pandas-tutorial.git
cd pandas-tutorial
```
2. **Create a virtual environment**:
```bash
python -m venv venv
```
3. **Activate the virtual environment**:
- On Windows:
```bash
venv\Scripts\activate
```
- On macOS and Linux:
```bash
source venv/bin/activate
```
4. **Install the required packages**:
```bash
pip install -r requirements.txt
```
## 1. Data Inspection
### 1.1 View the first few rows
```python
df.head()
```
### 1.2 View the last few rows
```python
df.tail()
```
### 1.3 View column names
```python
df.columns
```
### 1.4 Access a single column
```python
df['col']
# or
df.col
```
### 1.5 Check the shape of the DataFrame
```python
df.shape
```
### 1.6 Check data types and non-null counts
```python
df.info()
```
### 1.7 Get summary statistics
```python
df.describe()
```
### 1.8 Check for duplicates
```python
df.duplicated().sum()
```
### 1.9 See value counts in a column
```python
df['col'].value_counts()
```
### 1.10 .empty to check if a df is empty
```python
if df.empty:
df = pd.DataFrame({'SecondHighestSalary': [None]})
```
---
## 2. Basic Manipulation
### 2.1 Select specific columns
```python
df[['col1', 'col2']]
```
### 2.2 Filter rows using condition
```python
df[df['col'] > 100]
```
### 2.3 Filter rows using `.between()`
```python
df[df['col'].between(10, 50)]
```
### 2.4 Rename columns
```python
df.rename(columns={'old_name': 'new_name'})
```
### 2.5 Sort values
```python
df.sort_values(by='col', ascending=False)
```
### 2.6 Select rows using `.loc` (label-based)
```python
df.loc[5]
```
### 2.7 Select rows using `.iloc` (position-based)
```python
df.iloc[5]
```
### 2.8 Change datatypes using `astypes`
```python
df = df.astype({'col_name': 'desired_dtype'})
# or
df['col_name'] = df['col_name'].astype('desired_dtype')
```
# Return all rows where the 'verified' column is NOT True
```python
df[~df['col'] == 'target']
```
---
## 3. Aggregation Functions
### 3.1 Group by and sum
```python
df.groupby('group_col')['value_col'].sum().reset_index()
```
### 3.2 Group by and count
```python
df.groupby('group_col')['value_col'].count().reset_index()
```
### 3.3 Group by and count unique values
```python
df.groupby('group_col')['value_col'].nunique().reset_index()
```
### 3.4 Group by with multiple aggregations using `.agg()`
```python
df.groupby('group_col').agg({
'col1': 'sum',
'col2': 'mean',
'col3': 'nunique'
}).reset_index()
```
---
## 4. Join and Union
### 4.1 Merge two DataFrames on different keys
```python
df1.merge(df2, left_on='key1', right_on='key2', how='inner')
```
### 4.2 Merge using `how='left'`, `how='right'`, or `how='outer'`
```python
df1.merge(df2, on='key', how='left')
```
### 4.3 Union two DataFrames (like SQL `UNION ALL`)
```python
pd.concat([df1, df2], ignore_index=True)
```
### 4.4 Union with deduplication (like SQL `UNION`)
```python
pd.concat([df1, df2], ignore_index=True).drop_duplicates()
```
---
## 5. Datetime Functions
### 5.1 Create a timestamp
```python
pd.Timestamp('2025-04-12')
```
### 5.2 Create a time delta
```python
pd.Timedelta(days=30)
```
### 5.3 Convert a column to datetime
```python
df['date_col'] = pd.to_datetime(df['date_col'])
```
### 5.4 Extract day from datetime
```python
df['day'] = df['date_col'].dt.day
```
### 5.5 Convert datetime to monthly period
```python
df['month'] = df['date_col'].dt.to_period('M')
```
### 5.6 Calculate time difference in seconds
```python
(df['end_time'] - df['start_time']).dt.total_seconds()
```
---
## 6. Null Handling
### 6.1 Check for null values
```python
df['col'].isnull()
```
### 6.2 Filter rows with null values
```python
df[df['col'].isnull()]
```
---
## 7. Duplicates Handling
### 7.1 Drop duplicate rows
```python
df.drop_duplicates()
```
### 7.2 Drop duplicates based on specific columns
```python
df.drop_duplicates(subset=['col1', 'col2'])
```
### 7.3 Keep the last occurrence of duplicates
```python
df.drop_duplicates(keep='last')
```
### 7.4 Keep no duplicates at all
```python
df[df.duplicated() == False]
```
### 7.5 Get only duplicate rows
```python
df[df.duplicated()]
```
### 7.6 Mark duplicates with a boolean
```python
df.duplicated()
```
---
## 8. CSV File Handling
### 8.1 Read a CSV file
```python
df = pd.read_csv('data.csv')
```
### 8.2 Read a CSV with index column
```python
df = pd.read_csv('data.csv', index_col=0)
```
### 8.3 Write a DataFrame to CSV
```python
df.to_csv('output.csv', index=False)
```
### 8.4 Read a CSV with specific columns
```python
df = pd.read_csv('data.csv', usecols=['col1', 'col2'])
```
### 8.5 Read only first N rows from a CSV
```python
df = pd.read_csv('data.csv', nrows=100)
```
### 8.6 Skip initial rows while reading
```python
df = pd.read_csv('data.csv', skiprows=1)
```
### 8.7 Handle missing values while reading
```python
df = pd.read_csv('data.csv', na_values=['NA', 'null', 'NaN'])
```
---