https://github.com/noodleslove/house-of-representative-analysis-i
This project uses public data about the stock trades made by members of the US House of Representatives.
https://github.com/noodleslove/house-of-representative-analysis-i
data-analysis data-science eda kaggle-dataset matplotlib-pyplot pandas python stocks-trading
Last synced: 3 months ago
JSON representation
This project uses public data about the stock trades made by members of the US House of Representatives.
- Host: GitHub
- URL: https://github.com/noodleslove/house-of-representative-analysis-i
- Owner: noodleslove
- Created: 2022-12-12T21:41:43.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2022-12-12T22:03:43.000Z (over 2 years ago)
- Last Synced: 2025-01-28T22:26:55.693Z (5 months ago)
- Topics: data-analysis, data-science, eda, kaggle-dataset, matplotlib-pyplot, pandas, python, stocks-trading
- Language: Jupyter Notebook
- Homepage:
- Size: 1.34 MB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Stock Trades by Members of the US House of Representatives
This project uses public data about the stock trades made by members of the US House of Representatives. This data is collected and maintained by Timothy Carambat as part of the [House Stock Watcher](https://housestockwatcher.com/) project. The project describes itself as follows:
> With recent and ongoing investigations of incumbent congressional members being investigated for potentially violating the STOCK act. This website compiles this publicly available information in a format that is easier to digest then the original PDF source.
>
> Members of Congress must report periodic reports of their asset transactions. This website is purely for an informative purpose and aid in transparency.
>
> This site does not manipluate or censor any of the information from the original source. All data is transcribed by our community of contributors, which you can join for free by going to our transcription tool. Our moderation team takes great care in ensuring the accuracy of the information.
>
> This site is built and maintained by Timothy Carambat and supported with our contributors.Some interesting questions to consider for this data set include:
- Is there a difference in stock trading behavior between political parties? For example:
- does one party trade more often?
- does one party make larger trades?
- do the two parties invest in different stocks or sectors? For instance, do Democrats invest in Tesla more than Republicans?
- What congresspeople have made the most trades?
- What companies are most traded by congresspeople?
- Is there evidence of insider trading? For example, Boeing stock dropped sharply in February 2020. Were there a suspiciously-high number of sales of Boeing before the drop?
- When are stocks bought and sold? Is there a day of the week that is most common? Or a month of the year?### Getting the Data
The full data set of stock trade disclosures is available as a CSV or as JSON at https://housestockwatcher.com/api.
This data set does not, however, contain the political affiliation of the congresspeople. If you wish to investigate a question that relies on having this information, you'll need to find another dataset that contains it and perform a merge. *Hint*: Kaggle is a useful source of data sets.
### Cleaning and EDA
- Clean the data.
- Certain fields have "missing" data that isn't labeled as missing. For example, there are fields with the value "--." Do some exploration to find those values and convert them to null values.
- You may also want to clean up the date columns to enable time-series exploration.
- Understand the data in ways relevant to your question using univariate and bivariate analysis of the data as well as aggregations.### Assessment of Missingness
- Assess the missingness per the requirements in `project03.ipynb`
### Hypothesis Test / Permutation Test
Find a hypothesis test or permutation test to perform. You can use the questions at the top of the notebook for inspiration.# Summary of Findings
### Introduction
The following notebook analyzes publicly available information regarding stock trades made by US House of Representatives members to reach inferences. We first cleaned up the dataset and combined _The 116th U.S. House of Representatives_ at [https://www.kaggle.com/datasets/aavigan/house-of-representatives-congress-116](https://www.kaggle.com/datasets/aavigan/house-of-representatives-congress-116) which contains information regarding political affiliation of congresspeople. Next, we evaluate the dataset's `owner` column's missingness association.
After that, we began processing the dataset for insights, and try to explore answers to the following questions:
- Does one party trade more often?
- Does one party make larger trades?
- What congresspeople have made the most trades?
- What companies are most traded by congresspeople?
- When are stocks bought and sold? Is there a day of the week that is most common? Or a month of the year?### Cleaning and EDA
After obtaining the complete dataset of stock trade disclosures from https://housestockwatcher.com/api, we discover that the data need to be cleaned up since they are fairly untidy. To clean it, we did what is shown below:
1. Change `disclosure_date` and `transaction_date` column to `datetime` type.
2. Replace '--' value in `ticker` column with `np.NaN`.
3. Replace '--' value in `owner` column with `np.NaN`.
4. Convert `amount` to a `pd.Categorical` series.The political affiliation of congressmen is missing from the dataset after it has been cleaned up, therefore we choose to utilize one from __Kaggle__ at [https://www.kaggle.com/datasets/aavigan/house-of-representatives-congress-116](https://www.kaggle.com/datasets/aavigan/house-of-representatives-congress-116). Due of the distinctions in the names between the two datasets, we combined them using the first and last names of each participant. Then we examine a few rare occurrences and manually resolve them. We were able to successfully integrate stock trade activity with representative political allegiance as an outcome.
Moving on, we process to EDA and find out that:
- `owner`, `ticker`, `transaction_date`, and `asset_description` are 4 columns that contain missing data, some of the missingness in `transaction_date` is because of the incorrect value. For example, there are a few cell with value `0009-06-09` which is clearly not a valid date.
- `transaction_date` range between `2012-06-19` and `2022-10-21`.
- Most of the congresspeople are either from `Democrat` or `Republican`, there is only one house member who is listed as `Independent` regrading their political affiliation.#### What congresspeople have made the most trades?
- By plotting the value counts of `representative` column, we have discovered that the representative __Josh Gottheimer__ has made the most trades.#### What companies are most traded by congresspeople?
- By plotting the value counts of `ticker` column, we have discovered that the ticker __MSFT__, which is __Microsoft Corp.__, has the most trade transactions.#### When are stocks bought and sold? Is there a day of the week that is most common? Or a month of the year?
- By grouping the dateset by weekday of `transaction_date`, such that most of the transactions happened during weekdays, while only a tiny amount of transactions are done in weekend. Among weekdays, __Thurday__ seems to have a slightly higher transaction volume.
- By grouping the dataset by month of `transation_date`, we discover that __February__ is the month has largest volume of transactions.### Assessment of Missingness
In this section we decided to evaluate the missingness of `owner` column as it has the most missing values across all columns. It has values like `self`, `joint`, `dependent`, and `np.NaN`. We think the missingness of `owner` column could be associated with `type` column. This concept arises from the fact that `type` describes the sort of transaction that is performed; if the type of transaction is not a stock exchange, it is less likely to fall into the `self`, `joint`, or `dependant` categories and end up as an empty value.
In order to validate this assumption, we have to perform a permutation test. To begin with, we determine the test statistic to be __Total Variation Distance (TVD)__, as `type` is a categorical data. Then, we calculate the observed statistic for the original dataset, which is `0.07390`. Afterwards, we shuffle the `owner` column and calculate the simulate statistics. By repeating this prcoess for 5,000 times, we then calculate the p-value for this permutation. As a result, we get a p-value of `0.0` which indicates that none of the simulate statistics has a more extreme result than the observed statistics. In conclusion, we conclude that the missingness of `owner` is __Missing at Random (MAR)__, and it's dependent on `type` column the most.
### Hypothesis Test
#### Which party trade more often?
* **Null hypothesis**: the distribution of trading frequency among congresspeople from different party is the same. The difference between the two observed sample is due to chance.
* **Alternative hypothesis**: the distribution of trading frequency among congresspeople from different party are different.
For the test statistics, we calculate the average trading transactions per month of each party and take the absolute difference between them. The observed statistics is `58.5469`, and we shuffle the `party` column and run the permutation test for 5,000 times. At the end, we get a p-value of `0.8756`, which indicates that majority of the permutation test cases have more extreme result than the observed statistics. Therefore, we __fail to reject__ the null hypothesis, the distribution of trading frequency among congresspeople from various parties is probrably the same.
# Code
```python
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina' # Higher resolution figures
```### Load transaction dataset
```python
transactions = pd.read_csv('data/all_transactions.csv')
transactions.head()
```.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}.dataframe tbody tr th {
vertical-align: top;
}.dataframe thead th {
text-align: right;
}
disclosure_year
disclosure_date
transaction_date
owner
ticker
asset_description
type
amount
representative
district
ptr_link
cap_gains_over_200_usd
0
2021
10/04/2021
2021-09-27
joint
BP
BP plc
purchase
$1,001 - $15,000
Hon. Virginia Foxx
NC05
https://disclosures-clerk.house.gov/public_dis...
False
1
2021
10/04/2021
2021-09-13
joint
XOM
Exxon Mobil Corporation
purchase
$1,001 - $15,000
Hon. Virginia Foxx
NC05
https://disclosures-clerk.house.gov/public_dis...
False
2
2021
10/04/2021
2021-09-10
joint
ILPT
Industrial Logistics Properties Trust - Common...
purchase
$15,001 - $50,000
Hon. Virginia Foxx
NC05
https://disclosures-clerk.house.gov/public_dis...
False
3
2021
10/04/2021
2021-09-28
joint
PM
Phillip Morris International Inc
purchase
$15,001 - $50,000
Hon. Virginia Foxx
NC05
https://disclosures-clerk.house.gov/public_dis...
False
4
2021
10/04/2021
2021-09-17
self
BLK
BlackRock Inc
sale_partial
$1,001 - $15,000
Hon. Alan S. Lowenthal
CA47
https://disclosures-clerk.house.gov/public_dis...
False
### Cleaning and EDA
```python
cleaned = transactions.copy()# convert `disclosure_date`, `transaction_date` to datetime type
cleaned['disclosure_date'] = pd.to_datetime(cleaned['disclosure_date'])
cleaned['transaction_date'] = pd.to_datetime(cleaned['transaction_date'], errors='coerce')# change `ticker` null values
cleaned['ticker'] = cleaned['ticker'].replace('--', np.NaN)# cahnge `owner` null values
cleaned['owner'] = cleaned['owner'].replace('--', np.NaN)# convert `amount` to categorical type
cleaned['amount'] = pd.Categorical(cleaned['amount'])cleaned.info()
```
RangeIndex: 15674 entries, 0 to 15673
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 disclosure_year 15674 non-null int64
1 disclosure_date 15674 non-null datetime64[ns]
2 transaction_date 15667 non-null datetime64[ns]
3 owner 8346 non-null object
4 ticker 14378 non-null object
5 asset_description 15670 non-null object
6 type 15674 non-null object
7 amount 15674 non-null category
8 representative 15674 non-null object
9 district 15674 non-null object
10 ptr_link 15674 non-null object
11 cap_gains_over_200_usd 15674 non-null bool
dtypes: bool(1), category(1), datetime64[ns](2), int64(1), object(7)
memory usage: 1.2+ MB```python
cleaned.isna().sum()```
disclosure_year 0
disclosure_date 0
transaction_date 7
owner 7328
ticker 1296
asset_description 4
type 0
amount 0
representative 0
district 0
ptr_link 0
cap_gains_over_200_usd 0
dtype: int64### Combine with political affliation dataset
```python
# remove unwanted name suffixs
suffixs = ['Hon\\.', 'Mr\\.', 'Mrs\\.', 'None', 'Aston', 'S\\.', 'W\\.']
cleaned['representative'] = (cleaned['representative']
.str.replace('|'.join(suffixs), '', regex=True)
.str.strip())cleaned['representative'].head()
```
0 Virginia Foxx
1 Virginia Foxx
2 Virginia Foxx
3 Virginia Foxx
4 Alan Lowenthal
Name: representative, dtype: object```python
# split representative name into `first_name` and `last_name` for later merge
cleaned['first_name'] = cleaned['representative'].apply(lambda x: x.split()[0].lower())
cleaned['last_name'] = cleaned['representative'].apply(lambda x: x.split()[-1].lower())# fix special cases
cleaned.loc[cleaned['representative'] == 'Neal Patrick Dunn MD, FACS', 'last_name'] = 'dunn'cleaned['first_name'].head()
```
0 virginia
1 virginia
2 virginia
3 virginia
4 alan
Name: first_name, dtype: object```python
# import member table 1
members1 = pd.read_csv('data/us-house.csv')
members1 = members1[['party', 'first_name', 'last_name']]
members1['first_name'] = members1['first_name'].str.lower()
members1['last_name'] = members1['last_name'].str.lower()
members1['party'] = members1['party'].str.capitalize()members1.head(10)
```
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}.dataframe tbody tr th {
vertical-align: top;
}.dataframe thead th {
text-align: right;
}
party
first_name
last_name
0
Republican
don
young
1
Republican
jerry
carl
2
Republican
felix
moore
3
Republican
mike
rogers
4
Republican
robert
aderholt
5
Republican
mo
brooks
6
Republican
gary
palmer
7
Democrat
terri
sewell
8
Republican
rick
crawford
9
Republican
french
hill
```python
# import member table 2
members2 = pd.read_csv('data/house_members_116.csv')
members2['first_name'] = members2['name'].apply(
lambda x: x.split('-')[0].lower())
members2['last_name'] = members2['name'].apply(
lambda x: x.split('-')[-1].lower())
members2 = members2.rename(columns={'current_party': 'party'})[
['first_name', 'last_name', 'party']]# unify party values
members2.loc[members2['party'] == 'Democratic', 'party'] = 'Democrat'members2.head(10)
```
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}.dataframe tbody tr th {
vertical-align: top;
}.dataframe thead th {
text-align: right;
}
first_name
last_name
party
0
ralph
abraham
Republican
1
alma
adams
Democrat
2
robert
aderholt
Republican
3
pete
aguilar
Democrat
4
rick
allen
Republican
5
colin
allred
Democrat
6
justin
amash
Independent
7
mark
amodei
Republican
8
kelly
armstrong
Republican
9
jodey
arrington
Republican
```python
# combine 2 member tables
members = (pd.concat([members1, members2])
.sort_values(['first_name', 'last_name'])
.drop_duplicates(subset=['first_name', 'last_name'])
.reset_index(drop=True))# fix mismatch names
members.loc[members['first_name'] == 'k', 'first_name'] = 'k.'
members.loc[members['first_name'] == 'raul', 'first_name'] = 'raúl'
members.loc[members['first_name'] == 'wm', 'first_name'] = 'wm.'
members.loc[members['first_name'] == 'ro', 'first_name'] = 'rohit'
members.loc[members['first_name'] == 'cynthia', 'first_name'] = 'cindy'
members.loc[members['last_name'] == 'allen', 'first_name'] = 'richard'
members.loc[members['last_name'] == 'steube', 'first_name'] = 'greg'
members.loc[members['last_name'] == 'banks', 'first_name'] = 'james'
members.loc[(members['first_name'] == 'j') & (
members['last_name'] == 'hill'), 'first_name'] = 'james'
members.loc[(members['first_name'] == 'mike') & (
members['last_name'] == 'garcia'), 'first_name'] = 'michael'
members.loc[members['last_name'] == 'crenshaw', 'first_name'] = 'daniel'
members.loc[members['last_name'] == 'taylor', 'first_name'] = 'nicholas'
members.loc[members['last_name'] == 'gallagher', 'first_name'] = 'michael'
members.loc[(members['first_name'] == 'gregory') & (
members['last_name'] == 'murphy'), 'first_name'] = 'greg'
members.loc[members['first_name'] == 'ashley', 'last_name'] = 'arenholz'
members.loc[members['last_name'] == 'buck', 'first_name'] = 'kenneth'
members.loc[members['last_name'] == 'costa', 'first_name'] = 'james'
members.loc[members['last_name'] == 'hagedorn', 'first_name'] = 'james'# drop duplicate rows
members = members.drop_duplicates(subset=['first_name', 'last_name'])# output cleaned representative table
members.to_csv('data/cleaned_members.csv', index=False)members.shape
```
(547, 3)
```python
# transaction table with member info table
combined = cleaned.merge(members, how='left', on=['first_name', 'last_name'])combined.loc[combined['party'].isna(), 'representative'].unique()
```
array([], dtype=object)
```python
combined.shape
```(15674, 15)
```python
transactions.shape
```(15674, 12)
```python
combined.to_csv('data/combined_transactions.csv', index=False)
``````python
combined['disclosure_year'].plot(kind='hist', density=True, bins=range(2019,2025,1));
```

```python
combined['type'].value_counts().plot(kind='bar')
plt.xticks(rotation=0);
```

```python
combined['amount'].value_counts().plot(kind='barh');
```

```python
combined['representative'].value_counts().head(10).plot(kind='barh');
```

```python
combined['district'].value_counts().head(10).plot(kind='barh');
```

```python
combined['ticker'].value_counts().head(10).plot(kind='barh');
```

### Assessment of Missingness
```python
combined.isna().sum()
```disclosure_year 0
disclosure_date 0
transaction_date 7
owner 7328
ticker 1296
asset_description 4
type 0
amount 0
representative 0
district 0
ptr_link 0
cap_gains_over_200_usd 0
first_name 0
last_name 0
party 0
dtype: int64```python
combined['owner'].unique()
```array(['joint', 'self', nan, 'dependent'], dtype=object)
```python
def make_dist(df, missing_col, col):
dist = (
df
.assign(**{f'{missing_col}_null': df[missing_col].isna()})
.pivot_table(index=col, columns=f'{missing_col}_null', aggfunc='size', fill_value=0)
)
dist = dist / dist.sum()
dist.plot(kind='barh', legend=True, title=f"{col.capitalize()} by Missingness of {missing_col.capitalize()}")
plt.show()
return distdef calc_tvd(df, missing_col, col):
dist = (
df
.assign(**{f'{missing_col}_null': df[missing_col].isna()})
.pivot_table(index=col, columns=f'{missing_col}_null', aggfunc='size', fill_value=0)
)
dist = dist / dist.sum()
return dist.diff(axis=1).iloc[:, -1].abs().sum() / 2def missingness_perm_test(df, missing_col, col):
shuffled = df.copy()
shuffled[f'{missing_col}_null'] = shuffled[missing_col].isna()
make_dist(df, missing_col, col)
obs_tvd = calc_tvd(df, missing_col, col)n_repetitions = 1000
tvds = []
for _ in range(n_repetitions):# Shuffling genders and assigning back to the DataFrame
shuffled[col] = np.random.permutation(shuffled[col])# Computing and storing TVD
tvd = calc_tvd(shuffled, missing_col, col)
tvds.append(tvd)tvds = np.array(tvds)
pval = np.mean(tvds >= obs_tvd)
# Draw the p-value graph
pd.Series(tvds).plot(kind='hist', density=True, ec='w', bins=10, title=f'p-value: {pval}', label='Simulated TVDs')
plt.axvline(x=obs_tvd, color='red', linewidth=4, label='Observed TVD')
plt.legend()
plt.show()
return obs_tvd, pvalobs_tvd, pval = missingness_perm_test(combined, 'owner', 'type')
```


```python
shuffled = combined.copy()
shuffled['owner_null'] = shuffled['owner'].isna()n_repetitions = 1000
tvds = []
for _ in range(n_repetitions):
# Shuffling genders and assigning back to the DataFrame
shuffled['type'] = np.random.permutation(shuffled['type'])
# Computing and storing TVD
pivoted = (
shuffled
.pivot_table(index='type', columns='owner_null', aggfunc='size')
.apply(lambda x: x / x.sum(), axis=0)
)
tvd = pivoted.diff(axis=1).iloc[:, -1].abs().sum() / 2
tvds.append(tvd)
tvds = np.array(tvds)
tvds[:10]
```array([0.00394066, 0.01337585, 0.01542276, 0.00559226, 0.01175366,
0.00862066, 0.00919086, 0.00751793, 0.00661405, 0.01520279])```python
pval = np.mean(tvds >= obs_tvd)pd.Series(tvds).plot(kind='hist', density=True, ec='w', bins=10, title=f'p-value: {pval}', label='Simulated TVDs')
plt.axvline(x=obs_tvd, color='red', linewidth=4, label='Observed TVD')
plt.legend();
```

So we conclude that the missingness of `owner` is __MAR__, and it's dependent on `type` column the most.
### Hypothesis Test / Permutation Test
#### Which party trade more often?
* **Null hypothesis**: the distribution of trading frequency among congresspeople from different party is the same. The difference between the two observed sample is due to chance.
* **Alternative hypothesis**: the distribution of trading frequency among congresspeople from different party are different.
```python
combined.head()
```.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}.dataframe tbody tr th {
vertical-align: top;
}.dataframe thead th {
text-align: right;
}
disclosure_year
disclosure_date
transaction_date
owner
ticker
asset_description
type
amount
representative
district
ptr_link
cap_gains_over_200_usd
first_name
last_name
party
0
2021
2021-10-04
2021-09-27
joint
BP
BP plc
purchase
$1,001 - $15,000
Virginia Foxx
NC05
https://disclosures-clerk.house.gov/public_dis...
False
virginia
foxx
Republican
1
2021
2021-10-04
2021-09-13
joint
XOM
Exxon Mobil Corporation
purchase
$1,001 - $15,000
Virginia Foxx
NC05
https://disclosures-clerk.house.gov/public_dis...
False
virginia
foxx
Republican
2
2021
2021-10-04
2021-09-10
joint
ILPT
Industrial Logistics Properties Trust - Common...
purchase
$15,001 - $50,000
Virginia Foxx
NC05
https://disclosures-clerk.house.gov/public_dis...
False
virginia
foxx
Republican
3
2021
2021-10-04
2021-09-28
joint
PM
Phillip Morris International Inc
purchase
$15,001 - $50,000
Virginia Foxx
NC05
https://disclosures-clerk.house.gov/public_dis...
False
virginia
foxx
Republican
4
2021
2021-10-04
2021-09-17
self
BLK
BlackRock Inc
sale_partial
$1,001 - $15,000
Alan Lowenthal
CA47
https://disclosures-clerk.house.gov/public_dis...
False
alan
lowenthal
Democrat
```python
df = combined.assign(transaction_year=combined['transaction_date'].dt.year,
transaction_month=combined['transaction_date'].dt.month)
df = (
df
.groupby(['transaction_year', 'transaction_month', 'party'])[['representative']]
.count()
.reset_index()
)
democrat_stats = df.loc[df['party'] == 'Democrat', 'representative'].sum() / (df['party'] == 'Democrat').sum()
republican_stats = df.loc[df['party'] == 'Republican', 'representative'].sum() / (df['party'] == 'Republican').sum()obs_stats = abs(democrat_stats - republican_stats)
shuffled = combined.assign(transaction_year=combined['transaction_date'].dt.year,
transaction_month=combined['transaction_date'].dt.month)n_repetitions = 5000
stats = []
for _ in range(n_repetitions):
# Shuffling genders and assigning back to the DataFrame
shuffled['party'] = np.random.permutation(shuffled['party'])
# Computing and storing TVD
pivoted = (
shuffled
.groupby(['transaction_year', 'transaction_month', 'party'])[['representative']]
.count()
.reset_index()
)
democrat_stats = pivoted.loc[pivoted['party'] == 'Democrat', 'representative'].sum() / (pivoted['party'] == 'Democrat').sum()
republican_stats = pivoted.loc[pivoted['party'] == 'Republican', 'representative'].sum() / (pivoted['party'] == 'Republican').sum()
stats.append(abs(democrat_stats - republican_stats))
stats = np.array(stats)
pval = np.mean(stats >= obs_stats)pd.Series(stats).plot(kind='hist', density=True, ec='w', bins=10, title=f'p-value: {pval}', label='Simulated Stats')
plt.axvline(x=obs_stats, color='red', linewidth=4, label='Observed Stats')
plt.legend();
```

#### Conclusion
The p-value of the permutation test is `0.8722`, which is way larger the the `0.05`. Thus, we __fail to reject__ the null hypothesis, which means that distribution of trading frequency among congresspeople from different party may be the same.
```python
df = combined.assign(transaction_year=combined['transaction_date'].dt.year,
transaction_month=combined['transaction_date'].dt.month)
df = (
df
.groupby(['transaction_year', 'transaction_month', 'party'])[['representative']]
.count()
.reset_index()
)
democrat_stats = df.loc[df['party'] == 'Democrat', 'representative'].sum() / (df['party'] == 'Democrat').sum()
republican_stats = df.loc[df['party'] == 'Republican', 'representative'].sum() / (df['party'] == 'Republican').sum()democrat_stats, republican_stats
```(178.2, 119.65306122448979)
```python
obs_stats = abs(democrat_stats - republican_stats)
obs_stats
```58.5469387755102
```python
shuffled = combined.assign(transaction_year=combined['transaction_date'].dt.year,
transaction_month=combined['transaction_date'].dt.month)n_repetitions = 5000
stats = []
for _ in range(n_repetitions):
# Shuffling genders and assigning back to the DataFrame
shuffled['party'] = np.random.permutation(shuffled['party'])
# Computing and storing TVD
pivoted = (
shuffled
.groupby(['transaction_year', 'transaction_month', 'party'])[['representative']]
.count()
.reset_index()
)
democrat_stats = pivoted.loc[pivoted['party'] == 'Democrat', 'representative'].sum() / (pivoted['party'] == 'Democrat').sum()
republican_stats = pivoted.loc[pivoted['party'] == 'Republican', 'representative'].sum() / (pivoted['party'] == 'Republican').sum()
stats.append(abs(democrat_stats - republican_stats))
stats = np.array(stats)
stats[:10]
```array([59.98214286, 72.09833091, 70.87735849, 65.37517483, 74.30188679,
57.72 , 55.32653061, 68.71225071, 60.90181818, 65.33776224])```python
pval = np.mean(stats >= obs_stats)pd.Series(stats).plot(kind='hist', density=True, ec='w', bins=10, title=f'p-value: {pval}', label='Simulated Stats')
plt.axvline(x=obs_stats, color='red', linewidth=4, label='Observed Stats')
plt.legend();
```

```python
ser1 = (
combined
.groupby('party')['transaction_date'].agg(['min', 'max'])
.diff(axis=1)
.iloc[:, -1]
.apply(lambda x: x.days)
)
ser2 = combined.groupby('party')['representative'].count()ser2 / ser1
```party
Democrat 2.596398
Independent inf
Republican 3.725079
dtype: float64```python
df2 = combined.groupby('party')['representative'].count()
``````python
def calc_test_statistics(df):
counts = df.groupby('party')['representative'].count()
ranges = (
df
.groupby('party')['transaction_date'].agg(['min', 'max'])
.diff(axis=1)
.iloc[:, -1]
.apply(lambda x: x.days)
)
result = counts / ranges
return abs(result['Democrat'] - result['Republican'])calc_test_statistics(combined)
```1.1286810599946193
```python
def permutation_test(df, n_repetitions=1000):
shuffled = df.copy()
obs_stats = calc_test_statistics(shuffled)sim_stats = []
for _ in range(n_repetitions):# Shuffling genders and assigning back to the DataFrame
shuffled['party'] = np.random.permutation(shuffled['party'])# Computing and storing TVD
stats = calc_test_statistics(shuffled)
sim_stats.append(stats)
sim_stats = np.array(sim_stats)
pval = np.mean(sim_stats >= obs_stats)
pd.Series(sim_stats).plot(kind='hist', density=True, ec='w', bins=20, title=f'p-value: {pval}', label='Simulated Stats')
plt.axvline(x=obs_stats, color='red', linewidth=4, label='Observed Stats')
plt.legend()
plt.show()
return pval, sim_statspval, sim_stats = permutation_test(combined)
sim_stats[:10]
```

array([3.49204486, 4.5007299 , 0.74758808, 1.03989987, 4.19301549,
0.85884316, 3.71209501, 0.81663253, 0.58560186, 0.96367451])```python
sim_stats.min(), sim_stats.max()
```(0.4204057417960114, 4.691782988465429)
```python
combined.sort_values('transaction_date').head(10)
```.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}.dataframe tbody tr th {
vertical-align: top;
}.dataframe thead th {
text-align: right;
}
disclosure_year
disclosure_date
transaction_date
owner
ticker
asset_description
type
amount
representative
district
ptr_link
cap_gains_over_200_usd
first_name
last_name
party
10586
2021
2021-08-26
2012-06-19
NaN
BLFSD
BioLife Solutions Inc
purchase
$1,001 - $15,000
Tom Malinowski
NJ07
https://disclosures-clerk.house.gov/public_dis...
False
tom
malinowski
Democrat
11456
2022
2022-03-03
2017-09-05
NaN
SUP
Superior Industries International Inc Common S...
purchase
$1,001 - $15,000
Thomas Suozzi
NY03
https://disclosures-clerk.house.gov/public_dis...
False
thomas
suozzi
Democrat
11437
2022
2022-03-03
2017-12-06
NaN
CAT
Caterpillar Inc
purchase
$1,001 - $15,000
Thomas Suozzi
NY03
https://disclosures-clerk.house.gov/public_dis...
False
thomas
suozzi
Democrat
11436
2022
2022-03-03
2018-04-17
NaN
BA
Boeing Company
purchase
$15,001 - $50,000
Thomas Suozzi
NY03
https://disclosures-clerk.house.gov/public_dis...
False
thomas
suozzi
Democrat
11442
2022
2022-03-03
2018-04-30
NaN
CTRL
Control4 Corporation
purchase
$1,001 - $15,000
Thomas Suozzi
NY03
https://disclosures-clerk.house.gov/public_dis...
False
thomas
suozzi
Democrat
11453
2022
2022-03-03
2018-05-08
NaN
GE
General Electric Company
sale_full
$1,001 - $15,000
Thomas Suozzi
NY03
https://disclosures-clerk.house.gov/public_dis...
False
thomas
suozzi
Democrat
11443
2022
2022-03-03
2018-06-27
NaN
CRTL
Control4 Corporation
purchase
$1,001 - $15,000
Thomas Suozzi
NY03
https://disclosures-clerk.house.gov/public_dis...
False
thomas
suozzi
Democrat
11455
2022
2022-03-03
2018-06-27
NaN
IBM
International Business Machines Corporation
sale_full
$1,001 - $15,000
Thomas Suozzi
NY03
https://disclosures-clerk.house.gov/public_dis...
False
thomas
suozzi
Democrat
11457
2022
2022-03-03
2018-06-27
NaN
VZ
Verizon Communications Inc
purchase
$1,001 - $15,000
Thomas Suozzi
NY03
https://disclosures-clerk.house.gov/public_dis...
False
thomas
suozzi
Democrat
4285
2021
2021-09-28
2018-09-08
dependent
NaN
iSelect Fund B St. Louis Inc
purchase
$1,001 - $15,000
Roger Marshall
KS01
https://disclosures-clerk.house.gov/public_dis...
False
roger
marshall
Republican
```python
def to_weekday(x):
day = x.weekday()
if day == 0:
return 'Monday'
elif day == 1:
return 'Tuesday'
elif day == 2:
return 'Wednesday'
elif day == 3:
return 'Thursday'
elif day == 4:
return 'Friday'
elif day == 5:
return 'Saturday'
else:
return 'Sunday'df = combined.assign(weekday=combined['transaction_date'].apply(to_weekday))
df.groupby('weekday').count().rename(columns={'transaction_date': 'count'})[['count']]
```.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}.dataframe tbody tr th {
vertical-align: top;
}.dataframe thead th {
text-align: right;
}
count
weekday
Friday
3079
Monday
3059
Saturday
48
Sunday
27
Thursday
3358
Tuesday
3021
Wednesday
3075
```python
df = combined.assign(month=combined['transaction_date'].dt.month)
df.groupby('month').count().rename(columns={'transaction_date': 'count'})[['count']]
```.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}.dataframe tbody tr th {
vertical-align: top;
}.dataframe thead th {
text-align: right;
}
count
month
1.0
1561
2.0
2110
3.0
2081
4.0
1438
5.0
1006
6.0
1485
7.0
971
8.0
972
9.0
1056
10.0
856
11.0
1131
12.0
1000