{"id":49454849,"url":"https://github.com/FiscalNote/BillSum","last_synced_at":"2026-05-16T17:00:53.934Z","repository":{"id":37231180,"uuid":"163438276","full_name":"FiscalNote/BillSum","owner":"FiscalNote","description":"US Bill Summarization Corpus","archived":false,"fork":false,"pushed_at":"2023-08-14T22:06:54.000Z","size":82013,"stargazers_count":70,"open_issues_count":19,"forks_count":17,"subscribers_count":16,"default_branch":"master","last_synced_at":"2026-04-21T04:46:11.573Z","etag":null,"topics":["dataset","law","natural-language-processing","nlp","summarization"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FiscalNote.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-12-28T18:21:19.000Z","updated_at":"2026-04-02T07:22:08.000Z","dependencies_parsed_at":"2023-01-24T16:45:16.915Z","dependency_job_id":null,"html_url":"https://github.com/FiscalNote/BillSum","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/FiscalNote/BillSum","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FiscalNote%2FBillSum","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FiscalNote%2FBillSum/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FiscalNote%2FBillSum/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FiscalNote%2FBillSum/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FiscalNote","download_url":"https://codeload.github.com/FiscalNote/BillSum/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FiscalNote%2FBillSum/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33111496,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-16T04:41:52.686Z","status":"ssl_error","status_checked_at":"2026-05-16T04:41:52.009Z","response_time":115,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","law","natural-language-processing","nlp","summarization"],"created_at":"2026-04-30T05:00:30.916Z","updated_at":"2026-05-16T17:00:53.901Z","avatar_url":"https://github.com/FiscalNote.png","language":"Python","funding_links":[],"categories":["Machine Learning Datasets \u0026 Corpora"],"sub_categories":["Legal Summarization"],"readme":"# BillSum\n\nCode for the paper: [BillSum: A Corpus for Automatic Summarization of US Legislation](https://arxiv.org/abs/1910.00523) (Kornilova and Eidelman, 2019)\n\nThis paper was be presented at [EMNLP 2019 Workshop on New Frontiers in Summarization](https://summarization2019.github.io/). [Link to slides from workshop](https://docs.google.com/presentation/d/1GEMSvUdS7lYo_WevKhSY0NuWzy6tm5IciCj0jq-r7Vc/edit?usp=sharing)\n\n**Accessing the Dataset**: \nThis dataset was updated on 12/3/2019, if you accessed the dataset prior to this date, please redownload it.\n\n[Link to Google Drive](https://drive.google.com/file/d/1SkwK-PfcHzznKUHy2S3jfdITR4D5MD5u/view?usp=sharing) \n\n[TensorFlow Datasets](https://www.tensorflow.org/datasets/catalog/billsum) - does not contain the \"clean\" versions of the texts\n\n\nIf you do something cool with the data, share on our [Kaggle Page](https://www.kaggle.com/akornilo/billsum)!\n\nInformation on how the dataset was collected is under [BillSum_Data_Documentation.md](BillSum_Data_Documentation.md)\n\n\n\n**Data Structure**\nThe data is stored in a jsonlines format, with one bill per line.\n\n- text: bill text\n\n- clean_text: a preprocessed version of the text that was used to train the models in the paper\n\n- summary: (human-written) bill summary \n\n- title: bill title (can be used for generating a summary)\n\n- bill_id: An identified for the bill - in US data it is SESSION_BILL-ID, for CA BILL-ID \n\n\n\n# Set-up\n\n1. Install python dependencies (If using conda, use env.lst. If using pip, use requirements.txt)\n2. Set the env `BILLSUM_PREFIX` to the base directory for all the data. (Download from link above)\n3. Set `PYTHONPATH=.` to run code from this directory.\n4. Install packages from `environment.lst` (we used conda, but you should be able to use pip\n---\n\n# Experiments\n\nThe results for the intermediate steps (explained below) can be found [here](https://drive.google.com/file/d/1uBCRSs_KFv7jD6nM4MKXZZ4nZAPI2Go4/view?usp=sharing)\n\nFor all the experiments described in the paper, the texts were first cleaned using the script `billsum/data_prep/clean_text.py`. Results will be saved into the `BILLSUM_PREFIX/clean_final` directory.\n\n## Sumy baselines\n\n1. Clone [sumy](git@github.com:akornilo/sumy.git) and checkout the branch `ak_fork` (This is a minor modification on the original sumy library that allows it to work with my sentence selection logic).\n2. In that directory run `pip install -e .`\n3. From this directory, run `bill_sum/sumy_baselines.py`\n\n## Supervised Experiments\n\n### Preparing the data\n\n1. Run `billsum/data_prep/clean_text.py` to clean up the whitespace formatting in the dataset. Outputs new jsonlines files with 'clean_text' field + original fields to `BILLSUM_PREFIX/clean_data`\n\n2. Run `billsum/data_prep/label_sentences.py` to create labeled dataset.\n\nThis script takes each document, splits it into sentences, processes them with Spacy to get useful syntactic features and calculates the Rouge Score relative to the summary.\n\nOutputs for each dataset part will be a pickle file with a dict of (bill_id, sentence data) pairs. (Stored under `PREFIX/sent_data/`) directory\n\n```\nBill_id --\u003e [\n\t('The monthly limitation for each coverage month during the taxable year is an amount equal to the lesser of 50 percent of the amount paid for qualified health insurance for such month, or an amount equal to 112 of in the case of self-only coverage, $1,320, and in the case of family coverage, $3,480. ',\n\t  [('The ', 186, 'the', '', 'O', 'DET', 'det', 188),\n\t   ('monthly ', 187, 'monthly', 'DATE', 'B', 'ADJ', 'amod', 188),\n\t   ('limitation ', 188, 'limitation', '', 'O', 'NOUN', 'nsubj', 197),\n\t   ...]\n\t  {'rouge-1': {'f': 0.2545454500809918,\n\t    'p': 0.3783783783783784,\n\t    'r': 0.1917808219178082},\n\t   'rouge-2': {'f': 0.09459459021183367, 'p': 0.14583333333333334, 'r': 0.07},\n\t   'rouge-l': {'f': 0.16757568176139123,\n\t    'p': 0.2972972972972973,\n\t    'r': 0.1506849315068493}}),\n\t    ...]\n```\n\n## Running Bert Models\n\n0. Clone https://github.com/google-research/bert. Replace the `run_classifier.py` file with `billsum/bert_helpers/run_classifier.py` (adds custom code to read data in and out of files). Install dependencies as described in this repo.\n\n1. Create train.tsv / test.tsv files with `billsum/bert_helpers/prep_bert.py`. These will be stored under `PREFIX/bert_data` (set `$BERT_DATA_DIR` to point here)\n\n2. Download the [Bert-Large, Uncased](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip) model. \n\n3. Set `$BERT_BASE_DIR` environment variable to point to directory where you downloaded the model\n\n3. Pretrain the Bert Model (run from the cloned bert repo)\n\n```\npython create_pretraining_data.py \\\n  --input_file=$BERT_DATA_DIR/all_texts_us_train.txt \\\n  --output_file=$BERT_DATA_DIR/all_texts_us_train.tfrecord \\\n  --vocab_file=$BERT_BASE_DIR/vocab.txt \\\n  --do_lower_case=True \\\n  --max_seq_length=128 \\\n  --max_predictions_per_seq=20 \\\n  --masked_lm_prob=0.15 \\\n  --random_seed=12345 \\\n  --dupe_factor=5\n```\n\nSet `$BERT_MODEL_DIR` to the directory where you want to store your pretrained model.\n\n```\npython run_pretraining.py \\\n  --input_file=$BERT_DATA_DIR/all_texts_us_train.tfrecord\\\n  --output_dir=$BERT_MODEL_DIR \\\n  --do_train=True \\\n  --do_eval=True \\\n  --bert_config_file=$BERT_BASE_DIR/bert_config.json \\\n  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \\\n  --train_batch_size=32 \\\n  --max_seq_length=128 \\\n  --max_predictions_per_seq=20 \\\n  --num_train_steps=20000 \\\n  --num_warmup_steps=10 \\\n  --learning_rate=2e-5\n```\n\n\nThis will take a while to run. \n\n4. To train the classifier model run (from bert repo):\n\n``` \npython run_classifier.py   \n--task_name=simple\n--do_train=true   \n--do_predict=true   \n--do_predict_ca=true   \n--data_dir=$BERT_DATA_DIR   \n--vocab_file=$BERT_BASE_DIR/vocab.txt   \n--bert_config_file=$BERT_BASE_DIR/bert_config.json   \n--init_checkpoint=$BERT_MODEL_DIR/model.ckpt-40000   \n--max_seq_length=128  \n--train_batch_size=32   \n--num_train_epochs=3.0   \n--output_dir=$BERT_CLASSIFIER_DIR\n```\n\nChange `BERT_CLASSIFIER_DIR` to the directory where you want to store the classifier - should be different from pretraining directory. This script will create a model in the `BERT_CLASSIFIER_DIR` and store the sentence predictions in `BERT_CLASSIFIER_DIR/` dir.\n\nFor clarity:\n- BERT_BASE_DIR: directory of the original downloaded model (same as for step 3)\n- BERT_MODEL_DIR: directory where the output of the pretraining was stored\n- BERT_DATA_DIR: directory with all train/test examples\n- BERT_CLASSIFIER_DIR: directory where new model should\n\n\nAfter this procedure is run, two files will be generated in the BERT_CLASSIFIER_DIR: test_results.tsv / ca_test_results.tsv -- this contain sentence level predictions for each test sentence. Rename the `test_results.tsv` file to `us_test_results.tsv`. Then copy both of them over to the `bert_data` folder.\n\n\n5. Evaluate results using `bill_sum/bert_helpers/evaluate_bert.py`. Change the prefix variable to point to `BERT_CLASSIFIER_DIR` from above.\n\nResults will be stored under `BILLSUM_PREFIX/score_data/`\n\n\n## Running feature classifier + ensemble\n\nRun `bill_sum/train_wrapper.py`. Results will be stored under `BILLSUM_PREFIX/score_data/`\n\nTo get computations for the ensemble method run `billsum/evaluate_ensemble.py` \n\n## Final Result aggregation\n\nThe `PrintFinalScores.ipynb` will compute the summary statistics for each method + generate the Oracle scores.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFiscalNote%2FBillSum","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FFiscalNote%2FBillSum","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFiscalNote%2FBillSum/lists"}