{"id":15432538,"url":"https://github.com/asvyatkovskiy/scabillmatch","last_synced_at":"2025-08-02T04:35:54.155Z","repository":{"id":75542618,"uuid":"54519744","full_name":"ASvyatkovskiy/ScaBillMatch","owner":"ASvyatkovskiy","description":"Policy diffusion in the US legislature","archived":false,"fork":false,"pushed_at":"2019-01-12T18:28:54.000Z","size":17927,"stargazers_count":9,"open_issues_count":0,"forks_count":4,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-19T19:43:50.569Z","etag":null,"topics":["data-frame","graph","policy-diffusion","spark","tf-idf"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ASvyatkovskiy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2016-03-23T01:00:07.000Z","updated_at":"2023-12-05T12:06:29.000Z","dependencies_parsed_at":"2023-06-06T20:45:16.329Z","dependency_job_id":null,"html_url":"https://github.com/ASvyatkovskiy/ScaBillMatch","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/ASvyatkovskiy/ScaBillMatch","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ASvyatkovskiy%2FScaBillMatch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ASvyatkovskiy%2FScaBillMatch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ASvyatkovskiy%2FScaBillMatch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ASvyatkovskiy%2FScaBillMatch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ASvyatkovskiy","download_url":"https://codeload.github.com/ASvyatkovskiy/ScaBillMatch/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ASvyatkovskiy%2FScaBillMatch/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":268334642,"owners_count":24233795,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-02T02:00:12.353Z","response_time":74,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-frame","graph","policy-diffusion","spark","tf-idf"],"created_at":"2024-10-01T18:27:07.841Z","updated_at":"2025-08-02T04:35:54.143Z","avatar_url":"https://github.com/ASvyatkovskiy.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ScaBillMatch [![Build Status](https://travis-ci.org/ASvyatkovskiy/ScaBillMatch.svg?branch=master)](https://travis-ci.org/ASvyatkovskiy/ScaBillMatch.svg?branch=master)\n\n[![DOI](https://zenodo.org/badge/54519744.svg)](https://zenodo.org/badge/latestdoi/54519744)\n\nPolicy diffusion occurs when government decisions in a given jurisdiction are systematically influenced by prior policy choices made in other jurisdictions [Gilardi]. While policy diffusion can manifest in a variety of forms, we focus on a\ntype of policy diffusion that can be detected by examining similarity of legislative bill texts. We aim to identify groups of legislative bills from different states falling into the same diffusion topic, to perform an all-pairs comparison between the bills within each topic, and to identify paths connecting specific legislative proposals on a graph.\n\n\n## Data ingestion\n\nDuring ingestion step the raw unstructured data are converted into JSON and, subsequently, Apache Avro format having following schema:\n\n```json\n{\"namespace\" : \"bills.avro\" ,\n   \"type\": \"record\",\n   \"name\": \"Bills\",\n   \"fields\": [\n      {\"name\": \"primary_key\" , \"type\": \"string\"},\n      {\"name\": \"content\" , \"type\" : \"string\"}\n      {\"name\": \"year\" , \"type\" : \"int\"},\n      {\"name\": \"state\" , \"type\" : \"int\"},\n      {\"name\": \"docversion\" , \"type\" : \"string\"}\n      ]\n}\n```\n\nwhere the `primary_key` field is a unique identifier of the elements in the dataset constructed from year, state and\ndocument version. The year, state and docversion fields are used to construct predicates and filter the data before the allpairs\nsimilarity join calculation. The `content` field stores the entire legislative proposal as a unicode string. It is only used for feature extraction step, and is not read into memory during candidate selection and filtering steps, thanks to the Avro schema evolution property. \n\nAvro schema is stored in a file along with the data. Thus, if the program reading the data expects a different schema this can be easily resolved by setting the `avro.input.schema.key` in the Spark application, since the schemas of Avro writer and reader are both present.\n\nThe data ingestion steps would differ depending on the dataset structure/type.\n\n#FIXME code smnippet for raw to JSON conversion\n#FIXME code snippet for JSOn to Avro conversion\n\n## Pre-processing and feature extraction\n\nThe feature extraction step consists of a sequence of `Spark ML` transformers intended to produce numerical feature vectors\nas a dataframe column. The resulting dataframe is fed to Spark ML k-means estimator, later used to calculate the all-pairs join, and subsequently during the graph analysis step with `GraphFrames`.\n\n### Types of features\n\n 1. Bag-of-words and the N-gram\n 1. Term frequency and inverse document frequency (TF-IDF)\n 1. Minhash features\n\nDifferent types of text features has been found to perform better for each type of simialrity measures. For instance, TF-IDF (small granularity N gram) +truncated SVD is best suited for cosine similarity calcualtions. Jaccard similarity perofrms best with unweighted features (i.e. MinHash or TF), larger N gram granularity is preferred for the latter.\n\n### Dimensionality reduction\n\nSingular value decomposition (SVD) is applied to the TF-IDF document-feature matrix to extract concepts which are most relevant for classification.\n\n## Candidate selection and clustering  \n\nFocusing on the document vectors which are likely to be highly similar is essential for all-pairs comparison at scale.\nModern studies employ variations of nearest-neighbor search, locality sensitive hashing, as well as sampling techniques to select a subset of rows of TF-IDF matrix based on the sparsity [DIMSUM]. \nOur approach currently utilizes k-means clustering to identify groups of documents which are likely to belong to the same diffusion topic, reducing the number of comparisons in the all-pairs similarity join calculation. In addition, `LSH` and `BucketedrandomProjectionLSH` are being added based on `Spark ML` implementation.\n\n#FIXME copy paste the submission command for this step\n#FIXME describe the configuration file parameters to show how to configure options described above\n\n\n## Document similarity calculation\n\nWe consider Jaccard, Cosine, manhattan and Hamming distances. We convert those to similarities assuming inverse proportionality, and re-scale all similarities to a common range, adding an extra additive term in the denominator serves as a regularization\nparameter for the case of identical vectors.\n\n#FIXME describe the configuration file parameters to show how to configure options described above\n\n## Exploratory analysis: histogramming and plotting\n\nHistogrammar [http://histogrammar.org/docs/] is a suite of data aggregation primitives for making histograms, calculating descriptive statistics and plotting. A few composable functions can generate many different types of plots, and these functions are reimplemented in multiple languages and serialized to JSON for cross-platform compatibility. Histogrammar allows to aggregate data using cross-platform, functional primitives, summarizing a large dataset with discretized distributions, using lambda functions and composition rather than a restrictive set of histogram types.\n\nTo use Histogrammar in the Spark shell, you don’t have to download anything. Just start Spark with\n\n```bash\nspark-shell --packages \"org.diana-hep:histogrammar_2.11:1.0.4\"\n```\nand call\n\n```scala\nimport org.dianahep.histogrammar._\n```\non the Spark prompt. For plotting with Bokeh, `include org.diana-hep:histogrammar-bokeh_2.11:1.0.4` and for interaction with Spark-SQL, include `org.diana-hep:histogrammar-sparksql_2.11:1.0.4`.\n\n### Example of stat analysis with Spark and Histogrammar\n\nGiven the cosine and jaccard output files on the key-key pair, and convert it to dataframe:\n\n```scala\nval data = cosineRDD.join(jaccardRDD).toDF(\"cosine\",\"jaccard\")\ndata.write.parquet(\"/user/alexeys/correlations_3state\")\n```\n\nLaunch spark-shell session with histogrammar pre-loaded:\n\n```bash\nspark-shell --master yarn --queue production --num-executors 20 --executor-cores 3 --executor-memory 10g --packages \"org.diana-hep:histogrammar-bokeh_2.10:1.0.3\" --jars target/scala-2.11/BillAnalysis-assembly-2.0.jar \n```\n\nGet basic descriptive statistics:\n\n```scala\nscala\u003e data.describe().show()\n+-------+--------------------+------------------+\n|summary|              cosine|           jaccard|\n+-------+--------------------+------------------+\n|  count|          2632811191|        2632811191|\n|   mean|   2.648025009784054|11.899197957421478|\n| stddev|  3.2252594746900303|3.5343401388251032|\n|    min|2.389704494502045...|1.4545454545454546|\n|    max|   98.63368807585896| 81.14406779661016|\n+-------+--------------------+------------------+\n```\n\nGet correlation coefficients and distributions in 10 bins:\n\n```scala\nscala\u003e val cosine_rdd = data.select(\"cosine\").rdd.map(x=\u003ex.getDouble(0))\ncosine_rdd: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[4] at map at \u003cconsole\u003e:27\n\nscala\u003e val jaccard_rdd = data.select(\"jaccard\").rdd.map(x=\u003ex.getDouble(0))\njaccard_rdd: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[7] at map at \u003cconsole\u003e:27\n\nscala\u003e import org.dianahep.histogrammar._\nimport org.dianahep.histogrammar._\n\nscala\u003e import org.dianahep.histogrammar.ascii._\nimport org.dianahep.histogrammar.ascii._\n\nscala\u003e val histo = Histogram(10,0,100,{x: Double =\u003e x})\nhisto: org.dianahep.histogrammar.Selecting[Double,org.dianahep.histogrammar.Binning[Double,org.dianahep.histogrammar.Counting,org.dianahep.histogrammar.Counting,org.dianahep.histogrammar.Counting,org.dianahep.histogrammar.Counting]] = \u003cSelecting cut=Bin\u003e\n\nscala\u003e val jaccard_histo = jaccard_rdd.aggregate(histo)(new Increment, new Combine)\njaccard_histo: org.dianahep.histogrammar.Selecting[Double,org.dianahep.histogrammar.Binning[Double,org.dianahep.histogrammar.Counting,org.dianahep.histogrammar.Counting,org.dianahep.histogrammar.Counting,org.dianahep.histogrammar.Counting]] = \u003cSelecting cut=Bin\u003e\n\nscala\u003e jaccard_histo.println\n                       +----------------------------------------------------------+\nunderflow     0        |                                                          |\n[  0 ,  10 )  7.943E+8 |***********************                                   |\n[  10,  20 )  1.792E+9 |*****************************************************     |\n[  20,  30 )  4.668E+7 |*                                                         |\n[  30,  40 )  2.269E+5 |                                                          |\n[  40,  50 )  5661     |                                                          |\n[  50,  60 )  572      |                                                          |\n[  60,  70 )  125      |                                                          |\n[  70,  80 )  57       |                                                          |\n[  80,  90 )  4        |                                                          |\n[  90,  100)  0        |                                                          |\noverflow      0        |                                                          |\nnanflow       0        |                                                          |\n                       +----------------------------------------------------------+\n```                      \n\nFor more details on how to use Histogrammar, refer to the website: http://histogrammar.org/docs/\n\n## Reformulating the problem as a network (graph) problem\n\nSome policy diffusion questions are easier answered if the problem is formulated as a graph analysis problem. The dataframe output of the document similarity step is mapped onto a weighted undirected graph, considering each unique legislative proposal as a node and a presence of a document with similarity above a certain threshold as an edge with a weight attribute equal to the similarity. \n\nThe PageRank and Dijkstra minimum cost path algorithms are applied to detect events of policy diffusion and the most influential states. A GraphFrame is constructed using two dataframes (a dataframe of nodes and an edge dataframe), allowing to easily integrate the graph processing step into the pipeline along with Spark ML, without a need to move the results of previous steps manually and feeding them to the graph processing module from an intermediate sink, like with isolated graph analysis systems.\n\n#FIXME describe the configuration file parameters to show how to configure options described above\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fasvyatkovskiy%2Fscabillmatch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fasvyatkovskiy%2Fscabillmatch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fasvyatkovskiy%2Fscabillmatch/lists"}