{"id":20000222,"url":"https://github.com/andstor/verified-smart-contracts","last_synced_at":"2025-05-04T15:32:16.548Z","repository":{"id":45702352,"uuid":"479578772","full_name":"andstor/verified-smart-contracts","owner":"andstor","description":":page_facing_up: Verified Ethereum Smart Contract dataset","archived":false,"fork":false,"pushed_at":"2023-11-09T22:59:33.000Z","size":43,"stargazers_count":29,"open_issues_count":1,"forks_count":4,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-08T07:42:56.379Z","etag":null,"topics":["dataset","ethereum","etherscan","huggingface","language-modeling","smart-contracts","text-generation"],"latest_commit_sha":null,"homepage":"https://huggingface.co/datasets/andstor/smart_contracts","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/andstor.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-04-09T00:45:35.000Z","updated_at":"2025-03-16T01:14:05.000Z","dependencies_parsed_at":"2022-08-28T19:30:45.752Z","dependency_job_id":null,"html_url":"https://github.com/andstor/verified-smart-contracts","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andstor%2Fverified-smart-contracts","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andstor%2Fverified-smart-contracts/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andstor%2Fverified-smart-contracts/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andstor%2Fverified-smart-contracts/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/andstor","download_url":"https://codeload.github.com/andstor/verified-smart-contracts/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252356083,"owners_count":21734876,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","ethereum","etherscan","huggingface","language-modeling","smart-contracts","text-generation"],"created_at":"2024-11-13T05:14:08.878Z","updated_at":"2025-05-04T15:32:16.238Z","avatar_url":"https://github.com/andstor.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# verified-smart-contracts\n\n\u003e :page_facing_up: Verified Ethereum Smart Contract dataset\n\nVerified Smart Contracts is a dataset of real Ethereum Smart Contract, containing both Solidity and Vyper source code. It consists of every deployed Ethereum Smart Contract as of :black_joker: 1st of April 2022, whose been verified on Etherescan, and has at least one transaction.\nThe dataset is available at 🤗 [Hugging Face](https://huggingface.co/datasets/andstor/smart_contracts).\n\n## Metrics\n\n| Component | Size | Num rows | LoC[^1] |\n| --------- |:----:| -------:| -------:|\n| [Raw](https://huggingface.co/datasets/andstor/smart_contracts/tree/main/data/raw)| 8.80 GiB | 2217692 | 839665295 |\n| [Flattened](https://huggingface.co/datasets/andstor/smart_contracts/tree/main/data/flattened) | 1.16 GiB | 136969 | 97529473 |\n| [Inflated](https://huggingface.co/datasets/andstor/smart_contracts/tree/main/data/inflated) | 0.76 GiB | 186397 | 53843305 |\n| [Parsed](https://huggingface.co/datasets/andstor/smart_contracts/tree/main/data/parsed) | 4.44 GiB | 4434014 | 29965185 |\n\n[^1]: LoC refers to the lines of **source_code**. The *Parsed* dataset counts lines of **func_code** + **func_documentation**.\n\n## Description\n\n### Raw\nThe raw dataset contains mostly the raw data from Etherscan, downloaded with the [smart-contract-downlader](https://github.com/andstor/smart-contract-downloader) tool. It normalizes all different contract formats (JSON, multi-file, etc.) to a flattened source code structure.\n\n```script\npython script/2parquet.py -s data -o parquet\n```\n\n### Flattened\nThe flattened dataset contains smart contracts, where every contract contains all required library code. Each \"file\" is marked in the source code with a comment stating the original file path: `//File: path/to/file.sol`. These are then filtered for uniqeness with a similarity threshold of 0.9. The low uniqeness requirement is due to the often large amount of embedded library code. If a more unique dataset is required, see the [inflated](#inflated) dataset instead.\n\n```script\npython script/filter_data.py -s parquet -o data/flattened --threshold 0.9\n```\n\n### Inflated\nThe inflated dataset splits every contracts into its representative files. These are then filtered for uniqeness with a similarity threshold of 0.9.\n\n```script\npython script/filter_data.py -s parquet -o data/inflated --split-files --threshold 0.9\n```\n\n### Parsed\nThe parsed dataset contains a parsed extract of Solidity code from the [*Inflated*](#inflated) dataset. It consists of contract classes (contract definition) and functions (function definition), as well as accompanying documentation (code comments). The code is parsed with the [solidity-universal-parser](https://github.com/andstor/solidity-universal-parser.git).\n\n```script\npython script/parse_data.py -s data/inflated -o data/parsed\n```\n\n### Plain Text\nA subset of the datasets above can be created by using the `2plain_text.py` script. This will produce a plain text dataset with the columns `text` (source code) and `language`.\n\n```script\npython script/2plain_text.py -s data/inflated -o data/inflated_plain_text\n```\nThis will produce a plain text version of the inflated dataset, and save it to `data/inflated_plain_text`.\n\n## Filtering\nA large quantity of the Smart Contracts is/contains duplicated code. This is mostly due to frequent use of library code. Etherscan embeds the library code used in a contract in the source code. To mitigate this, some filtering is applied in order to produce dataset with mostly unique contract source code. This filtering is done by calculating the string distance between the surce code. Due to the large amount of contracts (~2 million), the comparison is only done in groups by `contract_name` for the flattened dataset, and by `file_name` for the inflated dataset.\n\nThe string comparison algorithm used is the [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index).\n\n## Data format\nThe data format used is parquet files, most with a total of 30,000 records.\n\n## License\n\nCopyright © [André Storhaug](https://github.com/andstor)\n\nThis repository is licensed under the [MIT License](https://github.com/andstor/verified-smart-contracts/blob/main/LICENSE).\n\nAll contracts in the dataset are publicly available, obtained by using [Etherscan APIs](https://etherscan.io/apis), and subject to their own original licenses.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandstor%2Fverified-smart-contracts","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fandstor%2Fverified-smart-contracts","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandstor%2Fverified-smart-contracts/lists"}