{"id":38261919,"url":"https://github.com/mitmedialab/sherlock-project","last_synced_at":"2026-01-17T01:37:49.219Z","repository":{"id":39409863,"uuid":"200261817","full_name":"mitmedialab/sherlock-project","owner":"mitmedialab","description":"This repository provides data and scripts to use Sherlock, a DL-based model for semantic data type detection: https://sherlock.media.mit.edu.","archived":false,"fork":false,"pushed_at":"2024-05-06T15:11:37.000Z","size":76619,"stargazers_count":140,"open_issues_count":11,"forks_count":67,"subscribers_count":8,"default_branch":"master","last_synced_at":"2024-05-06T16:40:19.147Z","etag":null,"topics":["deep-learning","semantic-table-interpretation","semantic-type-detection","tables"],"latest_commit_sha":null,"homepage":"https://sherlock.media.mit.edu","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mitmedialab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-08-02T15:55:23.000Z","updated_at":"2024-04-29T12:40:40.000Z","dependencies_parsed_at":"2024-05-03T21:27:13.911Z","dependency_job_id":null,"html_url":"https://github.com/mitmedialab/sherlock-project","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/mitmedialab/sherlock-project","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mitmedialab%2Fsherlock-project","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mitmedialab%2Fsherlock-project/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mitmedialab%2Fsherlock-project/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mitmedialab%2Fsherlock-project/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mitmedialab","download_url":"https://codeload.github.com/mitmedialab/sherlock-project/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mitmedialab%2Fsherlock-project/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28491630,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-17T00:50:05.742Z","status":"ssl_error","status_checked_at":"2026-01-17T00:43:11.982Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","semantic-table-interpretation","semantic-type-detection","tables"],"created_at":"2026-01-17T01:37:48.641Z","updated_at":"2026-01-17T01:37:49.212Z","avatar_url":"https://github.com/mitmedialab.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Sherlock: code, data, and trained model.\n\nSherlock is a deep-learning approach to semantic data type detection, i.e. labeling tables with column types such as `name`, `address`, etc. This is helpful for, among others, data validation, processing and integration. This repository provides data and code to guide usage of Sherlock, retraining the model, and replication of results. Visit https://sherlock.media.mit.edu for more background on this project.\n\n## Installation of package\n1. You can install Sherlock by cloning this repository, and run `pip install .`.\n2. Install dependencies using `pip install -r requirements.txt` (or `requirements38.txt` depending on your Python version).\n\n## Demonstration of usage\nThe `00-use-sherlock-out-of-the-box.ipynb` notebook demonstrates usage of the readily trained model for a given table.\n\nThe notebooks in `notebooks/` prefixed with `01-data processing.ipynb` and `02-1-train-and-test-sherlock.ipynb` can be used to reproduce the results, and demonstrate the usage of Sherlock (from data preprocessing to model training and evaluation).\n\n## Data\nThe raw data (corresponding to annotated table columns) can be downloaded using the `download_data()` function in the `helpers` module.\nThis will download +/- 500MB of data into the `data` directory. Use the `01-data-preprocessing.ipynb` notebook to preprocess this data. Each column is then represented by a feature vector of dimensions 1x1588. The extracted features per column are based on \"paragraph\" embeddings (full column), word embeddings (aggregated from each column cell), character count statistics (e.g. average number of \".\" in a column's cells) and column-level statistics (e.g. column entropy).\n\n## The Sherlock model\nThe `SherlockModel` class is specified in the `sherlock.deploy.model` module. This model constitutes a multi-input neural network which specifies a separate network for each feature set (e.g. the word embedding features), concatenates them, and finally adds a few shared layers. Interaction with the model follows the scikit-learn interface, with methods `fit`, `predict` and `predict_proba`.\n\n## Making predictions\nThe originally trained `SherlockModel` can be used for generating predictions for a dataset. First, extract features using the `features.preprocessing` module. The original weights of Sherlock are provided in the repository in the `model_files` directory and can be loaded using the `initialize_model_from_json` method of the model. The procedure for making predictions (on the data) is demonstrated in the `02-1-train-and-test-sherlock.ipynb` notebook.\n\n\n## Retraining Sherlock\nThe notebook `02-1-train-and-test-sherlock.ipynb` also illustrates how Sherlock can be retrained. The model will infer the number of unique classes from the training labels unless you load a model from a json file, the number of classes will be 78 in that case.\n\n\n## Citing this work\n\nTo cite this work, please use the below bibtex:\n\n```\n@inproceedings{Hulsebos:2019:SDL:3292500.3330993,\n author = {Hulsebos, Madelon and Hu, Kevin and Bakker, Michiel and Zgraggen, Emanuel and Satyanarayan, Arvind and Kraska, Tim and Demiralp, {\\c{C}}a{\\u{g}}atay and Hidalgo, C{\\'e}sar},\n title = {Sherlock: A Deep Learning Approach to Semantic Data Type Detection},\n booktitle = {Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery \\\u0026\\#38; Data Mining},\n year={2019},\n publisher = {ACM},\n}\n```\n\n## Project structure\n    ├── data   \u003c- Placeholder directory to download data into.\n\n    ├── docs   \u003c- Files for https://sherlock.media.mit.edu landing page.\n\n    ├── model_files  \u003c- Files with trained model weights and specification.\n        ├── sherlock_model.json\n        └── sherlock_weights.h5\n\n    ├── notebooks   \u003c- Notebooks demonstrating data preprocessing and train/test of Sherlock.\n        └── 00-use-sherlock-out-of-the-box.ipynb\n        └── 01-data-preprocessing.ipynb\n        └── 02-1-train-and-test-sherlock.ipynb\n        └── 02-2-train-and-test-sherlock-rf-ensemble.ipynb\n        └── 03-train-paragraph-vector-features-optional.ipynb\n\n    ├── sherlock  \u003c- Package.\n        ├── deploy  \u003c- Code for (re)training Sherlock, as well as model specification.\n            └── helpers.py\n            └── model.py\n        ├── features     \u003c- Files to turn raw data, storing raw data columns, into features.\n            ├── feature_column_identifiers   \u003c- Directory with feature names categorized by feature set.\n            └── bag_of_characters.py\n            └── bag_of_words.py\n            └── par_vec_trained_400.pkl\n            └── paragraph_vectors.py\n            └── preprocessing.py\n            └── word_embeddings.py\n        ├── helpers.py     \u003c- Supportive modules.\n\n---------\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmitmedialab%2Fsherlock-project","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmitmedialab%2Fsherlock-project","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmitmedialab%2Fsherlock-project/lists"}