{"id":30288579,"url":"https://github.com/da03/wildchat","last_synced_at":"2025-08-16T22:37:58.913Z","repository":{"id":307764578,"uuid":"990290045","full_name":"da03/wildchat","owner":"da03","description":null,"archived":false,"fork":false,"pushed_at":"2025-08-15T16:21:55.000Z","size":1402,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-15T18:35:12.380Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://wildvisualizer.com/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/da03.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-25T22:00:52.000Z","updated_at":"2025-08-15T16:21:58.000Z","dependencies_parsed_at":"2025-08-02T05:12:36.829Z","dependency_job_id":null,"html_url":"https://github.com/da03/wildchat","commit_stats":null,"previous_names":["da03/wildchat"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/da03/wildchat","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/da03%2Fwildchat","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/da03%2Fwildchat/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/da03%2Fwildchat/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/da03%2Fwildchat/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/da03","download_url":"https://codeload.github.com/da03/wildchat/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/da03%2Fwildchat/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270781214,"owners_count":24643808,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-16T02:00:11.002Z","response_time":91,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-16T22:37:57.571Z","updated_at":"2025-08-16T22:37:58.903Z","avatar_url":"https://github.com/da03.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Data Preparation\n\nThe below code presumes access to the raw MySQL database. It will run the below sequence of functions:\n\n\n* `load_mysql`: dumps the MySQL database into a list of files.\n* `generate_hash`: hashes conversation content into hashes, which will be used later to link turns belonging to the same conversation together.\n* `link_conversations`: links conversation turns into conversations, based on conversation content, timestamps, device fingerprints, and IPs.\n* `add_languages`: detects language of each turn.\n* `push_dataset`: pushes a raw version of the dataset for internal use.\n* `remove_wildbench`: removes conversations reserved for WildBench.\n* `add_moderation`: adds OpenAI Moderation results.\n* `add_detoxify`: adds detoxify results.\n* `hash_ips`: hashes IP addresses with salt to ensure nonreversibility.\n\n```\npython process_database.py\n```\n\n## PII Removal\n\nThe PII removal code assumes access to a SLURM-managed GPU cluster and uses distributed computing to process data using multiple GPUs.\n\nFirst, run analyzer on every chunk (in practice, this should be run using multiple GPUs in parallel as spacy's NER is slow):\n\n```\npython run_presidio_ner.py --save_name data/aug1_2025 --chunk_idx 0\npython run_presidio_ner.py --save_name data/aug1_2025 --chunk_idx 1\npython run_presidio_ner.py --save_name data/aug1_2025 --chunk_idx 2\n...\npython run_presidio_ner.py --save_name data/aug1_2025 --chunk_idx [N]\n```\n\nNote that we have provided an example script of launching multiple jobs using slurm (it is unlikely to work without adapting to your own slurm environment!):\n\n```\nsbatch run_presidio_ner.slurm\n```\n\nNext, count the number of occurrences of each entity. These statistics will be later used for determining common entities (such as celebrities) that will not be removed.\n\n```\npython count_entity_freqs.py --save_name data/aug1_2025\n```\n\nNow, redact PII. The current PII removal code is developed iteratively: we check the identified named entities and add / remove rules to identify / deidentify PII. Similarly, we determine thresholds for common entities using an iterative process as well.\n\n```\npython redact_PII.py --save_name data/aug1_2025\n```\n\nNext, use TruffleHog to remove API keys.\n\n```\npython redact_trufflehog.py --save_name data/aug1_2025\n```\n\nFinally, release data.\n\n```\npython release.py --save_name data/aug1_2025\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fda03%2Fwildchat","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fda03%2Fwildchat","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fda03%2Fwildchat/lists"}