{"id":15038087,"url":"https://github.com/intellabs/control-flag","last_synced_at":"2025-04-08T16:00:23.649Z","repository":{"id":40417461,"uuid":"392565553","full_name":"IntelLabs/control-flag","owner":"IntelLabs","description":"A system to flag anomalous source code expressions by learning typical expressions from training data","archived":false,"fork":false,"pushed_at":"2024-05-30T22:37:23.000Z","size":2119,"stargazers_count":1242,"open_issues_count":19,"forks_count":114,"subscribers_count":45,"default_branch":"master","last_synced_at":"2025-04-01T14:01:53.086Z","etag":null,"topics":["algorithms","machine-learning","machine-programming"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/IntelLabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-08-04T05:48:07.000Z","updated_at":"2025-02-19T16:31:37.000Z","dependencies_parsed_at":"2023-01-19T11:18:39.381Z","dependency_job_id":"5f48a19c-590f-40fe-aa35-871f26f2780f","html_url":"https://github.com/IntelLabs/control-flag","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IntelLabs%2Fcontrol-flag","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IntelLabs%2Fcontrol-flag/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IntelLabs%2Fcontrol-flag/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IntelLabs%2Fcontrol-flag/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/IntelLabs","download_url":"https://codeload.github.com/IntelLabs/control-flag/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247878014,"owners_count":21011158,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["algorithms","machine-learning","machine-programming"],"created_at":"2024-09-24T20:37:02.236Z","updated_at":"2025-04-08T16:00:23.625Z","avatar_url":"https://github.com/IntelLabs.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"**A friendly request: Thanks for visiting control-flag GitHub repository! If you find control-flag useful, we would appreciate a note from you (to niranjan.hasabnis@intel.com). And, of course, we love testimonials!**\n\n-- The ControlFlag Team \n\n[![linux_build_and_test](https://github.com/IntelLabs/control-flag/actions/workflows/linux_controlflag_cmake.yml/badge.svg)](https://github.com/IntelLabs/control-flag/actions/workflows/linux_controlflag_cmake.yml)\n[![linux_style_check](https://github.com/IntelLabs/control-flag/actions/workflows/linux_controlflag_cpplint.yml/badge.svg)](https://github.com/IntelLabs/control-flag/actions/workflows/linux_controlflag_cpplint.yml)\n[![macos_build_and_test](https://github.com/IntelLabs/control-flag/actions/workflows/macos_controlflag_cmake.yml/badge.svg)](https://github.com/IntelLabs/control-flag/actions/workflows/macos_controlflag_cmake.yml)\n[![macos_style_check](https://github.com/IntelLabs/control-flag/actions/workflows/macos_controlflag_cpplint.yml/badge.svg)](https://github.com/IntelLabs/control-flag/actions/workflows/macos_controlflag_cpplint.yml)\n[![GitHub license](https://img.shields.io/github/license/IntelLabs/control-flag)](https://github.com/IntelLabs/control-flag/blob/master/LICENSE)\n\n# ControlFlag: A Self-supervised Idiosyncratic Pattern Detection System for Software Control Structures\n\nControlFlag is a self-supervised idiosyncratic pattern detection system that\nlearns typical patterns that occur in the control structures of high-level\nprogramming languages, such as C/C++, by mining these patterns from open-source\nrepositories (on GitHub and other version control systems). It then applies\nlearned patterns to detect anomalous patterns in user's code.\n\n## Brief technical description\n\nControlFlag's pattern anomaly detection system can be used for various problems\nsuch as typographical error detection, flagging a missing NULL check to\nname a few. *This PoC demonstrates ControlFlag's application in the typographical\nerror detection.*\n\nFigure below shows ControlFlag's two main phases: (1) pattern\nmining phase, and (2) scanning for anomalous patterns phase. The pattern mining\nphase is a \"training phase\" that mines typical patterns in the user-provided GitHub\nrepositories and then builds a decision-tree from the mined patterns. The scanning\nphase, on the other hand, applies the mined patterns to flag anomalous\nexpressions in the user-specified target repositories.\n\n![ControlFlag design](/docs/controlflag_design.jpg)\n\nMore details can be found in our MAPS paper (https://arxiv.org/abs/2011.03616).\n\n## Directory structure (evolving)\n- `src`: Source code for ControlFlag for typographical error detection system\n- `scripts`: Scripts for pattern mining and scanning for anomalies\n- `quick_start`: Scripts to run quick start tests\n- `github`: Scripts and data for downloading GitHub repos.\n- `tests`: unit tests\n\n## Install\n\nControlFlag can be built on Linux and MacOS.\n\n#### Requirements\n\n- CMake 3.4.3 or above\n- C++17 compatible compiler\n- [Tree-sitter](https://github.com/tree-sitter/tree-sitter.git) parser (downloaded automatically as part of cmake)\n- [GNU parallel](https://www.gnu.org/software/parallel/) (optional, if you want\n  to generate your own training data)\n\n**Tested build configuration on Linux-based systems**\n- CentOS-7.6/Ubuntu-20.04 with g++-v10.2.0 for x86\\_64\n\n**Tested build configuration on MacOS**\n- MacOS Mojave v10.14.6 with clang-1001.0.46.4 (Apple LLVM version 10.0.1) for x86\\_64 (obtained from The Command Line Tools Package)\n\n#### Build\n\n```\n$ cd control-flag\n$ cmake .\n$ make -j\n$ make test\n```\nAll tests in `make test` should pass.\n\n## Using ControlFlag\n\n### Quick start\n\n#### Using patterns obtained from several GitHub repos to scan repository of your choice\n\nDownload the training data for the language of interest depending on the memory constraints of your device. Note, however, that using smaller datasets may lead to reduced accuracy in the results ControlFlag produces and possibly an increase in the number of false positives it generates.\n\nLanguage | Dataset name | Size on disk | Memory requirements | Direct link | MD5 checksum\n---------|--------------|--------------|---------------------|-------------|-------------\nC | Small        | ~100MB       | ~400MB              | [link](https://www.dropbox.com/s/88kb00r71t0lf94/c_lang_if_stmts_6000_gitrepos_small.ts.tgz?dl=0)| 2825f209aba0430993f7a21e74d99889\nC | Medium       |   ~450MB     | ~1.3GB           | [link](https://www.dropbox.com/s/zjdwmqvhgbdnuns/c_lang_if_stmts_6000_gitrepos_medium.ts.tgz?dl=0) | aab2427edebe9ed4acab75c3c6227f24\nC | Large        |   ~9GB       | ~13GB           | [link](https://www.dropbox.com/s/oledgd1jli55xps/c_lang_if_stmts_6000_gitrepos.large.ts.tgz?dl=0) | 1ba954d9716765d44917445d3abf8e85\nC++ | Small | ~200MB | ~500MB | [link](https://www.dropbox.com/s/jtys6pihknl329b/cpp_controlflag_if_stmts_small.ts.tgz?dl=0) | f954486e20961f0838ac08e5d4dbf312\nC++ | Medium | ~500MB | ~1.3GB | [link](https://www.dropbox.com/s/ea9nwa2ijv2zfxq/cpp_controlflag_if_stmts_medium.ts.tgz?dl=0) |  a5c18ea1cdbe354b93aabf9ecaa5b07a\nC++ | Large | ~1.2GB | ~3GB | [link](https://www.dropbox.com/s/4du59qq28r4qnbw/cpp_controlflag_if_stmts_large.ts.tgz?dl=0) | 4f5ffc1ab942eaba399cafd5be8bb45f\nPHP | Small      | ~120MB       |  ~1GB           | [link](https://www.dropbox.com/s/it0ql3d2e1viao8/php_controlflag_if_stmts.ts.tgz?dl=0) | 5a1cc4c24a20de7dad1b9f40661d517a\n\n```\n$ Download \u003ctgz_file\u003e from the link above.\n$ (optional) md5sum \u003ctgz_file\u003e\n$ tar -zxf \u003ctgz_file\u003e\n```\n\nTo scan C code of your choice, use below command:\n\n```\n$ scripts/scan_for_anomalies.sh -d \u003cdirectory_to_be_scanned_for_anomalies\u003e -t \u003ctraining_data\u003e.ts -o \u003coutput_directory_to_store_log_files\u003e -l 1\n```\n\nTo scan C++ code of your choice, use below command:\n\n```\n$ scripts/scan_for_anomalies.sh -d \u003cdirectory_to_be_scanned_for_anomalies\u003e -t \u003ctraining_data\u003e.ts -o \u003coutput_directory_to_store_log_files\u003e -l 4\n```\n\nOnce the run is complete (which could take some time depending on your system and the\nnumber of programs from your repository that can be scanned by ControlFlag,) refer to [the section below to\nunderstand scan output](#understanding-scan-output).\n\n#### Mining patterns from a small repo and applying them to another small repo\n\nIn this test for C language programs, we will mine patterns from\n[Glb-director](https://github.com/github/glb-director.git) project of GitHub and\napply them to flag anomalies in GitHub's [brubeck](https://github.com/github/brubeck.git) project.\n\nSimply run below command:\n```\ncd quick_start \u0026\u0026 ./test1_c.sh\n```\n\nIf everything goes well, you can see output from the scanner in `test1_scan_output`\ndirectory. Look for \"Potential anomaly\" label in it by `grep \"Potential anomaly\"\n-C 5 \\*.log`, and you should see output like below:\n\n```\nthread_6.log-Level:TWO Expression:(parenthesized_expression (binary_expression (\"==\") (identifier) (non_terminal_expression))) found in training dataset:\nSource file: brubeck/src/server.c:266:5:(s == sizeof(fdsi))\nthread_6.log-Autocorrect search took 0.000 secs\nthread_6.log:Potential anomaly\nthread_6.log-Did you mean:(parenthesized_expression (binary_expression (\"==\") (identifier) (non_terminal_expression))) with editing cost:0 and occurrences: 1\nthread_6.log-Did you mean:(parenthesized_expression (binary_expression (\"==\") (identifier) (null))) with editing cost:1 and occurrences: 25\nthread_6.log-Did you mean:(parenthesized_expression (binary_expression (\"==\") (identifier) (identifier))) with editing cost:1 and occurrences: 5\nthread_6.log-Did you mean:(parenthesized_expression (binary_expression (\"\u003e=\") (identifier) (non_terminal_expression))) with editing cost:1 and occurrences: 3\nthread_6.log-Did you mean:(parenthesized_expression (binary_expression (\"==\") (non_terminal_expression) (non_terminal_expression))) with editing cost:1 and occurrences: 2\n```\nThe anomaly is flagged for `brubeck/src/server.c` at line number `266`.\n\n### Detailed steps\n\n1. __Pattern Mining phase__ (if you want to generate training data yourself)\n\nIf you do not want to generate training data yourself, go to [Evaluation step below](#evaluation-or-scanning-for-anomalies-in-c-code-from-test-repo).\n\nIn this phase, we mine the idiosyncratic patterns that appear in the control\nstructures of high-level language such as C. *This PoC mines patterns from `if`\nstatements that appear in C programs.*\n\nIf you want to use your own repository for mining patterns, jump to Step 1.2.\n\n1.1 __Downloading GitHub repos for C language having more than 100 stars__\n\nSteps below show how to download GitHub repos for C language that have more than 100 stars\n(`c100.txt`) and generate training data. `training_repo_dir` is a directory\nwhere the command below will clone all the repos.\n\n```\n$ cd github\n$ python download_repos.py -f c100.txt -o \u003ctraining_repo_dir\u003e -m clone -p 5\n```\n\n1.2 __Mining patterns from downloaded repositories__\n\nYou can use your own repository to mine for expressions by passing it in\nplace of \u003ctraining_repo_dir\u003e.\n\n`mine_patterns.sh` script helps for this. It's usage is as below:\n\n```\nUsage: ./mine_patterns.sh -d \u003cdirectory_to_mine_patterns_from\u003e -o \u003coutput_file_to_store_training_data\u003e\nOptional:\n[-n number_of_processes_to_use_for_mining]  (default: num_cpus_on_system)\n[-l source_language_number] (default: 1 (C), supported: 1 (C), 2 (Verilog), 3 (PHP), 4 (C++)\n[-g github_repo_id] (default: 0) A unique identifier for GitHub repository, if any\n```\n\nWe use it as:\n```\n$ scripts/mine_patterns.sh -d \u003ctraining_repo_dir\u003e -o \u003ctraining_data_file\u003e -l 1\n```\n\n`\u003ctraining_dat_file\u003e` contains conditional expressions in C language that are\nfound in the specified GitHub repos and their AST (abstract syntax tree) representations.\nYou can view this file as a text file, if\nyou want.\n\n## Evaluation (or scanning for anomalies)\n\nWe can run `scan_for_anomalies.sh` script to scan target directory of interest.\nIts usage is as below.\n```\nUsage: ./scan_for_anomalies.sh -t \u003ctraining_data\u003e -d \u003cdirectory_to_scan_for_anomalous_patterns\u003e\nOptional:\n [-c max_cost_for_autocorrect]              (default: 2)\n [-n max_number_of_results_for_autocorrect] (default: 5)\n [-j number_of_scanning_threads]            (default: num_cpus_on_systems)\n [-o output_log_dir]                        (default: /tmp)\n [-l source_language_number]                (default: 1 (C), supported: 1 (C), 2 (Verilog), 3 (PHP), 4 (C++))\n [-a anomaly_threshold]                     (default: 3.0)\n```\n\nAs a part of scanning for anomalies, ControlFlag also suggests possible\ncorrections in case a conditional expression is flagged as an anomaly. `25` is the\n`max_cost` for the correction -- how close should the suggested correction be to\npossibly mistyped expression. Increasing `max_cost` leads to suggesting more\ncorrections. ___If you feel that the number of reported anomalies is\nhigh, consider reducing `anomaly_threshold` to `1.0` or less___.\n\n### Understanding scan output\n\nUnder `output_log_dir` you will find multiple log files corresponding to\nthe scan output from different scanner threads. Potential anomalies are reported\nwith \"Potential anomaly\" as a label. Command below will report log files\ncontaining at least one anomaly.\n\n```\n$ grep \"Potential anomaly\" \u003coutput_log_dir\u003e/thread_*.log\n```\n\nA sample anomaly report looks like below:\n```\nLevel:\u003cONE or TWO\u003e Expression: \u003cAST_for_anomalous_expression\u003e\nSource file and line number: \u003cSource code expression with line number having the anomaly\u003e\nPotential anomaly\nDid you mean ...\n```\nThe text after \"Did you mean\" shows possible corrections to the anomalous expression.\n\n## Success stories\nIn the spirit of community service, we routinely scan open-source packages using ControlFlag. We have found several programming errors in various open-source projects. We are mentioning some of the errors that are confirmed by the respective developers below.\n\nIssue link | Language | Erroneous expression | Comment\n-----------|----------------------|----------------------|---------\nhttps://github.com/curl/curl/pull/6193 | C | `if (s-\u003ekeepon \u003e TRUE)` | Comparison between a variable and a boolean using `\u003e`\nhttps://github.com/vrpn/vrpn/issues/263 | C | `(l_inbuf[2] \\| 1)`, `if (l_inbuf[3] \\| 1)` | Incorrect use of `\\|` instead of `\u0026`\nhttps://github.com/vlm/asn1c/issues/443 | C | `if(!saved_aid \u0026\u0026 0)` | Dead code\nhttps://github.com/shoes/shoes3/issues/468 | C | `if ((attr == 39) \\|\\| (attr = 49))` | Incorrect use of `=` instead of `==`\nhttps://github.com/IoLanguage/io/issues/455 | C | `if (UArray_greaterThan_(self, other) \\| UArray_equals_(self, other))` | Inefficient use of `\\|` instead of `\\|\\|`\nhttps://github.com/IoLanguage/io/issues/455 | C | `if( ln = (SFG_Node *)node-\u003eNext )`, `if( ln = (SFG_Node *)node-\u003ePrev )` | Missing parenthesis\nhttps://github.com/elua/elua/issues/170 | C | `if (Protection_Level_1_Register \u0026= FMI_Sector_Mask)` | Missing parenthesis\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fintellabs%2Fcontrol-flag","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fintellabs%2Fcontrol-flag","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fintellabs%2Fcontrol-flag/lists"}