{"id":30637613,"url":"https://github.com/definetlynotai/logicytics_vulnscan","last_synced_at":"2025-08-30T23:07:12.348Z","repository":{"id":306605545,"uuid":"1026727559","full_name":"DefinetlyNotAI/Logicytics_VulnScan","owner":"DefinetlyNotAI","description":"The Logicytics port of VulnScan, all the code and training tools is here - port started for v3.5.0 with the latest version being v3.4.2 from Logicytics.","archived":false,"fork":false,"pushed_at":"2025-08-27T00:02:37.000Z","size":54,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-27T08:28:13.746Z","etag":null,"topics":["ai","forensics","ml","python","python3","vulnerability","vulnerability-scanners"],"latest_commit_sha":null,"homepage":"https://github.com/DefinetlyNotAI/Logicytics","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DefinetlyNotAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-26T13:40:47.000Z","updated_at":"2025-08-27T00:02:38.000Z","dependencies_parsed_at":"2025-07-26T19:38:38.851Z","dependency_job_id":"ff9d6d11-2e38-4040-afcb-d06405c3d1dd","html_url":"https://github.com/DefinetlyNotAI/Logicytics_VulnScan","commit_stats":null,"previous_names":["definetlynotai/logicytics_vulnscan"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/DefinetlyNotAI/Logicytics_VulnScan","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DefinetlyNotAI%2FLogicytics_VulnScan","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DefinetlyNotAI%2FLogicytics_VulnScan/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DefinetlyNotAI%2FLogicytics_VulnScan/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DefinetlyNotAI%2FLogicytics_VulnScan/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DefinetlyNotAI","download_url":"https://codeload.github.com/DefinetlyNotAI/Logicytics_VulnScan/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DefinetlyNotAI%2FLogicytics_VulnScan/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272917735,"owners_count":25014935,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-30T02:00:09.474Z","response_time":77,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","forensics","ml","python","python3","vulnerability","vulnerability-scanners"],"created_at":"2025-08-30T23:06:48.123Z","updated_at":"2025-08-30T23:07:12.277Z","avatar_url":"https://github.com/DefinetlyNotAI.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# VulnScan Documentation\r\n\r\nVulnScan is designed to detect sensitive data across various file formats.\r\nIt offers a modular framework to train models using diverse algorithms,\r\nfrom traditional ML classifiers to advanced Neural Networks.\r\n\r\nThis document outlines the system's naming conventions, lifecycle, and model configuration.\r\n\r\n\u003e [!NOTE]\r\n\u003e Ported in update 3.5.0 of Logicytics - Latest update from there was 3.4.2\r\n\u003e\r\n\u003e You can find the main repo and generated files [here](https://github.com/DefinetlyNotAI/Logicytics/tree/main/CODE/vulnScan)\r\n\r\n\u003e [!IMPORTANT]\r\n\u003e Old documentation is available in the `Archived Models` directory of this [repository](https://github.com/DefinetlyNotAI/VulnScan_Data)\r\n\u003e\r\n\u003e This documentation covers test data, metrics and niche features.\r\n\r\n---\r\n\r\n## Naming Conventions\r\n\r\n### Model Naming Format\r\n`Model_{Type of model}.{Version}`\r\n\r\n- **Type of Model**: Describes the training data configuration.\r\n    - `SenseNano`: Test set with \u003c10k files or \u003c1k vals (PT), used for error-checking.\r\n    - `SenseMini`: Dataset with 10k to 50k files or 1k-5k vals (PT). `Balanced size for effective training and resource efficiency`.\r\n    - `Sense`: Sensitive data set with 50k to 100k files or 5k-10k (PT).\r\n    - `SenseMacro`: Large dataset with \u003e100k files or \u003e10k (PT).\r\n\r\n- **Version Format**: `{Version#}{c}{Repeat#}`\r\n    - **Version#**: Increment for major code updates.\r\n    - **c**: Model identifier (e.g., NeuralNetwork, BERT, etc.). See below for codes.\r\n    - **Repeat#**: Number of times the same model was trained without significant code changes, used to improve consistency.\r\n    - **-F**: Denotes a failed model or a corrupted model.\r\n\r\n### Model Identifiers\r\n\r\n| Code | Model Type                |\r\n|------|---------------------------|\r\n| `b`  | BERT                      |\r\n| `dt` | DecisionTree              |\r\n| `et` | ExtraTrees                |\r\n| `g`  | GBM                       |\r\n| `l`  | LSTM                      |\r\n| `n`  | NeuralNetwork (preferred) |\r\n| `nb` | NaiveBayes                |\r\n| `r`  | RandomForestClassifier    |\r\n| `lr` | Logistic Regression       |\r\n| `v`  | SupportVectorMachine      |\r\n| `x`  | XGBoost                   |\r\n\r\n### Example\r\n`Model Sense .1n2`:\r\n- Dataset: `Sense` (50k files, 50KB each).\r\n- Version: 1 (first major version).\r\n- Model: `NeuralNetwork` (`n`).\r\n- Repeat Count: 2 (second training run with no major code changes).\r\n\r\n---\r\n\r\n## Life Cycle Phases\r\n\r\n### Version 1 (Deprecated)\r\n- **Removed**: Small and weak codebase, replaced by `v3`.\r\n\r\n1. Generate data.\r\n2. Index paths.\r\n3. Read paths.\r\n4. Train models and iterate through epochs.\r\n5. Produce outputs: data, graphs, and `.pkl` files.\r\n\r\n---\r\n\r\n### Version 2 (Deprecated)\r\n- **Deprecation Reason**: Outdated methods for splitting and vectorizing data.\r\n\r\n1. Load Data.\r\n2. Split Data.\r\n3. Vectorize Text.\r\n4. Initialize Model.\r\n5. Train Model.\r\n6. Evaluate Model.\r\n7. Save Model.\r\n8. Track Progress.\r\n\r\n---\r\n\r\n### Version 3 (Superseded)\r\n- **Superseded by Version 4**\r\n- Retained for reference and backward compatibility.\r\n\r\n1. **Read Config**: Load model and training parameters.\r\n2. **Load Data**: Collect and preprocess sensitive data.\r\n3. **Split Data**: Separate into training and validation sets.\r\n4. **Vectorize Text**: Transform textual data using `TfidfVectorizer`.\r\n5. **Initialize Model**: Define traditional ML or Neural Network models.\r\n6. **Train Model**: Perform iterative training using epochs.\r\n7. **Validate Model**: Evaluate with metrics and generate classification reports.\r\n8. **Save Model**: Persist trained models and vectorizers for reuse.\r\n9. **Track Progress**: Log and visualize accuracy and loss trends over epochs.\r\n\r\n---\r\n\r\n### Version 4 (Current)\r\n- **Current Release**: Major improvements in scalability, modularity, and embedding-based training.\r\n- **Key Features**:\r\n    - **Dynamic Dataset Generation**: Uses GPT-Neo for synthetic sensitive data generation, scaling from small to large datasets.\r\n    - **Embedding-Based Training**: Employs MiniLM sentence embeddings for all text samples, improving feature representation.\r\n    - **Multi-Round Training**: Supports multiple training rounds per dataset size for robust model evaluation.\r\n    - **Automated Caching**: Datasets and embeddings are cached for reuse, reducing redundant computation.\r\n    - **Configurable Model Naming**: Model names reflect dataset size, type, version, and training round.\r\n    - **Progress Tracking**: Training history and metrics are saved per round for analysis.\r\n    - **Extensible Framework**: Easily integrates new models, datasets, and training strategies.\r\n\r\n#### Version 4 Workflow\r\n1. **Initialize Resources**: Load GPT-Neo and MiniLM models for generation and embedding.\r\n2. **Dataset Generation**: Create or load datasets of varying sizes, using cached data when available.\r\n3. **Embedding Generation**: Compute sentence embeddings for train, validation, and test splits.\r\n4. **Split Data**: Partition data into train, validation, and test sets based on configurable ratios.\r\n5. **Model Training**: Train a neural network using embeddings, with support for early stopping and learning rate scheduling.\r\n6. **Multi-Round Evaluation**: Repeat training for each dataset size and round, saving metrics and model states.\r\n7. **Progress Logging**: Save training history, plots, and logs for each round and model.\r\n8. **Extensibility**: Easily add new dataset sizes, model types, or embedding strategies.\r\n\r\n#### Example Model Name\r\n`Model_Sense.4n1`:\r\n- Dataset: `Sense` (50k to 100k files).\r\n- Version: 4 (current major version).\r\n- Model: NeuralNetwork (`n`).\r\n- Training Round: 1.\r\n\r\n---\r\n\r\n## Preferred Model\r\n**NeuralNetwork (`n`)**\r\n- Proven to be the most effective for detecting sensitive data in the project.\r\n\r\n---\r\n\r\n## Notes\r\n- **Naming System**: Helps track model versions, datasets, and training iterations for transparency and reproducibility.\r\n- **Current Focus**: Version 4 for improved scalability, embedding-based training, and robust performance.\r\n\r\n---\r\n\r\n## Additional Features\r\n\r\n- **Progress Tracking**: Visualizes accuracy and loss per epoch with graphs.\r\n- **Error Handling**: Logs errors for missing files, attribute issues, or unexpected conditions.\r\n- **Extensibility**: Supports plug-and-play integration for new algorithms or datasets.\r\n\r\n\r\n# More files\r\n\r\nThere is a repository that archived all the data used to make the model,\r\nas well as previously trained models for you to test out\r\n(loading scripts and vectorizers are not included).\r\n\r\nThe repository is located [here](https://github.com/DefinetlyNotAI/VulnScan_Data).\r\n\r\nThe repository contains the following directories:\r\n- `cache`: Contains all training data generated by [`Generator.py`](Generator.py).\r\n- `NN features`: Contains information about the model `.3n3` and the vectorizer used. Information include:\r\n    - `Documentation_Study_Network.md`: A markdown file that contains more info.\r\n    - `Neural_Network_Nodes_Graph.gexf`: A Gephi file that contains the model nodes and edges.\r\n    - `Feature_Importance.svg`: A SVG file that contains the feature importance of the model.\r\n    - `Loss_Landscape_3D.html`: A HTML file that contains the 3D loss landscape of the model.\r\n    - `Model_State_Dict.txt`: A text file that contains the model state dictionary.\r\n    - `Model_Summary.txt`: A text file that contains the model summary.\r\n    - `Model_Visualization.png`: A PNG file that contains the model visualization.\r\n    - `Visualize_Activation.png`: A PNG file that contains the visualization of the model activation.\r\n    - `Visualize_tSNE.png`: A PNG file that contains the visualization of the model t-SNE with the default training test embeds.\r\n    - `Visualize_tSNE_custom.png`: A PNG file that contains the visualization of the model t-SNE with real world training examples (only 100).\r\n    - `Weight_Distribution.png`: A PNG file that contains the weight distribution of the model.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdefinetlynotai%2Flogicytics_vulnscan","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdefinetlynotai%2Flogicytics_vulnscan","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdefinetlynotai%2Flogicytics_vulnscan/lists"}