{"id":25096312,"url":"https://github.com/yamil-serrano/language-processing-analyzer","last_synced_at":"2025-06-11T22:08:02.568Z","repository":{"id":276218827,"uuid":"928609763","full_name":"Yamil-Serrano/Language-Processing-Analyzer","owner":"Yamil-Serrano","description":"This repository contains the development of a Language Processing Analyzer, structured into three phases. It is part of the CIIC 4030 - ICOM 4036: Programming Languages course at my University, Department of Computer Science and Software Engineering.","archived":false,"fork":false,"pushed_at":"2025-04-30T21:34:37.000Z","size":33,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-30T22:33:31.352Z","etag":null,"topics":["lexer","parser","programming-languages"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Yamil-Serrano.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-02-06T23:12:32.000Z","updated_at":"2025-04-30T21:34:41.000Z","dependencies_parsed_at":"2025-03-04T20:19:27.118Z","dependency_job_id":"c71232c3-1217-443e-a708-028932d90b98","html_url":"https://github.com/Yamil-Serrano/Language-Processing-Analyzer","commit_stats":null,"previous_names":["yamil-serrano/language-processing-analyzer"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Yamil-Serrano/Language-Processing-Analyzer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Yamil-Serrano%2FLanguage-Processing-Analyzer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Yamil-Serrano%2FLanguage-Processing-Analyzer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Yamil-Serrano%2FLanguage-Processing-Analyzer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Yamil-Serrano%2FLanguage-Processing-Analyzer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Yamil-Serrano","download_url":"https://codeload.github.com/Yamil-Serrano/Language-Processing-Analyzer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Yamil-Serrano%2FLanguage-Processing-Analyzer/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259352666,"owners_count":22844738,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["lexer","parser","programming-languages"],"created_at":"2025-02-07T16:33:04.269Z","updated_at":"2025-06-11T22:08:02.557Z","avatar_url":"https://github.com/Yamil-Serrano.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Language Processing Analyzer\n\nThis repository contains the development of a **Language Processing Analyzer**, structured into three phases. This document currently focuses on **Phase 1: Lexical Analysis**, with placeholders for the upcoming phases.\n\n## Phase 1: Lexical Analysis\n\n### What is a Lexical Analyzer?\nA **Lexical Analyzer (Lexer)** is the first stage of a compiler or interpreter. Its function is to **take the source code as input and break it down into the smallest meaningful units, called tokens**.\n\n**Example of source code:**\n```c\nint x = 10;\n```\n**Lexical Analyzer Output:**\n```\nTOKEN_KEYWORD(int)\nTOKEN_IDENTIFIER(x)\nTOKEN_OPERATOR(=)\nTOKEN_NUMBER(10)\nTOKEN_SYMBOL(;)\n```\nThese tokens will be used in the next phase (**syntax analysis**) to verify the structure of the code.\n\n### How does it work?\n1. **Reads the source code.**\n2. **Ignores whitespace and comments.**\n3. **Identifies tokens according to lexical rules (regular expressions).**\n4. **Generates a list of tokens.**\n5. **Passes the tokens to the parser.**\n\n### Understanding Greedy Regular Expressions\nIn lexical analysis, regular expressions are **greedy** by default, meaning they try to match the longest possible string that fits their pattern. Here's how it works:\n\n1. The lexer starts at a position in the input.\n2. It looks ahead character by character, trying to match the longest possible sequence.\n3. When it can't match any more characters, it creates a token with the matched sequence.\n\n**Example with numbers:**\n```python\n# For the regular expression '\\d+' (one or more digits)\nInput: \"123abc\"\n\nLexer process:\n1. Starts at '1' → matches\n2. Looks ahead to '2' → still matches\n3. Looks ahead to '3' → still matches\n4. Looks ahead to 'a' → doesn't match\n5. Creates NUMBER token with \"123\"\n```\n\nThis greedy behavior ensures that numbers like \"123\" are tokenized as a single NUMBER token (123) rather than three separate tokens (1, 2, 3).\n\n## Automata and Regular Expressions\nThe **lexer** can be implemented using **Deterministic Finite Automata (DFA)**, generated from **Regular Expressions**.\n\n**Example of a regular expression for identifiers:**\n```\n[a-zA-Z][a-zA-Z0-9_]*\n```\nThis represents a **word that starts with a letter and can contain numbers and underscores**.\n\n**Example of a DFA to recognize \"if\", \"int\", and \"else\":**\n```\n  (q0) -- 'i' --\u003e (q1) -- 'f' --\u003e (q2) [IF]\n    |                        \n    |-- 'n' --\u003e (q3) -- 't' --\u003e (q4) [INT]\n    |-- 'e' --\u003e (q5) -- 'l' --\u003e (q6) -- 's' --\u003e (q7) -- 'e' --\u003e (q8) [ELSE]\n```\nThis diagram shows how a DFA recognizes keywords by following transitions between states.\n\n## Implementation with PLY\nIn this repository, a lexer is implemented using PLY, utilizing **regular expressions and functions in Python** to define tokens.\n\n**Example of lexer code:**\n```python\nimport ply.lex as lex\n\n# Token list\ntokens = ['IDENTIFIER', 'NUMBER', 'IF', 'ELSE', 'PLUS', 'EQUAL']\n\n# Lexical rules\nt_IDENTIFIER = r'[a-zA-Z_][a-zA-Z0-9_]*'\nt_NUMBER = r'\\d+'\nt_IF = r'if'\nt_ELSE = r'else'\nt_PLUS = r'\\+'\nt_EQUAL = r'='\n\ndef t_error(t):\n    print(f\"Illegal character: {t.value[0]}\")\n    t.lexer.skip(1)\n\nlexer = lex.lex()\n```\n\n## Phase 2: Syntax Analysis (Parser)\n\n### What is a Syntax Analyzer (Parser)?\nA **Syntax Analyzer (Parser)** is the second stage of a compiler or interpreter. Its function is to **verify the structure of the source code**, based on the grammar rules. The parser checks if the sequence of tokens (generated by the lexical analyzer) follows the syntax of the language.\n\nFor example, given the input:\n```c\nint x = 10;\n```\nThe parser would check if this code follows the correct syntax for variable declaration and assignment in the language.\n\n### How does it work?\n1. **Receives tokens from the lexical analyzer.**\n2. **Follows the grammar rules** to match the sequence of tokens.\n3. **Generates a syntax tree** (Abstract Syntax Tree - AST) that represents the hierarchical structure of the code.\n4. **Reports errors** if the code does not conform to the defined syntax.\n\n### Understanding Grammar Rules\nIn a parser, the language syntax is defined by **Context-Free Grammar (CFG)**, which is a set of production rules that describe how tokens can be combined into valid statements. \n\nHere is an example of a simple grammar rule for a mathematical expression:\n```\nexpression : term\n           | expression PLUS term\n           | expression MINUS term\n```\nThis means that an expression can be a single term, or an expression followed by a `PLUS` or `MINUS` operator and another term.\n\n### Implementing the Parser with PLY\nIn this repository, the parser is implemented using **PLY (Python Lex-Yacc)**, which is a library that helps implement lexers and parsers in Python. We define the grammar rules and the precedence of operators using **BNF (Backus-Naur Form)**, and then use **PLY's yacc** module to create the parser.\n\n#### Example of Parser Code:\n\n```python\nimport ply.yacc as yacc\n\n# Operator precedence\nprecedence = (\n    ('left', 'OR'),\n    ('left', 'AND'),\n    ('left', 'EQUAL', 'LESS', 'GREATER'),\n    ('left', 'PLUS', 'MINUS'),\n    ('left', 'TIMES', 'DIVIDE'),\n    ('left', 'DOT'),\n)\n\ndef p_global_facts(p):\n    '''global_facts : facts exec_line '''\n    pass\n\ndef p_facts(p):\n    '''facts : func_def facts \n             | assign facts \n             | '''\n    pass\n\ndef p_func_def(p):\n    '''func_def : FUNC ID_FUNC LBRACE params RBRACE ASSIGN stm END'''\n    pass\n\n# Additional grammar rules for parameters, statements, assignments, etc.\n```\n\n### Defining the Grammar\nThe syntax of the language is described through grammar rules that specify how different components of the language can be combined. Each rule is written as a function with the format `def p_rule_name(p):` where `p` represents the list of elements that match the rule.\n\nFor example, the rule for a function definition is written as:\n```python\ndef p_func_def(p):\n    '''func_def : FUNC ID_FUNC LBRACE params RBRACE ASSIGN stm END'''\n    pass\n```\n\n### Operator Precedence\nIn the parser, we define **operator precedence** to ensure that operations like `+` and `-` are handled before `*` and `/`, and that logical operators like `AND` and `OR` have their own precedence. This helps avoid ambiguities in parsing.\n\n```python\nprecedence = (\n    ('left', 'OR'),\n    ('left', 'AND'),\n    ('left', 'EQUAL', 'LESS', 'GREATER'),\n    ('left', 'PLUS', 'MINUS'),\n    ('left', 'TIMES', 'DIVIDE'),\n    ('left', 'DOT'),\n)\n```\n\n### Error Handling\nIn the parser, we define an error function to handle syntax errors. If the input doesn't match any of the grammar rules, it will print an error message indicating the problem:\n\n```python\ndef p_error(p):\n    if p:\n        print(f\"Syntax error in input: {p.value} at line {p.lineno}\")\n    else:\n        print(\"Syntax error in input: none.\")\n```\n\n## Phase 3: Semantic Analysis  \n\n### What is Semantic Analysis?  \n**Semantic Analysis** is the third stage of a compiler/interpreter. While syntax analysis ensures the code is grammatically correct, semantic analysis verifies that the code **makes logical sense** according to the language rules. It checks:  \n- **Type compatibility** (e.g., `5 + \"text\"` is invalid).  \n- **Variable/function existence** (e.g., using undeclared variables).  \n- **Scope rules** (e.g., accessing variables outside their scope).  \n- **Function argument validity** (e.g., incorrect number/type of arguments).  \n\n### How Does It Work?  \n1. **Traverses the AST** generated by the parser.  \n2. **Validates context-sensitive rules** using symbol tables and type-checking logic.  \n3. **Annotates the AST** with type information and scope details.  \n4. **Reports errors** for logical inconsistencies.  \n\n---\n\n### Example: Semantic Rules in Action  \n#### Code Snippet  \n```python  \nx = 5 + 3  \n```\n\n**Step-by-Step Analysis**\n1. **Variable Declaration Check**:\n   * If the language requires explicit declarations, ensure `x` is declared before use.\n2. **Type Checking**:\n   * Verify `5` (integer) and `3` (integer) are compatible with the `+` operator.\n3. **Result Assignment**:\n   * Assign the result type (integer) to `x`.\n\n**PLY Rule Explanation**\nFor the expression `5 + 3`, the parser rule might look like:\n```python\ndef p_stm_binary_op(p):  \n    '''stm : stm PLUS stm'''  \n    p[0] = {  \n        'type': 'binary_op',  \n        'operator': p[2],  # '+' (p[2] is the PLUS token)  \n        'left': p[1],      # Left operand (e.g., the AST node for 5)  \n        'right': p[3],     # Right operand (e.g., the AST node for 3)  \n        'data_type': None  # Placeholder for semantic analysis  \n    }  \n```\n\n* `p[0]`: The parent node combining the operation.\n* `p[1]` and `p[3]`: Child nodes representing the operands (`5` and `3`).\n* During semantic analysis, `data_type` is updated to `int` after validation.\n\n**Key Components**\n\n**1. Symbol Tables**\nTrack variables, functions, and their metadata (type, scope, etc.). Example:\n```python\nlet  \n    val x = 10  # Symbol table entry: {name: 'x', type: 'int', scope: 'local'}  \nin  \n    x + \"hello\"  # Error: x (int) + \"hello\" (string) is invalid  \nend  \n```\n\n**2. Type Checking**\n* Ensures operations are valid for their operand types.\n* Example error: `Type mismatch: int + string is not allowed`.\n\n**3. Scope Management**\n* Validates variable visibility.\n* Example error: `Variable 'y' not declared in this scope`.\n\n**Error Handling Examples**\n\n| **Error Case** | **Error Message** |\n|----------------|-------------------|\n| `5 + \"text\"` | Type mismatch: int and string |\n| `foo(1, 2)` (expects 1 arg) | Function 'foo' expects 1 argument |\n| `y = 10` (undeclared `y`) | Undeclared variable 'y' |\n\n**Integration with the Parser**\nSemantic actions are embedded in parser rules. For example:\n```python\ndef p_assign(p):  \n    '''assign : VAL ID ASSIGN stm END'''  \n    # Semantic check: Ensure ID is declared (if required)  \n    # and the RHS type matches the LHS type.  \n    p[0] = {  \n        'type': 'assign',  \n        'name': p[2],  \n        'value': p[4],  \n        'data_type': infer_type(p[4])  # Semantic annotation  \n    }  \n```\n\n## Phase 4: Interpreter\n\n### What is an Interpreter?\n\nThe **Interpreter** is the component that **executes the program** by walking through the **Abstract Syntax Tree (AST)** and evaluating each node based on its type. Unlike a compiler, which translates code ahead of time, the interpreter works **dynamically**, processing instructions as it encounters them.\n\n### How Does It Work?\n\n1. **AST Traversal**: The interpreter starts at the root of the AST and recursively visits each node.\n2. **Environment Management**: It maintains environments (symbol tables) to track variable/function definitions and their scopes.\n3. **Evaluation**: For each AST node, it:\n   - Looks up identifiers (variables/functions)\n   - Evaluates expressions\n   - Applies operations\n   - Manages scope for `let`, `if`, and `function` blocks\n4. **Recursive Execution**: Evaluation of a node often involves evaluating its children first, then combining their results.\n\n---\n\n### Example: Interpreting a Function Call\n\n#### Source Code\n```text\nfunc SomeFunction[n] := \n  let\n    val r := 15 end\n  in\n    n * r\n  end \nend\n\nexec SomeFunction[3]\n```\n\n#### Step-by-Step Execution\n\n1. **Top-Level AST Traversal**\n   - The root contains:\n     - A function definition: `SomeFunction`\n     - A statement: `exec SomeFunction[3]`\n   - The function is stored in the **global environment**, but **not executed yet**.\n\n2. **Executing `exec SomeFunction[3]`**\n   - The interpreter:\n     - Looks up `SomeFunction` in the global environment.\n     - Creates a **new environment** for the function call.\n     - Assigns the argument `3` to the parameter `n`.\n\n3. **Evaluating the Function Body (`let` Block)**\n   - Enters the `let` block:\n     - Extends the current environment.\n     - Defines `r := 15`.\n   - Evaluates the body expression: `n * r`\n     - Looks up `n → 3`\n     - Looks up `r → 15`\n     - Performs the multiplication: `3 * 15 = 45`\n\n4. **Returning the Result**\n   - The value `45` is returned from the function.\n   - The environment is **restored** after the function ends.\n   - `exec` produces the final output: `45`\n\n---\n\n### Interpreter Function Breakdown\n\n| Node Type       | Action Taken                                          |\n|-----------------|--------------------------------------------------------|\n| `binary_op`     | Recursively evaluate left and right, then apply op     |\n| `function_call` | Create new environment, bind arguments, eval body      |\n| `let`           | Create local scope, define vars, evaluate expression   |\n| `if`            | Evaluate condition, then evaluate appropriate branch   |\n| `value` / `int` | Return literal value                                   |\n| `id`            | Look up variable in current environment                |\n\n---\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyamil-serrano%2Flanguage-processing-analyzer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyamil-serrano%2Flanguage-processing-analyzer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyamil-serrano%2Flanguage-processing-analyzer/lists"}