{"id":50508895,"url":"https://github.com/denisecase/nlp-03-text-exploration","last_synced_at":"2026-06-02T18:31:18.689Z","repository":{"id":348523662,"uuid":"1176508491","full_name":"denisecase/nlp-03-text-exploration","owner":"denisecase","description":"Exploratory analysis of text corpora using tokenization, frequency, co-occurrence, and bigrams to reveal structure in text.","archived":false,"fork":false,"pushed_at":"2026-04-01T13:27:21.000Z","size":470,"stargazers_count":1,"open_issues_count":0,"forks_count":2,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-01T15:26:03.797Z","etag":null,"topics":["bigrams","co-occurence","corpus-analysis","data-analysis","nlp","python","text-analysis","text-exploration","tokenization"],"latest_commit_sha":null,"homepage":"https://denisecase.github.io/nlp-03-text-exploration/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/denisecase.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":".github/AGENTS.md","dco":null,"cla":null}},"created_at":"2026-03-09T04:54:37.000Z","updated_at":"2026-04-01T13:27:25.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/denisecase/nlp-03-text-exploration","commit_stats":null,"previous_names":["denisecase/nlp-03-text-exploration"],"tags_count":null,"template":true,"template_full_name":null,"purl":"pkg:github/denisecase/nlp-03-text-exploration","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/denisecase%2Fnlp-03-text-exploration","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/denisecase%2Fnlp-03-text-exploration/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/denisecase%2Fnlp-03-text-exploration/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/denisecase%2Fnlp-03-text-exploration/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/denisecase","download_url":"https://codeload.github.com/denisecase/nlp-03-text-exploration/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/denisecase%2Fnlp-03-text-exploration/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33833277,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-02T02:00:07.132Z","response_time":109,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigrams","co-occurence","corpus-analysis","data-analysis","nlp","python","text-analysis","text-exploration","tokenization"],"created_at":"2026-06-02T18:31:17.047Z","updated_at":"2026-06-02T18:31:18.681Z","avatar_url":"https://github.com/denisecase.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# nlp-03-text-exploration\n\n[![Python 3.14+](https://img.shields.io/badge/python-3.14%2B-blue?logo=python)](#)\n[![MIT](https://img.shields.io/badge/license-see%20LICENSE-yellow.svg)](./LICENSE)\n\n\u003e Professional Python project for Web Mining and Applied NLP.\n\nWeb Mining and Applied NLP focus on retrieving, processing, and analyzing text from the web and other digital sources.\nThis course builds those capabilities through working projects.\n\nIn the age of generative AI, durable skills are grounded in real work:\nsetting up a professional environment,\nreading and running code,\nunderstanding the logic,\nand pushing work to a shared repository.\nEach project follows a similar structure based on professional Python projects.\nThese projects are **hands-on textbooks** for learning Web Mining and Applied NLP.\n\n## This Project\n\nThis project focuses on **exploratory analysis of text data**.\n\nThe goal is to analyze a small, structured corpus and observe how\npatterns emerge from token distributions, category comparisons,\nand contextual relationships.\n\nYou will:\n\n- tokenize and clean text data\n- build frequency distributions\n- compare token usage across categories\n- examine co-occurrence (context windows)\n- analyze bigrams (local structure)\n- visualize results and interpret patterns\n\nThis project illustrates how **structure appears in text before any machine learning is applied**.\nThese patterns support later pipelines, embeddings, and retrieval.\n\nYou'll work with just these files as you update authorship and experiment:\n\n- **notebooks/nlp_corpus_explore_case.ipynb** - notebook version\n- **src/nlp/nlp_corpus_explore_case.py** - Python script\n- **pyproject.toml** - project configuration and dependencies\n- **zensical.toml** - project metadata\n\n## First: Follow These Instructions\n\nFollow the [step-by-step workflow guide](https://denisecase.github.io/pro-analytics-02/workflow-b-apply-example-project/) to complete:\n\n1. Phase 1. **Start \u0026 Run**\n2. Phase 2. **Change Authorship**\n3. Phase 3. **Read \u0026 Understand**\n\n## What to Look For\n\nAs you run the script and notebook, focus on:\n\n- which tokens dominate each category\n- how categories differ in vocabulary\n- which words appear in similar contexts\n- how local structure (bigrams) appears in text\n\nThese observations are the foundation for later modules.\n\n## Success\n\nAfter running the script successfully, you will see:\n\n```shell\n========================\nPipeline executed successfully!\n========================\n```\n\nYou will also see:\n\n- frequency tables printed to the console\n- visualizations of token distributions\n- examples of co-occurrence and bigram patterns\n\nA file named `project.log` will appear in the project folder.\n\n## Command Reference\n\nThe commands below are used in the workflow guide above.\nThey are provided here for convenience.\n\nFollow the guide for the **full instructions**.\n\n\u003cdetails\u003e\n\u003csummary\u003eShow command reference\u003c/summary\u003e\n\n### In a machine terminal (open in your `Repos` folder)\n\nAfter you get a copy of this repo in your own GitHub account,\nopen a machine terminal in your `Repos` folder:\n\n```shell\n# Replace username with YOUR GitHub username.\ngit clone https://github.com/username/nlp-03-text-exploration\ncd nlp-03-text-exploration\ncode .\n```\n\n### In a VS Code terminal\n\n```shell\nuv self update\nuv python pin 3.14\nuv sync --extra dev --extra docs --upgrade\n\nuvx pre-commit install\ngit add -A\nuvx pre-commit run --all-files\n\n# Later, we install spacy data model and\n# en_core_web_sm = english, core, web, small\n# It's big: spacy+data ~200+ MB w/ model installed\n#           ~350–450 MB for .venv is normal for NLP\n# uv run python -m spacy download en_core_web_sm\n\n# First, run the module\n# IMPORTANT: Close each figure after viewing so execution continues\nuv run python -m nlp.nlp_corpus_explore_case\n\n# Then, open the notebook.\n# IMPORTANT: Select the kernel and Run All:\n# notebooks/nlp_corpus_explore_case.ipynb\n\nuv run ruff format .\nuv run ruff check . --fix\nuv run zensical build\n\ngit add -A\ngit commit -m \"update\"\ngit push -u origin main\n```\n\n\u003c/details\u003e\n\n## Notes\n\n- Use the **UP ARROW** and **DOWN ARROW** in the terminal to scroll through past commands.\n- Use `CTRL+f` to find (and replace) text within a file.\n\n## Terminology\n\nIn preparation for large language models (LLM) and related methods,\nour analysis does not begin with semantic interpretation.\nInstead, we focus on **proximity** and observable **patterns** in the text.\n\nWe evaluate **co-occurrence (context windows)**, that is, _which words tend to appear near each other_.\n\nThe full collection of text is called a **corpus** (a set of documents).\nFor this analysis, each document is represented as a single line of text.\n\n## Example Output\n\n```text\nCorpus contains 22 documents.\nTokenization complete.\nshape: (10, 2)\n┌──────────┬────────┐\n│ category ┆ token  │\n│ ---      ┆ ---    │\n│ str      ┆ str    │\n╞══════════╪════════╡\n│ dog      ┆ dog    │\n│ dog      ┆ barks  │\n│ dog      ┆ loudly │\n│ dog      ┆ the    │\n│ dog      ┆ puppy  │\n│ dog      ┆ runs   │\n│ dog      ┆ the    │\n│ dog      ┆ yard   │\n│ dog      ┆ canine │\n│ dog      ┆ wears  │\n└──────────┴────────┘\nTop global tokens:\nshape: (10, 2)\n┌────────┬─────┐\n│ token  ┆ len │\n│ ---    ┆ --- │\n│ str    ┆ u32 │\n╞════════╪═════╡\n│ the    ┆ 27  │\n│ near   ┆ 4   │\n│ truck  ┆ 3   │\n│ cat    ┆ 3   │\n│ yard   ┆ 3   │\n│ garage ┆ 3   │\n│ dog    ┆ 3   │\n│ car    ┆ 3   │\n│ kitten ┆ 2   │\n│ window ┆ 2   │\n└────────┴─────┘\nTop tokens by category:\nshape: (12, 3)\n┌──────────┬─────────┬─────┐\n│ category ┆ token   ┆ len │\n│ ---      ┆ ---     ┆ --- │\n│ str      ┆ str     ┆ u32 │\n╞══════════╪═════════╪═════╡\n│ truck    ┆ the     ┆ 4   │\n│ truck    ┆ truck   ┆ 3   │\n│ truck    ┆ pickup  ┆ 1   │\n│ truck    ┆ carries ┆ 1   │\n│ truck    ┆ trailer ┆ 1   │\n│ …        ┆ …       ┆ …   │\n│ truck    ┆ heavy   ┆ 1   │\n│ truck    ┆ loads   ┆ 1   │\n│ truck    ┆ powers  ┆ 1   │\n│ truck    ┆ cargo   ┆ 1   │\n│ truck    ┆ hauls   ┆ 1   │\n└──────────┴─────────┴─────┘\nCAT top tokens: ['the', 'cat', 'kitten', 'window', 'near']\nTRUCK top tokens: ['the', 'truck', 'pickup', 'carries', 'trailer']\nCAR top tokens: ['the', 'garage', 'car', 'sedan', 'near']\nDOG top tokens: ['the', 'yard', 'dog', 'across', 'ran']\n\nContext for 'dog':\n['barks', 'loudly', 'holds', 'the', 'the', 'ran', 'across']\n\nContext for 'cat':\n['sleeps', 'quietly', 'the', 'has', 'whiskers', 'the', 'slept', 'near']\n\nContext for 'car':\n['drives', 'the', 'the', 'moves', 'down', 'the', 'stopped', 'near']\n\nContext for 'truck':\n['carries', 'cargo', 'powers', 'the', 'the', 'hauls', 'heavy']\nTop bigrams:\nshape: (10, 2)\n┌────────────┬─────┐\n│ bigram     ┆ len │\n│ ---        ┆ --- │\n│ str        ┆ u32 │\n╞════════════╪═════╡\n│ near the   ┆ 4   │\n│ the yard   ┆ 3   │\n│ the garage ┆ 3   │\n│ the cat    ┆ 2   │\n│ ran across ┆ 2   │\n│ the window ┆ 2   │\n│ the kitten ┆ 2   │\n│ the sedan  ┆ 2   │\n│ slept near ┆ 2   │\n│ across the ┆ 2   │\n└────────────┴─────┘\n```\n\n## Text Categorization Analysis\n\n- Which words appear **most often in each category**, and why?\n- Which words tend to appear near **dog**, **cat**, or **truck**?\n- What **differences** do you observe between animal-related and vehicle-related text?\n- Which words seem **interchangeable** based on how they are used?\n- What **patterns** help infer meaning from the data?\n\n## General Insights\n\nThese categories are artificial and were chosen to illustrate the process.\nRelated approaches are used to prepare and analyze large text corpora for modern LLMs.\n\nBy examining token frequency, category differences, and co-occurrence\n(which words appear near each other),\nthe **measurable structure of text** begins to appear.\n\nWords used in similar contexts exhibit similar patterns,\nand groups of related terms emerge naturally from the data.\n\nEven before any modeling, we can begin to distinguish categories\nand see how meaning is reflected through **patterns of use**.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdenisecase%2Fnlp-03-text-exploration","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdenisecase%2Fnlp-03-text-exploration","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdenisecase%2Fnlp-03-text-exploration/lists"}