{"id":18859985,"url":"https://github.com/stanfordnlp/edu-convokit","last_synced_at":"2025-04-15T00:02:07.019Z","repository":{"id":214737195,"uuid":"736429271","full_name":"stanfordnlp/edu-convokit","owner":"stanfordnlp","description":"Edu-ConvoKit: An Open-Source Framework for Education Conversation Data","archived":false,"fork":false,"pushed_at":"2024-08-08T20:39:13.000Z","size":15709,"stargazers_count":88,"open_issues_count":1,"forks_count":13,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-03-28T12:03:12.534Z","etag":null,"topics":["data","data-analysis","data-science","education","language","natural-language-processing"],"latest_commit_sha":null,"homepage":"https://edu-convokit.readthedocs.io/en/latest/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/stanfordnlp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-12-27T22:11:14.000Z","updated_at":"2025-03-27T11:48:47.000Z","dependencies_parsed_at":"2023-12-30T11:26:14.775Z","dependency_job_id":"f190d033-f562-4f34-b714-d38cdd9ec153","html_url":"https://github.com/stanfordnlp/edu-convokit","commit_stats":null,"previous_names":["rosewang2008/edu-toolkit","rosewang2008/edu-convokit"],"tags_count":0,"template":false,"template_full_name":"readthedocs/tutorial-template","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stanfordnlp%2Fedu-convokit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stanfordnlp%2Fedu-convokit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stanfordnlp%2Fedu-convokit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stanfordnlp%2Fedu-convokit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/stanfordnlp","download_url":"https://codeload.github.com/stanfordnlp/edu-convokit/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248981263,"owners_count":21193145,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","data-analysis","data-science","education","language","natural-language-processing"],"created_at":"2024-11-08T04:20:09.356Z","updated_at":"2025-04-15T00:02:06.893Z","avatar_url":"https://github.com/stanfordnlp.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/rosewang2008/edu-convokit/main/assets/full_logo.png\" width=\"500\"/\u003e\n\u003c/p\u003e\n\n\u003ch1\u003e\u003cimg src=\"https://raw.githubusercontent.com/rosewang2008/edu-convokit/main/assets/logo.png\" height=\"30\" /\u003e Edu-ConvoKit: An Open-Source Framework for Education Conversation Data \u003c/h1\u003e\n\n**Accepted to NAACL 2024, Systems Track**\n\nThe **Edu-ConvoKit** is an open-source framework designed to facilitate the study of conversation language data in educational settings.\nIt provides a practical and efficient pipeline for essential tasks such as text pre-processing, annotation, and analysis, tailored to meet the needs of researchers and developers.\nThis toolkit aims to enhance the accessibility and reproducibility of educational language data analysis, as well as advance both natural language processing (NLP) and education research.\nBy simplifying these key operations, the Edu-ConvoKit supports the efficient exploration and interpretation of text data in education.\n\nOur publication on Edu-ConvoKit can be found here: https://arxiv.org/pdf/2402.05111.pdf\n\n## 📖 Table of Contents\n[**Installation**](#installation) | [**Tutorials**](#tutorials) | [**Example Usage**](#example-usage) | [**Documentation**](https://edu-convokit.readthedocs.io/en/latest/) | [**Papers with Edu-ConvoKit**](papers.md) | [**Citation**](#citation) | [**Future Extensions**](#future-features) | [**Contributing**]() | [**Contact**](#contact)\n\n## Installation\n\nYou can install `edu-convokit` with pip:\n\n```bash\n\npip install edu-convokit\n\n```\n\n## Overview of the `Edu-ConvoKit` Pipeline\n\nThe **Edu-ConvoKit** pipeline consists of three key modules: `preprocess`, `annotate`, and `analyze`.\nThe pipeline is designed to be modular, so you can use any combination of these modules to suit your needs.\nThis pipeline has also been used by prior work in the field of education, so you can easily reproduce their results. \n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/rosewang2008/edu-convokit/main/assets/main_figure.png\"/\u003e\n\u003c/p\u003e\n\n## Tutorials\n\nWe have provided a series of tutorials to help you get started with the `Edu-ConvoKit`.\n\n### Demo Video\n\nHere is a 2-minute demo of what `Edu-ConvoKit` can do.\n\n\n\u003cp\u003e\n  \u003ca href=\"https://youtu.be/zdcI839vAko?si=yOOgiBAR3wIdE5IV\"\u003e\n    \u003cimg src=\"assets/video.png\" width=\"200\"/\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n\n### Basics of `Edu-ConvoKit`\n\nThere are three key modules of the `Edu-ConvoKit` pipeline: `preprocess`, `annotate`, and `analyze`.\n\n* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)][textcolab] [Tutorial: Text Pre-processing][textcolab]\n* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)][annotationcolab] [Tutorial: Annotation][annotationcolab]\n* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)][analyzecolab] [Tutorial: Analysis][analyzecolab]\n...\n\n### Datasets with `Edu-ConvoKit`\n\nWe've applied the `Edu-ConvoKit` to a variety of datasets. Here are some examples:\n* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)][nctecolab] [Tutorial: NCTE Dataset][nctecolab]\n* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)][ambercolab] [Tutorial: Amber Dataset][ambercolab]\n* [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)][talkmovescolab] [Tutorial: Talk Moves Dataset][talkmovescolab]\n\n\n## Example Usage\n\nContents:\n- [Pre-Processing](#pre-processing)\n- [Annotation](#annotation)\n- [Analysis](#analysis)\n  - [Qualitative Analysis](#-qualitative-analysis)\n  - [Quantitative Analysis](#-quantitative-analysis)\n  - [Lexical Analysis](#-lexical-analysis)\n  - [Temporal Analysis](#-temporal-analysis)\n  - [GPT Analysis](#-gpt-analysis)\n\n\n### Pre-Processing\n\nThe `preprocess` module provides a set of tools for cleaning and formatting raw text data. Text pre-processing is a critical step in handling education language data. \n- It ensures the data is clean (education data is notoriously messy). \n- It ensures the data is standardized, ready for annotation and analysis. \n- It ensures that the students and educators are anonymized; this is important to protect the privacy of individuals involved and allow for safe secondary data analysis.\n\nHere's an example of using `preprocess` to anonymize the dataset with known names:\n\n```python \n\n\u003e\u003e from edu_convokit.preprocessors import TextPreprocessor\n# For helping us flexibly load data\n\u003e\u003e from edu_convokit import utils\n\n# First get the data\n\u003e\u003e !wget \"https://raw.githubusercontent.com/rosewang2008/edu-convokit/master/data/talkmoves/Boats and Fish 2_Grade 4.xlsx\"\n\u003e\u003e data_fname = \"Boats and Fish 2_Grade 4.xlsx\"\n\u003e\u003e df = utils.load_data(data_fname) # Handles loading data from different file types including: .csv, .xlsx, .json\n\n# Show some lines that contain names in the speaker and text columns.\n\u003e\u003e df.iloc[25:35]\n```\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/rosewang2008/edu-convokit/main/assets/not_anonymized.png\" width=\"500\"/\u003e\n\u003c/p\u003e\n\n💡 Note: We see that the names occur in the speaker and text column. \n- e.g., names like David and Meredith appear in the speaker and text column. \n- The teacher is always shortened to “T” in the speaker column.\n\nWe can use the `TextPreprocessor` to anonymize the data in both columns.\n\n```python\n\n# Creating variables for the columns we want to use\n\u003e\u003e TEXT_COLUMN = \"Sentence\"\n\u003e\u003e SPEAKER_COLUMN = \"Speaker\"\n\n# Show the names of the speakers. In your use case, you might load this from a file or database.\n\u003e\u003e print(df[SPEAKER_COLUMN].unique())\n['T' 'David' 'Meredith' 'Beth' 'Meredith and David' 'T 2']\n\n# Create list of names and replacement names. We will make the replacement names unique so that we can easily find them later.\n\u003e\u003e known_names = [\"David\", \"Meredith\", \"Beth\"]\n\u003e\u003e known_replacement_names = [f\"[STUDENT_{i}]\" for i in range(len(known_names))]\n\u003e\u003e print(known_replacement_names)\n['[STUDENT_0]', '[STUDENT_1]', '[STUDENT_2]']\n\n# Now let's anonymize the names in the text!\n\u003e\u003e processor = TextPreprocessor()\n\u003e\u003e df = processor.anonymize_known_names(\n    df=df,\n    text_column=TEXT_COLUMN,\n    names=known_names,\n    replacement_names=known_replacement_names,\n    # We will directly replace the names in the text column.\n    # If you want to keep the original text, you can set `target_text_column` to a new column name.\n    target_text_column=TEXT_COLUMN\n)\n\u003e\u003e df.iloc[25:35]\n```\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/rosewang2008/edu-convokit/main/assets/sentence_anonymized.png\" width=\"500\"/\u003e\n\u003c/p\u003e\n\n💡 Note: Nice, we can see that the text has been anonymized (e.g., line 31)! Now let's anonymize the names in the speaker column.\n\n```python\n\ndf = processor.anonymize_known_names(\n    df=df,\n    text_column=SPEAKER_COLUMN,\n    names=known_names,\n    replacement_names=known_replacement_names,\n    target_text_column=SPEAKER_COLUMN\n)\n\ndf.iloc[25:35]\n```\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/rosewang2008/edu-convokit/main/assets/anonymized.png\" width=\"500\"/\u003e\n\u003c/p\u003e\n\n🎉 Great, now we have anonymized the speaker names as well! Some other great things are that: \n- We have a record of the original names and the anonymized names. So if we want to go back to the original names, we can do that. \n- The anonymized names are consistent: So [STUDENT_0] in the SPEAKER_COLUMN will refer to the same [STUDENT_0] in the TEXT_COLUMN.\n\nWe can also use the `TextPreprocessor` to group the utterances from the same speaker together.\n\n```python\ndf = processor.merge_utterances_from_same_speaker(\n    df=df,\n    text_column=TEXT_COLUMN,\n    speaker_column=SPEAKER_COLUMN,\n    # We're going to directly replace the text in the text column.\n    target_text_column=TEXT_COLUMN\n)\n\ndf.iloc[25:35]\n```\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/rosewang2008/edu-convokit/main/assets/merged.png\" width=\"500\"/\u003e\n\u003c/p\u003e\n\n### Annotation\n\nAnnotation is a critical step in understanding your data, and it is important to do it right and consistently across datasets. \nAnnotation is useful because: \n- It creates descriptive statistics about your data, which can help you understand the data. \n- It quantifies the language used by your students and educators, which can help you understand the language. \n- It measures the interaction between the student and the educator, which can help you understand the interaction.\n\nEdu-ConvoKit is designed to support these purposes with the `annotator` module. \nHere's an example of using `annotator` to annotate the dataset for talktime, student reasoning, and teacher conversational uptake.\nWe're going to use the same dataframe from the previous example; so `df` will be the dataframe with anonymized names and merged utterances.\n\n```python\n\n\u003e\u003e from edu_convokit.annotation import Annotator\n\u003e\u003e annotator = Annotator()\n\n# The talktime values will be populated in this column\n\u003e\u003e TALK_TIME_COLUMN = \"talktime\"\n\u003e\u003e df = annotator.get_talktime(\n    df=df,\n    text_column=TEXT_COLUMN,\n    analysis_unit=\"words\",\n    output_column=TALK_TIME_COLUMN\n)\n\ndf.head()\n```\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/rosewang2008/edu-convokit/main/assets/talktime.png\" width=\"500\"/\u003e\n\u003c/p\u003e\n\n🎉 We can see with a single function call, we’ve added our first annotation – talktime – to our data!\n\nLet's do the same for student reasoning and teacher conversational uptake.\n\n```python\n\n# The reasoning annotations will be populated in this column\n\u003e\u003e STUDENT_REASONING_COLUMN = \"student_reasoning\"\n\u003e\u003e df = annotator.get_student_reasoning(\n    df=df,\n    speaker_column=SPEAKER_COLUMN,\n    text_column=TEXT_COLUMN,\n    output_column=STUDENT_REASONING_COLUMN,\n    # Since this model is only trained on _student_ utterances,\n    # we can explicitly pass in the speaker names associated to students.\n    # It will only annotate utterances from these speakers.\n    speaker_value=known_replacement_names,\n)\n\n# The uptake annotations will be populated in this column.\n\u003e\u003e UPTAKE_COLUMN = \"uptake\"\n\u003e\u003e df = annotator.get_uptake(\n    df=df,\n    speaker_column=SPEAKER_COLUMN,\n    text_column=TEXT_COLUMN,\n    output_column=UPTAKE_COLUMN,\n    # Conversation uptake is about how much the teacher builds on what the students say.\n    # So, we want to specify the first speaker to be the students.\n    speaker1=known_replacement_names,\n    # And the second speaker to be the teacher.\n    speaker2='T'\n)\n\u003e\u003e df.head(20)\n```\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/rosewang2008/edu-convokit/main/assets/annotations.png\" width=\"500\"/\u003e\n\u003c/p\u003e\n\nWith these annotations, we can now do some analysis on our data. We can save the annotated data to a file for later use.\n\n```python\n\n\u003e\u003e df.to_csv(\"annotated_data.csv\", index=False)\n```\n\n### Analysis\n\nAnalyzing your data with `Edu-ConvoKit` can help you understand the language used by your students and educators.\nAnalysis on education data can happen in many ways. \n- 🔍 It can happen qualitatively where you look at the data; for example, we annotated the data for student_reasoning and you might be interested in looking at the specific instances of student reasoning. \n- 📊 It can also happen quantitatively where you look at the data in aggregate; for example, you might be interested in the average amount of time the student and educator talk. \n- 💬 You might be interested in lexically analyzing the data; for example, you might be interested in what words the student uses the most compared to the educator. \n- 📈 You might be interested in temporally analyzing the data; for example, you might be interested in the amount of time the student and educator talk over the course of their interaction.\n- 🤖 You might also be interested in using **GPT** to analyze the data; for example, you might want GPT4 to summarize the transcript.\n\nHere's an example of using `analyzer` for each for these analysis types. We'll be using the same dataframe from the previous example; so `df` will be the dataframe with annotations.\n\n#### 🔍 Qualitative Analysis\n\nLet’s start by looking at the data qualitatively. We will use the `QualitativeAnalyzer` to look at the data.\nEver wondered whether there was a quick way to … \n- Look at examples of student reasoning? \n- Look at examples of high conversational uptake by the educator?\n\nThis is where the `QualitativeAnalyzer` comes in handy!\n\nLet's look at some examples of student reasoning.\n\n```python\n\n\u003e\u003e from edu_convokit.analyzers import QualitativeAnalyzer,\n\u003e\u003e qual_analyzer = QualitativeAnalyzer()\n\n\u003e\u003e qual_analyzer.print_examples(\n    df=df,\n    speaker_column=SPEAKER_COLUMN,\n    text_column=TEXT_COLUMN,\n    # We want to look at examples for the reasoning feature\n    feature_column=STUDENT_REASONING_COLUMN,\n    # We want to look at positive examples of reasoning\n    feature_value=1.0,\n    # Let's look at 3 examples\n    max_num_examples=3\n)\nstudent_reasoning: 1.0\n\u003e\u003e [STUDENT_1]: one half by one sixth. Cause if you put six ones up to a whole\n\nstudent_reasoning: 1.0\n\u003e\u003e [STUDENT_0]: This would be one third, and this is one half of  dark green, and then it would be bigger by one sixth, because\n\nstudent_reasoning: 1.0\n\u003e\u003e [STUDENT_1]: I mean it's bigger than one tenth, I mean twelfth, one twelfth, one  twelfth.\n```\n\n🎉 Great! This is a quick way to look at examples of student reasoning.\n\nIf you’re curious in looking at the preceding and succeeding utterances (for context), you can easily do that by specifying the number of lines:\n\n```python\n\n\u003e\u003e qual_analyzer.print_examples(\n    df=df,\n    speaker_column=SPEAKER_COLUMN,\n    text_column=TEXT_COLUMN,\n    feature_column=STUDENT_REASONING_COLUMN,\n    feature_value=1.0,\n    max_num_examples=3,\n    # We want to look at the previous 2 lines of context\n    show_k_previous_lines=2,\n    # We also want to look at the next 1 line of context\n    show_k_next_lines=1\n)\n\nstudent_reasoning: 1.0\n[STUDENT_1] and [STUDENT_0]: is bigger than\nT: You both agree?\n\u003e\u003e [STUDENT_1]: one half by one sixth. Cause if you put six ones up to a whole\n[STUDENT_0]: dark green\n\nstudent_reasoning: 1.0\nT: Ok\n[STUDENT_1]: So there’s six sixths\n\u003e\u003e [STUDENT_0]: This would be one third, and this is one half of  dark green, and then it would be bigger by one sixth, because\nT: Do you both agree with that?\n\nstudent_reasoning: 1.0\n[STUDENT_1]: Yeah, and you take the two dark green rods, those are the halves... And you take two  thirds, and put it up to it, and you take... two sixths, it's bigger than  two sixths. And  in this one, it you take this ...\nT 2: Can we go back to that one again?\n\u003e\u003e [STUDENT_1]: I mean it's bigger than one tenth, I mean twelfth, one twelfth, one  twelfth.\nT 2: How does that work? I'm confused about that. I’m confused about  the little white rods, I am following you right up to that point.\n\n```\n\n🎉 Awesome! Another handy way to use QualitativeAnalyzer is to look at both positive and negative examples of student reasoning. You can do that by omitting the feature values specification:\n\n```python\n\n\u003e\u003e qual_analyzer.print_examples(\n    df=df,\n    speaker_column=SPEAKER_COLUMN,\n    text_column=TEXT_COLUMN,\n    feature_column=STUDENT_REASONING_COLUMN,\n    max_num_examples=3,\n    show_k_previous_lines=2,\n    show_k_next_lines=1,\n    # feature_value=\"1.0\", Omitted!\n)\nstudent_reasoning: 0.0\nT: I'm wondering which is bigger, one half or two thirds. Now  before you model it you might think in your head, before you begin  to model it what you is bigger and if so, if one is bigger, by how  much. Why don’t you work with your partner and see what you can  do.\n\u003e\u003e [STUDENT_0]: Try the purples. Get three purples. It doesn’t work, try the greens\n[STUDENT_1]: What was it? Two thirds?\n\nstudent_reasoning: 0.0\n[STUDENT_0]: Try the purples. Get three purples. It doesn’t work, try the greens\n[STUDENT_1]: What was it? Two thirds?\n\u003e\u003e [STUDENT_0]: It would be like brown or something like that.\n[STUDENT_1]: Ok\n\nstudent_reasoning: 0.0\n[STUDENT_0]: It would be like brown or something like that.\n[STUDENT_1]: Ok\n\u003e\u003e [STUDENT_0]: We’re not doing the one third, we’re doing two thirds. That is one  third\n[STUDENT_1]: First we’ve got to find out what a third of it is. What’s a third of an  orange?\n\nstudent_reasoning: 1.0\n[STUDENT_1] and [STUDENT_0]: is bigger than\nT: You both agree?\n\u003e\u003e [STUDENT_1]: one half by one sixth. Cause if you put six ones up to a whole\n[STUDENT_0]: dark green\n\nstudent_reasoning: 1.0\nT: Ok\n[STUDENT_1]: So there’s six sixths\n\u003e\u003e [STUDENT_0]: This would be one third, and this is one half of  dark green, and then it would be bigger by one sixth, because\nT: Do you both agree with that?\n\nstudent_reasoning: 1.0\n[STUDENT_1]: Yeah, and you take the two dark green rods, those are the halves... And you take two  thirds, and put it up to it, and you take... two sixths, it's bigger than  two sixths. And  in this one, it you take this ...\nT 2: Can we go back to that one again?\n\u003e\u003e [STUDENT_1]: I mean it's bigger than one tenth, I mean twelfth, one twelfth, one  twelfth.\nT 2: How does that work? I'm confused about that. I’m confused about  the little white rods, I am following you right up to that point.\n\n```\n\nLooking at examples of other annotations is similar---just specify the feature column and feature value.\n\n#### 📊 Quantitative Analysis\n\nLet’s start by looking at the data quantitatively. We will use the `QuantitativeAnalyzer` to look at the data.\n\nEver wondered whether there was a quick way to report aggregate statistics on: \n- Talk time? \n- Student reasoning? \n- Conversational uptake?\n\nThis is where the QuantitativeAnalyzer comes in handy!\n\nLet’s say we want to understand the talk time percentage split between the student and educator. We might want to know the statistic and plot the data. Here’s how we can do that:\n\n```python   \n\u003e\u003e from edu_convokit.analyzers import QuantitativeAnalyzer\n\u003e\u003e analyzer = QuantitativeAnalyzer()\n# Create speaker mapping to A, B, C, D\n\u003e\u003e label_mapping = {\n    speaker: chr(ord('A') + i) for i, speaker in enumerate(df[SPEAKER_COLUMN].unique())\n}\n\u003e\u003e analyzer.plot_statistics(\n    # Everything else is the same\n    feature_column=TALK_TIME_COLUMN,\n    df=df,\n    speaker_column=SPEAKER_COLUMN,\n    value_as=\"prop\",\n    # We want to use the mapping we created\n    label_mapping=label_mapping\n)\n```\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/rosewang2008/edu-convokit/main/assets/talktime_quantitative.png\" width=\"500\"/\u003e\n\u003c/p\u003e\n\nThe other statistics are similar to the example above for talk time. For conciseness, we will omit them here; but the only thing you need to do is change the feature_column argument to the feature you want to analyze.\n\n\n#### 💬 Lexical Analysis\nA lexical analysis is an analysis on the words used in the data. This is useful for understanding the low-level language (i.e., word usage) used by the student and educator. While a lexical analysis may be too low-level for capturing e.g., the meaning of the discourse, it can be a useful first step in capturing language trends in the data.\n\nWe will give a simple demonstration of two features of the `LexicalAnalyzer`: \n- Word Frequency: We will look at the most frequent words used by each speaker. \n- Log-Odds: We will look at the log-odds of words used by each speaker (i.e., which words are more likely to be used by the student vs. the educator).\n\n```python\n\n\u003e\u003e from edu_convokit.analyzers import LexicalAnalyzer\n\u003e\u003e analyzer = LexicalAnalyzer()\n\n\u003e\u003e analyzer.print_word_frequency(\n    df=df,\n    text_column=TEXT_COLUMN,\n    speaker_column=SPEAKER_COLUMN,\n    # We want to look at the top 5 words\n    topk=5,\n    # We want to format the text e.g., remove punctuation and stopwords (https://en.wikipedia.org/wiki/Stop_word)\n    run_text_formatting=True\n)\nTop Words By Speaker\nT\nworks: 6\none: 5\nbigger: 4\nmodels: 4\nwrite: 3\n\n\n[STUDENT_0]\none: 10\nwould: 5\nthird: 4\nyeah: 3\nbigger: 3\n\n\n[STUDENT_1]\none: 16\ntwo: 13\ntake: 9\nput: 8\nyeah: 8\n\n\n[STUDENT_2]\none: 1\nhalf: 1\n\n\n[STUDENT_1] and [STUDENT_0]\nwell: 1\nbigger: 1\none: 1\n\n\nT 2\none: 9\ntwo: 8\ninteresting: 5\nokay: 4\ntwelfths: 4\n```\n\n💡 We can see that there’s a lot of use of numbers and fractions (“one”, “third”), in addition to comparison language (“bigger”).\n\nIf you want to see the most frequent words overall, you can omit the speaker_column argument:\n\n```python\n\n\u003e\u003e analyzer.print_word_frequency(\n    df=df,\n    text_column=TEXT_COLUMN,\n    # Bye! We don't care about the speaker anymore\n    # speaker_column=SPEAKER_COLUMN,\n    topk=5,\n    run_text_formatting=True\n)\n\none: 42\ntwo: 25\nbigger: 15\nyeah: 12\nsixth: 11\n```\n\nLet’s now move onto the log-odds analysis.\n\n💡 Why a log-odds analysis?\n\nGoing beyond just counting frequent words in the student and teacher’s utterances, we might be interested in the chances of a word occurring in the student’s text over it occurring in the teacher’s text.\n\nThis gets us to a log-odds analysis on the words. For more information on log-odds analysis, please refer to this paper which applied the same analysis to study language use in political speeches.\n\nIn order to run a log-odds analysis, we need to specify two groups of texts we want to compare to each other. In this case, we are interested in comparing the student’s text to the teacher’s text. So let’s split our original dataframe into these two groups and pass them into the log-odds analysis.\n\n```python\n\n\u003e\u003e speakers = df[SPEAKER_COLUMN].unique()\n# Student speakers are ones that contain STUDENT in their name\n\u003e\u003e student_speakers = [speaker for speaker in speakers if \"STUDENT\" in speaker]\n# Teacher speakers are all the other speakers\n\u003e\u003e teacher_speakers = [speaker for speaker in speakers if speaker not in student_speakers]\n\n# Now let's split the data frame into two data frames: one for student speakers and one for teacher speakers\n\u003e\u003e student_df = df[df[SPEAKER_COLUMN].isin(student_speakers)]\n\u003e\u003e teacher_df = df[df[SPEAKER_COLUMN].isin(teacher_speakers)]\n\n# We can now run the analyzer:\n\u003e\u003e analyzer.print_log_odds(\n    df1=student_df,\n    df2=teacher_df,\n    text_column1=TEXT_COLUMN,\n    text_column2=TEXT_COLUMN,\n    # We want to look at the top 5 words\n    topk=5,\n    # We still want to format the text\n    run_text_formatting=True\n)\n\nTop words for Group 1\nput: 1.5082544501223418\nyeah: 1.4243179151183505\nthird: 1.323889968245073\ntake: 1.2333974083596422\ngreen: 1.1289734037070032\n\n\nTop words for Group 2\nworks: -1.5890822958913073\nmodels: -1.5890822958913073\ninteresting: -1.447190205063502\nsee: -1.291347559145683\nokay: -1.291347559145683\n```\n\n🎉 Awesome, this shows us both the words and odds of the words used by the student (Group 1) and educator (Group 2).\n\nWe can also plot this with a similar function call:\n\n```python\n\nanalyzer.plot_log_odds(\n    df1=student_df,\n    df2=teacher_df,\n    text_column1=TEXT_COLUMN,\n    text_column2=TEXT_COLUMN,\n    topk=5,\n    run_text_formatting=True,\n    # We can pass plot labels for group 1 and 2\n    group1_name=\"Student\",\n    group2_name=\"Teacher\"\n)\n```\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/rosewang2008/edu-convokit/main/assets/logodds.png\" width=\"500\"/\u003e\n\u003c/p\u003e\n\n#### 📈 Temporal Analysis\n\nA temporal analysis is an analysis on the features over time. We define time as the time over the course of the transcript.\n\nLet’s see how we can look at the talk time ratio (which we summarized before quantitatively) over time. This setup will look similar to the quantitative analysis however the key difference is that we will specify a num_bins argument to specify the number of bins we want to split the transcript into.\n\n```python\n\n\u003e\u003e from edu_convokit.analyzers import TemporalAnalyzer\n\u003e\u003e analyzer = TemporalAnalyzer()\n\n# We'll use the same label mapping as before to abbreviate the speaker names\n\u003e\u003e print(f\"Label mapping: {label_mapping}\")\nLabel mapping: {'T': 'A', '[STUDENT_0]': 'B', '[STUDENT_1]': 'C', '[STUDENT_2]': 'D', '[STUDENT_1] and [STUDENT_0]': 'E', 'T 2': 'F'}\n\n\u003e\u003e analyzer.plot_temporal_statistics(\n            feature_column=TALK_TIME_COLUMN,\n            dfs=df,\n            speaker_column=SPEAKER_COLUMN,\n            # We want to see the proportion of talk time\n            value_as=\"prop\",\n            # We will split the session into 5 bins\n            num_bins=5,\n            # We want to use the mapping we created\n            label_mapping=label_mapping\n        )\n```\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/rosewang2008/edu-convokit/main/assets/talktime_temporal.png\" width=\"500\"/\u003e\n\u003c/p\u003e\n\n🎉 Plotting the temporal data was straightforward!\n\n💡 There are some interesting observations, for example: \n- Speaker B ([STUDENT_0]) decreases their talk time over time whereas \n- Speaker C (the other student, [STUDENT_1]) has fluctuating talk time over time.\n\nPlotting the other statistics is similar to the example above for talk time. For conciseness, we will omit them here; but the only thing you need to do is change the feature_column argument to the feature you want to analyze.\n\n\n#### 🤖 GPT Analysis\n\nA GPT analysis is an analysis using GPT (e.g., ChatGPT or GPT4) to analyze the data.\nThis is useful for understanding the high-level language (i.e., meaning) used by the student and educator. \nWhile a GPT analysis may be too high-level for capturing e.g., the specific words used by the student and educator, it can be a useful first step in capturing meaning in the data.\n\nLet’s see how we can use GPT to summarize the transcript. We will use the `GPTConversationAnalyzer` to do this. There are other prompts you can use for GPT in [this directory](https://github.com/rosewang2008/edu-convokit/tree/main/edu_convokit/prompts/).\n\nYou will first need to set your OpenAI API key as an environment variable.\n\n```python\n\n\u003e\u003e import os\n# Remember to never share your API key with anyone!\n\u003e\u003e os.environ[\"OPENAI_API_KEY\"] = \"YOUR_API_KEY\"\n\n```\n\nWe can now try to summarize the transcript.\n\n```python\n\n\u003e\u003e from edu_convokit.analyzers import GPTConversationAnalyzer\n\u003e\u003e analyzer = GPTConversationAnalyzer()\n\u003e\u003e prompt = analyzer.preview_prompt(\n            df=df,\n            # Using the summarize prompt under prompts/conversation\n            prompt_name=\"summarize\",\n            text_column=TEXT_COLUMN,\n            speaker_column=SPEAKER_COLUMN,\n            model=\"gpt-4\",\n            format_template=\"{speaker}: {text}\",\n        )\n\u003e\u003e print(prompt)\n\nConsider the following transcript of a conversation between a teacher and a student.\n\nTranscript:\nT: I'm wondering which is bigger, one half or two thirds. Now  before you model it you might think in your head, before you begin  to model it what you is bigger and if so, if one is bigger, by how  much. Why don’t you work with your partner and see what you can  do.\n[STUDENT_0]: Try the purples. Get three purples. It doesn’t work, try the greens\n[STUDENT_1]: What was it? Two thirds?\n[STUDENT_0]: It would be like brown or something like that.\n[STUDENT_1]: Ok\n[STUDENT_0]: We’re not doing the one third, we’re doing two thirds. That is one  third\n[STUDENT_1]: First we’ve got to find out what a third of it is. What’s a third of an  orange?\n[STUDENT_0]: One third?\n[STUDENT_1]: What’s third of an orange? Let’s start a different model. The green. The green, half of it is the  light green\n[STUDENT_0]: Alright, yeah, I was thinking of that way before\n[STUDENT_1]: And you can take the take the red, and the light green, and put it up  to it , it’s, she asked, is one half bigger than,  what did she ask? What did she ask?\n[STUDENT_0]: She asked, which is bigger, one half or two thirds?\n...\n[STUDENT_1]: Uh hmm.\nT 2: And over here one of the whites you say is one sixth?\n[STUDENT_1]: Yeah.\nT 2: Oh, that’s interesting, two different models. Okay.\n\nPlease summarize the conversation in a few sentences.\n\nSummary:\n```\n\n💡 We can see that the prompt is set up to summarize the transcript. If you are comfortable with the prompt, you can now prompt the model on that prompt with `run_prompt`\n\n```python\n\n\u003e\u003e prompt, response = analyzer.run_prompt(\n            df=df,\n            prompt_name=\"summarize\",\n            text_column=TEXT_COLUMN,\n            speaker_column=SPEAKER_COLUMN,\n            model=\"gpt-4\",\n            format_template=\"{speaker}: {text}\",\n        )\n\u003e\u003e print(response)\n\nThe teacher asked the students to determine which is larger, one half or two thirds, and by how much. The students used different colored rods to represent fractions and concluded that two thirds is larger than one half by one sixth. They then tested their theory with different models, using different rods to represent different fractions, and found that their conclusion held true. The teacher encouraged them to write up their findings and explain them to the class.\n```\n\nThere are many other prompts that you can perform a GPT analysis on. For more information, please see our prompts database [here](https://github.com/rosewang2008/edu-convokit/tree/main/edu_convokit/prompts/).\n\n## Papers that have used the `Edu-ConvoKit`\n\nPlease find [here](papers.md) a list of papers that have used the `Edu-ConvoKit`.\n\n## Future Extensions\n\n- Print entire transcript with formatting option\n- Print text examples of top-k log odd words\n- Enable saving of examples as csv\n- Add GPT feature annotation\n- Flexible ways for users to load their own prompts without having it in the Edu-ConvoKit repo\n- Connecting annotations/analysis to outcomes e.g., with regression methods \n- Support transcription / diarization.\n\n## Citation\n\nIf you use the `Edu-ConvoKit` in your research, please cite the following paper:\n\n```\n@inproceedings{wang2024educonvokit,\n      title={Edu-ConvoKit: An Open-Source Library for Education Conversation Data}, \n      author={Rose E. Wang and Dorottya Demszky},\n      booktitle={Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistic: System Demonstrations},\n      year={2024}\n}\n```\n\nIf you would like to be added to the list of papers that have used the `Edu-ConvoKit`, please make a pull request or contact Rose E. Wang at rewang@cs.stanford.edu.\n\n[textcolab]: https://colab.research.google.com/github/rosewang2008/edu-convokit/blob/main/tutorials/tutorial_text_preprocessing.ipynb\n[annotationcolab]: https://colab.research.google.com/github/rosewang2008/edu-convokit/blob/main/tutorials/tutorial_annotation.ipynb\n[analyzecolab]: https://colab.research.google.com/github/rosewang2008/edu-convokit/blob/main/tutorials/tutorial_analyzers.ipynb\n[ambercolab]: https://colab.research.google.com/github/rosewang2008/edu-convokit/blob/main/tutorials/tutorial_amber.ipynb\n[talkmovescolab]: https://colab.research.google.com/github/rosewang2008/edu-convokit/blob/main/tutorials/tutorial_talkmoves.ipynb\n[nctecolab]: https://colab.research.google.com/github/rosewang2008/edu-convokit/blob/main/tutorials/tutorial_ncte.ipynb\n\n## Contributing\n\nWe welcome contributions to the `Edu-ConvoKit`! \nPlease familiarize yourself with our [documentation](https://edu-convokit.readthedocs.io/en/latest/) and [tutorials](#tutorials) before contributing. \nOnce you are familiar with the library, feel free to make a pull request.\n\n## Contact\n\nIf you have any questions, please contact Rose E. Wang at rewang@cs.stanford.edu.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstanfordnlp%2Fedu-convokit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstanfordnlp%2Fedu-convokit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstanfordnlp%2Fedu-convokit/lists"}