{"id":13579323,"url":"https://github.com/MelinaPl/speech-act-analysis","last_synced_at":"2025-04-05T20:33:49.308Z","repository":{"id":187877594,"uuid":"432835850","full_name":"MelinaPl/speech-act-analysis","owner":"MelinaPl","description":"A speech act analysis of offensive language in German Tweets - an annotated datatset.","archived":false,"fork":false,"pushed_at":"2024-12-12T12:15:05.000Z","size":4170,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-12-12T13:23:42.076Z","etag":null,"topics":["hate-speech-detection","offensive-language","speech-acts"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MelinaPl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-11-28T22:04:34.000Z","updated_at":"2024-12-12T12:15:09.000Z","dependencies_parsed_at":"2024-08-01T15:30:30.818Z","dependency_job_id":"6a9e8ae4-54e5-4f84-a091-e2fc8cdf305b","html_url":"https://github.com/MelinaPl/speech-act-analysis","commit_stats":null,"previous_names":["melinapl/speech-act-analysis"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MelinaPl%2Fspeech-act-analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MelinaPl%2Fspeech-act-analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MelinaPl%2Fspeech-act-analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MelinaPl%2Fspeech-act-analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MelinaPl","download_url":"https://codeload.github.com/MelinaPl/speech-act-analysis/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247399874,"owners_count":20932876,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["hate-speech-detection","offensive-language","speech-acts"],"created_at":"2024-08-01T15:01:38.379Z","updated_at":"2025-04-05T20:33:44.297Z","avatar_url":"https://github.com/MelinaPl.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# A Speech Act Analysis of Offensive Language in German Tweets\n\nThis repository provides an annotated dataset which constitutes a subset of 600 tweets taken from the dataset by Struß, Siegel, Ruppenhofer, Wiegand, and Klenner (2019) that consists of German offensive and non-offensive language tweets. The annotated dataset comprises three levels of annotation, namely coarse-grained speech acts, fine-grained speech acts and sentence types. \n\nFrom each of the six classes (implicit, explicit, profanity, insult, abuse, other), 100 tweets were randomly collected. For the two classes implicit and explicit, the 2019 gold standard files of the test data of subtask 3 were used and for the other four classes, the 2019 gold standard files of the test data from subtask 1 \u0026 2 were used. For both test datasets, the data was shuffled using the random package from python and for each class, the first 100 occurrences were selected. Every tweet was saved as a text file which was named after the following scheme: [dataSource]“\\_Tweet\\_”[idNew]“\\_”[idOld]“\\_”[offensiveCategory]“.txt” (example: ‘s3\\_Tweet\\_99\\_731\\_implicit.txt’). Due to an error, eleven tweets had to be removed and replaced with new tweets. They were taken from the remaining set of randomly selected tweets. As an annotation tool, the open source tool INCEpTION (Klie, Bugert, Boullosa, de Castilho, \u0026 Gurevych, 2018) was chosen.\n\n- [Link](https://projects.fzai.h-da.de/iggsa/wp-content/uploads/2019/08/germeval2019GoldLabelsSubtask1_2.txt) to original data from subtask 1 \u0026 2\n- [Link](https://projects.fzai.h-da.de/iggsa/wp-content/uploads/2019/08/germeval2019GoldLabelsSubtask3.txt) to original data from subtask 3 \n\n## Note: New Version (1.1) Available (Fixed Bugs)\n\n13.10.2023: Upload of new version 1.1 of the dataset which contain bug fixes.\n\nDataset available [here](https://github.com/MelinaPl/speech-act-analysis/blob/main/data/version_1-1.json)\n\nFor a documentation of the changes, see document [version_1-1_changes.md](https://github.com/MelinaPl/speech-act-analysis/blob/main/version_1-1_changes.md). Version 1.1 is only available in JSON format. Please do not use the old version as it contains some bugs that have been fixed in version 1.1.\n\n## Dataset \n\nThe dataset is available in different formats: JSON, XML and the original downloaded version from [INCEpTION](https://inception-project.github.io/). The data is located in the directory data/ .\n\n### JSON\n\nThere are two types of JSON files available for download: \n\n- [annotations_without_text.json](https://github.com/MelinaPl/speech-act-analysis/blob/main/data/annotations_without_text.json)\n- [annotations_with_text.json](https://github.com/MelinaPl/speech-act-analysis/blob/main/data/annotations_with_text.json)\n\nThe files each contain all 600 annotated tweets in the following format:\n\n```\n{\n    \"s1-2_Tweet_100_1258_other.xml\": {\n        \"tweet\": {\n            \"scount\": 2,\n            \"sentences\": {\n                \"1\": {\n                    \"stype\": \"ment\",\n                    \"coarse\": \"DIRECTIVE\",\n                    \"fine\": \"ADDRESS\"\n                },\n                \"2\": {\n                    \"stype\": \"frag\",\n                    \"coarse\": \"UNSURE\",\n                    \"fine\": \"UNSURE\"\n                }\n            }\n        }\n    }, \n    ....\n\n```\n\n\n### XML\n\nThe zipped file `data_annotated_complete.tar.gz` contains all 600 annotated XML files. \n\n## Code\n\nAll files can be found in [src/](https://github.com/MelinaPl/speech-act-analysis/tree/main/src)\n\n- `process.py`: shows how the data from the GermEval shared task was processed\n- `statistics.py`: shows examples of how the data was retrieved for the statistical analysis\n- `process_annotations.py`: shows how the annotations were processed \n\n\n## Annotations\n\nThe subsequent annotation scheme is mainly inspired by Searle (1979) and Compagno et al. (2018) with regard to the speech act level. Furthermore, the idea from combining speech acts with syntactical categories is influenced by Weisser (2018). The sentence types are based on the sentence types used in the Georgetown University Multilayer Corpus (Zeldes, 2017) and contain eleven types in total. These categories are based on Leech, McEnery, and Weisser (2003) which explains the great similarity between the sentence types used in the works by Zeldes (2017) and Weisser (2018). \n\nThe following examples are all taken from the data except for the example of the class *Accept* (marked with an asterisk)\n\n- [Link](https://corpling.uis.georgetown.edu/wiki/doku.php?id=gum:tokenization_segmentation) to the documentation of the Georgetown University Multilayer Corpus \n- [Link](https://github.com/amir-zeldes/gum) to the Github repository of the Georgetown University Multilayer Corpus (Zeldes, 2017)\n\n### Speech Acts\n\nCoarse-grained | Fine-grained | Examples |\n| ----------- | ----------- | ----------- | \nAssertive | Assert | “Genderstudies stehen in ihrer 20 jährigen Existenz stärker im Konflikt mit den existierenden Wissenschaften als alles davor.” | \n| | Sustain  | “Er geht mir ziemlich auf den Keks, aber wegen Vorstehendem habe ich ihn noch nicht einfach geblockt!” | \n |  | Guess  | “Möglicherweise bin ich der Einzige, der den heterosexuellen Mann vor dem Feminismus retten kann.” | \n | | Predict  | “es werden paar hundert wenn es hoch kommt” | \n |  | Agree |  “das ist ein punkt, stimmt.” | \n  | | Disagree  | (begin context) *@AcarLukas @allesevolution ...Dem Fehlschluss dass eine These bewiesen ist, wenn sie nicht zu 100% entkräftet werden kann.* (end context) “\\|LBR\\| Leider funktioniert das nicht so.”\n Expressive  | Rejoice  | “gut dass es #ORF gibt” | \n  | | Complain |  “Selten son Dreck im Fernsehen gesehen wie diese #krone18” | \n  | | Wish  | “Schönen Freitag.” | \n | |  Apologize  | “btw sorry ob meiner polemik im ausgangstweet” | \n |  | Thank  | “Danke für die Aufklärung”  | \n |  | expressEmoji  | “🙃” | \nDirective  | Require  | “Schämen Sie sich.” | \n | |  Request  | “Warum veröffentlicht ihr keine Bilder von linken Anarchisten?” | \n |  | Suggest  | “Die linke,deutsch/islamische #Bundesregierung kann den #korantreuen #Moslems #IS #Hamas doch gleich den Schlüssel zu Deutschland überreichen.” | \n |  | Greet  | “Hallo liebe Freunde des deutschen Handballsports” | \n |  | Address  | “@DanielDOrville2 @jogginghosenafa” | \nCommissive |  Engage |  “Ich gehe jetzt pennen.” | \n |  | Accept*  | “Ja, das kann ich für dich machen” | \n |  | Refuse  | (begin context) *Ein echter Mann und echte Kinder?* (end context) “Gott behüte.” | \n |  | Threat  |  “euch zeigen wir noch wo es lang geht.” | \nOther  | Other | “#Migrationsbericht” | \nUnsure  | Unsure |  “OK.....!” | \n\n\n### Sentence Types \n\nSentence Types | Description | Examples |\n| ----------- | ----------- | ----------- | \nDeclarative  |  Declarative sentence (only indicative) | “Ich hatte allerdings mit mehr Demonstranten gerechnet” | \nExclamative  | Exclamative sentence  | “Wir schaffen das!” | \nImperative  | Finite verb needs to be in imperative mood | “Glaubt nicht ihre Lügen” | \nConjunctive  |  Finite verb is in conjunctive mood | “Man könnte den Islam auch verlassen.” | \nYes-/No-Question | A question which can be answered with “yes” or “no” |  “Stimmt das?” | \nAlternative Question  | Questions asking the addressee to decide for one option | “Wie lange geht so ein Handballspiel? Zwei, drei Stunden?” | \nW-Question  | Questions formed with w-phrases | “Wie zerstört man am effektivsten die Zukunft eines Landes?” | \nInterjection  | Short exclamations, annotated if they form a sentence on their own | “Hahaha!”, “Pfui.” | \nMultiple  | Combination of two or more types due to the conjunction of two main clauses | “@poothverona sieht mit ihrem Schlauchboot im Gesicht nur noch richtig scheiße aus und sollte besser nur noch zu Halloween auftreten” | \nOther  | Sentence types not fitting in other categories (e.g. using English phrases/ sentences, constructions with “:”)  | “same bruder”, “Der Bauch sagt: hau raus,das passt schon.”, “Kalter Winter = Klimawandel” | \nFragment |  Containing an ellipsis/ no subject predicate structure/ finite verb  | “Echtes Kunststück, dass sich da nicht ausnahmslos jeder verarscht fühlt.” | \nNon-textual  | Non-textual units such as symbols and emojis  | “:-D”, “:pray:”, “:joy:” | \nMention |  Mentioning a person/ another twitter account, only annotated if not part of a sentence | “@mariebreizh56 @QueeniePi” | \nHashtag |  Initial #, only annotated if not part of a sentence | “#CDUbpt18” | \n\n## References\n\nAustin, J. L. (1962). How to Do Things with Words. Oxford University Press.\n\nCompagno, D., Epure, E., Deneckere, R., \u0026 Salinesi, C. (2018). Exploring\nDigital Conversation Corpora with Process Mining. Corpus Pragmatics,\n2, 193–215. doi: 10.1007/s41701-018-0030-6\n\nKlie, J.-C., Bugert, M., Boullosa, B., de Castilho, R. E., \u0026 Gurevych, I. (2018,\nJuni). The INCEpTION Platform: Machine-Assisted and Knowledge-\nOriented Interactive Annotation. In Proceedings of the 27th International\nConference on Computational Linguistics: System Demonstrations (pp.\n5–9). Santa Fe, New Mexico: Association for Computational Linguistics.\nRetrieved from https://aclanthology.org/C18-2002\n\nLeech, G., McEnery, T., \u0026 Weisser, M. (2003). SPAAC Speech-Act Annotation\nScheme. University of Lancaster, Technical Report.\n\nSearle, J. R. (1979). Expression and Meaning: Studies in the Theory of Speech Acts. Cambridge University\nPress, Cambridge.\n\nStruß, J. M., Siegel, M., Ruppenhofer, J., Wiegand, M., \u0026 Klenner, M. (2019).\nOverview of GermEval Task 2, 2019 Shared Task on the Identification\nof Offensive Language. In German Society for Computational Linguis-\ntics. Proceedings of the 15th Conference on Natural Language Processing\n(KONVENS) 2019 (pp. 354–365). Nürnberg/Erlangen.\n\nWeisser, M. (2018). How to Do Corpus Pragmatics on Pragmatically Anno-\ntated Data: Speech Acts and Beyond. Amsterdam/Philadelphia: John\nBenjamins Publishing Company. doi: 10.1075/scl.84\n\nZeldes, A. (2017). The GUM Corpus: Creating Multilayer Resources in the\nClassroom. Language Resources and Evaluation, 51(3), 581–612. doi:\n10.1007/s10579-016-9343-x\n\nShield: [![CC BY 4.0][cc-by-shield]][cc-by]\n\nThis work is licensed under a\n[Creative Commons Attribution 4.0 International License][cc-by].\n\n[![CC BY 4.0][cc-by-image]][cc-by]\n\n[cc-by]: http://creativecommons.org/licenses/by/4.0/\n[cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png\n[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMelinaPl%2Fspeech-act-analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMelinaPl%2Fspeech-act-analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMelinaPl%2Fspeech-act-analysis/lists"}