https://github.com/rain1024/linguistic_tools
https://github.com/rain1024/linguistic_tools
Last synced: 8 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/rain1024/linguistic_tools
- Owner: rain1024
- Created: 2024-05-26T11:01:51.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-06-08T04:42:02.000Z (about 2 years ago)
- Last Synced: 2025-01-08T17:07:44.613Z (over 1 year ago)
- Language: Python
- Size: 5.86 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Linguistic Tools
This project is a set of linguistic tools designed to assist my **lovely friend** π with various language-related tasks.
* [Filter Words](#filter-words)
* [Count Senenteces](#count-sentences)
## Filter Words
The Filter Words tool is designed to process and filter specific words from a given text file, `query.txt`. This tool reads paragraphs from the input file located in the `inputs` folder, applies the necessary filters, and saves the processed output to separate files in the `outputs` folder.
### Usage
To use the Filter Words tool, follow these steps:
**Step 1:** Place your Microsoft Word file in the `inputs` folder.
Example:
```
inputs
βββ song_mon__nam_cao.docx
```
**Step 2:** Add your query in the `query.txt` file.
Example content for `query.txt`:
```
chα»
cΓ‘c
khΓ΄ng,ΔΓ£
ΔΓ£,rα»i
```
**Step 3:** Run the following command in your terminal:
```
python filter_words.py
```
This command will execute the script, process the input file, and generate the filtered outputs in the `outputs` folder.
### Output
The processed output files will be saved in the `outputs` folder, each corresponding to the words or word pairs specified in the `query.txt` file.
Example:
```
outputs
βββ chα».txt
βββ cΓ‘c.txt
βββ khΓ΄ng-ΔΓ£.txt
βββ ΔΓ£-rα»i.txt
```
## Count Sentences
The script described here is designed to count the number of sentences in Microsoft Word documents (.docx) located in a specified input directory. It processes each document to extract paragraphs, filters out empty paragraphs, and saves the text content into a temporary file. The script also prints the number of sentences (non-empty paragraphs) found in each document.
### Input
Place your Microsoft Word files (.docx) in the `inputs` folder. The script will automatically detect and process all .docx files within this directory.
Example:
```
inputs
βββ document1.docx
βββ document2.docx
```
### Command Line Usage
To execute the script, run the following command in your terminal:
```
python count_sentences.py
```
This command will initiate the script, which will process each .docx file in the `inputs` folder.
### Output
The script creates a temporary folder named `tmp` to store the output text files. Each output file corresponds to an input document and contains the extracted paragraphs. The script also prints the number of sentences found in each document to the console.
Example of the temporary folder structure and console output:
```
tmp
βββ document1.txt
βββ document2.txt
```
Console output:
```
inputs/document1.docx: 10 sentences
inputs/document2.docx: 8 sentences
```