https://github.com/tallesl/text-kitchen-sink

Pre-processing text for future cool stuff
https://github.com/tallesl/text-kitchen-sink

Last synced: about 2 months ago
JSON representation

Pre-processing text for future cool stuff

Host: GitHub
URL: https://github.com/tallesl/text-kitchen-sink
Owner: tallesl
Created: 2024-08-16T06:05:50.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-01-28T18:23:04.000Z (3 months ago)
Last Synced: 2025-01-28T19:30:07.729Z (3 months ago)
Language: Python
Homepage:
Size: 2.24 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Text Kitchen Sink

Pre-processing text for future cool stuff.

## .xxx to .txt

`.pdf` to `.txt`:

```
$ sudo apt install poppler-utils
$ for f in *.pdf; do echo "Processing $f"; pdftotext "$f"; done
```

`.mobi` to `.txt`:

```
$ sudo apt install calibre
$ for f in *.mobi; do echo "Processing $f"; ebook-convert "$f" "${f%.mobi}.txt"; done
```

`.prc` to `.txt`:

```
$ sudo apt install calibre
$ for f in *.prc; do echo "Processing $f"; ebook-convert "$f" "${f%.prc}.txt"; done
```

## .txt

From utf-8 to latin-1:

```
$ cat utf8.txt | iconv -c -f UTF-8 -t ISO-8859-1//IGNORE > latin1.txt
```

Concatenation:

```
$ cat *.txt > all.txt
```

Pre-processing:

```
$ chmod +x no-extra-spaces.py en-only.py # or pt-only.py
$ cat all.txt | ./en-only.py | ./no-extra-spaces.py | ./lower.py > preprocessed.txt
```

Statistics:

```
characters: 741
lines: 4
words: 140
unique words: 84

most frequent words:
• is: 8
• the: 8
• and: 6
• it: 6
• its: 5
• charmander: 4
• a: 4
• pokemon: 4
• when: 4
• in: 3
```

## crawled .html to scraped .csv

Directory search and flattening:

```
$ chmod +x find-and-flatten.py
$ ./find-and-flatten.py 'crawled_forum/' 'flattened_directory/' 'thread-*.html'
```

File sampling:

```
$ chmod +x sample-files.py
$ ./sample-files.py 'crawled_forum/' 'crawled_samples/' 15
```

Scraping to `.csv`:

```
$ chmod +x scrap_to_csv.py
$ pip install beautifulsoup4 tdqm
$ ./scrap-to-csv.py 'crawled_forum/' '.postbody div'
$ head -n 1 scrap.csv
uuid,directory,file,content
```

Scraping recipe:

1. Inspect the filepaths looking for a common pattern (`find crawled.com/ | vim -`).
1. [Find the files and flat the directory](#directory-search-and-flattening).
1. [Take some samples](#file-sampling).
1. Inspect the HTML samples looking for what to extract, figuring it out what CSS selector will do the job.
1. [Scrap the samples to .csv](#scraping-to-csv).
1. Inspect the .csv, check if it looks correct.
1. If yes, now scrap again but this time on all the crawled pages. If not, back to figuring it out the CSS selector.

## .csv cleanup

Counting rows:

```
$ sudo apt install csvkit
$ csvstat --count scrap.csv
```

Viewing "content" column only:

```
$ sudo apt install csvkit
$ cat scrap.csv | csvcut -c 4 --maxfieldsize 999999 | less
```

Use sed to remove any line containing things such as "CRITEO TAG", "upload picture", ".jpg", etc:

```
$ sed -i '/FISHY STRING GOES HERE$/d' scrap.csv
```

Using sed to remove extension:

```
$ sed -E -i '
/\.html/ { # match ".html"
s/\.html// # remove ".html"
}' scrap.csv
```

Use sed to remove query string:

```
$ sed -E -i '
/viewtopic\.php\?/ { # match "viewtopic.php?"
s/\?.*?,/,/ # remove from "?" up to the next ","
}' scrap.csv
```

Checking if the file is well-formed:

```
$ sudo apt install csvkit
$ csvclean -n scrap.csv
```

Deduping rows by "content" column:

```
$ ./dedupe-by-content.py scrap.csv > deduped.csv
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tallesl/text-kitchen-sink

Awesome Lists containing this project

README