https://github.com/ejw-data/python-open-files
Collection of notes related to opening files and handling text strings in python
https://github.com/ejw-data/python-open-files
pandas python sqlite
Last synced: about 2 months ago
JSON representation
Collection of notes related to opening files and handling text strings in python
- Host: GitHub
- URL: https://github.com/ejw-data/python-open-files
- Owner: ejw-data
- Created: 2022-04-09T04:45:07.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2022-06-27T14:47:45.000Z (almost 4 years ago)
- Last Synced: 2025-03-15T16:44:30.754Z (over 1 year ago)
- Topics: pandas, python, sqlite
- Language: Python
- Homepage:
- Size: 133 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Python File Management
Auhtor: Erin James Wills, ejw.data@gmail.com

## Overview
This repo is actually just a collection of notes related to opening files and handling text strings in python. Topics include base library methods, pandas, sqlalchemy, and sql.
This is a work in progress and used for compiling and recording useful resources.
## Reading Large Files
*Refs:* https://www.kaggle.com/code/rohanrao/tutorial-on-reading-large-datasets/notebook
## Paths in Python
When using a file path with `\` in the path, remember that the backslashes are used to escape characters.
For example: '\r', '\n', '\b', '\c', '\t'
The solution is to either `\\` every backslash, which inserts a single backslash or use a raw string such as `print(r'.\path\file.csv')`. The `r` instructs the interpreter to not evaluate backslashes as escapes and just as regular backslashes.
### Quick notes:
`print(u'string')` - prints
### Remaining Questions
In BeautifulSoup, are the outputs in unicode?
* ie. `soup[0].encode("ascii")` or `soup[0].encode("latin-1")` or `soup[0].encode("utf-8")` or `soup[0].encode(soup.originalEncoding)` to get the output
### Escaping Refernces
1. https://python-reference.readthedocs.io/en/latest/docs/str/escapes.html
1. https://www.w3schools.com/python/gloss_python_escape_characters.asp
# Python Open() Parameters
* `Read Only (‘r’)`: Open text file for reading. The handle is positioned at the beginning of the file. If the file does not exist, raises I/O error. This is also the default mode in which the file is opened.
* `Read and Write (‘r+’)`: Open the file for reading and writing. The handle is positioned at the beginning of the file. Raises I/O error if the file does not exist.
* `Write Only (‘w’)`: Open the file for writing. For existing file, the data is truncated and over-written. The handle is positioned at the beginning of the file. Creates the file if the file does not exist.
* `Write and Read (‘w+’)`: Open the file for reading and writing. For existing file, data is truncated and over-written. The handle is positioned at the beginning of the file.
* `Append Only (‘a’)`: Open the file for writing. The file is created if it does not exist. The handle is positioned at the end of the file. The data being written will be inserted at the end, after the existing data.
* `Append and Read (‘a+’)`: Open the file for reading and writing. The file is created if it does not exist. The handle is positioned at the end of the file. The data being written will be inserted at the end, after the existing data.
*Ref:* https://www.geeksforgeeks.org/open-a-file-in-python/
## Pandas
Reading Files in Parts
* `pd.read_csv(..., nrows, skiprows, chunksize)`
* `nrows` : int, default None Number of rows of file to read. Useful for reading pieces of large files*
* `skiprows` : list-like or integer Row numbers to skip (0-indexed) or number of rows to skip (int) at the start of the file
* `chunksize` : int, default None Return TextFileReader object for iteration
* Also keep in mind that this may be helpful when automating: `skiprows = nend - nrows`
## Pandas and Dask (Parallel Processing)
*Ref:* https://towardsdatascience.com/how-to-handle-large-datasets-in-python-with-pandas-and-dask-34f43a897d55
## Databases
For really large files then using a database with map reduce to get the contents would be the best route.
The general process for SQLite is:
1. Create database
```
conn = sqlite3.connect('pts.db')
c = conn.cursor()
```
2. Create Table
```
c.execute('''CREATE TABLE ptsdata (filename, line, x, y, z''')
```
3. Insert Data
```
c.execute("INSERT INTO ptsdata VALUES (filename, lineNumber, x, y, z)")
```
4. Query Data
```
c.execute("SELECT lineNumber, x, y, z FROM ptsdata WHERE filename=file.txt ORDER BY lineNumber ASC")
```
5. Get n results
```
c.fetchmany(size=n)
```