https://github.com/ejw-data/python-open-files

Collection of notes related to opening files and handling text strings in python
https://github.com/ejw-data/python-open-files

pandas python sqlite

Last synced: 2 months ago
JSON representation

Collection of notes related to opening files and handling text strings in python

Host: GitHub
URL: https://github.com/ejw-data/python-open-files
Owner: ejw-data
Created: 2022-04-09T04:45:07.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2022-06-27T14:47:45.000Z (about 4 years ago)
Last Synced: 2025-03-15T16:44:30.754Z (over 1 year ago)
Topics: pandas, python, sqlite
Language: Python
Homepage:
Size: 133 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Python File Management

Auhtor: Erin James Wills, ejw.data@gmail.com

![File Management](./images/py-openfiles1.png)

## Overview

This repo is actually just a collection of notes related to opening files and handling text strings in python. Topics include base library methods, pandas, sqlalchemy, and sql.

This is a work in progress and used for compiling and recording useful resources.

## Reading Large Files

*Refs:* https://www.kaggle.com/code/rohanrao/tutorial-on-reading-large-datasets/notebook

## Paths in Python

When using a file path with `\` in the path, remember that the backslashes are used to escape characters.

For example: '\r', '\n', '\b', '\c', '\t'

The solution is to either `\\` every backslash, which inserts a single backslash or use a raw string such as `print(r'.\path\file.csv')`. The `r` instructs the interpreter to not evaluate backslashes as escapes and just as regular backslashes.

### Quick notes:
`print(u'string')` - prints

### Remaining Questions
In BeautifulSoup, are the outputs in unicode?
* ie. `soup[0].encode("ascii")` or `soup[0].encode("latin-1")` or `soup[0].encode("utf-8")` or `soup[0].encode(soup.originalEncoding)` to get the output

### Escaping Refernces
1. https://python-reference.readthedocs.io/en/latest/docs/str/escapes.html
1. https://www.w3schools.com/python/gloss_python_escape_characters.asp

# Python Open() Parameters

* `Read Only (‘r’)`: Open text file for reading. The handle is positioned at the beginning of the file. If the file does not exist, raises I/O error. This is also the default mode in which the file is opened.

* `Read and Write (‘r+’)`: Open the file for reading and writing. The handle is positioned at the beginning of the file. Raises I/O error if the file does not exist.

* `Write Only (‘w’)`: Open the file for writing. For existing file, the data is truncated and over-written. The handle is positioned at the beginning of the file. Creates the file if the file does not exist.

* `Write and Read (‘w+’)`: Open the file for reading and writing. For existing file, data is truncated and over-written. The handle is positioned at the beginning of the file.

* `Append Only (‘a’)`: Open the file for writing. The file is created if it does not exist. The handle is positioned at the end of the file. The data being written will be inserted at the end, after the existing data.

* `Append and Read (‘a+’)`: Open the file for reading and writing. The file is created if it does not exist. The handle is positioned at the end of the file. The data being written will be inserted at the end, after the existing data.

*Ref:* https://www.geeksforgeeks.org/open-a-file-in-python/

## Pandas

Reading Files in Parts
* `pd.read_csv(..., nrows, skiprows, chunksize)`
* `nrows` : int, default None Number of rows of file to read. Useful for reading pieces of large files*
* `skiprows` : list-like or integer Row numbers to skip (0-indexed) or number of rows to skip (int) at the start of the file
* `chunksize` : int, default None Return TextFileReader object for iteration
* Also keep in mind that this may be helpful when automating: `skiprows = nend - nrows`

## Pandas and Dask (Parallel Processing)

*Ref:* https://towardsdatascience.com/how-to-handle-large-datasets-in-python-with-pandas-and-dask-34f43a897d55

## Databases

For really large files then using a database with map reduce to get the contents would be the best route.

The general process for SQLite is:
1. Create database
```
conn = sqlite3.connect('pts.db')
c = conn.cursor()
```

2. Create Table
```
c.execute('''CREATE TABLE ptsdata (filename, line, x, y, z''')
```

3. Insert Data
```
c.execute("INSERT INTO ptsdata VALUES (filename, lineNumber, x, y, z)")
```

4. Query Data
```
c.execute("SELECT lineNumber, x, y, z FROM ptsdata WHERE filename=file.txt ORDER BY lineNumber ASC")
```

5. Get n results
```
c.fetchmany(size=n)
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ejw-data/python-open-files

Awesome Lists containing this project

README