https://github.com/tac0x2a/lake_weed
Lake Weed is elastic converter for JSON, JSON Lines, and CSV string to use for constructin RDB query.
https://github.com/tac0x2a/lake_weed
clickhouse csv json json-lines pypi
Last synced: 2 months ago
JSON representation
Lake Weed is elastic converter for JSON, JSON Lines, and CSV string to use for constructin RDB query.
- Host: GitHub
- URL: https://github.com/tac0x2a/lake_weed
- Owner: tac0x2a
- License: mit
- Created: 2019-10-17T12:38:57.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2021-03-15T14:33:51.000Z (about 5 years ago)
- Last Synced: 2025-10-30T07:55:59.238Z (5 months ago)
- Topics: clickhouse, csv, json, json-lines, pypi
- Language: Python
- Homepage:
- Size: 116 KB
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Lake Weed


Lake Weed is elastic converter for JSON, JSON Lines, and CSV string to use for constructin RDB query.
You can get schema and convertion values just input src string.
# Usage
## Install package
```
pip install lakeweed
```
PyPI: https://pypi.org/project/lakeweed/
## Example(Json test to ClickHouse)
```py
from lakeweed import clickhouse
src_json = """
{
"array" : [1,2,3],
"array_in_array" : [[1.1, 2.2], [3.3, 4.4]],
"nested_map" : {"value" : [[1,2], [3,4]]},
"map_in_array" : [{"v":1}, {"v":2}],
"dates" : ["2019/09/15 14:50:03.101 +0900", "2019/09/15 14:50:03.202 +0900"],
"date" : {
"as_datetime": "2019/09/15 14:50:03.042042043 +0900",
"as_string" : "2019/09/15 14:50:03.042042043 +0900"
},
"str" : "Hello, LakeWeed"
}
"""
# Value types are guessed by lakeweed automatically.
# You can use specified type if you want.
my_types = {
"date__as_string": "str"
}
(columns, types, values) = clickhouse.data_string2type_value(src_json, specified_types=my_types)
print(columns)
# (
# 'array',
# 'array_in_array',
# 'nested_map__value',
# 'map_in_array',
# 'dates',
# 'date__as_datetime',
# 'date__as_string',
# 'str'
# )
print(types)
# (
# 'Array(Float64)',
# 'Array(String)',
# 'Array(String)',
# 'Array(String)',
# 'Array(DateTime64(6))',
# 'DateTime64(6)',
# 'String',
# 'String'
# )
print(values)
# [(
# [1.0, 2.0, 3.0],
# ['[1.1, 2.2]', '[3.3, 4.4]'],
# ['[1, 2]', '[3, 4]'],
# ['{"v": 1}', '{"v": 2}'],
# [
# datetime.datetime(2019, 9, 15, 14, 50, 3, 101000, tzinfo=tzoffset(None, 32400)),
# datetime.datetime(2019, 9, 15, 14, 50, 3, 202000, tzinfo=tzoffset(None, 32400))
# ],
# datetime.datetime(2019, 9, 15, 14, 50, 3, 42042, tzinfo=tzoffset(None, 32400)),
# '2019/09/15 14:50:03.042042043 +0900',
# 'Hello, LakeWeed'
# )]
```
## Example(CSV test to ClickHouse)
```py
src_csv = """
f,b,d
42,true,2019/09/15 14:50:03.101 +0900
"42","true",2019/12/15 14:50:03.101 +0900
"""
(columns, types, values) = clickhouse.data_string2type_value(src_csv)
print(columns)
# ('f', 'b', 'd', 'd_ns')
print(types)
# ('Float64', 'UInt8', 'DateTime64(6)')
print(values)
# [
# (42.0, 1, datetime.datetime(2019, 9, 15, 14, 50, 3, 101000, tzinfo=tzoffset(None, 32400))),
# (42.0, 1, datetime.datetime(2019, 12, 15, 14, 50, 3, 101000, tzinfo=tzoffset(None, 32400)))
# ]
```
## Example(Json lines test to ClickHouse)
Lake Weed converts each row of JSON in the same way as a single line of json.
Automatically selects the type so that all data can be stored. For example, if you have a mix of Numbers and Strings, select a String type that can store both.
```py
src_json_lines = """
{"f": 42, "b": true, "d": "2019/09/15 14:50:03.101 +0900"}
{"f": "42", "b": "true", "d": "2019/12/15 14:50:03.101 +0900"}
"""
(columns, types, values) = clickhouse.data_string2type_value(src_json_lines)
print(columns)
# ('f', 'b', 'd', 'd_ns')
print(types)
# ('String', 'String', 'DateTime64(6)')
# ('String', 'String', 'DateTime', 'UInt32')
print(values)
# [
# ('42', 'true', datetime.datetime(2019, 9, 15, 14, 50, 3, 101000, tzinfo=tzoffset(None, 32400))),
# ('42', 'true', datetime.datetime(2019, 12, 15, 14, 50, 3, 101000, tzinfo=tzoffset(None, 32400)))
# ]
```
# Type
## Lake Weed types
- `Int`
- `Float`
- `Bool`
- `String`
- `DateTime` (nano seconds order)
- `Array[Int]`
- `Array[Float]`
- `Array[Bool]`
- `Array[String]`
- `Array[DateTime]`
Python default data types are used for Int, Float, Bool and String types. By default, numeric values(Int or Float) are always treated as Float.
DateTime is expand based on `datetime.datetime` and it contains nano seconds. Please see `DateTimeWithNS` type.
`Array[]` support above primitive types.
## Specified Types
In default, Value types will be guessed by lakeweed automatically.
If you want enforce to use type by specified it as `specified_types` argument.
```python
my_types = {
"date__as_string": "str" # field name : specified type name
}
(columns, types, values) = clickhouse.data_string2type_value(src_json, specified_types=my_types)
```
These types you can use.
| Specified Type String (ignore case) | Lake Weed Type |
| :---------------------------------: | :------------: |
| `INT` | `Int` |
| `INTEGER` | `Int` |
| `FLOAT` | `Float` |
| `DOUBLE` | `Float` |
| `BOOL` | `Bool` |
| `BOOLEAN` | `Bool` |
| `DATETIME` | `DateTime` |
| `STR` | `String` |
| `STRING` | `String` |
If it faileds to cast, the value will be NULL.
## Output Data Type
### Clickhouse
| Source Type | [Clickhouse Data Types](https://clickhouse.tech/docs/en/sql-reference/data-types/) |
| ---------------: | :--------------------------------------------------------------------------------- |
| `Int` | `Int64` |
| `Float` | `Float64` |
| `Bool` | `UInt8` (True: 1, False: 0) |
| `String` | `String` |
| `DateTime` | `DateTime64(6)` (Nano seconds order is ignored.) |
| `Array(Int)` | `Array(Int64)` |
| `Array(Float)` | `Array(Float64)` |
| `Array(Bool)` | `Array(UInt8)` |
| `Array(String)` | `Array(String)` |
| `Array(DateTime)` | `Array(DateTime64(6))` |
# Release PyPI
## Setup
### Create `~/.pypirc`
```ini
[distutils]
index-servers =
pypi
testpypi
[pypi]
repository: https://upload.pypi.org/legacy/
username:
password:
[testpypi]
repository: https://test.pypi.org/legacy/
username:
password:
```
### Install packages for build and deploy
```sh
pip install wheel twine
```
## Build and Deploy
### Make Package
```sh
rm -f -r lakeweed.egg-info/* dist/*
python setup.py sdist bdist_wheel
```
### Local testing
```sh
python setup.py develop
```
### Deploy to PyPI
```sh
# for testing
twine upload --repository testpypi dist/*
# open https://test.pypi.org/project/lakeweed/
# for production
twine upload --repository pypi dist/*
# open https://pypi.org/project/lakeweed/
```
# Contributing
1. Fork it ( https://github.com/tac0x2a/lake_weed/fork )
2. Create your feature branch (`git checkout -b my-new-feature`)
3. Commit your changes (`git commit -am 'Add some feature'`)
4. Push to the branch (`git push origin my-new-feature`)
5. Create a new Pull Request