https://github.com/tac0x2a/lake_weed

Lake Weed is elastic converter for JSON, JSON Lines, and CSV string to use for constructin RDB query.
https://github.com/tac0x2a/lake_weed

clickhouse csv json json-lines pypi

Last synced: 4 months ago
JSON representation

Lake Weed is elastic converter for JSON, JSON Lines, and CSV string to use for constructin RDB query.

Host: GitHub
URL: https://github.com/tac0x2a/lake_weed
Owner: tac0x2a
License: mit
Created: 2019-10-17T12:38:57.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2021-03-15T14:33:51.000Z (about 5 years ago)
Last Synced: 2025-10-30T07:55:59.238Z (6 months ago)
Topics: clickhouse, csv, json, json-lines, pypi
Language: Python
Homepage:
Size: 116 KB
Stars: 4
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Lake Weed

![Python application](https://github.com/tac0x2a/lake_weed/workflows/Python%20application/badge.svg)

![Lake Weed](./doc/img/lakeweed_s.png)

Lake Weed is elastic converter for JSON, JSON Lines, and CSV string to use for constructin RDB query.

You can get schema and convertion values just input src string.

# Usage

## Install package

```

pip install lakeweed

```

PyPI: https://pypi.org/project/lakeweed/

## Example(Json test to ClickHouse)

```py

from lakeweed import clickhouse

src_json = """

{

  "array" : [1,2,3],

  "array_in_array" : [[1.1, 2.2], [3.3, 4.4]],

  "nested_map" : {"value" : [[1,2], [3,4]]},

  "map_in_array"  : [{"v":1}, {"v":2}],

  "dates" : ["2019/09/15 14:50:03.101 +0900", "2019/09/15 14:50:03.202 +0900"],

  "date"  : {

    "as_datetime": "2019/09/15 14:50:03.042042043 +0900",

    "as_string"  : "2019/09/15 14:50:03.042042043 +0900"

  },

  "str"   : "Hello, LakeWeed"

}

"""

# Value types are guessed by lakeweed automatically.

# You can use specified type if you want.

my_types = {

    "date__as_string": "str"

}

(columns, types, values) = clickhouse.data_string2type_value(src_json, specified_types=my_types)

print(columns)

# (

#   'array',

#   'array_in_array',

#   'nested_map__value',

#   'map_in_array',

#   'dates',

#   'date__as_datetime',

#   'date__as_string',

#   'str'

# )

print(types)

# (

#   'Array(Float64)',

#   'Array(String)',

#   'Array(String)',

#   'Array(String)',

#   'Array(DateTime64(6))',

#   'DateTime64(6)',

#   'String',

#   'String'

# )

print(values)

# [(

#   [1.0, 2.0, 3.0],

#   ['[1.1, 2.2]', '[3.3, 4.4]'],

#   ['[1, 2]', '[3, 4]'],

#   ['{"v": 1}', '{"v": 2}'],

#   [

#     datetime.datetime(2019, 9, 15, 14, 50, 3, 101000, tzinfo=tzoffset(None, 32400)),

#     datetime.datetime(2019, 9, 15, 14, 50, 3, 202000, tzinfo=tzoffset(None, 32400))

#   ],

#   datetime.datetime(2019, 9, 15, 14, 50, 3, 42042, tzinfo=tzoffset(None, 32400)),

#   '2019/09/15 14:50:03.042042043 +0900',

#   'Hello, LakeWeed'

# )]

```

## Example(CSV test to ClickHouse)

```py

src_csv = """

f,b,d

42,true,2019/09/15 14:50:03.101 +0900

"42","true",2019/12/15 14:50:03.101 +0900

"""

(columns, types, values) = clickhouse.data_string2type_value(src_csv)

print(columns)

# ('f', 'b', 'd', 'd_ns')

print(types)

# ('Float64', 'UInt8', 'DateTime64(6)')

print(values)

# [

#   (42.0, 1, datetime.datetime(2019, 9, 15, 14, 50, 3, 101000, tzinfo=tzoffset(None, 32400))),

#   (42.0, 1, datetime.datetime(2019, 12, 15, 14, 50, 3, 101000, tzinfo=tzoffset(None, 32400)))

# ]

```

## Example(Json lines test to ClickHouse)

Lake Weed converts each row of JSON in the same way as a single line of json.

Automatically selects the type so that all data can be stored. For example, if you have a mix of Numbers and Strings, select a String type that can store both.

```py

src_json_lines = """

{"f": 42,   "b": true,   "d": "2019/09/15 14:50:03.101 +0900"}

{"f": "42", "b": "true", "d": "2019/12/15 14:50:03.101 +0900"}

"""

(columns, types, values) = clickhouse.data_string2type_value(src_json_lines)

print(columns)

# ('f', 'b', 'd', 'd_ns')

print(types)

# ('String', 'String', 'DateTime64(6)')

# ('String', 'String', 'DateTime', 'UInt32')

print(values)

# [

#   ('42', 'true', datetime.datetime(2019, 9, 15, 14, 50, 3, 101000, tzinfo=tzoffset(None, 32400))),

#   ('42', 'true', datetime.datetime(2019, 12, 15, 14, 50, 3, 101000, tzinfo=tzoffset(None, 32400)))

# ]

```

# Type

## Lake Weed types

- `Int`

- `Float`

- `Bool`

- `String`

- `DateTime` (nano seconds order)

- `Array[Int]`

- `Array[Float]`

- `Array[Bool]`

- `Array[String]`

- `Array[DateTime]`

Python default data types are used for Int, Float, Bool and String types. By default, numeric values(Int or Float) are always treated as Float.

DateTime is expand based on `datetime.datetime` and it contains nano seconds. Please see `DateTimeWithNS` type.

`Array[]` support above primitive types.

## Specified Types

In default, Value types will be guessed by lakeweed automatically.

If you want enforce to use type by specified it as `specified_types` argument.

```python

my_types = {

    "date__as_string": "str" # field name : specified type name

}

(columns, types, values) = clickhouse.data_string2type_value(src_json, specified_types=my_types)

```

These types you can use.

| Specified Type String (ignore case) | Lake Weed Type |

| :---------------------------------: | :------------: |

|                `INT`                |     `Int`      |

|              `INTEGER`              |     `Int`      |

|               `FLOAT`               |    `Float`     |

|              `DOUBLE`               |    `Float`     |

|               `BOOL`                |     `Bool`     |

|              `BOOLEAN`              |     `Bool`     |

|             `DATETIME`              |   `DateTime`   |

|                `STR`                |    `String`    |

|              `STRING`               |    `String`    |

If it faileds to cast, the value will be NULL.

## Output Data Type

### Clickhouse

|    Source Type    | [Clickhouse Data Types](https://clickhouse.tech/docs/en/sql-reference/data-types/) |

| ---------------: | :--------------------------------------------------------------------------------- |

|       `Int`       | `Int64`                                                                            |

|      `Float`      | `Float64`                                                                          |

|      `Bool`       | `UInt8` (True: 1, False: 0)                                                        |

|     `String`      | `String`                                                                           |

|    `DateTime`     | `DateTime64(6)` (Nano seconds order is ignored.)                                   |

|   `Array(Int)`    | `Array(Int64)`                                                                     |

|  `Array(Float)`   | `Array(Float64)`                                                                   |

|   `Array(Bool)`   | `Array(UInt8)`                                                                     |

|  `Array(String)`  | `Array(String)`                                                                    |

| `Array(DateTime)` | `Array(DateTime64(6))`                                                             |

# Release PyPI

## Setup

### Create `~/.pypirc`

```ini

[distutils]

index-servers =

  pypi

  testpypi

[pypi]

repository: https://upload.pypi.org/legacy/

username: 

password: 

[testpypi]

repository: https://test.pypi.org/legacy/

username: 

password: 

```

### Install packages for build and deploy

```sh

pip install wheel twine

```

## Build and Deploy

### Make Package

```sh

rm -f -r lakeweed.egg-info/* dist/*

python setup.py sdist bdist_wheel

```

### Local testing

```sh

python setup.py develop

```

### Deploy to PyPI

```sh

# for testing

twine upload --repository testpypi dist/*

# open https://test.pypi.org/project/lakeweed/

# for production

twine upload --repository pypi dist/*

# open https://pypi.org/project/lakeweed/

```

# Contributing

1. Fork it ( https://github.com/tac0x2a/lake_weed/fork )

2. Create your feature branch (`git checkout -b my-new-feature`)

3. Commit your changes (`git commit -am 'Add some feature'`)

4. Push to the branch (`git push origin my-new-feature`)

5. Create a new Pull Request

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tac0x2a/lake_weed

Awesome Lists containing this project

README