https://github.com/ggeop/multiple-fields-management
Fields management from/to different data sources. :bulb:
https://github.com/ggeop/multiple-fields-management
data data-engineering data-organization data-retrieval data-science pandas python
Last synced: about 2 months ago
JSON representation
Fields management from/to different data sources. :bulb:
- Host: GitHub
- URL: https://github.com/ggeop/multiple-fields-management
- Owner: ggeop
- Created: 2022-09-23T11:04:39.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-09-23T11:08:01.000Z (over 3 years ago)
- Last Synced: 2025-02-08T17:33:46.802Z (over 1 year ago)
- Topics: data, data-engineering, data-organization, data-retrieval, data-science, pandas, python
- Language: Jupyter Notebook
- Homepage:
- Size: 3.91 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
This repo presents an idea of fields management from/to different data sources.
# Requirements :triangular_ruler:
* Python 3.8.x (Tested in Python 3.8.5)
* Python Pandas
# Target Audience :loudspeaker:
* Data Engineers
* Data Scientists
* Data Analysts
* Software Engineers
# Problem :confused:
A very common headache in data processing from multiple sources is the managment of column names.
In a nutshell, few problems are:
* **Lack of naming convention**: Column names are not following the same convention or any convention at all.
* **Unefficient refactoring**: Hardcoded strings are all over the project. In refactoring, its possible to forgot to update something.
* **Column dublications**: Column names are not in the same place, so its possible the user will use similar column names for the same thing.
# Solution :ok_hand:
Organize all the fields (column names) in a cetral place by using a Python object (class).
The benefits from this strategy is the elimination of the above problems and much more (e.g adds fields logic)!!
A Python class gives the ability to use methods and also bound other metadata with fields (e.g type).
## An example :yum:
Below there is a simple example, but you can find more code examples in `examples.ipynb` Notebook
Let's say that we have a pandas dataframe with one column.
```{python}
input_df = pd.DataFrame({'dummy_column_1': [1,2,3,4]})
```
Output:
```
+----+----------------+
| | dummy_column_1 |
|----+----------------|
| 0 | 1 |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
+----+----------------+
```
Now, we want to create a new column `dummy_column_2`. The first step, is to create a new entity in `Field` class. For example:
```
dummy_field_2 = field(input_name='dummy_column_2', exported_name='Dummy Column 2')
```
Then we can use the new field in our dataframe.
```{python}
import Field
df[Fields.dummy_field_2.input_name] = 2
```
Output:
```
+----+----------------+------------------+
| | dummy_column_1 | dummy_column_2 |
|----+----------------+------------------|
| 0 | 1 | 2 |
| 1 | 2 | 2 |
| 2 | 3 | 2 |
| 3 | 4 | 2 |
+----+----------------+------------------+
```
Then at the end, after we have finished with the processing steps, we can rename the fields before we export the dataframe by leveraging field functions.
```{python}
field_renames = Field.get_renames()
df = df.rename(columns={field: field_renames[field] for field in input_df.columns})
```
Output:
```
+----+----------------+------------------+
| | Dummy Column | Dummy Column 2 | <---- New column names
|----+----------------+------------------|
| 0 | 1 | 2 |
| 1 | 2 | 2 |
| 2 | 3 | 2 |
| 3 | 4 | 2 |
+----+----------------+------------------+
```
----
# Contributing
* Pull Requests (PRs) are welcome ☺️!
# Thanks!
Thanks for your time! This repo is just an idea, the above logic can be expanded and change the way you handle multiple data sources! Cheers :beers: