{"id":20630635,"url":"https://github.com/brookisme/dfgen","last_synced_at":"2026-04-18T17:32:51.641Z","repository":{"id":151273991,"uuid":"96054185","full_name":"brookisme/dfgen","owner":"brookisme","description":"Keras Image Generator from Dataframes","archived":false,"fork":false,"pushed_at":"2017-07-18T16:03:53.000Z","size":70,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-08-07T18:38:51.228Z","etag":null,"topics":["deeplearning","keras","machine-learning","tensorflow","theano"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/brookisme.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-07-02T23:24:02.000Z","updated_at":"2018-08-23T18:50:44.000Z","dependencies_parsed_at":null,"dependency_job_id":"1c1eefa2-dbda-4acb-b076-7a61a94d1e83","html_url":"https://github.com/brookisme/dfgen","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/brookisme/dfgen","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brookisme%2Fdfgen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brookisme%2Fdfgen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brookisme%2Fdfgen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brookisme%2Fdfgen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/brookisme","download_url":"https://codeload.github.com/brookisme/dfgen/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brookisme%2Fdfgen/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31977964,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T17:30:12.329Z","status":"ssl_error","status_checked_at":"2026-04-18T17:29:59.069Z","response_time":103,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deeplearning","keras","machine-learning","tensorflow","theano"],"created_at":"2024-11-16T14:09:05.291Z","updated_at":"2026-04-18T17:32:51.627Z","avatar_url":"https://github.com/brookisme.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## DFGen \n\n**Keras Image Generator from Dataframes**\n\nCreates generator from csv or dataframe.\nOptional Features:\n\n1. convert \"tag\" list to binary valued label vector for predictions\n2. save to train/test split files\n3. easy configuration with [yaml](#yaml) file\n\n##### INSTALL\n\n```bash\ncd ~/\ngit clone https://github.com/brookisme/dfgen.git\ncd dfgen\npip install -e .\n```\n\n---\n\n##### USAGE\n\nIn the examples below we have used the `dfg_config.yaml` file located [here](https://github.com/brookisme/dfgen/blob/master/example.dfg_config.yaml).\n\n- [Init|Train|Test](#traintest)\n- [Reduce Columns](#reduce_columns)\n- [DFGen.require_label](#require_label)\n- [Generator and Lambda](#lambda)\n\n---\n\n\u003ca name='traintest'\u003e\u003c/a\u003e\n\n###### save (processed) data to train and test csvs\n\n```bash\n# bash\n$ head data.csv \nimage_name,tags\ntrain_0,haze primary\ntrain_1,agriculture clear primary water\ntrain_2,clear primary\n\n# python\n\u003e\u003e\u003e from dfgen import DFGen\n\u003e\u003e\u003e gen=DFGen(csv_file='data.csv',csv_sep=',')\n\u003e\u003e\u003e gen.dataframe.sample(2)\n        image_name                 tags  \\\n7901    train_7901  clear primary water   \n38214  train_38214        clear primary   \n\n                                                  labels  \\\n7901   [1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...   \n38214  [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...   \n\n                        paths  \n7901    images/tif/train_7901.tif  \n38214  images/tif/train_38214.tif  \n\n# save as train/test split\n\u003e\u003e\u003e gen.save(path='train.csv',split_path='test.csv')\n# or save the processed data (with labels, paths, require's)\n\u003e\u003e\u003e gen.save(path='processed_data.csv')\n\n# if you want small datasets for developement you can use limit\n\u003e\u003e\u003e gen=DFGen(csv_file='train.csv',csv_sep=',')\n\u003e\u003e\u003e gen.size\n40479\n\u003e\u003e\u003e gen.limit(400)\n\u003e\u003e\u003e gen.size\n400\n\u003e\u003e\u003e gen.save(path='dev_train.csv',split_path='dev_test.csv',split=100)\n\n\n\n\n# side note: dfg_config file specifies tif but we could have loaded JPGs\n\u003e\u003e\u003e gen=DFGen(csv_file='data.csv',csv_sep=',',image_ext='jpg')\n\u003e\u003e\u003e gen.dataframe.paths.sample(2)\n21628    images/jpg/train_21628.jpg\n7955      images/jpg/train_7955.jpg\nName: paths, dtype: object\n```\n\n###### load data to train and test generators\n\n```bash\n# bash (note we have the label and path columns)\n$ head train.csv \nimage_name,tags,labels,paths\ntrain_12485,agriculture clear primary,\"[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\",images/tif/train_12485.tif\ntrain_3535,clear cultivation primary,\"[1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\",images/tif/train_3535.tif\ntrain_4857,agriculture cultivation habitation partly_cloudy primary road,\"[1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]\",images/tif/train_4857.tif\n\n# python\n\u003e\u003e\u003e train_gen=DFGen(csv_file='train.csv',csv_sep=',')\n\u003e\u003e\u003e test_gen=DFGen(csv_file='test.csv',csv_sep=',')\n\u003e\u003e\u003e train_gen.size/gen.size\n0.8000197633340744\n\u003e\u003e\u003e test_gen.size/gen.size\n0.19998023666592554\n```\n\n\n---\n\n\u003ca name='reduce_columns'\u003e\u003c/a\u003e\n###### Reduce Columns\n\n```bash\n\u003e\u003e\u003e gen=DFGen(csv_file='train.csv',csv_sep=',',image_ext='tif')\n\u003e\u003e\u003e gen.size\n40479\n\u003e\u003e\u003e gen.tags\n['primary', 'clear', 'agriculture', 'road', 'water', 'partly_cloudy', 'cultivation', 'habitation', 'haze', 'cloudy', 'bare_ground', 'selective_logging', 'artisinal_mine', 'blooming', 'slash_burn', 'conventional_mine', 'blow_down']\n\u003e\u003e\u003e gen.dataframe_with_tags('blow_down','cultivation').size\n32\n\u003e\u003e\u003e gen.reduce_columns('blow_down','cultivation')\n\u003e\u003e\u003e gen.tags\n['blow_down', 'cultivation', 'others']\n\u003e\u003e\u003e gen.dataframe.sample(2)\n        image_name                                  tags     labels  \\\n6550    train_6550                         clear primary  [0, 0, 1]   \n30966  train_30966  agriculture clear primary road water  [0, 0, 1]   \n\n                            paths  \n6550    images/tif/train_6550.tif  \n30966  images/tif/train_30966.tif  \n\u003e\u003e\u003e gen.reduce_columns('blow_down','cultivation',others=False)\n\u003e\u003e\u003e gen.tags\n['blow_down', 'cultivation']\n\u003e\u003e\u003e gen.dataframe.sample(2)\n        image_name                   tags  labels                       paths\n31901  train_31901  partly_cloudy primary  [0, 1]  images/tif/train_31901.tif\n14158  train_14158  partly_cloudy primary  [0, 1]  images/tif/train_14158.tif\n```\n\n\n--- \n\n\u003ca name='require_label'\u003e\u003c/a\u003e\n\n###### using require_label to reduce dataset\n\n```bash\n\u003e\u003e\u003e from dfgen import DFGen\n\u003e\u003e\u003e gen=DFGen(csv_file='data.csv',csv_sep=',')\n\u003e\u003e\u003e gen.size\n40479\n\u003e\u003e\u003e gen.dataframe.head(2)\n        image_name                                       tags  \\\n16452  train_16452  agriculture clear habitation primary road   \n20043  train_20043                              clear primary   \n\n                                                  labels  \\\n16452  [1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...   \n20043  [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...   \n\n                        paths  \n16452  images/tif/train_16452.tif  \n20043  images/tif/train_20043.tif  \n\n#\n# REQUIRE_LABEL:\n#\n\u003e\u003e\u003e gen.require_label('blow_down',70)\n\u003e\u003e\u003e gen.size\n140\n\u003e\u003e\u003e gen.tags\n['primary', 'clear', 'agriculture', 'road', 'water', 'partly_cloudy', 'cultivation', 'habitation', 'haze', 'cloudy', 'bare_ground', 'selective_logging', 'artisinal_mine', 'blooming', 'slash_burn', 'conventional_mine', 'blow_down']\n\u003e\u003e\u003e gen.dataframe.sample(2)\n      image_name                                             tags  \\\n55   train_23025   blow_down clear cultivation habitation primary   \n101  train_20618                        clear cultivation primary   \n\n                                                labels  \\\n55   [1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...   \n101  [1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...   \n\n                      paths  \n55   images/tif/train_23025.tif  \n101  images/tif/train_20618.tif  \n\n#\n# REQUIRE_LABEL: reduce_to_others=True \n#   - this is the same as:\n#       gen.require_label('blow_down',70)\n#       gen.reduce_columns('blow_down')\n#\n\u003e\u003e\u003e gen.require_label('blow_down',70,reduce_to_others=True)\n\u003e\u003e\u003e gen.size\n140\n\u003e\u003e\u003e gen.tags\n['blow_down', 'others']\n\u003e\u003e\u003e gen.dataframe.sample(2)\n      image_name                                         tags  labels  \\\n12   train_38607  agriculture blow_down partly_cloudy primary  [1, 1]   \n24   train_31495            blow_down clear primary blow_down  [1, 1]   \n\n                      paths  \n12   images/tif/train_38607.tif  \n109  images/tif/train_10679.tif  \n\n\n#\n# COMBINING REQUIRE LABELs\n#\n\u003e\u003e\u003e from dfgen import DFGen\n\u003e\u003e\u003e gen=DFGen(csv_file='data.csv',csv_sep=',',image_ext='tif')\n\u003e\u003e\u003e gen.size\n183\n\n# You can fetch the rows with specific tags\n\u003e\u003e\u003e gen.dataframe_with_tags('blow_down','cultivation').size\n32\n\u003e\u003e\u003e gen.dataframe_with_tags('blow_down','cultivation').head(2)\n        image_name                                               tags  \\\n25950  train_25950  agriculture blooming blow_down clear cultivati...   \n9961    train_9961    agriculture blow_down clear cultivation primary   \n\n                                                  labels  \\\n25950  [1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, ...   \n9961   [1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...   \n\n                        paths  \n25950  images/tif/train_25950.tif  \n9961    images/tif/train_9961.tif  \n\n# RequireLabel and check percentages\n\u003e\u003e\u003e gen.require_label('blow_down',10)\n\u003e\u003e\u003e gen.dataframe_with_tags('blow_down').shape[0]/gen.size\n0.1\n\u003e\u003e\u003e gen.require_label('cultivation',60)\n\u003e\u003e\u003e gen.dataframe_with_tags('cultivation').shape[0]/gen.size\n0.6010928961748634\n\n# NOTE: The second require label effects the first.  \n#       We no longer have exactly 10% blow_down.\n\u003e\u003e\u003e gen.dataframe_with_tags('blow_down').shape[0]/gen.size\n0.07650273224043716\n```\n\n\n---\n\n\u003ca name='lambda'\u003e\u003c/a\u003e\n\n###### generator and lambda\n\n```bash\n\u003e\u003e\u003e from dfgen import DFGen\n\u003e\u003e\u003e gen=DFGen(csv_file='data.csv',csv_sep=',')\n# returns first batch tuple (images,labels)\n\u003e\u003e\u003e batch=next(gen)\n# so batch[0][0] is the np.array for the first image in the batch\n# in this case the image has 4 bands: [blue, green, red, nir]\n\n#\n# LETS PREPROCESS THE IMAGES\n#\ndef ndvi(img):\n    r=img[:,:,2]\n    nir=img[:,:,3]\n    return (nir-r)/(nir+r)\n\ndef ndvi_img(img):\n    ndvi_band=_ndvi(img)\n    img[:,:,3]=ndvi_band\n    return img\n\n\u003e\u003e\u003e gen=DFGen(csv_file='data.csv',csv_sep=',',lambda_func=ndvi_img)\n# returns first batch tuple (ndvi-images,labels)\n\u003e\u003e\u003e batch=next(gen)\n# now batch[0][0] is the np.array for the first image in the \n# preprocessed-batch. its the same image as above but it has been \n# passed through the 'ndvi_image' method. \n# The 4 bands are now: [blue, green, red, ndvi]\n```\n\n\n---\n\n##### COMMENT-DOCS\n\n```\n    \"\"\" CREATES GENERATOR FROM DATAFRAME\n        \n        create generator from existing dataframe or from a csv\n        \n        Methods:\n            .require_label: ensure a min percentage of a particular label\n            .save: save processed csv to csv or as train/test-split csvs\n            .__next__: generator method, batchwise return tuple of (images,labels)\n\n        Args:\n            * image_column (column with image path or name) is required\n            * label_column column with label \"vectors\" is required \n                - if the label_column already exists the dataframe will contain the labels\n                - if the label_column does not exsit and both tags and tags_to_labels_column\n                  are specified the tags will be converted to binary valued vectors\n            * tags: optional list of tags in corresponding to places in the label vectors\n            * tags_to_labels_column: name of a column that contain a space seperated \n                string of tags. these strings will be converted to the binary label vectors\n            * image_dir: root path for image_paths given in \"image_column\"\n            * image_ext:\n                - append to image_column values when loading images\n                - if using dfg_config file image_ext can determine image_dir\n            * lambda_func: function that acts on image data before returned to user\n            * batch_size: batch_size\n    \"\"\"\n    \n    ...\n\n      def require_label(self,label_index_or_tag,pct,exact=False,reduce_to_others=False):\n        \"\"\"\n            Warning: Ordering matters\n\n                .require_label(1,40)\n                .require_label(2,20)\n            \n            may not equal:\n\n                .require_label(2,20)\n                .require_label(1,40)\n\n            Args:\n                * label_index_or_tag: \u003cint|tag\u003e\n                    - \u003cint\u003e(label_index): index of the label of interest\n                    - \u003cstr\u003e(tag): if \"tags\": the name of the tag of interest\n                * pct: \u003cint:0-100\u003e percentage required for label\n                * exact:\n                    if False and there is the label already has \u003e= pct of dataset\n                    return full-dataset\n                    else: remove data so that label is pct of dataset\n                * reduce to others.  \n                    return labels as 2 vectors [label,others]\n        \"\"\"\n        ...\n\n\n    def reduce_columns(self,*indices_or_tags,others=True):\n        \"\"\" Keep passed columns and optional \"others\"\n\n            Usage:\n                gen.reduce_to_others('blow_down','cultivation')\n\n            Args:\n                * str or int arguments: label indices or tag names\n                * others: \n                    - if falsey: do not include \"others column\"\n                    - else:\n                        include \"others\"\n                        - if others arg is \u003cstr\u003e: use others arg as column name\n                        - else: use \"others\" as column name\n        \"\"\"\n        ...\n\n\n    def limit(self,nb_rows):\n        \"\"\" limit number of rows in dataframe\n\n            Use to create dev training sets\n        \"\"\"\n        ...\n\n\n    def dataframe_with_tags(self,*tags):\n        \"\"\" return dataframe rows containing certain tags\n            Args: strings of tag names\n                ie. gen.dataframe_with_tags('blow_down','clear')\n        \"\"\"\n        ...\n\n\n    def save(self,path,split_path=None,split=0.2,sep=None):\n        \"\"\" save dataframe to csv(s)\n\n            usually save after processing (ie: tags-\u003elabels and/or require_label),\n            so you wont need to process again.\n            \n            if split_path and split: \n                - split dataframe into 2 csvs (path and save path)\n                - if split is int: split = number of lines in split_csv\n                  else: split = % of full dataframe\n        \"\"\"\n        ...\n```\n\n\n---\n\n\n##### EXAMPLE CONFIG (in directory with .py or ipynb file)\n\u003ca name='yaml'\u003e\u003c/a\u003e\n\n[dfg_config.yaml](https://github.com/brookisme/dfgen/blob/master/example.dfg_config.yaml)\n\n```\n# COLUMN NAMES\nimage_column: image_name\nlabel_column: labels\ntags_column: tags\n\n# IMAGE EXT\nimage_ext: tif\n\n# IMAGE DIR BY EXT\nimage_dirs: \n  tif: images/tif\n  jpg: images/jpg\n\n# BACKUP IMAGE DIR\nimage_dir: images/other\n\n# TAGS\ntags:\n    - primary\n    - clear\n    - agriculture\n    - road\n    - water\n    - partly_cloudy\n    - cultivation\n    - habitation\n    - haze\n    - cloudy\n    - bare_ground\n    - selective_logging\n    - artisinal_mine\n    - blooming\n    - slash_burn\n    - conventional_mine\n    - blow_down\n\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrookisme%2Fdfgen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbrookisme%2Fdfgen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrookisme%2Fdfgen/lists"}