{"id":21034416,"url":"https://github.com/zpoint/idataapi-transform","last_synced_at":"2025-09-01T01:11:50.974Z","repository":{"id":50133288,"uuid":"115174544","full_name":"zpoint/idataapi-transform","owner":"zpoint","description":"Full async support toolkit for IDataAPI for efficiency work, read data from API/ES/csv/xlsx/json/redis/mysql/mongo/kafka, write to ES/csv/xlsx/json/redis/mysql/mongo/kafka, provide CLI and python API","archived":false,"fork":false,"pushed_at":"2024-10-31T06:07:18.000Z","size":392,"stargazers_count":44,"open_issues_count":0,"forks_count":16,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-08-26T13:54:54.124Z","etag":null,"topics":["asyncio","cli","csv","elasticsearch","mongodb","mysql","python3","redis","xlsx"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zpoint.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-12-23T06:02:39.000Z","updated_at":"2024-10-31T06:07:22.000Z","dependencies_parsed_at":"2025-03-14T18:16:05.028Z","dependency_job_id":"275031de-5c55-4a19-ad97-217cd8256efe","html_url":"https://github.com/zpoint/idataapi-transform","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/zpoint/idataapi-transform","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zpoint%2Fidataapi-transform","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zpoint%2Fidataapi-transform/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zpoint%2Fidataapi-transform/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zpoint%2Fidataapi-transform/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zpoint","download_url":"https://codeload.github.com/zpoint/idataapi-transform/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zpoint%2Fidataapi-transform/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273061077,"owners_count":25038596,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-31T02:00:09.071Z","response_time":79,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asyncio","cli","csv","elasticsearch","mongodb","mysql","python3","redis","xlsx"],"created_at":"2024-11-19T13:04:08.733Z","updated_at":"2025-09-01T01:11:50.942Z","avatar_url":"https://github.com/zpoint.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# idataapi-transform\n\n**idataapi-transform** is a pure python, **full async** support **Toolkit**, to transform data from one location/format to another location/format, provide **command line interface** for easy use, and **python module** with powerful features\n\nIt's currently used in [IDataAPI](http://www.idataapi.cn/) team for efficiency work\n\n* [中文文档](https://github.com/zpoint/idataapi-transform/blob/master/README_CN.md)\n\nProvide\n\n* [Command line interface](#command-line-interface-example)\n* [Python API](#build-complex-routine-easily)\n\n![diagram](https://github.com/zpoint/idataapi-transform/blob/master/idataapi-transform.png)\n\nYou can read data from one of\n\n * **API(http)**\n * **ES(ElasticSearch)**\n * **CSV**\n * **XLSX**\n * **JSON**\n * **Redis**\n * **MySQL**\n * **MongoDB**\n\nand convert to\n\n * **CSV**\n * **XLSX**\n * **JSON**\n * **TXT**\n * **Redis**\n * **MySQL**\n * **MongoDB**\n * **Kafka**\n\nFeatures:\n\n* with asyncio support and share same API\n* Baesd on Template Method and Factory Method\n* Every failure network read operation will log to error log before retry 3(default) times\n* Every failure network write operation will log to error log before retry 3(default) times\n* Command line support for simple usage, python module provide more features\n* Every Getter and Writer support filter, you can alter or drop your data in filter\n* Auto header generation(csv/xlsx)/table generation(mysql) based on your data\n* APIGetter, will request next page automatically，each page will retry max_retry before fail\n* APIBulkGetter support bulk processing for APIGetter/url object, you can feed iterable object or async generator\n* easy to use concurrent control for APIBulkGetter\n* support normal call back and async call back\n* persistent task to disk\n-------------------\n\n### catalog\n\n* [Requirment](#requirment)\n* [Installation](#installation)\n* [Command line interface Example](#command-line-interface-example)\n\t* [Elasticsearch to CSV](#read-data-from-elasticsearch-convert-to-csv)\n\t* [API to xlsx](#read-data-from-api-convert-to-xlsx)\n\t* [json to csv](#read-data-from-json-convert-to-csv)\n\t* [csv to xlsx](#read-data-from-csv-convert-to-xlsx)\n\t* [Elasticsearch to csv with parameters](#read-data-from-elasticsearch-convert-to-csv-with-parameters)\n\t* [API to redis](#read-data-from-api-write-to-redis)\n\t* [redis to csv](#read-data-from-redis-write-to-csv)\n\t* [API to MySQL](#read-data-from-api-write-to-mysql)\n\t* [MySQL to redis](#read-data-from-mysql-write-to-redis)\n\t* [MongoDB to csv](#read-data-from-mongodb-write-to-csv)\n* [Python module support](#python-module-support)\n\t* [ES to csv](#es-to-csv)\n\t* [API to xlsx](#api-to-xlsx)\n\t* [CSV to xlsx](#csv-to-xlsx)\n\t* [API to redis](#api-to-redis)\n    * [redis to MySQL](#redis-to-mysql)\n    * [MongoDB to redis](#mongodb-to-redis)\n\t* [Bulk API to ES/MongoDB/Json](#bulk-api-to-es-or-mongodb-or-json)\n\t* [Extract error info from API](#extract-error-info-from-api)\n\t* [call_back](#call_back)\n\t* [filter](#redis-to-mysql)\n    * [done_if](#api-to-xlsx)\n    * [persistent to disk(resume from break point)](#persistent-to-disk)\n* [REDIS Usage](#redis-usage)\n* [ES Base Operation](#es-base-operation)\n\t* [Read data from ES](#read-data-from-es)\n\t* [Write data to ES](#write-data-to-es)\n\t* [DELETE data from ES](#delete-data-from-es)\n\t* [API to ES in detail](#api-to-es-in-detail)\n\t* [Get ES Client](#get-es-client)\n* [Config](#config)\n\t* [ini file](#ini-file)\n\t* [manual config in program](#manual-config-in-program)\n* [Doc String](#doc-string)\n* [Change Log](#changelog)\n* [License](#license)\n\n-------------------\n\n#### Requirment\n* python version \u003e= 3.5.2\n* If you need MySQL enable, your python version should be \u003e= 3.5.3\n* If you need MongoDB enable, your platform should not be **windows**\n-------------------\n\n#### Installation\n\n\tpython3 -m pip install idataapi-transform\n    # shell, run\n    transform --help # explanation of each parameter and create configure file\n    # edit ~/idataapi-transform.ini to config elasticsearch hosts, redis, mysql etc...\n\n    # Install MySQL module, if your python version \u003e= 3.5.3\n    python3 -m pip install 'PyMySQL\u003c=0.9.2,\u003e=0.9'\n    python3 -m pip install aiomysql\n\n    # Install MongoDB module, if your platform is not Windows\n    python3 -m pip install motor\n\n-------------------\n\n#### Command line interface Example\n\n* Read data from **[API, ES, CSV, XLSX, JSON, Redis, MySQL, MongoDB]**\n* Write data to **[CSV, XLSX, JSON, TXT, ES, Redis, MySQL, MongoDB]**\n\n##### read data from Elasticsearch convert to CSV\n\nwill read at most **500** items from given **index**: **knowledge20170517**, and write to ./result.csv\n\n\ttransform ES csv \"knowledge20170517\" --max_limit=500\n\n##### read data from API convert to XLSX\n\nwill read all items from given api url, until no more next page, and save to dest(/Users/zpoint/Desktop/result.xlsx), **dest is optional, default is ./result.xlsx**\n\n\ttransform API xlsx \"http://xxx/post/dengta?kw=中国石化\u0026apikey=xxx\" \"/Users/zpoint/Desktop/result\"\n\n##### read data from JSON convert to csv\n\nwill read items from json file, and save to **./result.csv**\n\n\ttransform JSON csv \"/Users/zpoint/Desktop/a.json\"\n\n##### read data from CSV convert to xlsx\n\nwill read items from csv file, and save to **./result.xlsx**\n\n\ttransform CSV xlsx \"./a.csv\"\n\n\n##### read data from Elasticsearch convert to CSV with parameters\n* save csv with file encoding \"gbk\" **(--w_encoding)**\n* specific index: knowledge20170517, **(knowledge20170517)**\n* when read from Elasticsearch, specific request body **(--query_body)**\n\n    \tbody = {\n        \t\"size\": 100,\n        \t\"_source\": {\n            \t\"includes\": [\"location\", \"title\", \"city\", \"id\"]\n                }\n              }\n\n* before write to csv, add timestamp to each item, and drop items with null city **(--filter)**\n\n        # create a file name my_filter.py (any filename will be accepted)\n        import time\n        def my_filter(item): # function name must be \"my_filter\"\n            item[\"createtime\"] = int(time.time())\n            if item[\"city\"]:\n                return item # item will be write to destination\n            # reach here, means return None, nothing will be write to destination\n\n* Shell:\n\n    \ttransform ES csv \"knowledge20170517\" --w_encoding gbk --query_body '{\"size\": 100, \"_source\": {\"includes\": [\"location\", \"title\", \"city\", \"id\"]}}' --filter ./my_filter.py\n\n##### Read data from API write to Redis\n\n* redis key name: my_key\n* redis store/read support LIST and HASH, default value is LIST, you can change it with  --key_type parameter\n\nwill read data from ./a.csv, and save to redis LIST data structure, KEY: my_key\n\n\ttransform API redis \"http://xxx/post/dengta?kw=中国石化\u0026apikey=xxx\" my_key\n\n##### Read data from Redis write to csv\n\nwill read data from redis key **my_key**, read at most 100 data， and save to **./result.csv**\n\n\ttransform Redis csv my_key --max_limit 100\n\n##### Read data from API write to MySQL\n\n* auto create table if not exist\n\nwill read data from **API**, read at most 50 data， and save to MySQL table: **my_table**\n\n\ttransform API MYSQL 'http://xxx' my_table --max_limit=50\n\n##### Read data from MySQL write to redis\n\nwill read data from MySQL table **my_table**, each read operation fetch 60 items， and save to a redis LIST name **result**, **result** is the default key name if you don't provide one\n\n\ttransform MYSQL redis my_table --per_limit=60\n\n##### Read data from MongoDB write to csv\n\n* you can provide --query_body\n\nwill read at most 50 data from \"my_coll\", and save to **./result.csv**\n\n\ttransform mongo csv my_coll --max_limit=50\n\n\n-------------------\n\n#### Python module support\n\n##### ES to csv\n\n    import asyncio\n\tfrom idataapi_transform import ProcessFactory, GetterConfig, WriterConfig\n\n\tasync def example():\n        body = {\n            \"size\": 100,\n            \"_source\": {\n                \"includes\": [\"likeCount\", \"id\", \"title\"]\n                }\n        }\n        es_config = GetterConfig.RESConfig(\"post20170630\", \"news\", max_limit=1000, query_body=body)\n        es_getter = ProcessFactory.create_getter(es_config)\n        csv_config = WriterConfig.WCSVConfig(\"./result.csv\")\n        with ProcessFactory.create_writer(csv_config) as csv_writer:\n            async for items in es_getter:\n                # do whatever you want with items\n                csv_writer.write(items)\n\n\tif __name__ == \"__main__\":\n        loop = asyncio.get_event_loop()\n        loop.run_until_complete(example())\n\n\n##### API to xlsx\n\n    import time\n    import asyncio\n\tfrom idataapi_transform import ProcessFactory, GetterConfig, WriterConfig\n\n    yesterday_ts = int(time.time()) - 24 * 60 * 60\n\n    def my_done_if(items):\n        # RAPIConfig will fetch next page until\n        # 1. no more page or\n        # 2. reach max_limit or\n        # 3. error occurs\n        # if you want to terminate fetching in some condition, you can provide done_if function\n        if items[-1][\"publishDate\"] \u003c yesterday_ts:\n        \treturn True\n        return False\n\n\tasync def example():\n        api_config = GetterConfig.RAPIConfig(\"http://xxxx\")\n        # or you can use: api_config = GetterConfig.RAPIConfig(\"http://xxxx\", done_if=my_done_if)\n        getter = ProcessFactory.create_getter(api_config)\n        xlsx_config = WriterConfig.WXLSXConfig(\"./result.xlsx\")\n        with ProcessFactory.create_writer(xlsx_config) as xlsx_writer:\n        \tasync for items in getter:\n                # do whatever you want with items\n                xlsx_writer.write(items)\n\n\tif __name__ == \"__main__\":\n        loop = asyncio.get_event_loop()\n        loop.run_until_complete(example())\n\n##### CSV to xlsx\n\n    import asyncio\n\tfrom idataapi_transform import ProcessFactory, GetterConfig, WriterConfig\n\n    async def example():\n        csv_config = GetterConfig.RCSVConfig(\"./result.csv\")\n        getter = ProcessFactory.create_getter(csv_config)\n        xlsx_config = WriterConfig.WXLSXConfig(\"./result.xlsx\")\n        with ProcessFactory.create_writer(xlsx_config) as xlsx_writer:\n            for items in getter:\n                # do whatever you want with items\n                xlsx_writer.write(items)\n\n    if __name__ == \"__main__\":\n        loop = asyncio.get_event_loop()\n        loop.run_until_complete(example())\n\n##### API to redis\n\n    import asyncio\n\tfrom idataapi_transform import ProcessFactory, GetterConfig, WriterConfig\n\n    async def example():\n        api_config = GetterConfig.RAPIConfig(\"http://xxx\")\n        getter = ProcessFactory.create_getter(api_config)\n        redis_config = WriterConfig.WRedisConfig(\"key_a\")\n        with ProcessFactory.create_writer(redis_config) as redis_writer:\n            async for items in getter:\n                # do whatever you want with items\n                await redis_writer.write(items)\n\n    if __name__ == \"__main__\":\n        loop = asyncio.get_event_loop()\n        loop.run_until_complete(example())\n\n##### redis to MySQL\n\n    import asyncio\n\tfrom idataapi_transform import ProcessFactory, GetterConfig, WriterConfig\n\n    def my_filter(item):\n        # I am a filter\n        # Every getter or writer created by ProcessFactory.create can set up a filter\n        # every data will be pass to filter before return from getter, or before write to writer\n        # you can alter data here, or drop data here\n    if item[\"viewCount\"] \u003e 10:\n        return item\n        # if don't return anything(return None) means drop this data\n\n\tasync def example():\n        api_config = GetterConfig.RAPIConfig(\"http://xxxx\", filter_=my_filter)\n        getter = ProcessFactory.create_getter(api_config)\n        mysql_config = WriterConfig.WMySQLConfig(\"my_table\")\n        with ProcessFactory.create_writer(mysql_config) as mysql_writer:\n        \tasync for items in getter:\n                # do whatever you want with items\n                await mysql_writer.write(items)\n\n        # await mysql_config.get_mysql_pool_cli() # aiomysql connection pool\n        # mysql_config.connection # one of the connection in previous connection pool\n        # mysql_config.cursor # cursor of previous connection\n        # you should alaways call 'await mysql_config.get_mysql_pool_cli()' before use connection and cursor\n        # provided by GetterConfig.RMySQLConfig and WriterConfig.WMySQLConfig\n\n\tif __name__ == \"__main__\":\n        loop = asyncio.get_event_loop()\n        loop.run_until_complete(example())\n\n\n##### MongoDB to redis\n\n    import asyncio\n\tfrom idataapi_transform import ProcessFactory, GetterConfig, WriterConfig\n\n\tasync def example():\n        mongo_config = GetterConfig.RMongoConfig(\"coll_name\")\n        mongo_getter = ProcessFactory.create_getter(mongo_config)\n        redis_config = WriterConfig.WRedisConfig(\"my_key\")\n        with ProcessFactory.create_writer(redis_config) as redis_writer:\n        \tasync for items in mongo_getter:\n                # do whatever you want with items\n                await mysql_writer.write(items)\n\n        # print(mongo_config.get_mysql_pget_mongo_cli()) # motor's AsyncIOMotorClient instance\n        # provided by GetterConfig.RMongoConfig and WriterConfig.WMongoConfig\n\n\tif __name__ == \"__main__\":\n        loop = asyncio.get_event_loop()\n        loop.run_until_complete(example())\n\n\n\n##### Bulk API to ES or MongoDB or Json\n\n    import asyncio\n\tfrom idataapi_transform import ProcessFactory, GetterConfig, WriterConfig\n\n    \"\"\"\n    RAPIConfig support parameter:\n    max_limit: get at most max_limit items, if not set, get all\n    max_retry: if request fail, retry max_retry times\n    filter_: run \"transform --help\" to see command line interface explanation for detail\n    \"\"\"\n\n    def url_generator():\n        for i in range(10000):\n            yield url % (i, ) # yield RAPIConfig(url  % (i, )) will be OK\n\n    async def example():\n        # urls can be any iterable object, each item can be api url or RAPIConfig\n        # RAPIBulkConfig accept a parameter: interval，means interval between each async generator return\n        # if you set interval to 2 seconds，the async for will wait for 2 seconds before return，every data you get between these 2 seconds will be returned\n\n        urls = [\"http://xxxx\", \"http://xxxx\", GetterConfig.RAPIConfig(\"http://xxxx\"), ...]\n        api_bulk_config = GetterConfig.RAPIBulkConfig(urls, concurrency=100)\n        api_bulk_getter = ProcessFactory.create_getter(api_bulk_config)\n        es_config = WriterConfig.WESConfig(\"profile201712\", \"user\")\n        with ProcessFactory.create_writer(es_config) as es_writer:\n            async for items in api_bulk_getter:\n                # do whatever you want with items\n                await es_writer.write(items)\n\n    async def example2mongo():\n        urls = url_generator()\n        api_bulk_config = GetterConfig.RAPIBulkConfig(urls, concurrency=50)\n        api_bulk_getter = ProcessFactory.create_getter(api_bulk_config)\n        # you can config host.port in configure file，or pass as parameters，parameters have higher priority than configure file\n        mongo_config = WriterConfig.WMongoConfig(\"my_coll\")\n        with ProcessFactory.create_writer(mongo_config) as mongo_writer:\n            async for items in api_bulk_getter:\n                # do whatever you want with items\n                await mongo_writer.write(items)\n\n    # ******************************************************\n    # Below is \"async generator\" example for RAPIBulkConfig\n    # keyword \"yield\" in \"async\" function only support for python3.6+，for python 3.5+ please refer below\n    # https://github.com/python-trio/async_generator\n    # Only idataapi-transform version \u003e= 1.4.4 support this feature\n    # ******************************************************\n\n    async def put_task2redis():\n        writer = ProcessFactory.create_writer(WriterConfig.WRedisConfig(\"test\"))\n        await writer.write([\n            {\"keyword\": \"1\"},\n            {\"keyword\": \"2\"},\n            {\"keyword\": \"3\"}\n        ])\n\n    async def async_generator():\n        \"\"\"\n        I am async generator\n        \"\"\"\n        getter = ProcessFactory.create_getter(GetterConfig.RRedisConfig(\"test\"))\n        async for items in getter:\n            for item in items:\n                r = GetterConfig.RAPIConfig(\"http://xxx%sxxx\" % (item[\"keyword\"], ), max_limit=100)\n                yield r\n\n    async def example2json():\n        # await put_task2redis()\n        urls = async_generator()\n        api_bulk_config = GetterConfig.RAPIBulkConfig(urls, concurrency=30)\n        api_bulk_getter = ProcessFactory.create_getter(api_bulk_config)\n        json_config = WriterConfig.WJsonConfig(\"./result.json\")\n        with ProcessFactory.create_writer(json_config) as json_writer:\n            async for items in api_bulk_getter:\n                # do whatever you want with items\n                json_writer.write(items)\n\n\tif __name__ == \"__main__\":\n        loop = asyncio.get_event_loop()\n        loop.run_until_complete(example())\n\n##### Extract error info from API\n\n    import asyncio\n\tfrom idataapi_transform import ProcessFactory, GetterConfig\n\n\tasync def example_simple():\n        # if return_fail set to true, after retry 3(default) times,\n        # still unable to get data error info will be returned in \"bad_items\"\n        url = \"xxx\"\n        config = GetterConfig.RAPIConfig(url, return_fail=True)\n        reader = ProcessFactory.create_getter(config)\n        async for good_items, bad_items in reader:\n        \tprint(good_items)\n            if len(bad_items) \u003e 0:\n            \terr_obj = bad_items[0]\n                print(err_obj.response) # http body, if network down，it will be None\n                print(err_obj.tag) # tag you pass to RAPIConfig, default None\n                print(err_obj.source) #  url you pass to RAPIConfig\n                print(err_obj.error_url) # the url that elicit error\n\n    async def example():\n        unfinished_id_set = {'246834800', '376796200', '339808400', ...}\n        config = GetterConfig.RAPIBulkConfig((RAPIConfig(base_url % (i,), return_fail=True, tag=i) for i in unfinished_id_set), return_fail=True, concurrency=100)\n        reader = ProcessFactory.create_getter(config)\n        async for good_items, bad_items in reader:\n            # A: When you set RAPIBulkConfig's return_fail to True,\n            # 1）normal url will retuen error info\n            # 2) RAPIConfig Object with return_fail set to True will retuen error info\n            # 3) RAPIConfig Object with return_fail set to False will not retuen error info\n            # B: Whrn you set RAPIBulkConfig's return_fail to False(default)\n            # None of the previous situitions will return error info, same as the API to ES example above\n        \tprint(bad_items)\n\n    if __name__ == \"__main__\":\n        loop = asyncio.get_event_loop()\n        loop.run_until_complete(example_simple())\n\n\n##### call_back\n\n    import asyncio\n\tfrom idataapi_transform import ProcessFactory, GetterConfig, WriterConfig\n\n    \"\"\"\n    call_back can be normal function or async funcion\n     only RAPIConfig support call_back parameter, call_back will be used after filter，\n    whatever call_back return will return to user\n    \"\"\"\n\n    async def fetch_next_day(self, items):\n        \"\"\" for each item, fetch seven days in order, combine data and return to user \"\"\"\n        prev_item = items[0]\n        date_obj = # get a timestamp or datetime object from prev_item\n        if date_obj not in seven_days:\n            call_back = None\n        else:\n            call_back = self.fetch_next_day\n        getter_config = GetterConfig.RAPIConfig(url, call_back=call_back)\n        getter = ProcessFactory.create_getter(getter_config)\n        async for items in getter:\n            item = items[0]\n            item = combine(item, prev_item)\n            return [item]\n\n    async def fetch_next_day_with_return_fail(self, good_items, bad_items):\n        \"\"\" for each item, fetch seven days in order, combine data and return to user \"\"\"\n        if bad_items:\n            # error in current level\n            pass\n        prev_item = good_items[0]\n        date_obj = # get a timestamp or datetime object from prev_item\n        if date_obj not in seven_days:\n            call_back = None\n        else:\n            call_back = self.fetch_next_day_with_return_fail\n        getter_config = GetterConfig.RAPIConfig(url, call_back=call_back, return_fail=True)\n        getter = ProcessFactory.create_getter(getter_config)\n        async for good_items, bad_items in getter:\n            if bad_items:\n                # next level retrn error, how to handle it?\n                return None, bad_items\n\n            item = good_items[0]\n            item = combine(item, prev_item)\n            return [item], bad_items\n\n    async def start(self):\n        id_set = {...a set of id...}\n        url_generator = (GetterConfig.RAPIConfig(base_url % (id_, ), call_back=fetch_next_day) for id_ in self.id_set)\n        bulk_config = GetterConfig.RAPIBulkConfig(url_generator, concurrency=20, interval=1)\n        bulk_getter = ProcessFactory.create_getter(bulk_config)\n        with ProcessFactory.create_writer(...) as writer:\n            async for items in bulk_getter:\n                await writer.write(items)\n\n    async def start_with_return_fail(self):\n        id_set = {...a set of id...}\n        url_generator = (GetterConfig.RAPIConfig(base_url % (id_, ), call_back=fetch_next_day) for id_ in self.id_set, return_fail=True)\n        bulk_config = GetterConfig.RAPIBulkConfig(url_generator, concurrency=20, interval=1, return_fail=True)\n        bulk_getter = ProcessFactory.create_getter(bulk_config)\n        with ProcessFactory.create_writer(...) as writer:\n            async for items in bulk_getter:\n                await es_writer.write([i for i in good_items if i is not None])\n\n\n    if __name__ == \"__main__\":\n        loop = asyncio.get_event_loop()\n        loop.run_until_complete(start())\n\n##### persistent to disk\n\n    import asyncio\n\tfrom idataapi_transform import ProcessFactory, GetterConfig, WriterConfig\n\n    ### persistent -\u003e only RAPIBulkConfig support this parameter，it's a boolean value, means whether to persist finished job ID to disk, default value is False\n    ### if persistent set to True, and you provide a batch of jobs to RAPIBulkConfig, because of the non-blocking event driven architecture，we can't simply record an order to disk, we create a json file in the working directory，and persist every finished job ID to disk, when program restart，it will load this json file，and won't execute those job that has been executed before\n    ### persistent_key -\u003e the json file name, to identify record of which batch of jobs\n    ### persistent_start_fresh_if_done -\u003e if all jobs done, whether remove the persistent json file, if the persistent file hasn't been removed and all of the jobs finished, next time you run the program, there will be no job to schedule, default is True\n    ### persistent_to_disk_if_give_up -\u003e  if there's a job fail after retry max_retry times, whether regard this job as success and persistent to disk or not, default is True\n\n    async def exapmle():\n        urls = [\n            \"http://xxx\",\n            \"http://xxx\",\n            GetterConfig.RAPIConfig(\"http://xxx\", persistent_to_disk_if_give_up=True)\n        ]\n        getter = ProcessFactory.create_getter(GetterConfig.RAPIBulkConfig(urls, persistent=True, persistent_to_disk_if_give_up=False))\n        async for items in getter:\n            print(items)\n\n    if __name__ == \"__main__\":\n        loop = asyncio.get_event_loop()\n        loop.run_until_complete(example())\n\n\n\n### REDIS Usage\n\n    import asyncio\n\tfrom idataapi_transform import ProcessFactory, GetterConfig, WriterConfig\n\n    async def example_simple():\n        # default key_type is LIST in redis\n        # you can pass parameter \"encoding\" to specify how to encode before write to redis, default utf8\n        json_lists = [...]\n        wredis_config = WriterConfig.WRedisConfig(\"my_key\")\n        writer = ProcessFactory.create_writer(wredis_config)\n        await writer.write(json_lists)\n\n        # get async redis client\n        client = await wredis_config.get_redis_pool_cli()\n        # if you instance a getter_config, you can get client by 'getter_config.get_redis_pool_cli()'\n        # then, you can do watever you want in redis\n        r = await client.hset(\"xxx\", \"k1\", \"v1\")\n        print(r)\n\n    async def example():\n        # specify redis's key_type to HASH, default is LIST\n        # compress means string object is compressed by zlib before write to redis,\n        # we need to decompress it before turn to json object\n        # you can pass parameter \"need_del\" to specify whether need to del the key after get object from redis, default false\n        # you can pass parameter \"direction\" to specify whether read data from left to right or right to left, default left to right(only work for LIST key type)\n        getter_config = GetterConfig.RRedisConfig(\"my_key_hash\", key_type=\"HASH\", compress=True)\n        async for items in reader:\n            print(items)\n\n\n    if __name__ == \"__main__\":\n        loop = asyncio.get_event_loop()\n        loop.run_until_complete(example())\n\n\n#### ES Base Operation\n\n##### Read data from ES\n\n    import asyncio\n\tfrom idataapi_transform import ProcessFactory, GetterConfig\n\n\tasync def example():\n        # max_limit: means get at most max_limit items, if you don't provide it, means read all items\n        # you can provide your query_body to es_config\n        es_config = GetterConfig.RESConfig(\"post20170630\", \"news\", max_limit=1000)\n        es_getter = ProcessFactory.create_getter(es_config)\n        async for items in es_getter:\n            print(items)\n\n\tif __name__ == \"__main__\":\n        loop = asyncio.get_event_loop()\n        loop.run_until_complete(example())\n\n##### Write data to ES\n\n    import asyncio\n\tfrom idataapi_transform import ProcessFactory, WriterConfig\n\n    def my_hash_func(item):\n        # generate ES_ID by my_hash_func\n        return hashlib.md5(item[\"id\"].encode(\"utf8\")).hexdigest()\n\n\tasync def example():\n        json_lists = [#lots of json object]\n        # actions support create, index, update default index\n        # you can ignore \"id_hash_func\" to use default function to create ES_ID, see \"API to ES in detail\" below\n        es_config = WriterConfig.WESConfig(\"post20170630\", \"news\", actions=\"create\")\n        es_writer = ProcessFactory.create_writer(es_config)\n        await es_writer.write(json_lists)\n\n\tif __name__ == \"__main__\":\n        loop = asyncio.get_event_loop()\n        loop.run_until_complete(example())\n\n##### DELETE data from ES\n\n    import asyncio\n    import json\n    from idataapi_transform import ProcessFactory, WriterConfig\n\n\tasync def example():\n        # wrapper of delete_by_query API\n        body = {\"size\": 100,  \"query\": {\"bool\": {\"must\": [{\"term\": {\"createDate\": \"1516111225\"}}]}}}\n        writer = ProcessFactory.create_writer(WriterConfig.WESConfig(\"post20170630\", \"news\"))\n        r = await writer.delete_all(body=body)\n        print(json.dumps(r))\n\n\tasync def example_no_body():\n        # same as above, without , delete all\n        writer = ProcessFactory.create_writer(WESConfig(\"post20170630\", \"news\"))\n        r = await writer.delete_all()\n        print(json.dumps(r))\n\n\tif __name__ == \"__main__\":\n        loop = asyncio.get_event_loop()\n        loop.run_until_complete(example())\n\n##### API to ES in detail\n\n    import time\n    import asyncio\n    from idataapi_transform import ProcessFactory, WriterConfig, GetterConfig\n    \"\"\"\n    Every es document need an _id\n    There are two rules to generate _id for ES inside this tool\n    1) iF privide \"id_hash_func\" parameter when create WESConfig Object, _id will be id_hash_func(item)\n    2) if rule 1 fail to match and the data(dictionary object) has key \"id\" and key \"appCode\"，_id will be md5(appCode_id)\n    3) if rule 1 and rule 2 both fail to match, _id will be md5(str(item))\n    \"\"\"\n\n    # global variables\n    now_ts = int(time.time())\n\n\tdef my_filter(item):\n        # I am a filter\n        # Every getter or writer created by ProcessFactory.create can set up a filter\n        # every data will be pass to filter before return from getter, or before write to writer\n        # you can alter data here, or drop data here\n        if \"posterId\" in item:\n        \treturn item\n        # if don't return anything(return None) means drop this data\n\n    async def example():\n        # urls can be any iterable object, each item can be api url or RAPIConfig\n        urls = [\"http://xxxx\", \"http://xxxx\", \"http://xxxx\", RAPIConfig(\"http://xxxx\", max_limit=10)]\n        # set up filter，drop every item without \"posterId\"\n        api_bulk_config = GetterConfig.RAPIBulkConfig(urls, concurrency=100, filter_=my_filter)\n        api_bulk_getter = ProcessFactory.create_getter(api_bulk_config)\n        # you can also set up filter here\n        # createDate parameter set same \"createDate\" for every data written by this es_writer\n        # Of course，you can ignore \"createDate\", es_writer will set every data's \"createDate\" to the current system's timestamp when it performs write operation, you can disable it by parameter auto_insert_createDate=False\n        # add parameter \"appCode\" for every data so that it can generate _id by rule 1\n        es_config = WriterConfig.WESConfig(\"profile201712\", \"user\", createDate=now_ts)\n        with ProcessFactory.create_writer(es_config) as es_writer:\n            async for items in api_bulk_getter:\n                # do whatever you want with items\n                await es_writer.write(items)\n\n\tif __name__ == \"__main__\":\n        loop = asyncio.get_event_loop()\n        loop.run_until_complete(example())\n\n\n##### Get ES Client\n\n    import asyncio\n    import json\n\tfrom idataapi_transform import ProcessFactory, WriterConfig\n\n\tasync def example():\n        writer = ProcessFactory.create_writer(WriterConfig.WESConfig(\"post20170630\", \"news\"))\n        client = writer.config.es_client\n        # a client based on elasticsearch-async, you can read offical document\n\n\tif __name__ == \"__main__\":\n        loop = asyncio.get_event_loop()\n        loop.run_until_complete(example())\n\n\n\n-------------------\n\n#### Config\n\n##### ini file\n\nBy default, program will load config file in the following order\n* the ini file your specific(see below)\n* ./idataapi-transform.ini\n* ~/idataapi-transform.ini\n\nif none of the configure file exists, program will create **~/idataapi-transform.ini** automatically and use it as default\n\n##### manual config in program\n\nBy default, program will log to file configured in **idataapi-transform.ini**, and also log to console, all of the log will be formatted\nIf you don't want any of it, you can disable it\n\n    from idataapi_transform import ManualConfig\n    ManualConfig.disable_log()\n\nIf you want to specific your own configure file\n\n    from idataapi_transform import ManualConfig\n    ManualConfig.set_config(\"/Users/zpoint/Desktop/idataapi-transform.ini\")\n\nIf you want to change log directory in the run time\n\n    from idataapi_transform import ManualConfig\n    # at most 5MB per log file\n    ManualConfig.set_log_path(\"/Users/zpoint/Desktop/logs/\", 5242880)\n\n-------------------\n\n#### doc string\n\n\tfrom idataapi_transform import GetterConfig, WriterConfig\n\n    # run help on config to see detail\n    help(GetterConfig.RAPIConfig)\n    \"\"\"\n    ...\n    will request until no more next_page to get, or get \"max_limit\" items\n\n    :param source: API to get, i.e. \"http://...\"\n    :param per_limit: how many items to get per time\n    :param max_limit: get at most max_limit items, if not set, get all\n    :param max_retry: if request fail, retry max_retry times\n    :param random_min_sleep: if request fail, random sleep at least random_min_sleep seconds before request again\n    :param random_max_sleep: if request fail, random sleep at most random_min_sleep seconds before request again\n    :param session: aiohttp session to perform request\n    :param args:\n    :param kwargs:\n\n    Example:\n        api_config = RAPIConfig(\"http://...\")\n        api_getter = ProcessFactory.create_getter(api_config)\n        async for items in api_getter:\n            print(items)\n    ...\n\t\"\"\"\n\n\n-------------------\n#### API to Kafka\n```python\nimport asyncio\n\nfrom idataapi_transform import ProcessFactory, GetterConfig, WriterConfig, ManualConfig\n\nManualConfig.set_config(\"conf/idataapi-transform.ini\")\n\n\nasync def example():\n    urls = [\n        \"http://api01.idataapi.cn:8000/article/idataapi?kw=%E9%9A%86%E5%9F%BA%E8%82%A1%E4%BB%BD\u0026KwPosition=3\u0026size=20\u0026catLabel2=%E8%82%A1%E7%A5%A8\u0026apikey=test\",\n    ]\n    api_bulk_config = GetterConfig.RAPIBulkConfig(urls, concurrency=1)\n    api_bulk_getter = ProcessFactory.create_getter(api_bulk_config)\n    kafka_config = WriterConfig.WKafkaConfig()\n    with ProcessFactory.create_writer(kafka_config, topic=\"news\") as kafka_writer:\n        async for items in api_bulk_getter:\n            # do whatever you want with items\n            await kafka_writer.write(items)\n\n\nif __name__ == \"__main__\":\n    loop = asyncio.get_event_loop()\n    loop.run_until_complete(example())\n\n```\n\n---------------------------\n#### ChangeLog\nv 1.6.6 - 1.6.9\n* redis manual db fix\n* keep_other_fields, keep_fields\n* mysql charset\n* self define http headers\n\nv 1.6.3 - 1.6.4\n* persistent to disk\n* debug mode support\n\nv 1.5.1 - 1.6.1\n* random sleep float seconds support\n* es specific host \u0026\u0026 headers\n* RAPIGetter HTTP POST support\n* xlsx/csv headers, append mode support\n\nv 1.4.7 - 1.5.1\n* done_if param support\n* manual success_ret_code config for user\n* xlsxWriter replace ilegal characters automatically\n\nv 1.4.4 - 1.4.6\n* RAPIBulkGetter support async generator\n* ini config relative path support, manual config support\n\nv 1.4.3\n* fix logging bug\n* max_limit limit number of data before filter\n* report_interval add for APIGetter\n\nv 1.4.1\n* call_back support\n* mongodb auth support and motor 2.0 support\n* mongodb support\n* fix APIBulkGetter incompleted data bug\n* 3.5 compatiable\n* ESGetter get all data instead of half\n* compatible with elasticsearch-async-6.1.0\n* ESClient singleton\n\nv 1.2.0\n* mysql support\n* redis support\n* retry 3 times for every write operation\n* ES create operation\n* shorter import directory\n\nv.1.0.1 - 1.1.1\n* fix es getter log error\n* unclose session error for elasticsearch\n* fix ES infinity scroll\n* fix bug (cli)\n* es_client msearch support\n* fix XLSX reader\n* return_fail for APIGetter\n* compatible for aiohttp 3.x\n\nv.1.0\n* fix ESWriter log bug\n* timeout add for ESWriter\n\nv.0.9\n* filter for every getter\n* createDate for ESWriter\n* APIGetter per_liimt bug fix\n* new session for all RAPIBulkConfig\n\nv.0.8\n* error logging when unable to insert to target for ESWriter\n* actions parameter add for WESConfig\n* id_hash func change for ESWriter\n\n\nv.0.7\n* remove APIGetter infinity loop for empty result\n\nv.0.6\n* No error when read empty item from ESGetter\n\nv.0.5\n* fetch more items for ESGetter in CLI per request\n* per_limit param fix for ESGetter in CLI\n\nv.0.4\n* appCode for ESWriter Config\n* ESGetter CLI bug fix\n\nv.0.3\n* doc string for each config\n* RAPIBulkConfig support\n\n-------------------\n\n#### License\n\nhttp://rem.mit-license.org\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzpoint%2Fidataapi-transform","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzpoint%2Fidataapi-transform","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzpoint%2Fidataapi-transform/lists"}