{"id":13446385,"url":"https://github.com/nerevu/riko","last_synced_at":"2025-05-15T04:05:58.420Z","repository":{"id":53336746,"uuid":"60261863","full_name":"nerevu/riko","owner":"nerevu","description":"A Python stream processing engine modeled after Yahoo! Pipes","archived":false,"fork":false,"pushed_at":"2021-12-28T23:01:39.000Z","size":2705,"stargazers_count":1601,"open_issues_count":22,"forks_count":75,"subscribers_count":50,"default_branch":"master","last_synced_at":"2025-05-14T14:04:28.171Z","etag":null,"topics":["asynchronous","cli","data","etl","featured","functional-programming","library","parallelism","rss","stream-processing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nerevu.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":"CONTRIBUTING.rst","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-06-02T12:22:51.000Z","updated_at":"2025-04-26T05:02:52.000Z","dependencies_parsed_at":"2022-08-28T02:00:25.118Z","dependency_job_id":null,"html_url":"https://github.com/nerevu/riko","commit_stats":null,"previous_names":[],"tags_count":103,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nerevu%2Friko","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nerevu%2Friko/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nerevu%2Friko/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nerevu%2Friko/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nerevu","download_url":"https://codeload.github.com/nerevu/riko/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254270646,"owners_count":22042859,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asynchronous","cli","data","etl","featured","functional-programming","library","parallelism","rss","stream-processing"],"created_at":"2024-07-31T05:00:52.434Z","updated_at":"2025-05-15T04:05:53.404Z","avatar_url":"https://github.com/nerevu.png","language":"Python","readme":"riko: A stream processing engine modeled after Yahoo! Pipes\n===========================================================\n\n|travis| |versions| |pypi|\n\nIndex\n-----\n\n`Introduction`_ | `Requirements`_ | `Word Count`_ | `Motivation`_ | `Usage`_ |\n`Installation`_ | `Design Principles`_ | `Scripts`_ | `Command-line Interface`_ |\n`Contributing`_ | `Credits`_ | `More Info`_ | `Project Structure`_ | `License`_\n\nIntroduction\n------------\n\n**riko** is a pure Python `library`_ for analyzing and processing ``streams`` of\nstructured data. ``riko`` has `synchronous`_ and `asynchronous`_ APIs, supports `parallel\nexecution`_, and is well suited for processing RSS feeds [#]_. ``riko`` also supplies\na `command-line interface`_ for executing ``flows``, i.e., stream processors aka ``workflows``.\n\nWith ``riko``, you can\n\n- Read csv/xml/json/html files\n- Create text and data based ``flows`` via modular `pipes`_\n- Parse, extract, and process RSS/Atom feeds\n- Create awesome mashups [#]_, APIs, and maps\n- Perform `parallel processing`_ via cpus/processors or threads\n- and much more...\n\nNotes\n^^^^^\n\n.. [#] `Really Simple Syndication`_\n.. [#] `Mashup (web application hybrid)`_\n\nRequirements\n------------\n\n``riko`` has been tested and is known to work on Python 3.7, 3.8, and 3.9; and PyPy3.7.\n\nOptional Dependencies\n^^^^^^^^^^^^^^^^^^^^^\n\n========================  ===================  ===========================\nFeature                   Dependency           Installation\n========================  ===================  ===========================\nAsync API                 `Twisted`_           ``pip install riko[async]``\nAccelerated xml parsing   `lxml`_ [#]_         ``pip install riko[xml]``\nAccelerated feed parsing  `speedparser`_ [#]_  ``pip install riko[xml]``\n========================  ===================  ===========================\n\nNotes\n^^^^^\n\n.. [#] If ``lxml`` isn't present, ``riko`` will default to the builtin Python xml parser\n.. [#] If ``speedparser`` isn't present, ``riko`` will default to ``feedparser``\n\nWord Count\n----------\n\nIn this example, we use several `pipes`_ to count the words on a webpage.\n\n.. code-block:: python\n\n    \u003e\u003e\u003e ### Create a SyncPipe flow ###\n    \u003e\u003e\u003e #\n    \u003e\u003e\u003e # `SyncPipe` is a convenience class that creates chainable flows\n    \u003e\u003e\u003e # and allows for parallel processing.\n    \u003e\u003e\u003e from riko.collections import SyncPipe\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e ### Set the pipe configurations ###\n    \u003e\u003e\u003e #\n    \u003e\u003e\u003e # Notes:\n    \u003e\u003e\u003e #   1. the `detag` option will strip all html tags from the result\n    \u003e\u003e\u003e #   2. fetch the text contained inside the 'body' tag of the hackernews\n    \u003e\u003e\u003e #      homepage\n    \u003e\u003e\u003e #   3. replace newlines with spaces and assign the result to 'content'\n    \u003e\u003e\u003e #   4. tokenize the resulting text using whitespace as the delimeter\n    \u003e\u003e\u003e #   5. count the number of times each token appears\n    \u003e\u003e\u003e #   6. obtain the raw stream\n    \u003e\u003e\u003e #   7. extract the first word and its count\n    \u003e\u003e\u003e #   8. extract the second word and its count\n    \u003e\u003e\u003e #   9. extract the third word and its count\n    \u003e\u003e\u003e url = 'https://news.ycombinator.com/'\n    \u003e\u003e\u003e fetch_conf = {\n    ...     'url': url, 'start': '\u003cbody\u003e', 'end': '\u003c/body\u003e', 'detag': True}  # 1\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e replace_conf = {\n    ...     'rule': [\n    ...         {'find': '\\r\\n', 'replace': ' '},\n    ...         {'find': '\\n', 'replace': ' '}]}\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e flow = (\n    ...     SyncPipe('fetchpage', conf=fetch_conf)                           # 2\n    ...         .strreplace(conf=replace_conf, assign='content')             # 3\n    ...         .tokenizer(conf={'delimiter': ' '}, emit=True)               # 4\n    ...         .count(conf={'count_key': 'content'}))                       # 5\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e stream = flow.output                                                 # 6\n    \u003e\u003e\u003e next(stream)                                                         # 7\n    {\"'sad\": 1}\n    \u003e\u003e\u003e next(stream)                                                         # 8\n    {'(': 28}\n    \u003e\u003e\u003e next(stream)                                                         # 9\n    {'(1999)': 1}\n\nMotivation\n----------\n\nWhy I built riko\n^^^^^^^^^^^^^^^^\n\nYahoo! Pipes [#]_ was a user friendly web application used to\n\n  aggregate, manipulate, and mashup content from around the web\n\nWanting to create custom pipes, I came across `pipe2py`_ which translated a\nYahoo! Pipe into python code. ``pipe2py`` suited my needs at the time\nbut was unmaintained and lacked asynchronous or parallel processing.\n\n``riko`` addresses the shortcomings of ``pipe2py`` but removed support for\nimporting Yahoo! Pipes json workflows. ``riko`` contains ~ `40 built-in`_\nmodules, aka ``pipes``, that allow you to programatically perform most of the\ntasks Yahoo! Pipes allowed.\n\nWhy you should use riko\n^^^^^^^^^^^^^^^^^^^^^^^\n\n``riko`` provides a number of benefits / differences from other stream processing\napplications such as Huginn, Flink, Spark, and Storm [#]_. Namely:\n\n- a small footprint (CPU and memory usage)\n- native RSS/Atom support\n- simple installation and usage\n- a pure python library with `pypy`_ support\n- builtin modular ``pipes`` to filter, sort, and modify ``streams``\n\nThe subsequent tradeoffs ``riko`` makes are:\n\n- not distributed (able to run on a cluster of servers)\n- no GUI for creating ``flows``\n- doesn't continually monitor ``streams`` for new data\n- can't react to specific events\n- iterator (pull) based so streams only support a single consumer [#]_\n\nThe following table summarizes these observations:\n\n=======  ===========  =========  =====  ===========  =====  ========  ========  ===========\nlibrary  Stream Type  Footprint  RSS    simple [#]_  async  parallel  CEP [#]_  distributed\n=======  ===========  =========  =====  ===========  =====  ========  ========  ===========\nriko     pull         small      √      √            √      √\npipe2py  pull         small      √      √\nHuginn   push         med        √                   [#]_   √         √\nOthers   push         large      [#]_   [#]_         [#]_   √         √         √\n=======  ===========  =========  =====  ===========  =====  ========  ========  ===========\n\nFor more detailed information, please check-out the `FAQ`_.\n\nNotes\n^^^^^\n\n.. [#] Yahoo discontinued Yahoo! Pipes in 2015, but you can view what `remains`_\n.. [#] `Huginn`_, `Flink`_, `Spark`_, and `Storm`_\n.. [#] You can mitigate this via the `split`_ module\n.. [#] Doesn't depend on outside services like MySQL, Kafka, YARN, ZooKeeper, or Mesos\n.. [#] `Complex Event Processing`_\n.. [#] Huginn doesn't appear to make `async web requests`_\n.. [#] Many libraries can't parse RSS streams without the use of 3rd party libraries\n.. [#] While most libraries offer a local mode, many require integrating with a data ingestor (e.g., Flume/Kafka) to do anything useful\n.. [#] I can't find evidence that these libraries offer an async APIs (and apparently `Spark doesn't`_)\n\nUsage\n-----\n\n``riko`` is intended to be used directly as a Python library.\n\nUsage Index\n^^^^^^^^^^^\n\n- `Fetching feeds`_\n- `Synchronous processing`_\n- `Parallel processing`_\n- `Asynchronous processing`_\n- `Cookbook`_\n\nFetching feeds\n^^^^^^^^^^^^^^\n\n``riko`` can fetch rss feeds from both local and remote filepaths via \"source\"\n``pipes``. Each \"source\" ``pipe`` returns a ``stream``, i.e., an iterator of\ndictionaries, aka ``items``.\n\n.. code-block:: python\n\n    \u003e\u003e\u003e from riko.modules import fetch, fetchsitefeed\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e ### Fetch an RSS feed ###\n    \u003e\u003e\u003e stream = fetch.pipe(conf={'url': 'https://news.ycombinator.com/rss'})\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e ### Fetch the first RSS feed found ###\n    \u003e\u003e\u003e stream = fetchsitefeed.pipe(conf={'url': 'http://arstechnica.com/rss-feeds/'})\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e ### View the fetched RSS feed(s) ###\n    \u003e\u003e\u003e #\n    \u003e\u003e\u003e # Note: regardless of how you fetch an RSS feed, it will have the same\n    \u003e\u003e\u003e # structure\n    \u003e\u003e\u003e item = next(stream)\n    \u003e\u003e\u003e item.keys()\n    dict_keys(['title_detail', 'author.uri', 'tags', 'summary_detail', 'author_detail',\n               'author.name', 'y:published', 'y:title', 'content', 'title', 'pubDate',\n               'guidislink', 'id', 'summary', 'dc:creator', 'authors', 'published_parsed',\n               'links', 'y:id', 'author', 'link', 'published'])\n\n    \u003e\u003e\u003e item['title'], item['author'], item['id']\n    ('Gravity doesn’t care about quantum spin',\n     'Chris Lee',\n     'http://arstechnica.com/?p=924009')\n\nPlease see the `FAQ`_ for a complete list of supported `file types`_ and\n`protocols`_. Please see `Fetching data and feeds`_ for more examples.\n\nSynchronous processing\n^^^^^^^^^^^^^^^^^^^^^^\n\n``riko`` can modify ``streams`` via the `40 built-in`_ ``pipes``\n\n.. code-block:: python\n\n    \u003e\u003e\u003e from riko.collections import SyncPipe\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e ### Set the pipe configurations ###\n    \u003e\u003e\u003e fetch_conf = {'url': 'https://news.ycombinator.com/rss'}\n    \u003e\u003e\u003e filter_rule = {'field': 'link', 'op': 'contains', 'value': '.com'}\n    \u003e\u003e\u003e xpath = '/html/body/center/table/tr[3]/td/table[2]/tr[1]/td/table/tr/td[3]/span/span'\n    \u003e\u003e\u003e xpath_conf = {'url': {'subkey': 'comments'}, 'xpath': xpath}\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e ### Create a SyncPipe flow ###\n    \u003e\u003e\u003e #\n    \u003e\u003e\u003e # `SyncPipe` is a convenience class that creates chainable flows\n    \u003e\u003e\u003e # and allows for parallel processing.\n    \u003e\u003e\u003e #\n    \u003e\u003e\u003e # The following flow will:\n    \u003e\u003e\u003e #   1. fetch the hackernews RSS feed\n    \u003e\u003e\u003e #   2. filter for items with '.com' in the link\n    \u003e\u003e\u003e #   3. sort the items ascending by title\n    \u003e\u003e\u003e #   4. fetch the first comment from each item\n    \u003e\u003e\u003e #   5. flatten the result into one raw stream\n    \u003e\u003e\u003e #   6. extract the first item's content\n    \u003e\u003e\u003e #\n    \u003e\u003e\u003e # Note: sorting is not lazy so take caution when using this pipe\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e flow = (\n    ...     SyncPipe('fetch', conf=fetch_conf)               # 1\n    ...         .filter(conf={'rule': filter_rule})          # 2\n    ...         .sort(conf={'rule': {'sort_key': 'title'}})  # 3\n    ...         .xpathfetchpage(conf=xpath_conf))            # 4\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e stream = flow.output                                 # 5\n    \u003e\u003e\u003e next(stream)['content']                              # 6\n    'Open Artificial Pancreas home:'\n\nPlease see `alternate workflow creation`_ for an alternative (function based) method for\ncreating a ``stream``. Please see `pipes`_ for a complete list of available ``pipes``.\n\nParallel processing\n^^^^^^^^^^^^^^^^^^^\n\nAn example using ``riko``'s parallel API to spawn a ``ThreadPool`` [#]_\n\n.. code-block:: python\n\n    \u003e\u003e\u003e from riko.collections import SyncPipe\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e ### Set the pipe configurations ###\n    \u003e\u003e\u003e fetch_conf = {'url': 'https://news.ycombinator.com/rss'}\n    \u003e\u003e\u003e filter_rule = {'field': 'link', 'op': 'contains', 'value': '.com'}\n    \u003e\u003e\u003e xpath = '/html/body/center/table/tr[3]/td/table[2]/tr[1]/td/table/tr/td[3]/span/span'\n    \u003e\u003e\u003e xpath_conf = {'url': {'subkey': 'comments'}, 'xpath': xpath}\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e ### Create a parallel SyncPipe flow ###\n    \u003e\u003e\u003e #\n    \u003e\u003e\u003e # The following flow will:\n    \u003e\u003e\u003e #   1. fetch the hackernews RSS feed\n    \u003e\u003e\u003e #   2. filter for items with '.com' in the article link\n    \u003e\u003e\u003e #   3. fetch the first comment from all items in parallel (using 4 workers)\n    \u003e\u003e\u003e #   4. flatten the result into one raw stream\n    \u003e\u003e\u003e #   5. extract the first item's content\n    \u003e\u003e\u003e #\n    \u003e\u003e\u003e # Note: no point in sorting after the filter since parallel fetching doesn't guarantee\n    \u003e\u003e\u003e # order\n    \u003e\u003e\u003e flow = (\n    ...     SyncPipe('fetch', conf=fetch_conf, parallel=True, workers=4)  # 1\n    ...         .filter(conf={'rule': filter_rule})                       # 2\n    ...         .xpathfetchpage(conf=xpath_conf))                         # 3\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e stream = flow.output                                              # 4\n    \u003e\u003e\u003e next(stream)['content']                                           # 5\n    'He uses the following example for when to throw your own errors:'\n\nAsynchronous processing\n^^^^^^^^^^^^^^^^^^^^^^^\n\nTo enable asynchronous processing, you must install the ``async`` module.\n\n.. code-block:: bash\n\n    pip install riko[async]\n\nAn example using ``riko``'s asynchronous API.\n\n.. code-block:: python\n\n    \u003e\u003e\u003e from riko.bado import coroutine, react\n    \u003e\u003e\u003e from riko.collections import AsyncPipe\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e ### Set the pipe configurations ###\n    \u003e\u003e\u003e fetch_conf = {'url': 'https://news.ycombinator.com/rss'}\n    \u003e\u003e\u003e filter_rule = {'field': 'link', 'op': 'contains', 'value': '.com'}\n    \u003e\u003e\u003e xpath = '/html/body/center/table/tr[3]/td/table[2]/tr[1]/td/table/tr/td[3]/span/span'\n    \u003e\u003e\u003e xpath_conf = {'url': {'subkey': 'comments'}, 'xpath': xpath}\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e ### Create an AsyncPipe flow ###\n    \u003e\u003e\u003e #\n    \u003e\u003e\u003e # The following flow will:\n    \u003e\u003e\u003e #   1. fetch the hackernews RSS feed\n    \u003e\u003e\u003e #   2. filter for items with '.com' in the article link\n    \u003e\u003e\u003e #   3. asynchronously fetch the first comment from each item (using 4 connections)\n    \u003e\u003e\u003e #   4. flatten the result into one raw stream\n    \u003e\u003e\u003e #   5. extract the first item's content\n    \u003e\u003e\u003e #\n    \u003e\u003e\u003e # Note: no point in sorting after the filter since async fetching doesn't guarantee\n    \u003e\u003e\u003e # order\n    \u003e\u003e\u003e @coroutine\n    ... def run(reactor):\n    ...     stream = yield (\n    ...         AsyncPipe('fetch', conf=fetch_conf, connections=4)  # 1\n    ...             .filter(conf={'rule': filter_rule})             # 2\n    ...             .xpathfetchpage(conf=xpath_conf)                # 3\n    ...             .output)                                        # 4\n    ...\n    ...     print(next(stream)['content'])                          # 5\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e try:\n    ...     react(run)\n    ... except SystemExit:\n    ...     pass\n    Here's how iteration works ():\n\nCookbook\n^^^^^^^^\n\nPlease see the `cookbook`_ or `ipython notebook`_ for more examples.\n\nNotes\n^^^^^\n\n.. [#] You can instead enable a ``ProcessPool`` by additionally passing ``threads=False`` to ``SyncPipe``, i.e., ``SyncPipe('fetch', conf={'url': url}, parallel=True, threads=False)``.\n\nInstallation\n------------\n\n(You are using a `virtualenv`_, right?)\n\nAt the command line, install ``riko`` using either ``pip`` (*recommended*)\n\n.. code-block:: bash\n\n    pip install riko\n\nor ``easy_install``\n\n.. code-block:: bash\n\n    easy_install riko\n\nPlease see the `installation doc`_ for more details.\n\nDesign Principles\n-----------------\n\nThe primary data structures in ``riko`` are the ``item`` and ``stream``. An ``item``\nis just a python dictionary, and a ``stream`` is an iterator of ``items``. You can\ncreate a ``stream`` manually with something as simple as\n``[{'content': 'hello world'}]``. You manipulate ``streams`` in\n``riko`` via ``pipes``. A ``pipe`` is simply a function that accepts either a\n``stream`` or ``item``, and returns a ``stream``. ``pipes`` are composable: you\ncan use the output of one ``pipe`` as the input to another ``pipe``.\n\n``riko`` ``pipes`` come in two flavors; ``operators`` and ``processors``.\n``operators`` operate on an entire ``stream`` at once and are unable to handle\nindividual items. Example ``operators`` include ``count``, ``pipefilter``,\nand ``reverse``.\n\n.. code-block:: python\n\n    \u003e\u003e\u003e from riko.modules.reverse import pipe\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e stream = [{'title': 'riko pt. 1'}, {'title': 'riko pt. 2'}]\n    \u003e\u003e\u003e next(pipe(stream))\n    {'title': 'riko pt. 2'}\n\n``processors`` process individual ``items`` and can be parallelized across\nthreads or processes. Example ``processors`` include ``fetchsitefeed``,\n``hash``, ``pipeitembuilder``, and ``piperegex``.\n\n.. code-block:: python\n\n    \u003e\u003e\u003e from riko.modules.hash import pipe\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e item = {'title': 'riko pt. 1'}\n    \u003e\u003e\u003e stream = pipe(item, field='title')\n    \u003e\u003e\u003e next(stream)\n    {'title': 'riko pt. 1', 'hash': 2853617420}\n\nSome ``processors``, e.g., ``pipetokenizer``, return multiple results.\n\n.. code-block:: python\n\n    \u003e\u003e\u003e from riko.modules.tokenizer import pipe\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e item = {'title': 'riko pt. 1'}\n    \u003e\u003e\u003e tokenizer_conf = {'delimiter': ' '}\n    \u003e\u003e\u003e stream = pipe(item, conf=tokenizer_conf, field='title')\n    \u003e\u003e\u003e next(stream)\n    {'tokenizer': [{'content': 'riko'},\n       {'content': 'pt.'},\n       {'content': '1'}],\n     'title': 'riko pt. 1'}\n\n    \u003e\u003e\u003e # In this case, if we just want the result, we can `emit` it instead\n    \u003e\u003e\u003e stream = pipe(item, conf=tokenizer_conf, field='title', emit=True)\n    \u003e\u003e\u003e next(stream)\n    {'content': 'riko'}\n\n``operators`` are split into sub-types of ``aggregators``\nand ``composers``. ``aggregators``, e.g., ``count``, combine\nall ``items`` of an input ``stream`` into a new ``stream`` with a single ``item``;\nwhile ``composers``, e.g., ``filter``, create a new ``stream`` containing\nsome or all ``items`` of an input ``stream``.\n\n.. code-block:: python\n\n    \u003e\u003e\u003e from riko.modules.count import pipe\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e stream = [{'title': 'riko pt. 1'}, {'title': 'riko pt. 2'}]\n    \u003e\u003e\u003e next(pipe(stream))\n    {'count': 2}\n\nIn case you are confused from the \"Word Count\" example up top, ``count`` can return\nmultiple items if you pass in the ``count_key`` config option.\n\n.. code-block:: python\n\n    \u003e\u003e\u003e counted = pipe(stream, conf={'count_key': 'title'})\n    \u003e\u003e\u003e next(counted)\n    {'riko pt. 1': 1}\n    \u003e\u003e\u003e next(counted)\n    {'riko pt. 2': 1}\n\n``processors`` are split into sub-types of ``source`` and ``transformer``.\n``sources``, e.g., ``itembuilder``, can create a ``stream`` while\n``transformers``, e.g. ``hash`` can only transform items in a ``stream``.\n\n.. code-block:: python\n\n    \u003e\u003e\u003e from riko.modules.itembuilder import pipe\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e attrs = {'key': 'title', 'value': 'riko pt. 1'}\n    \u003e\u003e\u003e next(pipe(conf={'attrs': attrs}))\n    {'title': 'riko pt. 1'}\n\nThe following table summaries these observations:\n\n+-----------+-------------+--------+-------------+-----------------+------------------+\n| type      | sub-type    | input  | output      | parallelizable? | creates streams? |\n+-----------+-------------+--------+-------------+-----------------+------------------+\n| operator  | aggregator  | stream | stream [#]_ |                 |                  |\n|           +-------------+--------+-------------+-----------------+------------------+\n|           | composer    | stream | stream      |                 |                  |\n+-----------+-------------+--------+-------------+-----------------+------------------+\n| processor | source      | item   | stream      | √               | √                |\n|           +-------------+--------+-------------+-----------------+------------------+\n|           | transformer | item   | stream      | √               |                  |\n+-----------+-------------+--------+-------------+-----------------+------------------+\n\nIf you are unsure of the type of ``pipe`` you have, check its metadata.\n\n.. code-block:: python\n\n    \u003e\u003e\u003e from riko.modules import fetchpage, count\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e fetchpage.async_pipe.__dict__\n    {'type': 'processor', 'name': 'fetchpage', 'sub_type': 'source'}\n    \u003e\u003e\u003e count.pipe.__dict__\n    {'type': 'operator', 'name': 'count', 'sub_type': 'aggregator'}\n\nThe ``SyncPipe`` and ``AsyncPipe`` classes (among other things) perform this\ncheck for you to allow for convenient method chaining and transparent\nparallelization.\n\n.. code-block:: python\n\n    \u003e\u003e\u003e from riko.collections import SyncPipe\n    \u003e\u003e\u003e\n    \u003e\u003e\u003e attrs = [\n    ...     {'key': 'title', 'value': 'riko pt. 1'},\n    ...     {'key': 'content', 'value': \"Let's talk about riko!\"}]\n    \u003e\u003e\u003e flow = SyncPipe('itembuilder', conf={'attrs': attrs}).hash()\n    \u003e\u003e\u003e flow.list[0]\n    {'title': 'riko pt. 1',\n     'content': \"Let's talk about riko!\",\n     'hash': 1346301218}\n\nPlease see the `cookbook`_ for advanced examples including how to wire in\nvales from other pipes or accept user input.\n\nNotes\n^^^^^\n\n.. [#] the output ``stream`` of an ``aggregator`` is an iterator of only 1 ``item``.\n\nCommand-line Interface\n----------------------\n\n``riko`` provides a command, ``runpipe``, to execute ``workflows``. A\n``workflow`` is simply a file containing a function named ``pipe`` that creates\na ``flow`` and processes the resulting ``stream``.\n\nCLI Usage\n^^^^^^^^^\n\n  usage: runpipe [pipeid]\n\n  description: Runs a riko pipe\n\n  positional arguments:\n    pipeid       The pipe to run (default: reads from stdin).\n\n  optional arguments:\n    -h, --help   show this help message and exit\n    -a, --async  Load async pipe.\n\n    -t, --test   Run in test mode (uses default inputs).\n\nCLI Setup\n^^^^^^^^^\n\n``flow.py``\n\n.. code-block:: python\n\n    from __future__ import print_function\n    from riko.collections import SyncPipe\n\n    conf1 = {'attrs': [{'value': 'https://google.com', 'key': 'content'}]}\n    conf2 = {'rule': [{'find': 'com', 'replace': 'co.uk'}]}\n\n    def pipe(test=False):\n        kwargs = {'conf': conf1, 'test': test}\n        flow = SyncPipe('itembuilder', **kwargs).strreplace(conf=conf2)\n        stream = flow.output\n\n        for i in stream:\n            print(i)\n\nCLI Examples\n^^^^^^^^^^^^\n\nNow to execute ``flow.py``, type the command ``runpipe flow``. You should\nthen see the following output in your terminal:\n\n.. code-block:: bash\n\n    https://google.co.uk\n\n``runpipe`` will also search the ``examples`` directory for ``workflows``. Type\n``runpipe demo`` and you should see the following output:\n\n.. code-block:: bash\n\n    Deadline to clear up health law eligibility near 682\n\nScripts\n-------\n\n``riko`` comes with a built in task manager ``manage``.\n\nSetup\n^^^^^\n\n.. code-block:: bash\n\n    pip install riko[develop]\n\nExamples\n^^^^^^^^\n\n*Run python linter and nose tests*\n\n.. code-block:: bash\n\n    manage lint\n    manage test\n\nContributing\n------------\n\nPlease mimic the coding style/conventions used in this repo.\nIf you add new classes or functions, please add the appropriate doc blocks with\nexamples. Also, make sure the python linter and nose tests pass.\n\nPlease see the `contributing doc`_ for more details.\n\nCredits\n-------\n\nShoutout to `pipe2py`_ for heavily inspiring ``riko``. ``riko`` started out as a fork\nof ``pipe2py``, but has since diverged so much that little (if any) of the original\ncode-base remains.\n\nMore Info\n---------\n\n- `FAQ`_\n- `Cookbook`_\n- `iPython Notebook`_\n- `Step-by-Step Intro. Tutorial`_\n\nProject Structure\n-----------------\n\n.. code-block:: bash\n\n    ┌── benchmarks\n    │   ├── __init__.py\n    │   └── parallel.py\n    ├── bin\n    │   └── run\n    ├── data/*\n    ├── docs\n    │   ├── AUTHORS.rst\n    │   ├── CHANGES.rst\n    │   ├── COOKBOOK.rst\n    │   ├── FAQ.rst\n    │   ├── INSTALLATION.rst\n    │   └── TODO.rst\n    ├── examples/*\n    ├── helpers/*\n    ├── riko\n    │   ├── __init__.py\n    │   ├── lib\n    │   │   ├── __init__.py\n    │   │   ├── autorss.py\n    │   │   ├── collections.py\n    │   │   ├── dotdict.py\n    │   │   ├── log.py\n    │   │   ├── tags.py\n    │   │   └── py\n    │   ├── modules/*\n    │   └── twisted\n    │       ├── __init__.py\n    │       ├── collections.py\n    │       └── py\n    ├── tests\n    │   ├── __init__.py\n    │   ├── standard.rc\n    │   └── test_examples.py\n    ├── CONTRIBUTING.rst\n    ├── dev-requirements.txt\n    ├── LICENSE\n    ├── Makefile\n    ├── manage.py\n    ├── MANIFEST.in\n    ├── optional-requirements.txt\n    ├── py2-requirements.txt\n    ├── README.rst\n    ├── requirements.txt\n    ├── setup.cfg\n    ├── setup.py\n    └── tox.ini\n\nLicense\n-------\n\n``riko`` is distributed under the `MIT License`_.\n\n.. |travis| image:: https://img.shields.io/travis/nerevu/riko/master.svg\n    :target: https://app.travis-ci.com/nerevu/riko\n\n.. |versions| image:: https://img.shields.io/pypi/pyversions/riko.svg\n    :target: https://pypi.python.org/pypi/riko\n\n.. |pypi| image:: https://img.shields.io/pypi/v/riko.svg\n    :target: https://pypi.python.org/pypi/riko\n\n.. _synchronous: #synchronous-processing\n.. _asynchronous: #asynchronous-processing\n.. _parallel execution: #parallel-processing\n.. _parallel processing: #parallel-processing\n.. _library: #usage\n\n.. _contributing doc: https://github.com/nerevu/riko/blob/master/CONTRIBUTING.rst\n.. _FAQ: https://github.com/nerevu/riko/blob/master/docs/FAQ.rst\n.. _pipes: https://github.com/nerevu/riko/blob/master/docs/FAQ.rst#what-pipes-are-available\n.. _40 built-in: https://github.com/nerevu/riko/blob/master/docs/FAQ.rst#what-pipes-are-available\n.. _file types: https://github.com/nerevu/riko/blob/master/docs/FAQ.rst#what-file-types-are-supported\n.. _protocols: https://github.com/nerevu/riko/blob/master/docs/FAQ.rst#what-protocols-are-supported\n.. _installation doc: https://github.com/nerevu/riko/blob/master/docs/INSTALLATION.rst\n.. _Cookbook: https://github.com/nerevu/riko/blob/master/docs/COOKBOOK.rst\n.. _split: https://github.com/nerevu/riko/blob/master/riko/modules/split.py#L15-L18\n.. _alternate workflow creation: https://github.com/nerevu/riko/blob/master/docs/COOKBOOK.rst#alternate-workflow-creation\n.. _Fetching data and feeds: https://github.com/nerevu/riko/blob/master/docs/COOKBOOK.rst#fetching-data-and-feeds\n\n.. _pypy: http://pypy.org\n.. _Really Simple Syndication: https://en.wikipedia.org/wiki/RSS\n.. _Mashup (web application hybrid): https://en.wikipedia.org/wiki/Mashup_%28web_application_hybrid%29\n.. _pipe2py: https://github.com/ggaughan/pipe2py/\n.. _Huginn: https://github.com/cantino/huginn/\n.. _Flink: http://flink.apache.org/\n.. _Spark: http://spark.apache.org/streaming/\n.. _Storm: http://storm.apache.org/\n.. _Complex Event Processing: https://en.wikipedia.org/wiki/Complex_event_processing\n.. _async web requests: https://github.com/cantino/huginn/blob/bf7c2feba4a7f27f39de96877c121d40282c0af9/app/models/agents/rss_agent.rb#L101\n.. _Spark doesn't: https://github.com/perwendel/spark/issues/208\n.. _remains: https://web.archive.org/web/20150930021241/http://pipes.yahoo.com/pipes/\n.. _lxml: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser\n.. _Twisted: http://twistedmatrix.com/\n.. _speedparser: https://github.com/jmoiron/speedparser\n.. _MIT License: http://opensource.org/licenses/MIT\n.. _virtualenv: http://www.virtualenv.org/en/latest/index.html\n.. _iPython Notebook: http://nbviewer.jupyter.org/github/nerevu/riko/blob/master/examples/usage.ipynb\n.. _Step-by-Step Intro. Tutorial: http://nbviewer.jupyter.org/github/aemreunal/riko-tutorial/blob/master/Tutorial.ipynb\n","funding_links":[],"categories":["Python","data","数据管道和流处理"],"sub_categories":["Libraries"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnerevu%2Friko","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnerevu%2Friko","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnerevu%2Friko/lists"}