{"id":38695618,"url":"https://github.com/borea17/efficient_rl","last_synced_at":"2026-01-17T10:39:47.322Z","repository":{"id":57425773,"uuid":"253471589","full_name":"borea17/efficient_rl","owner":"borea17","description":"Reimplementation of \"An Object-Oriented Representation for Efficient RL\"","archived":false,"fork":false,"pushed_at":"2024-09-12T20:39:16.000Z","size":273,"stargazers_count":16,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-12-26T22:51:09.430Z","etag":null,"topics":["door-max","efficient-rl","model-based-rl","reinforcement-learning","reproducible-research"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/borea17.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-04-06T11:01:50.000Z","updated_at":"2025-03-01T21:23:15.000Z","dependencies_parsed_at":"2023-12-15T22:00:05.085Z","dependency_job_id":"f67a23e9-b361-4584-9594-f37268f9e5ae","html_url":"https://github.com/borea17/efficient_rl","commit_stats":{"total_commits":91,"total_committers":3,"mean_commits":"30.333333333333332","dds":0.2857142857142857,"last_synced_commit":"7becd0d96753d5d243511260caf5f309e394c058"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/borea17/efficient_rl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/borea17%2Fefficient_rl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/borea17%2Fefficient_rl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/borea17%2Fefficient_rl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/borea17%2Fefficient_rl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/borea17","download_url":"https://codeload.github.com/borea17/efficient_rl/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/borea17%2Fefficient_rl/sbom","scorecard":{"id":248264,"data":{"date":"2025-08-11","repo":{"name":"github.com/borea17/efficient_rl","commit":"4387eb4355a32c840487dd1c88c374b8973c7687"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3,"checks":[{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: MIT License: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}}]},"last_synced_at":"2025-08-17T07:59:07.217Z","repository_id":57425773,"created_at":"2025-08-17T07:59:07.217Z","updated_at":"2025-08-17T07:59:07.217Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28506593,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-17T10:25:30.148Z","status":"ssl_error","status_checked_at":"2026-01-17T10:25:29.718Z","response_time":85,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["door-max","efficient-rl","model-based-rl","reinforcement-learning","reproducible-research"],"created_at":"2026-01-17T10:39:47.231Z","updated_at":"2026-01-17T10:39:47.295Z","avatar_url":"https://github.com/borea17.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Efficient Reinforcement Learning\n[![PyPI version](https://badge.fury.io/py/efficient-rl.svg)](https://badge.fury.io/py/efficient-rl)\n\n**[Motivation](https://github.com/borea17/efficient_rl#motivation)** | **[Summary](https://github.com/borea17/efficient_rl#summary)** | **[Results](https://github.com/borea17/efficient_rl#results)** | **[How to use this repository](https://github.com/borea17/efficient_rl#how-to-use-this-repository)**\n\n-------------------------------------------------------------------------------------\n\nThis is a Python *reimplementation* for the Taxi domain of \n\n* [An Object-Oriented Representation for Efficient Reinforcement Learning](http://carlosdiuk.github.io/papers/Thesis.pdf) (C. Diuk's Dissertation)\n* [An Object-Oriented Representation for Efficient Reinforcement Learning](http://carlosdiuk.github.io/papers/OORL.pdf) (Paper by C. Diuk et al.)\n\n\u003ctable\u003e\n\u003ctbody\u003e\n\u003ctr\u003e\n  \u003ctd\u003eIn the \u003ci\u003eTaxi domain\u003c/i\u003e the goal is to navigate the \u003ci\u003etaxi\u003c/i\u003e (initially yellow box) towards\u003cbr\u003e\n    the \u003ci\u003epassenger\u003c/i\u003e (blue letter), take a \u003ci\u003ePickup\u003c/i\u003e action and then deliver the \u003ci\u003etaxi with\u003cbr\u003epassenger inside\u003c/i\u003e (green box) towards the \u003ci\u003edestination\u003c/i\u003e (magenta letter) and perform \u003cbr\u003e a \u003ci\u003eDropoff\u003c/i\u003e action. A reward of -1 is obtained for every time step it takes until delivery. \u003cbr\u003eSuccessful \u003ci\u003eDropoff\u003c/i\u003e results in +20 reward, while non-successful \u003ci\u003eDropoff\u003c/i\u003e or \u003ci\u003ePickup\u003c/i\u003e is\u003cbr\u003e penalized with -10. \n    This task was introduced by \u003ca href=\"https://arxiv.org/abs/cs/9905014\"\u003eDietterich\u003c/a\u003e.\n \u003c/td\u003e\n  \u003ctd\u003e\u003cimg src='gifs/example.gif' width='120' height='185.25'\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\n-------------------------------------------------------------------------------------\n### Motivation\n\nIt is a well known empirical fact in reinforcement learning that\nmodel-based approaches (e.g., \u003ci\u003eR\u003c/i\u003e\u003csub\u003e\u003cfont\nsize=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e) are more sample-efficient than model-free\nalgorithms (e.g., \u003ci\u003eQ-learning\u003c/i\u003e). One of the main reasons may be\nthat model-based learning tackles the exploration-exploitation dilemma\nin a smarter way by using the accumulated experience to build an\napproximate model of the environment. Furthermore, it has been \nshown that rich state representations such as in a *factored MDP* can\nmake model-based learning even more sample-efficient. *Factored MDP*s\nenable an effective parametrization of transition and reward dynamics\nby using *dynamic Bayesian networks* (DBNs) to represent partial\ndependency relations between state variables, thereby the environment dynamics\ncan be learned with less samples. A major downside of these approaches\nis that the DBNs need to be provided as prior knowledge which might be\nimpossible sometimes.\n\nMotivated by human intelligence, Diuk et al. introduce a new\nframework *propositional object-oriented MDPs* (OO-MDPs) to model\nenvironments and their dynamics. As it turns out, humans are way more\nsample-efficient than state-of-the-art algorithms when playing games \nsuch as Taxi (Diuk actually performed an experiment). Diuk argues that \nhumans must use some prior knowledge when playing this game, he\nfurther speculates that this knowledge might come in form of object\nrepresentations, e.g., identifying horizontal lines as *walls* when \nobserving that the taxi cannot move through them. \n\nDiuk et al. provide a learning algorithm for deterministic\nOO-MDPs (\u003ci\u003eDOOR\u003c/i\u003e\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e) which\noutperforms *factored* \u003ci\u003eR\u003c/i\u003e\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e.\nAs prior knowledge \u003ci\u003eDOOR\u003c/i\u003e\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e\nneeds the objects and relations to consider, which seems more natural\nas these may also be used throughout different games. Furthermore,\nthis approach may also be used to inherit human biases. \n\n-------------------------------------------------------------------------------------\n\n### Summary\n\nThis part shall give an overview about the different reimplemented\nalgorithms. These can be divided into *model-free* and *model-based* approaches.\n\n#### Model-free Approaches\n\nIn model-free algorithms the agent learns the optimal action-value\nfunction (or value function or policy) directly from experience\nwithout having an actual model of the environment. Probably the most\nfamous model-free algorithm is *Q-learning* which also builds the\nbasis for the (perhaps even more famous) [DQN paper](https://arxiv.org/abs/1312.5602).\n\n##### Q-learning\n\nQ-learning aims to approximate the optimal action-value function \nfrom which the optimal policy can be inferred. In the simplest case, \na table (*Q-table*) is used as a function approximator. \n\nThe basic idea is to start with a random action-value function and\nthen iteratively update this function towards the optimal action-value\nfunction. The update comes after each action *a* with the observed\nreward *r* and new state *s\u003csup\u003e'\u003c/sup\u003e*, the update rule is very\nsimple and is derived from Bellman's optimality equation:\n\n$$\n\\displaystyle Q(s,a)\\leftarrow (1-\\alpha) Q(s,a) + \\alpha\\left\\[r + \\gamma \\max_{a^{'}} Q(s^{'}, a^{'})\\right\\]\n$$\n\n  \nwhere \u0026alpha; is the learning rate. To allow for exploration,\nQ-learning commonly uses *\u0026epsi;-greedy exploration* or the *Greedy\nin the Limit with Infinite Exploration* approach (see [David Silver,\np.13\nff](https://www.davidsilver.uk/wp-content/uploads/2020/03/control.pdf)).\n\nDiuk uses two variants of Q-learning:\n* **Q-learning**: standard Q-learning approach with \u0026epsi;-greedy\n  exploration where parameters \u0026alpha;=0.1 and \u0026epsi;=0.6 have been\n  found via parameter search.\n* **Q-learning with optimistic initialization**: instead of some\n  random initialization of the Q-table a smart initialization to an\n  optimistic value (maximum possible value of any state action pair \n  \u003cimg\n  src=\"https://render.githubusercontent.com/render/math?math=v_{max}\"\u003e)\n  is used. Thereby unvisited state-action pairs become more likely to be\n  visited. Here, \u0026alpha; was is to 1 (deterministic environment) and\n  \u0026epsi; to 0 (exploration ensured via initialization).\n\n#### Model-based Approaches \n\nIn model-based approaches the agent learns a model of the environment\nby accumulating experience. Then, an optimal action-value\nfunction (or value function or policy) is obtained through *planning*. Planning\ncan be done exactly or approximately. In the experiments, Diuk et al.\nuse exact planning, more precisely *value iteration*. The difference\nbetween the following three algorithms lies in the way they learn the\nenvironment dynamics.\n\n##### R\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e\n\nR\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e is a provably efficient\nstate-of-the-art algorithm to surpass the exploration-exploitation\ndilemma through an intuitive approach: R\u003csub\u003e\u003cfont\nsize=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e divides state-action pairs into *known*\n(state-action pairs which have been visited often enough to build an\naccurate transition/reward function) and *unknown*. Whenever a state\nis *known*, the algorithm uses the empirical transition and reward\nfunction for planning. In case a state is *unknown*,  R\u003csub\u003e\u003cfont\nsize=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e assumes a transition to a fictious state\nfrom which maximum reward can be obtained consistently (hence the name) and it uses\nthat for planning. Therefore, actions which have not been tried out\n(often enough) in the actual state will be preferred unless the\n*known* action also leads to maximal return. The parameter *M* defines the number of\nobservations the agent has to see until it considers a\ntransition/reward to be known, in a deterministic case such as the\nTaxi domain, it can be set to 1. R\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e\nis guaranteed to find a near-optimal action-value function in\npolynomial time.\n\n###### Learning transition and reward dynamics\n\nThe 5x5 Taxi domain has 500 different states:\n - 5 *x* positions for taxi\n - 5 *y* positions for taxi\n - 5 passenger locations (4 designated locations plus *in-taxi*)\n - 4 destinations\n\nIn the standard R\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e approach without\nany domain knowledge (except for the maximum possible reward\n*R*\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e, the\nnumber of states *|S|*, the number of actions *|A|*), the states are\nsimply enumerated and the agent will\nnot be able to transfer knowledge throughout the domain. E.g.,\nassume the agent performs an action *North* at some location on the\ngrid and learns the state transition (more precisely it would learn\nsomething like *picking action 1 at state 200 results in ending up in\nstate 220*). Being at the same location but with a different *passenger location* or *destination\nlocation* the agent will not be able to predict the outcome of action\n*North*. It will take the agent at least 3000 (*|S| \u0026middot; |A|*)\nsteps until it has fully learned the 5x5 Taxi transition dynamics.\nFurthermore, the learned transition and reward dynamics are rather\ndifficult to interpret.\nTo address this shortcoming, the agent needs a different\nrepresentation and some prior knowledge.\n\n##### Factored R\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e\n\nFactored R\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e is a R\u003csub\u003e\u003cfont\nsize=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e adaptation that builds on a *factored MDP*\nenvironment representation. In a *factored MDP* a state is represented\nas tuple (hence *factored state*), e.g., in the Taxi domain the state\ncan be represented as the 4-tuple \n\u003e [taxi x location, taxi y location, passenger location, passenger destination]\n\n(Note that *passenger location* actually enumerates the different *(x,\ny)* start passenger locations plus whether the passenger is *in\ntaxi*.) This representation allows to represent partial dependency\nrelations for the environment dynamics between variables using\n*Dynamic Bayesian Networks (DBNs)*. E.g., for action *North* we know\nthat each state variable at time *t+1* only depends on its own value at time *t*, i.e., the *x\nlocation* at time *t+1* under action *North* is independent of the *y\nlocation*, *passenger location* and *passenger destination* at time\n*t*. This knowledge is encoded in a *DBN* (each action may have a\ndifferent DBN) and it enables Factored R\u003csub\u003e\u003cfont\nsize=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e to much more sample-efficient learning. \nThe downside of this approach is that this kind of prior knowledge may not\nbe available and that it lacks some generalization, e.g., although\nFactored R\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e knows that the *x location* is independent of all other state\nvariables, Factored R\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e still needs\nto perfom action *North* at each *x location* to learn the outcome.\n\n\n##### DOOR\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e\n\nDOOR\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e is a R\u003csub\u003e\u003cfont\nsize=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e adaptation that builds on a *deterministic (propositional)\nobject-oriented MDP (OO MDP)* environment representation. This representation\nis based on objects and their interactions, a state is presented as \nthe union of all (object) attribute values. Additionally, each state\nhas an attributed boolean vector describing which *relations* are\nenabled and which are not in that state. During a transition each\nattribute of the state may exert some kind of *effect* which results in\nan attribute change. There are some limitations to the *effects* that can\noccur which are well explained in Diuk's dissertation. The basic idea\nof DOOR\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e is to recover the\ndeterministic OO MDP using *condition-effect learners* (in these\nlearners *conditions* are basically the relations that need to hold in\norder for an effect to occur).\nThe paper results show that in DOOR\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e\nknowledge can much better transfer throughout the domain compared to the\nother algorithms indicating that DOOR\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e\noffers better generalization. Another feature is that the learned transition\ndynamics is easy to interpret, e.g., DOOR\u003csub\u003e\u003cfont\nsize=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e will learn that action *North* has the\neffect ot incrementing *taxi.y* by 1 when the relation\n*touch_north(taxi, wall)* outputs *False* and there wont be any change\nin *taxi.y* if *touch_north(taxi, wall)* outputs *True*.\n\n-------------------------------------------------------------------------------------\n\n### Results\n\n#### Experimental Setup \n\nThe experimental setup is described on p.31 of Diuk's Dissertation or\np.7 of the paper. It consists of testing against six probe states and reporting the number\nof steps the agent had to take until the optimal policy for these 6\nstart states was reached. Since there is some randomness in the\ntrials, each algorithm runs 100 times and the results are then averaged. \n\n#### Differences between Reimplementation and Diuk\n\nThere are some differences between this reimplementation and Diuk's\napproach which are listed below:\n\n1) For educational purposes, the reward function is also learned in\n   this reimplementation (always in the simplest possible way). Note that Diuk mainly focused on learning the\n   transition model:\n   \u003e I will focus on learning dynamics and assume the reward function is available as a black box function.\n2) It is unknwon whether in Diuk's setting during training the passenger start location\n   and destination could be the same. The original definition by\n   [Diettetrich](https://arxiv.org/abs/cs/9905014) states:\n   \n   \u003e To keep things uniform, the taxi must pick up and drop off the passenger even if he/she is already at the destination.  \n   \n   Therefore, in this reimplementation this was also possible during\n   training. While the results for R\u003csub\u003e\u003cfont\n   size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e an its adaptations indicate that Diuk used\n   the same setting, there is a discrepancy for Q-learning. When the\n   setting was changed such that passenger start and destination could\n   not be the same (these are the results in brackets), similar results to Diuk could be obtained.\n3) Some implementation details are different such as the update procedure\n   of the empirical transition and reward functions or the\n   condition-effect-learners which were not well enough documented or\n   which did not fit into the reimplementation structure.\n\n#### Dissertation Results (p.49)\n\nThe dissertation results align with the reimplementation results. Clearly, DOOR\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e outperforms the other algorithms in terms of sample-efficiency.\n\nFor the differences in *Q-Learning* and the values in brackets, refer to\n2) of [Differences between Reimplementation and Diuk](https://github.com/borea17/efficient_rl/#differences-between-reimplementation-and-diuk).\n\nThe results were obtained on a lenovo thinkpad yoga 260 (i7-6500 CPU\n@ 2.50 GHz x 4).\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003eDomain knowledge\u003c/th\u003e\n    \u003cth\u003eAlgorithm\u003c/th\u003e\n    \u003cth colspan=\"2\"\u003eDiuk's Results\u003cbr\u003e\u003c/th\u003e\n    \u003cth colspan=\"3\"\u003eReimplementation Results\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cu\u003e\u003ci\u003e# Steps\u003c/i\u003e\u003c/u\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cu\u003e\u003ci\u003eTime/step\u003c/i\u003e\u003c/u\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cu\u003e\u003ci\u003e# Steps\u003c/i\u003e\u003c/u\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cu\u003e\u003ci\u003eTime/step\u003c/i\u003e\u003c/u\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cu\u003e\u003ci\u003eTotal Time\u003c/i\u003e\u003c/u\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e|\u003ci\u003eS\u003c/i\u003e|, |\u003ci\u003eA\u003c/i\u003e|\u003cbr\u003e\u003c/td\u003e\n    \u003ctd\u003eQ-learning\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003cb\u003e106859\u003c/b\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u0026lt;1ms\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003cb\u003e120148\u003c/b\u003e\u003cbr\u003e(119941)\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u0026lt;1ms\u003cbr\u003e(\u0026lt;1ms)\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e4.9s\u003cbr\u003e(4.9s)\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e|\u003ci\u003eS\u003c/i\u003e|, |\u003ci\u003eA\u003c/i\u003e|, \u003ci\u003eR\u003c/i\u003e\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e\u003c/td\u003e\n    \u003ctd\u003eQ-learning - optimistic \u003cbr\u003einitialization\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003cb\u003e29350\u003c/b\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u0026lt;1ms\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003cb\u003e75289\u003c/b\u003e\u003cbr\u003e(28989)\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u0026lt;1ms\u003cbr\u003e(\u0026lt;1ms)\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e4.2s\u003cbr\u003e(1.7s)\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e|\u003ci\u003eS\u003c/i\u003e|, |\u003ci\u003eA\u003c/i\u003e|, \u003ci\u003eR\u003c/i\u003e\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ci\u003eR\u003c/i\u003e\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003cb\u003e4151\u003c/b\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e74ms\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003cb\u003e4080\u003c/b\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e2.9ms\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e11.8s\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003ci\u003eR\u003c/i\u003e\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e, DBN structure\u003c/td\u003e\n    \u003ctd\u003eFactored \u003ci\u003eR\u003c/i\u003e\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003cb\u003e1676\u003c/b\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e97.7ms\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003cb\u003e1686\u003c/b\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e30.5ms\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e51.4s\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eObjects, relations to consider,\u003cbr\u003e\u003ci\u003eR\u003c/i\u003e\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e\u003c/td\u003e\n    \u003ctd\u003eDOO\u003ci\u003eR\u003c/i\u003e\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003cb\u003e529\u003c/b\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e48.2ms\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003cb\u003e498\u003c/b\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e36ms\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e17.9s\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e|\u003ci\u003eA\u003c/i\u003e|, visualization of game\u003c/td\u003e\n    \u003ctd\u003eHumans (non-\u003cbr\u003evideogamers)\u003cbr\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003cb\u003e101\u003c/b\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003eNA\u003c/td\u003e\n    \u003ctd align=\"center\"\u003eNA\u003c/td\u003e\n    \u003ctd align=\"center\"\u003eNA\u003c/td\u003e\n    \u003ctd align=\"center\"\u003eNA\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e|\u003ci\u003eA\u003c/i\u003e|, visualization of game\u003c/td\u003e\n    \u003ctd\u003eHumans (videogamers)\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003cb\u003e48.8\u003c/b\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003eNA\u003c/td\u003e\n    \u003ctd align=\"center\"\u003eNA\u003c/td\u003e\n    \u003ctd align=\"center\"\u003eNA\u003c/td\u003e\n    \u003ctd align=\"center\"\u003eNA\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n#### Paper Results (p.7)\n\nThe paper results align with the reimplementation results. These results show that DOOR\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e  not only outperforms Factored R\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e in terms of sample-efficiency, but also scales much better to larger problems. Note that the number of states increases by a factor of more than 14. \n\nThe results were obtained on a cluster from which I do not know the CPU specifics (this is not too important since the focus lies on the comparison). Note that Diuk et al. used a more powerful machine for the paper result: the average step times are notably smaller compared to the dissertation results. \n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003e\u003c/th\u003e\n    \u003cth colspan=\"3\"\u003eDiuk's Result\u003c/th\u003e\n    \u003cth colspan=\"3\"\u003eReimplementation Results\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ci\u003eTaxi 5x5 \u003c/i\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ci\u003eTaxi 10x10\u003c/i\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ci\u003eRatio\u003c/i\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ci\u003eTaxi 5x5\u003c/i\u003e\u003cbr\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ci\u003eTaxi 10x10\u003c/i\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003ci\u003eRatio\u003c/i\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eNumber of states\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e500\u003c/td\u003e\n    \u003ctd align=\"center\" \u003e7200\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e14.40\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e500\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e7200\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e14.40\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eFactored \u003ci\u003eR\u003c/i\u003e\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e\u003cbr\u003e\n      \u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;# steps\u003cbr\u003e\n      \u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;Time per step\u003c/td\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u0026nbsp;\u003cbr\u003e1676\u003cbr\u003e43.59ms\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u0026nbsp;\u003cbr\u003e19866\u003cbr\u003e306.71ms\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u0026nbsp;\u003cbr\u003e11.85\u003cbr\u003e7.03\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u0026nbsp;\u003cbr\u003e1687\u003cbr\u003e24.61ms\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u0026nbsp;\u003cbr\u003e21868\u003cbr\u003e1.02s\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003cbr\u003e12.96\u003cbr\u003e41.45\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eDOO\u003ci\u003eR\u003c/i\u003e\u003csub\u003e\u003cfont size=\"4\"\u003emax\u003c/font\u003e\u003c/sub\u003e\u003cbr\u003e\n      \u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;# steps\u003cbr\u003e\n      \u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;Time per step\u003c/td\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u0026nbsp;\u003cbr\u003e529\u003cbr\u003e13.88ms\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u0026nbsp;\u003cbr\u003e821\u003cbr\u003e293.72ms\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u0026nbsp;\u003cbr\u003e\u003cb\u003e1.55\u003c/b\u003e\u003cbr\u003e21.16\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u0026nbsp;\u003cbr\u003e502\u003cbr\u003e22.12ms\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u0026nbsp;\u003cbr\u003e1086\u003cbr\u003e1.22s\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003cbr\u003e\u003cb\u003e2.16\u003c/b\u003e\u003cbr\u003e55.15\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n\n\n\n### How to use this repository\n\n#### Installation\n\n##### Building from Source\n\n```bash\ngit clone --depth 1 https://github.com/borea17/efficient_rl/\ncd efficient_rl\npython setup.py install\n```\n\n##### Via Pip\n\n```bash\npip install efficient_rl\n```\n\n#### Reproduce results\n\nAfter successful installation, download `dissertation_script.py` and `paper_script.py` (which are in folder [efficient_rl](https://github.com/borea17/efficient_rl/tree/master/efficient_rl)), then run\n\n```bash\npython dissertation_script.py \npython paper_script.py\n```\n\nDefaultly, each agent runs only once. To increase the number of repetitions change `n_repetitions` in the scripts. \n\nWARNING: It is not recommended to run `paper_script.py` on a standard computer as it may take\nseveral hours.\n\n#### Contributions\n\nIf you want to use this repository for a different environment, you\nmay want to have a look at `efficient_rl/environment` folder. There is\na self written environment called `TaxiEnvironmentClass.py` and there\nare extensions to the `gym` Taxi environment in the corresponding folders. \n\nContributions are welcome and if needed, I will provide a more detailed documentation.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fborea17%2Fefficient_rl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fborea17%2Fefficient_rl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fborea17%2Fefficient_rl/lists"}