{"id":20103484,"url":"https://github.com/pythonlessons/cartpole_reinforcement_learning","last_synced_at":"2026-05-28T23:31:27.089Z","repository":{"id":105023692,"uuid":"209738476","full_name":"pythonlessons/CartPole_reinforcement_learning","owner":"pythonlessons","description":"Basics of reinforcement learning","archived":false,"fork":false,"pushed_at":"2019-11-21T20:51:07.000Z","size":1904,"stargazers_count":2,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-02T17:30:24.423Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pythonlessons.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-09-20T08:06:08.000Z","updated_at":"2023-06-07T17:40:27.000Z","dependencies_parsed_at":null,"dependency_job_id":"a74bf2ff-70ae-4d26-94be-379329581376","html_url":"https://github.com/pythonlessons/CartPole_reinforcement_learning","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/pythonlessons/CartPole_reinforcement_learning","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pythonlessons%2FCartPole_reinforcement_learning","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pythonlessons%2FCartPole_reinforcement_learning/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pythonlessons%2FCartPole_reinforcement_learning/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pythonlessons%2FCartPole_reinforcement_learning/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pythonlessons","download_url":"https://codeload.github.com/pythonlessons/CartPole_reinforcement_learning/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pythonlessons%2FCartPole_reinforcement_learning/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33630999,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-28T02:00:06.440Z","response_time":99,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-13T17:36:09.666Z","updated_at":"2026-05-28T23:31:27.070Z","avatar_url":"https://github.com/pythonlessons.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Solving the CartPole balancing game\n\nThe idea of CartPole is that there is a pole standing up on top of a cart. The goal is to balance this pole by moving the cart from side to side to keep the pole balanced upright.\n\nThe environment is deemed successful if we can balance for 500 frames, and failure is deemed when the pole is more than 15 degrees from fully vertical or the cart moves more than 2.4 units from the center.\n\nEvery frame that we go with the pole \"balanced\" (less than 15 degrees from vertical), our \"score\" gets +1, and our target is a score of 500.\n\nNow, how do we do this? There are endless ways, some very complex, and some very specific. I chose to demonstrate how deep reinforcement learning (deep Q-learning) can be implemented and applied to play a CartPole game using Keras and Gym. I will try to explain everything without requiring any prerequisite knowledge about reinforcement learning.\n\nBefore starting, take a look at this [YouTube video](https://youtu.be/XiigTGKZfks) with a real-life demonstration of a cartpole problem learning process. Looks amazing, right? Implementing such a self-learning system is easier than you may think. Let’s dive in!\n\n\n# Reinforcement Learning\nIn order to achieve the desired behavior of an agent that learns from its mistakes and improves its performance, we need to get more familiar with the concept of \u003cb\u003eReinforcement Learning (RL)\u003c/b\u003e.\n\nRL is a type of machine learning that allows us to create AI agents that learn from the environment by interacting with it in order to maximize its cumulative reward. The same way how we learn to ride a bicycle, AI learns it by trial and error, agents in RL algorithms are incentivized with punishments for bad actions and rewards for good ones.\n\nAfter each action, the agent receives the feedback. The feedback consists of the reward and next state of the environment. The reward is usually defined by a human. If we use the analogy of the bicycle, we can define reward as the distance from the original starting point.\n\n\n# Cartpole Game\nCartPole is one of the simplest environments in OpenAI gym (collection of environments to develop and test RL algorithms). Cartpole is built on a Markov chain model that is illustrated below.\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://github.com/pythonlessons/CartPole_reinforcement_learning/blob/master/IMAGES/image.png\"\n\u003c/p\u003e\n  \nThen for each iteration, an agent takes current state (S_t), picks best (based on model prediction) action (A_t) and executes it on an environment. Subsequently, environment returns a reward (R_t+1) for a given action, a new state (S_t+1) and an information if the new state is terminal. The process repeats until termination.\n\nThe goal of CartPole is to balance a pole connected with one joint on top of a moving cart. To make it simplier for us, instead of pixel information, there are 4 kinds of information given by the state, such as angle of the pole and position of the cart. An agent can move the cart by performing a series of actions of 0 or 1 to the cart, pushing it left or right.\n\nGym makes interacting with the game environment really simple:\n```\nnext_state, reward, done, info = env.step(action)\n```\n\nHere, ```action``` can be either 0 or 1. If we pass those numbers, env, which represents the game environment, will emit the results. ```done``` is a boolean value telling whether the game ended or not. ```next_state``` space handles all possible state values:\u003cbr\u003e\n(\u003cbr\u003e\n[Cart Position from -4.8 to 4.8],\u003cbr\u003e\n[Cart Velocity from -Inf to Inf],\u003cbr\u003e\n[Pole Angle from -24° to 24°],\u003cbr\u003e\n[Pole Velocity At Tip from -Inf to Inf]\u003cbr\u003e\n)\n\nThe old state information paired with ```action```, ```next_state``` and ```reward``` is the information we need for training the agent.\n\nSo to understand everything from basics, lets first create CartPole environment where our python script would play with it randomly:\n\n```\nimport gym\nimport random\n\nenv = gym.make(\"CartPole-v0\")\nenv.reset()\n\ndef Random_games():\n    # Each of this episode is its own game.\n    for episode in range(10):\n        env.reset()\n        # this is each frame, up to 500...but we wont make it that far with random.\n        for t in range(500):\n            # This will display the environment\n            # Only display if you really want to see it.\n            # Takes much longer to display it.\n            env.render()\n            \n            # This will just create a sample action in any environment.\n            # In this environment, the action can be 0 or 1, which is left or right\n            action = env.action_space.sample()\n\n            # this executes the environment with an action, \n            # and returns the observation of the environment, \n            # the reward, if the env is over, and other info.\n            next_state, reward, done, info = env.step(action)\n            \n            # lets print everything in one line:\n            print(t, next_state, reward, done, info, action)\n            if done:\n                break\n                \nRandom_games()\n```\n\n# Learn with Simple Neural Network using Keras\nThis tutorial is not about deep learning or neural networks. So I will not explain how it works in details, I'll consider it just as a black box algorithm that approximately maps inputs to outputs. This is basically an NN algorithm that learns on the pairs of examples input and output data, detects some kind of patterns, and predicts the output based on an unseen input data.\n\nNeural networks are not the focus of this tutorial, but we should understand how it is used to learn in deep Q-learning algorithm.\n\nKeras makes it really simple to implement a basic neural network. With code below we will create an empty NN model. activation, loss and optimizer are the parameters that define the characteristics of the neural network, but we are not going to discuss it here.\n\n```\nfrom keras.models import  Model\nfrom keras.layers import Input, Dense, Dropout\nfrom keras.optimizers import Adam\n\n# Neural Network model for Deep Q Learning\ndef OurModel(input_shape, action_space):\n    X_input = Input(input_shape)\n    X = X_input\n\n    # 'Dense' is the basic form of a neural network layer\n    # Input Layer of state size(4) and Hidden Layer with 512 nodes\n    X = Dense(512, input_shape=input_shape, activation=\"relu\")(X)\n    X = Dropout(0.5)(X)\n    \n    # Hidden layer with 256 nodes\n    X = Dense(256, activation=\"relu\")(X)\n    X = Dropout(0.5)(X)\n    \n    # Hidden layer with 64 nodes\n    X = Dense(64, activation=\"relu\")(X)\n    X = Dropout(0.5)(X)\n    \n    # Output Layer with # of actions: 2 nodes (left, right)\n    X = Dense(action_space, activation=\"linear\")(X)\n\n    model = Model(inputs = X_input, outputs = X, name='CartPole model')\n    model.compile(loss='mse', optimizer=Adam())\n    \n    return model\n```\nFor a NN to understand and predict based on the environment data, we have initialized our model (will show it in original code) and feed it the information. Later in full code you will see, that fit() method feeds input and output pairs to the model. Then the model will train on those data to approximate the output based on the input.\n\nIn above model, I used 3 layers neural network, 512, 256 and 64 neurons. With every layer I added dropout layer, later when we will be training our model, you will see that when training DQN it performs worse than in test mode, this is because of dropout layer. But our goal is to make perfect model on test mode, so everything is fine! Feel free to play with its structure and parameters.\n\nLater in training process you will see what makes the NN to predict the reward value from a certain state. You will see that in code I will use ```model.fit(next_state, reward)```, same as in standard Keras NN model.\n\nAfter training, the model we will be able to predict the output from unseen input. When we call ```predict()``` function on the model, the model will predict the reward of current state based on the data we trained. Like so: ```prediction = model.predict(next_state)```\n\n\n# Implementing Deep Q Network (DQN)\nNormally in games, the reward directly relates to the score of the game. But, imagine a situation where the pole from CartPole game is tilted to the left. The expected future reward of pushing left button will then be higher than that of pushing the right button since it could yield higher score of the game as the pole survives longer.\n\nIn order to logically represent this intuition and train it, we need to express this as a formula that we can optimize on. The loss is just a value that indicates how far our prediction is from the actual target. For example, the prediction of the model could indicate that it sees more value in pushing the left button when in fact it can gain more reward by pushing the right button. We want to decrease this gap between the prediction and the target (loss). So, we will define our loss function as follows:\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://github.com/pythonlessons/CartPole_reinforcement_learning/blob/master/IMAGES/math.PNG\"\n\u003c/p\u003e\n    \nWe first carry out an action a and observe the reward r and resulting new state s. Based on the result, we calculate the maximum target Q and then discount it so that the future reward is worth less than immediate reward. Lastly, we add the current reward to the discounted future reward to get the target value. Subtracting our current prediction from the target gives the loss. Squaring this value allows us to punish the large loss value more and treat the negative values same as the positive values.\n\nBut it's not that difficult than you think it is, Keras takes care of most of the difficult tasks for us. We just need to define our target. We can express the target in a magical one line of code in python: \n```target = reward + gamma * np.max(model.predict(next_state))```\n\nKeras does all the work of subtracting the target from NN output and squaring it. It also applies the learning rate that we can define when creating the neural network model (otherwise model will define it by itself). This all happens inside the fit() function. This function decreases the gap between our prediction to target by the learning rate. The approximation of the Q-value converges to the true Q-value as we repeat the updating process. The loss will decrease, and score will grow higher.\n\nThe most notable features of the DQN algorithm are remember and replay methods. Both are simple concepts. The original DQN architecture contains a several more tweaks for better training, but we are going to stick to a simpler version for better understanding.\n\n\n# Implementing Remember function\n\nOne of the specific things for DQN is that neural network used in the algorithm tends to forget the previous experiences as it overwrites them with new experiences. Experience replay is a biologically inspired process that uniformly (to reduce correlation between subsequent actions) samples experiences from the memory and for each entry updates its Q value. So, we need a memory (list) of previous experiences and observations to re-train the model with the previous experiences. We will call this array of experiences memory and use remember() function to append state, action, reward, and next state to the memory.\n\nIn our example, the memory list will have a form of:\n```\nmemory = [(state, action, reward, next_state, done)...]\n```\n\nAnd remember function will simply store states, actions and resulting rewards to the memory like:\n```\ndef remember(self, state, action, reward, next_state, done):\n    self.memory.append((state, action, reward, next_state, done))\n```\n\ndone is just a Boolean that indicates if the state is the final state (cartpole failed).\n\n\n# Implementing Replay function\nA method that trains NN with experiences in the memory we will call ```replay()``` function. First, we will sample some experiences from the memory and call them minibath. ```minibatch = random.sample(memory, min(len(memory), batch_size))```\n\nThe above code will make a minibatch, which is just a randomly sampled elements from full memories of size ```batch_size```. I will set the batch size as 64 for this example. If memory size is less than 64, we will take everything is in our memory.\n\nTo make the agent perform well in long-term, we need to consider not only the immediate rewards but also the future rewards we are going to get. In order to do this, we are going to have a ```discount rate``` or ```gamma``` and ultimately adding it to the current state reward. This way the agent will learn to maximize the discounted future reward based on the given state. In other words, we are updating our Q value with the cumulative discounted future rewards.\n\nFor those of you who wonder how such function can possibly converge, as it looks like it is trying to predict its own output (in some sense it is), don’t worry - it’s possible and in our simple case it does. However, convergence is not always that 'easy' and in more complex problems there comes a need of more advanced techniques than CartPole stabilize training. These techniques are for example Double DQN’s or Dueling DQN’s, but that’s a topic for another article (stay tuned).\n\n```\ndef replay(self):\n    x_batch, y_batch = [], []\n    # Randomly sample minibatch from the memory\n    minibatch = random.sample(self.memory, min(len(self.memory), self.batch_size))\n    # Extract informations from each memory\n    for state, action, reward, next_state, done in minibatch:\n        # make the agent to approximately map the current state to future discounted reward\n        # We'll call that y_target\n        y_target = self.model.predict(state)\n        # if done, make our target reward\n        if done:\n            y_target[0][action] = reward\n        else:\n            # predict the future discounted reward\n            y_target[0][action] = reward + self.gamma * np.max(self.model.predict(next_state)[0])\n        # append results to lists, that will be used for training\n        x_batch.append(state[0])\n        y_batch.append(y_target[0])\n        \n    # Train the Neural Network with batches\n    self.model.fit(np.array(x_batch), np.array(y_batch), batch_size=len(x_batch), verbose=0)\n    if self.epsilon \u003e self.epsilon_min:\n        self.epsilon *= self.epsilon_decay\n```\n\n# Setting Hyper Parameters\nThere are some parameters that have to be passed to a reinforcement learning agent. You will see similar parameters in all DQN models:\n\n* EPISODES - number of games we want the agent to play.\n* gamma - decay or discount rate, to calculate the future discounted reward.\n* epsilon - exploration rate, this is the rate in which an agent randomly decides its action rather than prediction.\n* epsilon_decay - we want to decrease the number of explorations as it gets good at playing games.\n* epsilon_min - we want the agent to explore at least this amount.\n* learning_rate - Determines how much neural net learns in each iteration (if used).\n* batch_size - Determines how much memory DQN will use to learn.\n\n# Putting It All Together: Coding The Deep Q-Learning Agent\nI tried to explain each part of the agent in the above. In the code below I'll implement everything we’ve talked about as a nice and clean class called DQNAgent.\n\n```\nimport random\nimport gym\nimport numpy as np\nfrom collections import deque\nfrom keras.models import Model, load_model\nfrom keras.layers import Input, Dense, LSTM, Reshape, Dropout\nfrom keras.optimizers import Adam\n\n\n# Neural Network model for Deep Q Learning\ndef OurModel(input_shape, action_space):\n    X_input = Input(input_shape)\n    X = X_input\n\n    # 'Dense' is the basic form of a neural network layer\n    # Input Layer of state size(4) and Hidden Layer with 512 nodes\n    X = Dense(512, input_shape=input_shape, activation=\"relu\")(X)\n    X = Dropout(0.5)(X)\n\n    # Hidden layer with 256 nodes\n    X = Dense(256, activation=\"relu\")(X)\n    X = Dropout(0.5)(X)\n    \n    # Hidden layer with 64 nodes\n    X = Dense(64, activation=\"relu\")(X)\n    X = Dropout(0.5)(X)\n    \n    # Output Layer with # of actions: 2 nodes (left, right)\n    X = Dense(action_space, activation=\"linear\")(X)\n\n    model = Model(inputs = X_input, outputs = X, name='CartPole model')\n    model.compile(loss='mse', optimizer=Adam())\n    \n    return model\n\nclass DQNAgent:\n    def __init__(self):\n        self.env = gym.make('CartPole-v1')\n        self.state_size = self.env.observation_space.shape[0]\n        self.action_size = self.env.action_space.n\n        self.EPISODES = 1000\n        self.memory = deque(maxlen=2000)\n        \n        self.gamma = 0.95    # discount rate\n        self.epsilon = 1.0  # exploration rate\n        self.epsilon_min = 0.0001\n        self.epsilon_decay = 0.999\n        self.batch_size = 128\n\n        self.model = OurModel(input_shape=(self.state_size,), action_space = self.action_size)\n\n    def remember(self, state, action, reward, next_state, done):\n        self.memory.append((state, action, reward, next_state, done))\n\n    def act(self, state):\n        if np.random.random() \u003c= self.epsilon:\n            return random.randrange(self.action_size)\n        else:\n            return np.argmax(self.model.predict(state))\n\n    def replay(self):\n        x_batch, y_batch = [], []\n        # Randomly sample minibatch from the memory\n        minibatch = random.sample(self.memory, min(len(self.memory), self.batch_size))\n        for state, action, reward, next_state, done in minibatch:\n            # make the agent to approximately map the current state to future discounted reward\n            # We'll call that y_target\n            y_target = self.model.predict(state)\n            # if done, make our target reward\n            if done:\n                y_target[0][action] = reward\n            else:\n                # predict the future discounted reward\n                y_target[0][action] = reward + self.gamma * np.max(self.model.predict(next_state)[0])\n            # append results to lists, that will be used for training\n            x_batch.append(state[0])\n            y_batch.append(y_target[0])\n\n        # Train the Neural Network with batches\n        self.model.fit(np.array(x_batch), np.array(y_batch), batch_size=len(x_batch), verbose=0)\n        if self.epsilon \u003e self.epsilon_min:\n            self.epsilon *= self.epsilon_decay\n            \n    def load(self, name):\n        self.model = load_model(name)\n\n    def save(self, name):\n        self.model.save(name)\n\n    def run(self):\n        for e in range(self.EPISODES):\n            state = self.env.reset()\n            state = np.reshape(state, [1, self.state_size])\n            done = False\n            i = 0\n            while not done:\n                self.env.render()\n                action = self.act(state)\n                next_state, reward, done, _ = self.env.step(action)\n                next_state = np.reshape(next_state, [1, self.state_size])\n                if not done:\n                    reward = reward\n                else:\n                    reward = -10\n                self.remember(state, action, reward, next_state, done)\n                state = next_state\n                i += 1\n                if done:\n                    print(\"episode: {}/{}, score: {}, e: {:.2}\".format(e, self.EPISODES, i, self.epsilon))\n                    if i == 500:\n                        print(\"Saving trained model as cartpole-dqn.h5\")\n                        self.save(\"cartpole-dqn.h5\")\n                    break\n                self.replay()\n\n    def test(self):\n        self.load(\"cartpole-dqn.h5\")\n        for e in range(self.EPISODES):\n            state = self.env.reset()\n            state = np.reshape(state, [1, self.state_size])\n            done = False\n            i = 0\n            while not done:\n                self.env.render()\n                action = np.argmax(self.model.predict(state))\n                next_state, reward, done, _ = self.env.step(action)\n                state = np.reshape(next_state, [1, self.state_size])\n                i += 1\n                if done:\n                    print(\"episode: {}/{}, score: {}\".format(e, self.EPISODES, i))\n                    break\n\nif __name__ == \"__main__\":\n    agent = DQNAgent()\n    agent.run()\n    #agent.test()\n```\n\n# DQN CartPole training part\n\nBelow is part code, responsible for training our DQN model. I will not go deep into explanation line by line, because everything was explained above. But in our code, we are running for 1000 episodes of game to train. If you don't want to see how training performs you can comment this line self.env.render(). Every step is rendered here, and while done is equal to False, our model keeps training. We save results from every step to memory, which we use for training on every step. When our model hits score of 500, we save it and already we can use it for testing. But I recommend not to turn off training at first save, give it more time to train before testing. It may take up to 100 steps before it reaches 500 score. You may ask, why it takes so long? Answer is simple, because of Dropout layer in our model, without dropout it may reach 500 much faster, but then our testing results would be worse. So, here is the code part of this short explanation:\n```\ndef run(self):\n    for e in range(self.EPISODES):\n        state = self.env.reset()\n        state = np.reshape(state, [1, self.state_size])\n        done = False\n        i = 0\n        while not done:\n            self.env.render()\n            action = self.act(state)\n            next_state, reward, done, _ = self.env.step(action)\n            next_state = np.reshape(next_state, [1, self.state_size])\n            if not done:\n                reward = reward\n            else:\n                reward = -10\n            self.remember(state, action, reward, next_state, done)\n            state = next_state\n            i += 1\n            if done:\n                print(\"episode: {}/{}, score: {}, e: {:.2}\".format(e, self.EPISODES, i, self.epsilon))\n                if i == 500:\n                    print(\"Saving trained model as cartpole-dqn.h5\")\n                    self.save(\"cartpole-dqn.h5\")\n                break\n            self.replay()\n```\n\nFor me, model reached 500 score in 73rd step, here my model was saved:\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://github.com/pythonlessons/CartPole_reinforcement_learning/blob/master/IMAGES/training_model.PNG\"\n\u003c/p\u003e\n\n\n# DQN CartPole testing part\nSo now, when you have trained your model, its time test it! Comment ```agent.run()``` line and uncomment ```agent.test()```. And check, how your first DQN model works!\n```\nif __name__ == \"__main__\":\n    agent = DQNAgent()\n    #agent.run()\n    agent.test()\n```\n\nSo here is 20 test episodes of our trained model, as you can see 16 times it hit the maximum score, it would be interesting what is the maximum score it could hit, but sadly limit is 500\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://github.com/pythonlessons/CartPole_reinforcement_learning/blob/master/IMAGES/testing_model.PNG\"\n\u003c/p\u003e\n\nAnd here is short gif, which shows how our agent performs:\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://github.com/pythonlessons/CartPole_reinforcement_learning/blob/master/IMAGES/CartPole_test.gif\"\n\u003c/p\u003e\n    \nFor this task our goal was reached, short recap what we done:\n\n* Learned how DQN works\n* Wrote simple DQN model\n* Teched model to play CartPole game\n\nThis is the end for this tutorial. I challenge you to try creating your own RL agents! Let me know how they perform in solving the cartpole problem. Furthermore, stay tuned for more future tutorials.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpythonlessons%2Fcartpole_reinforcement_learning","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpythonlessons%2Fcartpole_reinforcement_learning","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpythonlessons%2Fcartpole_reinforcement_learning/lists"}