{"id":20652427,"url":"https://github.com/thesofakillers/dlml-tutorial","last_synced_at":"2025-10-05T04:23:55.205Z","repository":{"id":169793430,"uuid":"645817969","full_name":"thesofakillers/dlml-tutorial","owner":"thesofakillers","description":"🤓 A tutorial on the Discretized Logistic Mixture Likelihood (DLML)","archived":false,"fork":false,"pushed_at":"2024-08-10T16:34:36.000Z","size":30,"stargazers_count":8,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-07T19:03:31.992Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://www.giuliostarace.com/posts/dlml-tutorial/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thesofakillers.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-26T14:00:24.000Z","updated_at":"2024-10-12T20:30:41.000Z","dependencies_parsed_at":null,"dependency_job_id":"0a48ff87-b36c-4420-b94b-f7218244e203","html_url":"https://github.com/thesofakillers/dlml-tutorial","commit_stats":null,"previous_names":["thesofakillers/dlml-tutorial"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/thesofakillers/dlml-tutorial","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thesofakillers%2Fdlml-tutorial","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thesofakillers%2Fdlml-tutorial/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thesofakillers%2Fdlml-tutorial/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thesofakillers%2Fdlml-tutorial/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thesofakillers","download_url":"https://codeload.github.com/thesofakillers/dlml-tutorial/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thesofakillers%2Fdlml-tutorial/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265820261,"owners_count":23833563,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-16T17:34:56.887Z","updated_at":"2025-10-05T04:23:55.102Z","avatar_url":"https://github.com/thesofakillers.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003e originally posted on\n\u003e [giuliostarace.com/posts/dlml-tutorial/](https://www.giuliostarace.com/posts/dlml-tutorial/)\n\u003e (recommended for better math rendering)\n\n\u003e UPDATE (2024-08-10): I have added a bit more detail in [The How](#the-how)\n\u003e section regarding how we implement the edge cases. From \"Let's go line by\n\u003e line\" to \"in lines 7 through 9\".\n\n# Discretized Logistic Mixture Likelihood - The Why, What, and How\n\nIn this post I will explain what the Discretized Logistic Mixture Likelihood\n(DLML)[^1] is. This is a modeling method particularly relevant to my MSc thesis,\nwhere I use it to model the continuous action my imitation learning agent should\nmake. While there are already some great posts explaining the concept, the\ninformation is scattered, which can make understanding the concept a bit\npainful. I will first start with motivating _why_ we need DLML. I will then\npresent _what_ DLML is. Finally I will outline _how_ we can implement DLML in\n[PyTorch](https://pytorch.org/).\n\n\u003c!-- vim-markdown-toc GFM --\u003e\n\n* [The Why](#the-why)\n* [The What (and some more why)](#the-what-and-some-more-why)\n   * [Some more why](#some-more-why)\n* [The How](#the-how)\n   * [Training](#training)\n   * [Sampling](#sampling)\n* [Closing words](#closing-words)\n* [More Resources](#more-resources)\n\n\u003c!-- vim-markdown-toc --\u003e\n\n## The Why\n\nSuppose you wish to predict some variable that happens to be continuous,\nconditional on some other quantity. For example, you are interested in\npredicting the value of a given pixel in an image, given the values of\nneighbouring pixels.\n\nWith a bit of domain knowledge, this class of problem can typically be\nreformulated by discretizing the target variable and modeling the resulting\n(conditional) probability distribution. The prediction task can then be posed as\na classification one over the discretized bins: apply a\n[softmax](https://en.wikipedia.org/wiki/Softmax_function) and train using\n[cross-entropy loss](https://en.wikipedia.org/wiki/Cross_entropy).\n\nThis is (in part) what the authors of\n[PixelCNN](https://arxiv.org/abs/1606.05328) did: for the task of conditional\nimage generation, they model each (sub)pixel of an image with a softmax over a\n256-dimensional vector, where each dimension represents an 8-bit intensity value\nthat the pixel may take. There are more details, particularly around the\nconditioning, but that's all you need to know for now for the premise of this\ntutorial.\n\nImmediately, we face a number of limitations:\n\n1. Softmax can be\n   [computationally expensive](https://en.wikipedia.org/wiki/Softmax_function#Computational_complexity_and_remedies)\n   and [unstable](https://gregorygundersen.com/blog/2020/02/09/log-sum-exp/).\n   This is particularly problematic for high-dimensional inputs, which are\n   usually the case when dealing with (discretized) continuous variables. This\n   is particularly problematic if you plan to repeat the computation on several\n   output variables (e.g. several pixels of an output image, several dimensions\n   of robotic arm rotation, etc.), which is usually the case when interested in\n   conditional generation of continuous variables.\n\n2. [Softmax can lead to sparse gradients](https://arxiv.org/abs/1811.10779),\n   especially at the beginning of training, which can slow down learning. This\n   is also especially the case with high-dimensional input.\n\n3. Softmax does not model any sort of ordinality in the random variable that is\n   being considered: every single dimension in the input vector is considered\n   independently. There is no notion that a value of 127 is close to 128. This\n   ordinality is typically present when dealing with (discretized) continuous\n   variables by virtue of their nature. Rather than relying on some inductive\n   bias, the model has to devote more training time to learn this aspect of the\n   data, leading to slower training.\n\n4. Softmax fails to properly model values that are never observed, assigning\n   probabilities of 0 to values that may otherwise be more likely to occur.\n\nThese issues are at least some of the motivations for using DLML, which I will\nintroduce in the next section.\n\n## The What (and some more why)\n\nIn DLML, for a given output variable $y$ we do the following:\n\n1. We assume that there is a latent value $v$ with a continuous distribution.\n2. We take $y$ to come from a discretization of this continuous distribution of\n   $v$. We do this discretization in some arbitrary way, but usually by rounding\n   to the nearest 8-bit representation. What this means is that if e.g. $v$ can\n   be any value between 0 and 255, then $y$ will be any _integer_ between those\n   two numbers.\n3. We model $v$ using a simple continuous distribution - e.g. the\n   [logistic distribution](https://en.wikipedia.org/wiki/Logistic_distribution).\n4. We then take a further step, choosing to model $v$ as a mixture of $K$\n   logistic distributions:\n\n   $$ v \\sim \\sum_i^K \\pi_i \\text{logistic}(\\mu_i, s_i), $$\n\n   (equation 1) where $\\pi_i$ is some coefficient weighing the likelihood of the\n   $i$th distribution while $\\mu_i$ and $s_i$ are the mean and scale\n   parametrizing it.\n\n5. To compute the likelihood of $y$, we sum its (weighted) probability masses\n   over the $K$ mixtures. We can obtain the probability masses by computing the\n   difference between consecutive cumulative density function (CDF) values of\n   equation (1). Note that the\n   [CDF of the logistic distribution is a sigmoid function](https://en.wikipedia.org/wiki/Logistic_distribution#Cumulative_distribution_function).\n   We therefore write:\n\n   $$\n   p(y | \\mathbf{\\pi}, \\mathbf{\\mu}, \\mathbf{s} )  = \\sum_{i=1}^K \\pi_i\n   \\left[\\sigma\\left(\\frac{y + 0.5 - \\mu_i}{s_i}\\right) -\n   \\sigma\\left(\\frac{y - 0.5 - \\mu_i}{s_i}\\right)\\right],\n   $$\n\n   (equation 2) where $\\sigma$ is the logistic sigmoid. The 0.5 value comes from\n   the fact that we have discretized $v$ into $y$ through rounding, and\n   therefore successive values of our discrete random variable $y$ are found at\n   this boundary.\n\n6. We can additionally model edge cases, replacing $y - 0.5$ with $-\\infty$ when\n   $y=0$ and $y + 0.5$ with $+\\infty$ when $y = 2^8 = 255$.\n\nThis is nothing more than a likelihood, so we can use it in a\n[maximum likelihood estimation (MLE)](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation)\nprocess to estimate our parameters. In the case of Deep Learning, we use the\nnegative log likelihood as our loss function.\n[This comment on GitHub](https://github.com/Rayhane-mamah/Tacotron-2/issues/155#issuecomment-413364857)\nprovides a different perspective to what's going on.\n\n### Some more why\n\nThis approach provides a number of advantages, many of which address the\nshortcomings of the softmax approach described in\n[the previous section](#the-why). In particular:\n\n1. It avoids assigning probability mass outside the valid range of [0, 255] by\n   explicitly modeling the rounding and edge cases.\n\n2. Edge values are naturally assigned higher probability values, which tends to\n   align with what is observed when dealing with this nature of data.\n\n3. We rely on the simple sigmoid function, which is less computationally\n   expensive than its multi-class cousin the softmax. This addresses limitation\n   1 from above.\n\n4. Because we are now making use of the logistic distribution to model the\n   (latent) value of $y$, we are implicitly also modelling ordinality when\n   discretizing, since the logistic distribution is continuous. This addresses\n   limitation 3 from above.\n\n5. Our reliance on a continuous distribution similarly addresses limitation 4\n   from above, as we will no longer assign non-zero probability prematurely.\n\n6. Empirically it has been found that only a small number of mixtures,\n   $\\le\n   10$, is enough. What this means is that we can work with much lower\n   dimensionality network outputs (3 parameters: $\\mu$, $s$ and $\\pi$ for each\n   mixture element), leading to denser gradients. This addresses limitation 2\n   from above.\n\n7. Because we make use of a mixture, we can more easily model multi-modal data.\n   This can be desirable when learning skills from imitation, where the same\n   skill can be shown to be completed through different action sequences. It is\n   exactly for this reason that\n   [Lynch et al. 2020](https://proceedings.mlr.press/v100/lynch20a.html) and\n   [Mees et al. 2022](https://arxiv.org/abs/2204.06252) make use of DLML in\n   their action decoders.\n\n## The How\n\nSo how do we actually go about implementing this? This is one of those\ntechniques where we do slightly different things depending on whether we are\ntraining or whether we just want outputs from our model (sampling). For\ncompleteness, I provide a full-reference to the code below on my\n[GitHub](https://github.com/thesofakillers/dlml-tutorial).\n\n### Training\n\nEarlier, we defined a likelihood for $y$. We can use this to train our model to\noutput the appropriate parameters $\\mu$, $s$, and $\\pi$ for a given input using\nMLE. In practice we will calculate the likelihood and then minimize the negative\nlog likelihood.\n\nFor a given output variable $y$, using $K$ mixture elements, our model should\ntherefore output $K$ means, scales and mixture logits:\n\n```python {linenos=table}\n# each of these have shape (B x K)\nmeans, log_scales, mixture_logits = model(**batch['inputs'])\ninv_scales = torch.exp(-log_scales)\n```\n\nWe treat the predicted scales as $\\log(s)$, which we can then follow with an\nexponential to recover $s$. This is to enforce positive values of $s$ and for\nnumerical stability. In practice we take an exponential of the negative\n`log_scales` to obtain `inv_scales` i.e. $\\frac{1}{s}$, since $s$ is always used\nin the denominator in our formulas.\n\nWe can then start computing the rest of the terms for our likelihood from\nequation (2).\n\n```python {linenos=table}\ny = batch['targets']\n# explained in text\nepsilon = (0.5*y_range) / (num_y_vals - 1)\n# convenience variable\ncentered_y = y.unsqueeze(-1).repeat(1, 1, means.shape[-1]) - means\n# inputs to our sigmoid functions\nupper_bound_in = inv_scales * (centered_y + epsilon)\nlower_bound_in = inv_scales * (centered_y - epsilon)\n# remember: cdf of logistic distr is sigmoid of above input format\nupper_cdf = torch.sigmoid(upper_bound_in)\nlower_cdf = torch.sigmoid(lower_bound_in)\n# finally, the probability mass and equivalent log prob\nprob_mass = upper_cdf - lower_cdf\nvanilla_log_prob = torch.log(torch.clamp(prob_mass, min=1e-12))\n```\n\nBefore I go on, you may be asking - \"what is this epsilon? Weren't we\nadding/subtracting $0.5$ instead?\". Indeed, this is actually still the case.\nHowever, here we are operating on the assumption that we have scaled our $y$'s\nto be in the range [-1, 1]. For 8-bit data, that is equivalent to scaling by\n$\\frac{2}{2^8 - 1}$. We have to apply this same scaling to our $0.5$ boundaries\nfor consistency. Note that `y_range` is simply $1 - (-1) = 2$, the $2$ in the\nnumerator, and that `num_classes` for $y$ is $ 2^8 = 256 $. A final note, we\nneed need to play with `y`'s shape a little bit when computing `centered_y`:\nremember, for each target variable, we have multiple means (one for each mixture\ncomponent), while we have a single target value. We therefore need to repeat the\ntarget value for each mixture component to complete the subtraction. We now move\non to the edge cases described in step 6 of\n[the what](#the-what-and-some-more-why).\n\n```python {linenos=table}\n# edges\n# log probability for edge case of 0 (before scaling)\nlow_bound_log_prob = upper_bound_in - F.softplus(upper_bound_in)\n# log probability for edge case of 255 (before scaling)\nupp_bound_log_prob = - F.softplus(lower_bound_in)\n# middle\nmid_in = inv_scales * centered_y\nlog_pdf_mid = mid_in - log_scales - 2.0 * F.softplus(mid_in)\nlog_prob_mid = log_pdf_mid - np.log((num_classes - 1) / 2)\n```\n\n\"Woah, what is going on here?\" you may ask. Let's go line by line.\n\nIn line 3, we are defining the log probability for the edge case where we have\n$y$ values close to and below to 0, the lower bound. We can use the CDF of the\nlogistic distribution to compute this probability, since, for a random variable\nX, $\\text{CDF}(X) = P(X \\le x)$. Remember, we are working with discrete values,\nso we are not interested in the probability of $y$ being exactly 0, but rather\nthe probability of $y$ being in the \"bin\" of 0 or below. In this case, we\ntherefore want the CDF of $y + \\epsilon$, the upper bound of the bin, which we\nhave assigned in the variable `upper_bound_in`.\n\nRecall that the CDF of the logistic distribution is the sigmoid function:\n\n$$\n\\text{CDF}(x) = \\sigma(x) = \\frac{1}{1 + e^{-x}}\n$$\n\nRecall that we are interested in _log_ probabilities:\n\n$$\n\\log(\\text{CDF}(x)) = \\log(\\sigma(x)) = -\\log(1 + e^{-x})\n$$\n\nFinally, we can leverage the softplus function $\\zeta(x) = \\log(1 + e^x)$, and\nexpress the log probability in terms of $\\zeta(x)$ for numerical stability. To\ndo this, we note:\n\n$$\n\\begin{align*}\n\\log(\\text{CDF}(x)) \u0026= -\\log(1 + e^{-x}) \\\\\\\\\\\\\n\u0026= -\\log\\left(\\frac{(1 + e^{-x})(e^{x})}{e^{x}}\\right) \\\\\\\\\\\\\n\u0026= -\\log\\left(\\frac{e^{x} + 1}{e^{x}}\\right) \\\\\\\\\\\\\n\u0026= -(\\log(e^{x} + 1) - \\log(e^{x})) \\\\\\\\\\\\\n\u0026= -(\\log(e^{x} + 1) - x) \\\\\\\\\\\\\n\u0026= x - \\log(1 + e^{x}) \\\\\\\\\\\\\n\u0026= x - \\zeta(x)\n\\end{align*},\n$$\n\nwhich is the expression we use in the code, with `upper_bound_in` as $x$.\n\nLet's now move on to line 5, where we compute the log probability for the edge\ncase where we have $y$ values close to and above to 255, the upper bound. The\nreasoning here is identical, but we are now interested in values _above_ a\ncertain value rather than below, so we rely on the _complement_ of the CDF of\nthe _lower bound_ of the bin. This is why we use `lower_bound_in` in this case.\n\nIn other words, we are interested in\n\n$$\nP(X \u003e x) = 1 - P(X \\le x) = 1 - \\text{CDF}(x)\n$$\n\nwhich we can take the logarithm of and then manipulate in terms of $\\zeta(x)$ as\nwe did above:\n\n$$\n\\begin{align*}\n\\log(1 - \\text{CDF}(x)) \u0026= \\log(\\frac{e^{-x}}{1 + e^{-x}}) \\\\\\\\\\\\\n\u0026=\\log(e^{-x}) - \\log(1 + e^{-x}) \\\\\\\\\\\\\n\u0026= -x - \\log(1 + e^{-x}) \\\\\\\\\\\\\n\u0026= -x - \\zeta(x) \\\\\\\\\\\\\n\u0026= -\\zeta(x),\n\\end{align*}\n$$\n\nwhere in the final line we used the\n[identity](https://github.com/openai/pixel-cnn/issues/23)\n$\\zeta(-x) = x + \\zeta(x)$. This leaves us with the expression we use in the\ncode, with `lower_bound_in` as $x$.\n\nFinally, in lines 7 through 9, we also approximate the log probability at the\ncenter of the bin, based on the assumption that the log-density is constant in\nthe bin of the observed value. This is used as a backup in cases where\ncalculated probabilities are below 1e-5, which could happen due to numerical\ninstability. This case is extremely rare and I would not dedicate too much\nthought to it, it is just there as a (rarely-used) backup.\n\nWe can now put all these terms together into a single log likelihood tensor:\n\n```python {linenos=table}\n# Create a tensor with the same shape as 'y', filled with zeros\nlog_probs = torch.zeros_like(y)\n# conditions for filling in tensor\nis_near_min = y \u003c output_min_bound + 1e-3\nis_near_max = y \u003e output_max_bound - 1e-3\nis_prob_mass_sufficient = prob_mass \u003e 1e-5\n# And then fill it in accordingly\n# lower edge\nlog_probs[is_near_min] = low_bound_log_prob[is_near_min]\n# upper edge\nlog_probs[is_near_max] = upp_bound_log_prob[is_near_max]\n# vanilla case\nlog_probs[\n    ~is_near_min \u0026 ~is_near_max \u0026 is_prob_mass_sufficient\n] = vanilla_log_prob[\n    ~is_near_min \u0026 ~is_near_max \u0026 is_prob_mass_sufficient\n]\n# extreme case where prob mass is too small\nlog_probs[\n    ~is_near_min \u0026 ~is_near_max \u0026 ~is_prob_mass_sufficient\n] = log_prob_mid[\n    ~is_near_min \u0026 ~is_near_max \u0026 ~is_prob_mass_sufficient\n]\n```\n\nWe are almost done, but there is one last piece. So far we have computed the\nterms to minimize for learning the distribution(s), i.e. learning $\\mu$ and $s$.\nWe also need to learn which mixture distribution to sample from, i.e. we have to\nlearn $\\pi$. This is very simple, and consists in adding a log of the softmax\nover the logits (the $\\pi_i$) outputted by our model:\n\n```python {linenos=table}\n# modeling which mixture to sample from\nlog_probs = log_probs + F.log_softmax(mixture_logits, dim=-1)\n```\n\nWe add a log of the softmax because we are after\n$\\log(\\text{softmax}(\\pi) \\cdot \\text{likelihood})$ which, by applying log\nproperties, is equivalent to\n$\\log(\\text{softmax}(\\pi)) + \\log(\\text{likelihood})$, which is what we get.\n\nAll that's left to do now is summing over our mixtures. We do this after\napplying the\n[Log-Sum-Exp trick for numerical stability](https://gregorygundersen.com/blog/2020/02/09/log-sum-exp/)\n\n```python {linenos=table}\nlog_likelihood = torch.sum(log_sum_exp(log_probs), dim=-1)\n```\n\nOur loss is the negative log likelihood, which we can choose to reduce across\nthe batch or return unreduced\n\n```python {linenos=table}\nloss = - log_likelihood\n\nif reduction == 'mean'\n   loss = torch.mean(loss)\nelif reduction =='sum'\n   loss =  torch.sum(loss)\n```\n\nAnd that's it for training. Once you have your loss, you can run\nloss.backwards() and all the cool stuff torch provides with autodiff.\n\n### Sampling\n\nSampling is fortunately a bit easier, and some people start their explanation\nfrom here. Here, we first sample a distribution from our mixture, and then\nsample a value from the sampled distribution. We have logits for each\ndistribution in our mixture, so we can sample from a softmax over this\ndistribution. In practice, we make use of the\n[Gumbel-Max trick](https://timvieira.github.io/blog/post/2014/07/31/gumbel-max-trick/),\nto keep things differentiable.\n\n```python {linenos=table}\n# each of these have shape (B x K)\nmeans, log_scales, mixture_logits = model(**batch['inputs'])\n\n# gumbel-max sampling\nr1, r2 = 1e-5, 1.0 - 1e-5\ntemp = (r1 - r2) * torch.rand(means.shape, device=means.device) + r2\ntemp = mixture_logits - torch.log(-torch.log(temp))\nargmax = torch.argmax(temp, -1)\n```\n\n`argmax` is the index of our sampled distribution. We can use it to get the\ndistribution's mean and scale from our model's outputs:\n\n```python {linenos=table}\n# (K dimensional vector, e.g. [0 0 0 1 0 0 0 0] for k=8, argmax=3\ndist_one_hot = torch.eye(k)[argmax]\n\n# use it to sample, and aggregate over the batch\nsampled_log_scale = (dist_one_hot * log_scales).sum(dim=-1)\nsampled_mean = (dist_one_hot * means).sum(dim=-1)\n```\n\nWe can then sample from our logistic distribution using\n[inverse sampling](https://www.statisticshowto.com/inverse-sampling/). For a\nlogistic distribution, this consists in\n\n$$\nX = \\mu + s \\log \\left(\\frac{y}{1-y} \\right),\n$$\n\n(equation 4) where we select y from a random uniform distribution. In code:\n\n```python {linenos=table}\n# scale the (0,1) uniform distribution and re-center it\ny = (r1 - r2) * torch.rand(sampled_mean.shape, device=sampled_mean.device) + r2\n\nsampled_output = sampled_mean + torch.exp(sampled_log_scale) * (\n    torch.log(y) - torch.log(1 - y)\n)\n```\n\nAnd just like that, we have a way of sampling from our model.\n\n## Closing words\n\nI hope this post was helpful. I came across DLML during my MSc thesis on\nlanguage-enabled imitation learning and while there are several high quality\nposts online, I couldn't find a single one that summarized the process in its\nentirety, from motivation to implementation, so I decided to write it myself,\nalso as a way to help me understand the concept. As a reminder, the complete\ncode accompanying this post is available on\n[my GitHub profile](https://github.com/thesofakillers/dlml-tutorial).\n\n## More Resources\n\n- [Great comment on Tacotron GitHub](https://github.com/Rayhane-mamah/Tacotron-2/issues/155#issuecomment-413364857)\n- [Somewhat outdated Google Colab](https://colab.research.google.com/github/tensorchiefs/dl_book/blob/master/chapter_06/nb_ch06_01.ipynb)\n- [_PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications_, Salimans et al., 2017.](https://openreview.net/forum?id=BJrFC6ceg)\n- [_Learning Latent Plans from Play_, Lynch et al., 2020.](https://proceedings.mlr.press/v100/lynch20a.html)\n- [_What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data_, Mees et al., 2022.](https://arxiv.org/abs/2204.06252)\n\n[^1]:\n    Also known as \"Discretized Mixture of Logistics (DMoL)\", \"Discretized\n    Logistic Mixture (DLM)\", \"Mixture of Discretized Logistics (MDL)\"\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthesofakillers%2Fdlml-tutorial","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthesofakillers%2Fdlml-tutorial","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthesofakillers%2Fdlml-tutorial/lists"}