{"id":13711139,"url":"https://github.com/melisgl/mgl","last_synced_at":"2025-05-06T20:32:01.505Z","repository":{"id":21153668,"uuid":"24456105","full_name":"melisgl/mgl","owner":"melisgl","description":"Common Lisp machine learning library.","archived":false,"fork":false,"pushed_at":"2023-04-27T10:28:07.000Z","size":2989,"stargazers_count":573,"open_issues_count":5,"forks_count":38,"subscribers_count":39,"default_branch":"master","last_synced_at":"2024-04-13T21:01:48.328Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Common Lisp","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/melisgl.png","metadata":{"files":{"readme":"README","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2014-09-25T11:52:22.000Z","updated_at":"2024-04-09T17:45:07.000Z","dependencies_parsed_at":"2022-08-07T09:16:36.906Z","dependency_job_id":"3a606e6d-75f6-4f08-be33-204b6bf3ae4d","html_url":"https://github.com/melisgl/mgl","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/melisgl%2Fmgl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/melisgl%2Fmgl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/melisgl%2Fmgl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/melisgl%2Fmgl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/melisgl","download_url":"https://codeload.github.com/melisgl/mgl/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224528341,"owners_count":17326347,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T23:01:04.916Z","updated_at":"2024-11-13T21:31:32.417Z","avatar_url":"https://github.com/melisgl.png","language":"Common Lisp","readme":"# MGL Manual\n\n###### \\[in package MGL\\]\n## MGL ASDF System\n\n- Version: 0.1.0\n- Description: MGL is a machine learning library for backpropagation\n  neural networks, boltzmann machines, gaussian processes and more.\n- Licence: MIT, see COPYING.\n- Author: Gábor Melis \u003cmega@retes.hu\u003e\n- Mailto: [mega@retes.hu](mailto:mega@retes.hu)\n- Homepage: [http://melisgl.github.io/mgl](http://melisgl.github.io/mgl)\n- Bug tracker: [https://github.com/melisgl/mgl/issues](https://github.com/melisgl/mgl/issues)\n- Source control: [GIT](https://github.com/melisgl/mgl.git)\n\n## Introduction\n\n### Overview\n\nMGL is a Common Lisp machine learning library by [Gábor\nMelis](http://quotenil.com) with some parts originally contributed\nby Ravenpack International. It mainly concentrates on various forms\nof neural networks (boltzmann machines, feed-forward and recurrent\nbackprop nets). Most of MGL is built on top of MGL-MAT so it has\nBLAS and CUDA support.\n\nIn general, the focus is on power and performance not on ease of\nuse. Perhaps one day there will be a cookie cutter interface with\nrestricted functionality if a reasonable compromise is found between\npower and utility.\n\n### Links\n\nHere is the [official repository](https://github.com/melisgl/mgl)\nand the [HTML\ndocumentation](http://melisgl.github.io/mgl-pax-world/mgl-manual.html)\nfor the latest version.\n\n### Dependencies\n\nMGL used to rely on [LLA](https://github.com/tpapp/lla) to\ninterface to BLAS and LAPACK. That's mostly history by now, but\nconfiguration of foreign libraries is still done via LLA. See the\nREADME in LLA on how to set things up. Note that these days OpenBLAS\nis easier to set up and just as fast as ATLAS.\n\n[CL-CUDA](https://github.com/takagi/cl-cuda) and\n[MGL-MAT](https://github.com/melisgl/mgl) are the two main\ndependencies and also the ones not yet in quicklisp, so just drop\nthem into `quicklisp/local-projects/`. If there is no suitable GPU\non the system or the CUDA SDK is not installed, MGL will simply\nfall back on using BLAS and Lisp code. Wrapping code in\nMGL-MAT:WITH-CUDA\\* is basically all that's needed to run on the GPU,\nand with MGL-MAT:CUDA-AVAILABLE-P one can check whether the GPU is\nreally being used.\n\n### Code Organization\n\nMGL consists of several packages dedicated to different tasks.\nFor example, package `MGL-RESAMPLE` is about\nMGL-RESAMPLE::@MGL-RESAMPLE and `MGL-GD` is about MGL-GD::@MGL-GD\nand so on. On one hand, having many packages makes it easier to\ncleanly separate API and implementation and also to explore into a\nspecific task. At other times, they can be a hassle, so the MGL\npackage itself reexports every external symbol found in all the\nother packages that make up MGL and MGL-MAT (see\nMGL-MAT::@MAT-MANUAL) on which it heavily relies.\n\nOne exception to this rule is the bundled, but independent\nMGL-GNUPLOT library.\n\nThe built in tests can be run with:\n\n    (ASDF:OOS 'ASDF:TEST-OP '#:MGL)\n\nNote, that most of the tests are rather stochastic and can fail once\nin a while.\n\n### Glossary\n\nUltimately machine learning is about creating **models** of some\ndomain. The observations in the modelled domain are called\n**instances** (also known as examples or samples). Sets of instances\nare called **datasets**. Datasets are used when fitting a model or\nwhen making **predictions**. Sometimes the word predictions is too\nspecific, and the results obtained from applying a model to some\ninstances are simply called **results**.\n\n## Datasets\n\n###### \\[in package MGL-DATASET\\]\nAn instance can often be any kind of object of the user's choice.\nIt is typically represented by a set of numbers which is called a\nfeature vector or by a structure holding the feature vector, the\nlabel, etc. A dataset is a SEQUENCE of such instances or a\n@MGL-SAMPLER object that produces instances.\n\n- [function] MAP-DATASET FN DATASET\n\n    Call FN with each instance in DATASET. This is basically equivalent\n    to iterating over the elements of a sequence or a sampler (see\n    @MGL-SAMPLER).\n\n- [function] MAP-DATASETS FN DATASETS \u0026KEY (IMPUTE NIL IMPUTEP)\n\n    Call FN with a list of instances, one from each dataset in\n    DATASETS. Return nothing. If IMPUTE is specified then iterate until\n    the largest dataset is consumed imputing IMPUTE for missing values.\n    If IMPUTE is not specified then iterate until the smallest dataset\n    runs out.\n    \n    ```common-lisp\n    (map-datasets #'prin1 '((0 1 2) (:a :b)))\n    .. (0 :A)(1 :B)\n    \n    (map-datasets #'prin1 '((0 1 2) (:a :b)) :impute nil)\n    .. (0 :A)(1 :B)(2 NIL)\n    ```\n    \n    It is of course allowed to mix sequences with samplers:\n    \n    ```common-lisp\n    (map-datasets #'prin1\n                  (list '(0 1 2)\n                        (make-sequence-sampler '(:a :b) :max-n-samples 2)))\n    .. (0 :A)(1 :B)\n    ```\n\n\n### Samplers\n\nSome algorithms do not need random access to the entire dataset and\ncan work with a stream observations. Samplers are simple generators\nproviding two functions: SAMPLE and FINISHEDP.\n\n- [generic-function] SAMPLE SAMPLER\n\n    If SAMPLER has not run out of data (see FINISHEDP)\n    SAMPLE returns an object that represents a sample from the world to\n    be experienced or, in other words, simply something the can be used\n    as input for training or prediction. It is not allowed to call\n    SAMPLE if SAMPLER is FINISHEDP.\n\n- [generic-function] FINISHEDP SAMPLER\n\n    See if SAMPLER has run out of examples.\n\n- [function] LIST-SAMPLES SAMPLER MAX-SIZE\n\n    Return a list of samples of length at most MAX-SIZE or less if\n    SAMPLER runs out.\n\n- [function] MAKE-SEQUENCE-SAMPLER SEQ \u0026KEY MAX-N-SAMPLES\n\n    Create a sampler that returns elements of SEQ in their original\n    order. If MAX-N-SAMPLES is non-nil, then at most MAX-N-SAMPLES are\n    sampled.\n\n- [function] MAKE-RANDOM-SAMPLER SEQ \u0026KEY MAX-N-SAMPLES (REORDER #'MGL-RESAMPLE:SHUFFLE)\n\n    Create a sampler that returns elements of SEQ in random order. If\n    MAX-N-SAMPLES is non-nil, then at most MAX-N-SAMPLES are sampled.\n    The first pass over a shuffled copy of SEQ, and this copy is\n    reshuffled whenever the sampler reaches the end of it. Shuffling is\n    performed by calling the REORDER function.\n\n- [variable] *INFINITELY-EMPTY-DATASET* #\\\u003cFUNCTION-SAMPLER \"infinitely empty\" \\\u003e\n\n    This is the default dataset for MGL-OPT:MINIMIZE. It's an infinite\n    stream of NILs.\n\n#### Function Sampler\n\n- [class] FUNCTION-SAMPLER\n\n    A sampler with a function in its GENERATOR that\n    produces a stream of samples which may or may not be finite\n    depending on MAX-N-SAMPLES. FINISHEDP returns T iff MAX-N-SAMPLES is\n    non-nil, and it's not greater than the number of samples\n    generated (N-SAMPLES).\n    \n        (list-samples (make-instance 'function-sampler\n                                     :generator (lambda ()\n                                                  (random 10))\n                                     :max-n-samples 5)\n                      10)\n        =\u003e (3 5 2 3 3)\n\n\n- [reader] GENERATOR FUNCTION-SAMPLER (:GENERATOR)\n\n    A generator function of no arguments that returns\n    the next sample.\n\n- [accessor] MAX-N-SAMPLES FUNCTION-SAMPLER (:MAX-N-SAMPLES = NIL)\n\n- [reader] NAME FUNCTION-SAMPLER (:NAME = NIL)\n\n    An arbitrary object naming the sampler. Only used\n    for printing the sampler object.\n\n- [reader] N-SAMPLES FUNCTION-SAMPLER (:N-SAMPLES = 0)\n\n## Resampling\n\n###### \\[in package MGL-RESAMPLE\\]\nThe focus of this package is on resampling methods such as\ncross-validation and bagging which can be used for model evaluation,\nmodel selection, and also as a simple form of ensembling. Data\npartitioning and sampling functions are also provided because they\ntend to be used together with resampling.\n\n### Partitions\n\nThe following functions partition a dataset (currently only\nSEQUENCEs are supported) into a number of partitions. For each\nelement in the original dataset there is exactly one partition that\ncontains it.\n\n- [function] FRACTURE FRACTIONS SEQ \u0026KEY WEIGHT\n\n    Partition SEQ into a number of subsequences. FRACTIONS is either a\n    positive integer or a list of non-negative real numbers. WEIGHT is\n    NIL or a function that returns a non-negative real number when\n    called with an element from SEQ. If FRACTIONS is a positive integer\n    then return a list of that many subsequences with equal sum of\n    weights bar rounding errors, else partition SEQ into subsequences,\n    where the sum of weights of subsequence I is proportional to element\n    I of FRACTIONS. If WEIGHT is NIL, then it's element is assumed to\n    have the same weight.\n    \n    To split into 5 sequences:\n    \n    ```common-lisp\n    (fracture 5 '(0 1 2 3 4 5 6 7 8 9))\n    =\u003e ((0 1) (2 3) (4 5) (6 7) (8 9))\n    ```\n    \n    To split into two sequences whose lengths are proportional to 2 and\n    3:\n    \n    ```common-lisp\n    (fracture '(2 3) '(0 1 2 3 4 5 6 7 8 9))\n    =\u003e ((0 1 2 3) (4 5 6 7 8 9))\n    ```\n\n\n- [function] STRATIFY SEQ \u0026KEY (KEY #'IDENTITY) (TEST #'EQL)\n\n    Return the list of strata of SEQ. SEQ is a sequence of elements for\n    which the function KEY returns the class they belong to. Such\n    classes are opaque objects compared for equality with TEST. A\n    stratum is a sequence of elements with the same (under TEST) KEY.\n    \n    ```common-lisp\n    (stratify '(0 1 2 3 4 5 6 7 8 9) :key #'evenp)\n    =\u003e ((0 2 4 6 8) (1 3 5 7 9))\n    ```\n\n\n- [function] FRACTURE-STRATIFIED FRACTIONS SEQ \u0026KEY (KEY #'IDENTITY) (TEST #'EQL) WEIGHT\n\n    Similar to FRACTURE, but also makes sure that keys are evenly\n    distributed among the partitions (see STRATIFY). It can be useful\n    for classification tasks to partition the data set while keeping the\n    distribution of classes the same.\n    \n    Note that the sets returned are not in random order. In fact, they\n    are sorted internally by KEY.\n    \n    For example, to make two splits with approximately the same number\n    of even and odd numbers:\n    \n    ```common-lisp\n    (fracture-stratified 2 '(0 1 2 3 4 5 6 7 8 9) :key #'evenp)\n    =\u003e ((0 2 1 3) (4 6 8 5 7 9))\n    ```\n\n\n### Cross-validation\n\n- [function] CROSS-VALIDATE DATA FN \u0026KEY (N-FOLDS 5) (FOLDS (ALEXANDRIA:IOTA N-FOLDS)) (SPLIT-FN #'SPLIT-FOLD/MOD) PASS-FOLD\n\n    Map FN over the FOLDS of DATA split with SPLIT-FN and collect the\n    results in a list. The simplest demonstration is:\n    \n    ```common-lisp\n    (cross-validate '(0 1 2 3 4)\n                    (lambda (test training)\n                     (list test training))\n                    :n-folds 5)\n    =\u003e (((0) (1 2 3 4))\n        ((1) (0 2 3 4))\n        ((2) (0 1 3 4))\n        ((3) (0 1 2 4))\n        ((4) (0 1 2 3)))\n    ```\n    \n    Of course, in practice one would typically train a model and return\n    the trained model and/or its score on TEST. Also, sometimes one may\n    want to do only some of the folds and remember which ones they were:\n    \n    ```common-lisp\n    (cross-validate '(0 1 2 3 4)\n                    (lambda (fold test training)\n                     (list :fold fold test training))\n                    :folds '(2 3)\n                    :pass-fold t)\n    =\u003e ((:fold 2 (2) (0 1 3 4))\n        (:fold 3 (3) (0 1 2 4)))\n    ```\n    \n    Finally, the way the data is split can be customized. By default\n    SPLIT-FOLD/MOD is called with the arguments DATA, the fold (from\n    among FOLDS) and N-FOLDS. SPLIT-FOLD/MOD returns two values which\n    are then passed on to FN. One can use SPLIT-FOLD/CONT or\n    SPLIT-STRATIFIED or any other function that works with these\n    arguments. The only real constraint is that FN has to take as many\n    arguments (plus the fold argument if PASS-FOLD) as SPLIT-FN\n    returns.\n\n- [function] SPLIT-FOLD/MOD SEQ FOLD N-FOLDS\n\n    Partition SEQ into two sequences: one with elements of SEQ with\n    indices whose remainder is FOLD when divided with N-FOLDS, and a\n    second one with the rest. The second one is the larger set. The\n    order of elements remains stable. This function is suitable as the\n    SPLIT-FN argument of CROSS-VALIDATE.\n\n- [function] SPLIT-FOLD/CONT SEQ FOLD N-FOLDS\n\n    Imagine dividing SEQ into N-FOLDS subsequences of the same\n    size (bar rounding). Return the subsequence of index FOLD as the\n    first value and the all the other subsequences concatenated into one\n    as the second value. The order of elements remains stable. This\n    function is suitable as the SPLIT-FN argument of CROSS-VALIDATE.\n\n- [function] SPLIT-STRATIFIED SEQ FOLD N-FOLDS \u0026KEY (KEY #'IDENTITY) (TEST #'EQL) WEIGHT\n\n    Split SEQ into N-FOLDS partitions (as in FRACTURE-STRATIFIED).\n    Return the partition of index FOLD as the first value, and the\n    concatenation of the rest as the second value. This function is\n    suitable as the SPLIT-FN argument of CROSS-VALIDATE (mostly likely\n    as a closure with KEY, TEST, WEIGHT bound).\n\n### Bagging\n\n- [function] BAG SEQ FN \u0026KEY (RATIO 1) N WEIGHT (REPLACEMENT T) KEY (TEST #'EQL) (RANDOM-STATE \\*RANDOM-STATE\\*)\n\n    Sample from SEQ with SAMPLE-FROM (passing RATIO, WEIGHT,\n    REPLACEMENT), or SAMPLE-STRATIFIED if KEY is not NIL. Call FN with\n    the sample. If N is NIL then keep repeating this until FN performs a\n    non-local exit. Else N must be a non-negative integer, N iterations\n    will be performed, the primary values returned by FN collected into\n    a list and returned. See SAMPLE-FROM and SAMPLE-STRATIFIED for\n    examples.\n\n- [function] SAMPLE-FROM RATIO SEQ \u0026KEY WEIGHT REPLACEMENT (RANDOM-STATE \\*RANDOM-STATE\\*)\n\n    Return a sequence constructed by sampling with or without\n    REPLACEMENT from SEQ. The sum of weights in the result sequence will\n    approximately be the sum of weights of SEQ times RATIO. If WEIGHT is\n    NIL then elements are assumed to have equal weights, else WEIGHT\n    should return a non-negative real number when called with an element\n    of SEQ.\n    \n    To randomly select half of the elements:\n    \n    ```common-lisp\n    (sample-from 1/2 '(0 1 2 3 4 5))\n    =\u003e (5 3 2)\n    ```\n    \n    To randomly select some elements such that the sum of their weights\n    constitute about half of the sum of weights across the whole\n    sequence:\n    \n    ```common-lisp\n    (sample-from 1/2 '(0 1 2 3 4 5 6 7 8 9) :weight #'identity)\n    =\u003e ;; sums to 28 that's near 45/2\n       (9 4 1 6 8)\n    ```\n    \n    To sample with replacement (that is, allowing the element to be\n    sampled multiple times):\n    \n    ```common-lisp\n    (sample-from 1 '(0 1 2 3 4 5) :replacement t)\n    =\u003e (1 1 5 1 4 4)\n    ```\n\n\n- [function] SAMPLE-STRATIFIED RATIO SEQ \u0026KEY WEIGHT REPLACEMENT (KEY #'IDENTITY) (TEST #'EQL) (RANDOM-STATE \\*RANDOM-STATE\\*)\n\n    Like SAMPLE-FROM but makes sure that the weighted proportion of\n    classes in the result is approximately the same as the proportion in\n    SEQ. See STRATIFY for the description of KEY and TEST.\n\n### CV Bagging\n\n- [function] BAG-CV DATA FN \u0026KEY N (N-FOLDS 5) (FOLDS (ALEXANDRIA:IOTA N-FOLDS)) (SPLIT-FN #'SPLIT-FOLD/MOD) PASS-FOLD (RANDOM-STATE \\*RANDOM-STATE\\*)\n\n    Perform cross-validation on different shuffles of DATA N times and\n    collect the results. Since CROSS-VALIDATE collects the return values\n    of FN, the return value of this function is a list of lists of FN\n    results. If N is NIL, don't collect anything just keep doing\n    repeated CVs until FN performs a non-local exit.\n    \n    The following example simply collects the test and training sets for\n    2-fold CV repeated 3 times with shuffled data:\n    \n    ```commonlisp\n    ;;; This is non-deterministic.\n    (bag-cv '(0 1 2 3 4) #'list :n 3 :n-folds 2)\n    =\u003e ((((2 3 4) (1 0))\n         ((1 0) (2 3 4)))\n        (((2 1 0) (4 3))\n         ((4 3) (2 1 0)))\n        (((1 0 3) (2 4))\n         ((2 4) (1 0 3))))\n    ```\n    \n    CV bagging is useful when a single CV is not producing stable\n    results. As an ensemble method, CV bagging has the advantage over\n    bagging that each example will occur the same number of times and\n    after the first CV is complete there is a complete but less reliable\n    estimate for each example which gets refined by further CVs.\n\n### Miscellaneous Operations\n\n- [function] SPREAD-STRATA SEQ \u0026KEY (KEY #'IDENTITY) (TEST #'EQL)\n\n    Return a sequence that's a reordering of SEQ such that elements\n    belonging to different strata (under KEY and TEST, see STRATIFY) are\n    distributed evenly. The order of elements belonging to the same\n    stratum is unchanged.\n    \n    For example, to make sure that even and odd numbers are distributed\n    evenly:\n    \n    ```common-lisp\n    (spread-strata '(0 2 4 6 8 1 3 5 7 9) :key #'evenp)\n    =\u003e (0 1 2 3 4 5 6 7 8 9)\n    ```\n    \n    Same thing with unbalanced classes:\n    \n    ```common-lisp\n    (spread-strata (vector 0 2 3 5 6 1 4)\n                   :key (lambda (x)\n                          (if (member x '(1 4))\n                              t\n                              nil)))\n    =\u003e #(0 1 2 3 4 5 6)\n    ```\n\n\n- [function] ZIP-EVENLY SEQS \u0026KEY RESULT-TYPE\n\n    Make a single sequence out of the sequences in SEQS so that in the\n    returned sequence indices of elements belonging to the same source\n    sequence are spread evenly across the whole range. The result is a\n    list is RESULT-TYPE is LIST, it's a vector if RESULT-TYPE is VECTOR.\n    If RESULT-TYPE is NIL, then it's determined by the type of the first\n    sequence in SEQS.\n    \n    ```common-lisp\n    (zip-evenly '((0 2 4) (1 3)))\n    =\u003e (0 1 2 3 4)\n    ```\n\n\n## Core\n\n###### \\[in package MGL-CORE\\]\n### Persistence\n\n- [function] LOAD-STATE FILENAME OBJECT\n\n    Load weights of OBJECT from FILENAME. Return OBJECT.\n\n- [function] SAVE-STATE FILENAME OBJECT \u0026KEY (IF-EXISTS :ERROR) (ENSURE T)\n\n    Save weights of OBJECT to FILENAME. If ENSURE, then\n    ENSURE-DIRECTORIES-EXIST is called on FILENAME. IF-EXISTS is passed\n    on to OPEN. Return OBJECT.\n\n- [function] READ-STATE OBJECT STREAM\n\n    Read the weights of OBJECT from the bivalent STREAM where weights\n    mean the learnt parameters. There is currently no sanity checking of\n    data which will most certainly change in the future together with\n    the serialization format. Return OBJECT.\n\n- [function] WRITE-STATE OBJECT STREAM\n\n    Write weight of OBJECT to the bivalent STREAM. Return OBJECT.\n\n- [generic-function] READ-STATE* OBJECT STREAM CONTEXT\n\n    This is the extension point for READ-STATE. It is\n    guaranteed that primary READ-STATE\\* methods will be called only once\n    for each OBJECT (under EQ). CONTEXT is an opaque object and must be\n    passed on to any recursive READ-STATE\\* calls.\n\n- [generic-function] WRITE-STATE* OBJECT STREAM CONTEXT\n\n    This is the extension point for WRITE-STATE. It is\n    guaranteed that primary WRITE-STATE\\* methods will be called only\n    once for each OBJECT (under EQ). CONTEXT is an opaque object and must\n    be passed on to any recursive WRITE-STATE\\* calls.\n\n### Batch Processing\n\nProcessing instances one by one during training or prediction can\nbe slow. The models that support batch processing for greater\nefficiency are said to be *striped*.\n\nTypically, during or after creating a model, one sets MAX-N-STRIPES\non it a positive integer. When a batch of instances is to be fed to\nthe model it is first broken into subbatches of length that's at\nmost MAX-N-STRIPES. For each subbatch, SET-INPUT (FIXDOC) is called\nand a before method takes care of setting N-STRIPES to the actual\nnumber of instances in the subbatch. When MAX-N-STRIPES is set\ninternal data structures may be resized which is an expensive\noperation. Setting N-STRIPES is a comparatively cheap operation,\noften implemented as matrix reshaping.\n\nNote that for models made of different parts (for example,\nMGL-BP:BPN consists of MGL-BP:LUMPs) , setting these\nvalues affects the constituent parts, but one should never change\nthe number stripes of the parts directly because that would lead to\nan internal inconsistency in the model.\n\n- [generic-function] MAX-N-STRIPES OBJECT\n\n    The number of stripes with which the OBJECT is\n    capable of dealing simultaneously. \n\n- [generic-function] SET-MAX-N-STRIPES MAX-N-STRIPES OBJECT\n\n    Allocate the necessary stuff to allow for\n    MAX-N-STRIPES number of stripes to be worked with simultaneously in\n    OBJECT. This is called when MAX-N-STRIPES is SETF'ed.\n\n- [generic-function] N-STRIPES OBJECT\n\n    The number of stripes currently present in OBJECT.\n    This is at most MAX-N-STRIPES.\n\n- [generic-function] SET-N-STRIPES N-STRIPES OBJECT\n\n    Set the number of stripes (out of MAX-N-STRIPES)\n    that are in use in OBJECT. This is called when N-STRIPES is\n    SETF'ed.\n\n- [macro] WITH-STRIPES SPECS \u0026BODY BODY\n\n    Bind start and optionally end indices belonging to stripes in\n    striped objects.\n    \n        (WITH-STRIPES ((STRIPE1 OBJECT1 START1 END1)\n                       (STRIPE2 OBJECT2 START2)\n                       ...)\n         ...)\n    \n    This is how one's supposed to find the index range corresponding to\n    the Nth input in an input lump of a bpn:\n    \n         (with-stripes ((n input-lump start end))\n           (loop for i upfrom start below end\n                 do (setf (mref (nodes input-lump) i) 0d0)))\n    \n    Note how the input lump is striped, but the matrix into which we are\n    indexing (NODES) is not known to WITH-STRIPES. In fact, for lumps\n    the same stripe indices work with NODES and MGL-BP:DERIVATIVES.\n\n- [generic-function] STRIPE-START STRIPE OBJECT\n\n    Return the start index of STRIPE in some array or\n    matrix of OBJECT.\n\n- [generic-function] STRIPE-END STRIPE OBJECT\n\n    Return the end index (exclusive) of STRIPE in some\n    array or matrix of OBJECT.\n\n- [generic-function] SET-INPUT INSTANCES MODEL\n\n    Set INSTANCES as inputs in MODEL. INSTANCES is\n    always a SEQUENCE of instances even for models not capable of batch\n    operation. It sets N-STRIPES to (LENGTH INSTANCES) in a :BEFORE\n    method.\n\n- [function] MAP-BATCHES-FOR-MODEL FN DATASET MODEL\n\n    Call FN with batches of instances from DATASET suitable for MODEL.\n    The number of instances in a batch is MAX-N-STRIPES of MODEL or less\n    if there are no more instances left.\n\n- [macro] DO-BATCHES-FOR-MODEL (BATCH (DATASET MODEL)) \u0026BODY BODY\n\n    Convenience macro over MAP-BATCHES-FOR-MODEL.\n\n### Executors\n\n- [generic-function] MAP-OVER-EXECUTORS FN INSTANCES PROTOTYPE-EXECUTOR\n\n    Divide INSTANCES between executors that perform the\n    same function as PROTOTYPE-EXECUTOR and call FN with the instances\n    and the executor for which the instances are.\n    \n    Some objects conflate function and call: the forward pass of a\n    MGL-BP:BPN computes output from inputs so it is like a\n    function but it also doubles as a function call in the sense that\n    the bpn (function) object changes state during the computation of\n    the output. Hence not even the forward pass of a bpn is thread safe.\n    There is also the restriction that all inputs must be of the same\n    size.\n    \n    For example, if we have a function that builds bpn a for an input of\n    a certain size, then we can create a factory that creates bpns for a\n    particular call. The factory probably wants to keep the weights the\n    same though. In @MGL-PARAMETERIZED-EXECUTOR-CACHE,\n    MAKE-EXECUTOR-WITH-PARAMETERS is this factory.\n    \n    Parallelization of execution is another possibility\n    MAP-OVER-EXECUTORS allows, but there is no prebuilt solution for it,\n    yet.\n    \n    The default implementation simply calls FN with INSTANCES and\n    PROTOTYPE-EXECUTOR.\n\n- [macro] DO-EXECUTORS (INSTANCES OBJECT) \u0026BODY BODY\n\n    Convenience macro on top of MAP-OVER-EXECUTORS.\n\n#### Parameterized Executor Cache\n\n- [class] PARAMETERIZED-EXECUTOR-CACHE-MIXIN\n\n    Mix this into a model, implement\n    INSTANCE-TO-EXECUTOR-PARAMETERS and MAKE-EXECUTOR-WITH-PARAMETERS\n    and DO-EXECUTORS will be to able build executors suitable for\n    different instances. The canonical example is using a BPN to compute\n    the means and convariances of a gaussian process. Since each\n    instance is made of a variable number of observations, the size of\n    the input is not constant, thus we have a bpn (an executor) for each\n    input dimension (the parameters).\n\n- [generic-function] MAKE-EXECUTOR-WITH-PARAMETERS PARAMETERS CACHE\n\n    Create a new executor for PARAMETERS. CACHE is a\n    PARAMETERIZED-EXECUTOR-CACHE-MIXIN. In the BPN gaussian process\n    example, PARAMETERS would be a list of input dimensions.\n\n- [generic-function] INSTANCE-TO-EXECUTOR-PARAMETERS INSTANCE CACHE\n\n    Return the parameters for an executor able to\n    handle INSTANCE. Called by MAP-OVER-EXECUTORS on CACHE (that's a\n    PARAMETERIZED-EXECUTOR-CACHE-MIXIN). The returned parameters are\n    keys in an EQUAL parameters-\u003eexecutor hash table.\n\n## Monitoring\n\n###### \\[in package MGL-CORE\\]\nWhen training or applying a model, one often wants to track various\nstatistics. For example, in the case of training a neural network\nwith cross-entropy loss, these statistics could be the average\ncross-entropy loss itself, classification accuracy, or even the\nentire confusion matrix and sparsity levels in hidden layers. Also,\nthere is the question of what to do with the measured values (log\nand forget, add to some counter or a list).\n\nSo there may be several phases of operation when we want to keep an\neye on. Let's call these **events**. There can also be many fairly\nindependent things to do in response to an event. Let's call these\n**monitors**. Some monitors are a composition of two operations: one\nthat extracts some measurements and another that aggregates those\nmeasurements. Let's call these two **measurers** and **counters**,\nrespectively.\n\nFor example, consider training a backpropagation neural network. We\nwant to look at the state of of network just after the backward\npass. MGL-BP:BP-LEARNER has a MONITORS event hook corresponding to the moment after\nbackpropagating the gradients. Suppose we are interested in how the\ntraining cost evolves:\n\n    (push (make-instance 'monitor\n                         :measurer (lambda (instances bpn)\n                                     (declare (ignore instances))\n                                     (mgl-bp:cost bpn))\n                         :counter (make-instance 'basic-counter))\n          (monitors learner))\n\nDuring training, this monitor will track the cost of training\nexamples behind the scenes. If we want to print and reset this\nmonitor periodically we can put another monitor on\nMGL-OPT:ITERATIVE-OPTIMIZER's MGL-OPT:ON-N-INSTANCES-CHANGED\naccessor:\n\n    (push (lambda (optimizer gradient-source n-instances)\n            (declare (ignore optimizer))\n            (when (zerop (mod n-instances 1000))\n              (format t \"n-instances: ~S~%\" n-instances)\n              (dolist (monitor (monitors gradient-source))\n                (when (counter monitor)\n                  (format t \"~A~%\" (counter monitor))\n                  (reset-counter (counter monitor)))))\n          (mgl-opt:on-n-instances-changed optimizer))\n\nNote that the monitor we push can be anything as long as\nAPPLY-MONITOR is implemented on it with the appropriate signature.\nAlso note that the ZEROP + MOD logic is fragile, so you will likely\nwant to use MGL-OPT:MONITOR-OPTIMIZATION-PERIODICALLY instead of\ndoing the above.\n\nSo that's the general idea. Concrete events are documented where\nthey are signalled. Often there are task specific utilities that\ncreate a reasonable set of default monitors (see\n@MGL-CLASSIFICATION-MONITOR).\n\n- [function] APPLY-MONITORS MONITORS \u0026REST ARGUMENTS\n\n    Call APPLY-MONITOR on each monitor in MONITORS and ARGUMENTS. This\n    is how an event is fired.\n\n- [generic-function] APPLY-MONITOR MONITOR \u0026REST ARGUMENTS\n\n    Apply MONITOR to ARGUMENTS. This sound fairly\n    generic, because it is. MONITOR can be anything, even a simple\n    function or symbol, in which case this is just CL:APPLY. See\n    @MGL-MONITOR for more.\n\n- [generic-function] COUNTER MONITOR\n\n    Return an object representing the state of MONITOR\n    or NIL, if it doesn't have any (say because it's a simple logging\n    function). Most monitors have counters into which they accumulate\n    results until they are printed and reset. See @MGL-COUNTER for\n    more.\n\n- [function] MONITOR-MODEL-RESULTS FN DATASET MODEL MONITORS\n\n    Call FN with batches of instances from DATASET until it runs\n    out (as in DO-BATCHES-FOR-MODEL). FN is supposed to apply MODEL to\n    the batch and return some kind of result (for neural networks, the\n    result is the model state itself). Apply MONITORS to each batch and\n    the result returned by FN for that batch. Finally, return the list\n    of counters of MONITORS.\n    \n    The purpose of this function is to collect various results and\n    statistics (such as error measures) efficiently by applying the\n    model only once, leaving extraction of quantities of interest from\n    the model's results to MONITORS.\n    \n    See the model specific versions of this functions such as\n    MGL-BP:MONITOR-BPN-RESULTS.\n\n- [generic-function] MONITORS OBJECT\n\n    Return monitors associated with OBJECT. See various\n    methods such as MONITORS for more\n    documentation.\n\n### Monitors\n\n- [class] MONITOR\n\n    A monitor that has another monitor called MEASURER\n    embedded in it. When this monitor is applied, it applies the\n    measurer and passes the returned values to ADD-TO-COUNTER called on\n    its COUNTER slot. One may further specialize APPLY-MONITOR to change\n    that.\n    \n    This class is useful when the same event monitor is applied\n    repeatedly over a period and its results must be aggregated such as\n    when training statistics are being tracked or when predictions are\n    begin made. Note that the monitor must be compatible with the event\n    it handles. That is, the embedded MEASURER must be prepared to take\n    the arguments that are documented to come with the event.\n\n- [reader] MEASURER MONITOR (:MEASURER)\n\n    This must be a monitor itself which only means\n    that APPLY-MONITOR is defined on it (but see @MGL-MONITORING). The\n    returned values are aggregated by COUNTER. See\n    @MGL-MEASURER for a library of measurers.\n\n- [reader] COUNTER MONITOR (:COUNTER)\n\n    The COUNTER of a monitor carries out the\n    aggregation of results returned by MEASURER. The See @MGL-COUNTER\n    for a library of counters.\n\n### Measurers\n\nMEASURER is a part of MONITOR objects, an embedded monitor that\ncomputes a specific quantity (e.g. classification accuracy) from the\narguments of event it is applied to (e.g. the model results).\nMeasurers are often implemented by combining some kind of model\nspecific extractor with a generic measurer function.\n\nAll generic measurer functions return their results as multiple\nvalues matching the arguments of ADD-TO-COUNTER for a counter of a\ncertain type (see @MGL-COUNTER) so as to make them easily used in a\nMONITOR:\n\n    (multiple-value-call #'add-to-counter \u003csome-counter\u003e\n                         \u003ccall-to-some-measurer\u003e)\n\nThe counter class compatible with the measurer this way is noted for\neach function.\n\nFor a list of measurer functions see @MGL-CLASSIFICATION-MEASURER.\n\n### Counters\n\n- [generic-function] ADD-TO-COUNTER COUNTER \u0026REST ARGS\n\n    Add ARGS to COUNTER in some way. See specialized\n    methods for type specific documentation. The kind of arguments to be\n    supported is the what the measurer functions (see @MGL-MEASURER)\n    intended to be paired with the counter return as multiple values.\n\n- [generic-function] COUNTER-VALUES COUNTER\n\n    Return any number of values representing the state\n    of COUNTER. See specialized methods for type specific\n    documentation.\n\n- [generic-function] COUNTER-RAW-VALUES COUNTER\n\n    Return any number of values representing the state\n    of COUNTER in such a way that passing the returned values as\n    arguments ADD-TO-COUNTER on a fresh instance of the same type\n    recreates the original state.\n\n- [generic-function] RESET-COUNTER COUNTER\n\n    Restore state of COUNTER to what it was just after\n    creation.\n\n#### Attributes\n\n- [class] ATTRIBUTED\n\n    This is a utility class that all counters subclass.\n    The ATTRIBUTES plist can hold basically anything. Currently the\n    attributes are only used when printing and they can be specified by\n    the user. The monitor maker functions such as those in\n    @MGL-CLASSIFICATION-MONITOR also add attributes of their own to the\n    counters they create.\n    \n    With the :PREPEND-ATTRIBUTES initarg when can easily add new\n    attributes without clobbering the those in the :INITFORM, (:TYPE\n    \"rmse\") in this case.\n    \n        (princ (make-instance 'rmse-counter\n                              :prepend-attributes '(:event \"pred.\"\n                                                    :dataset \"test\")))\n        ;; pred. test rmse: 0.000e+0 (0)\n        =\u003e #\u003cRMSE-COUNTER pred. test rmse: 0.000e+0 (0)\u003e\n\n\n- [accessor] ATTRIBUTES ATTRIBUTED (:ATTRIBUTES = NIL)\n\n    A plist of attribute keys and values.\n\n- [method] NAME (ATTRIBUTED ATTRIBUTED)\n\n    Return a string assembled from the values of the ATTRIBUTES of\n    ATTRIBUTED. If there are multiple entries with the same key, then\n    they are printed near together.\n    \n    Values may be padded according to an enclosing\n    WITH-PADDED-ATTRIBUTE-PRINTING.\n\n- [macro] WITH-PADDED-ATTRIBUTE-PRINTING (ATTRIBUTEDS) \u0026BODY BODY\n\n    Note the width of values for each attribute key which is the number\n    of characters in the value's PRINC-TO-STRING'ed representation. In\n    BODY, if attributes with they same key are printed they are forced\n    to be at least this wide. This allows for nice, table-like output:\n    \n        (let ((attributeds\n                (list (make-instance 'basic-counter\n                                     :attributes '(:a 1 :b 23 :c 456))\n                      (make-instance 'basic-counter\n                                     :attributes '(:a 123 :b 45 :c 6)))))\n          (with-padded-attribute-printing (attributeds)\n            (map nil (lambda (attributed)\n                       (format t \"~A~%\" attributed))\n                 attributeds)))\n        ;; 1   23 456: 0.000e+0 (0)\n        ;; 123 45 6  : 0.000e+0 (0)\n\n\n- [function] LOG-PADDED ATTRIBUTEDS\n\n    Log (see LOG-MSG) ATTRIBUTEDS non-escaped (as in PRINC or ~A) with\n    the output being as table-like as possible.\n\n#### Counter classes\n\nIn addition to the really basic ones here, also see\n@MGL-CLASSIFICATION-COUNTER.\n\n- [class] BASIC-COUNTER ATTRIBUTED\n\n    A simple counter whose ADD-TO-COUNTER takes two\n    additional parameters: an increment to the internal sums of called\n    the NUMERATOR and DENOMINATOR. COUNTER-VALUES returns two\n    values:\n    \n    - NUMERATOR divided by DENOMINATOR (or 0 if DENOMINATOR is 0) and\n    \n    - DENOMINATOR\n    \n    Here is an example the compute the mean of 5 things received in two\n    batches:\n    \n         (let ((counter (make-instance 'basic-counter)))\n           (add-to-counter counter 6.5 3)\n           (add-to-counter counter 3.5 2)\n           counter)\n         =\u003e #\u003cBASIC-COUNTER 2.00000e+0 (5)\u003e\n\n\n- [class] RMSE-COUNTER BASIC-COUNTER\n\n    A BASIC-COUNTER with whose nominator accumulates\n    the square of some statistics. It has the attribute :TYPE \"rmse\".\n    COUNTER-VALUES returns the square root of what BASIC-COUNTER's\n    COUNTER-VALUES would return.\n    \n        (let ((counter (make-instance 'rmse-counter)))\n          (add-to-counter counter (+ (* 3 3) (* 4 4)) 2)\n          counter)\n        =\u003e #\u003cRMSE-COUNTER rmse: 3.53553e+0 (2)\u003e\n\n\n- [class] CONCAT-COUNTER ATTRIBUTED\n\n    A counter that simply concatenates\n    sequences.\n    \n    \\`\\`\\`cl-transcript\n    (let ((counter (make-instance 'concat-counter)))\n      (add-to-counter counter '(1 2 3) #(4 5))\n      (add-to-counter counter '(6 7))\n      (counter-values counter))\n    =\u003e (1 2 3 4 5 6 7)\n    \\`\\`\\`\\`\n\n- [reader] CONCATENATION-TYPE CONCAT-COUNTER (:CONCATENATION-TYPE = 'LIST)\n\n    A type designator suitable as the RESULT-TYPE\n    argument to CONCATENATE.\n\n## Classification\n\n###### \\[in package MGL-CORE\\]\nTo be able to measure classification related quantities, we need to\ndefine what the label of an instance is. Customization is possible\nby implementing a method for a specific type of instance, but these\nfunctions only ever appear as defaults that can be overridden.\n\n- [generic-function] LABEL-INDEX INSTANCE\n\n    Return the label of INSTANCE as a non-negative\n    integer.\n\n- [generic-function] LABEL-INDEX-DISTRIBUTION INSTANCE\n\n    Return a one dimensional array of probabilities\n    representing the distribution of labels. The probability of the\n    label with LABEL-INDEX `I` is element at index `I` of the returned\n    arrray.\n\nThe following two functions are basically the same as the previous\ntwo, but in batch mode: they return a sequence of label indices or\ndistributions. These are called on results produced by models.\nImplement these for a model and the monitor maker functions below\nwill automatically work. See FIXDOC: for bpn and boltzmann.\n\n- [generic-function] LABEL-INDICES RESULTS\n\n    Return a sequence of label indices for RESULTS\n    produced by some model for a batch of instances. This is akin to\n    LABEL-INDEX.\n\n- [generic-function] LABEL-INDEX-DISTRIBUTIONS RESULT\n\n    Return a sequence of label index distributions for\n    RESULTS produced by some model for a batch of instances. This is\n    akin to LABEL-INDEX-DISTRIBUTION.\n\n### Classification Monitors\n\nThe following functions return a list monitors. The monitors are\nfor events of signature (INSTANCES MODEL) such as those produced by\nMONITOR-MODEL-RESULTS and its various model specific variations.\nThey are model-agnostic functions, extensible to new classifier\ntypes. \n\n- [function] MAKE-CLASSIFICATION-ACCURACY-MONITORS MODEL \u0026KEY OPERATION-MODE ATTRIBUTES (LABEL-INDEX-FN #'LABEL-INDEX)\n\n    Return a list of MONITOR objects associated with\n    CLASSIFICATION-ACCURACY-COUNTERs. LABEL-INDEX-FN is a function\n    like LABEL-INDEX. See that function for more.\n    \n    Implemented in terms of MAKE-CLASSIFICATION-ACCURACY-MONITORS\\*.\n\n- [function] MAKE-CROSS-ENTROPY-MONITORS MODEL \u0026KEY OPERATION-MODE ATTRIBUTES (LABEL-INDEX-DISTRIBUTION-FN #'LABEL-INDEX-DISTRIBUTION)\n\n    Return a list of MONITOR objects associated with\n    CROSS-ENTROPY-COUNTERs. LABEL-INDEX-DISTRIBUTION-FN is a\n    function like LABEL-INDEX-DISTRIBUTION. See that function for more.\n    \n    Implemented in terms of MAKE-CROSS-ENTROPY-MONITORS\\*.\n\n- [function] MAKE-LABEL-MONITORS MODEL \u0026KEY OPERATION-MODE ATTRIBUTES (LABEL-INDEX-FN #'LABEL-INDEX) (LABEL-INDEX-DISTRIBUTION-FN #'LABEL-INDEX-DISTRIBUTION)\n\n    Return classification accuracy and cross-entropy monitors. See\n    MAKE-CLASSIFICATION-ACCURACY-MONITORS and\n    MAKE-CROSS-ENTROPY-MONITORS for a description of paramters.\n\nThe monitor makers above can be extended to support new classifier\ntypes via the following generic functions.\n\n- [generic-function] MAKE-CLASSIFICATION-ACCURACY-MONITORS* MODEL OPERATION-MODE LABEL-INDEX-FN ATTRIBUTES\n\n    Identical to MAKE-CLASSIFICATION-ACCURACY-MONITORS\n    bar the keywords arguments. Specialize this to add to support for\n    new model types. The default implementation also allows for some\n    extensibility: if LABEL-INDICES is defined on MODEL, then it will be\n    used to extract label indices from model results.\n\n- [generic-function] MAKE-CROSS-ENTROPY-MONITORS* MODEL OPERATION-MODE LABEL-INDEX-DISTRIBUTION-FN ATTRIBUTES\n\n    Identical to MAKE-CROSS-ENTROPY-MONITORS bar the\n    keywords arguments. Specialize this to add to support for new model\n    types. The default implementation also allows for some\n    extensibility: if LABEL-INDEX-DISTRIBUTIONS is defined on MODEL,\n    then it will be used to extract label distributions from model\n    results.\n\n### Classification Measurers\n\nThe functions here compare some known good solution (also known as\n*ground truth* or *target*) to a prediction or approximation and\nreturn some measure of their \\[dis\\]similarity. They are model\nindependent, hence one has to extract the ground truths and\npredictions first. Rarely used directly, they are mostly hidden\nbehind @MGL-CLASSIFICATION-MONITOR.\n\n- [function] MEASURE-CLASSIFICATION-ACCURACY TRUTHS PREDICTIONS \u0026KEY (TEST #'EQL) TRUTH-KEY PREDICTION-KEY WEIGHT\n\n    Return the number of correct classifications and as the second\n    value the number of instances (equal to length of TRUTHS in the\n    non-weighted case). TRUTHS (keyed by TRUTH-KEY) is a sequence of\n    opaque class labels compared with TEST to another sequence of\n    classes labels in PREDICTIONS (keyed by PREDICTION-KEY). If WEIGHT\n    is non-nil, then it is a function that returns the weight of an\n    element of TRUTHS. Weighted cases add their weight to both\n    counts (returned as the first and second values) instead of 1 as in\n    the non-weighted case.\n    \n    Note how the returned values are suitable for MULTIPLE-VALUE-CALL\n    with #'ADD-TO-COUNTER and a CLASSIFICATION-ACCURACY-COUNTER.\n\n- [function] MEASURE-CROSS-ENTROPY TRUTHS PREDICTIONS \u0026KEY TRUTH-KEY PREDICTION-KEY (MIN-PREDICTION-PR 1.0d-15)\n\n    Return the sum of the cross-entropy between pairs of elements with\n    the same index of TRUTHS and PREDICTIONS. TRUTH-KEY is a function\n    that's when applied to an element of TRUTHS returns a sequence\n    representing some kind of discrete target distribution (P in the\n    definition below). TRUTH-KEY may be NIL which is equivalent to the\n    IDENTITY function. PREDICTION-KEY is the same kind of key for\n    PREDICTIONS, but the sequence it returns represents a distribution\n    that approximates (Q below) the true one.\n    \n    Cross-entropy of the true and approximating distributions is defined\n    as:\n    \n        cross-entropy(p,q) = - sum_i p(i) * log(q(i))\n    \n    of which this function returns the sum over the pairs of elements of\n    TRUTHS and PREDICTIONS keyed by TRUTH-KEY and PREDICTION-KEY.\n    \n    Due to the logarithm, if q(i) is close to zero, we run into\n    numerical problems. To prevent this, all q(i) that are less than\n    MIN-PREDICTION-PR are treated as if they were MIN-PREDICTION-PR.\n    \n    The second value returned is the sum of p(i) over all TRUTHS and all\n    `I`. This is normally equal to `(LENGTH TRUTHS)`, since elements of\n    TRUTHS represent a probability distribution, but this is not\n    enforced which allows relative importance of elements to be\n    controlled.\n    \n    The third value returned is a plist that maps each index occurring\n    in the distribution sequences to a list of two elements:\n    \n         sum_j p_j(i) * log(q_j(i))\n    \n    and\n    \n        sum_j p_j(i)\n    \n    where `J` indexes into TRUTHS and PREDICTIONS.\n    \n        (measure-cross-entropy '((0 1 0)) '((0.1 0.7 0.2)))\n        =\u003e 0.35667497\n           1\n           (2 (0.0 0)\n            1 (0.35667497 1)\n            0 (0.0 0))\n    \n    Note how the returned values are suitable for MULTIPLE-VALUE-CALL\n    with #'ADD-TO-COUNTER and a CROSS-ENTROPY-COUNTER.\n\n- [function] MEASURE-ROC-AUC PREDICTIONS PRED \u0026KEY (KEY #'IDENTITY) WEIGHT\n\n    Return the area under the ROC curve for PREDICTIONS representing\n    predictions for a binary classification problem. PRED is a predicate\n    function for deciding whether a prediction belongs to the so called\n    positive class. KEY returns a number for each element which is the\n    predictor's idea of how much that element is likely to belong to the\n    class, although it's not necessarily a probability.\n    \n    If WEIGHT is NIL, then all elements of PREDICTIONS count as 1\n    towards the unnormalized sum within AUC. Else WEIGHT must be a\n    function like KEY, but it should return the importance (a positive\n    real number) of elements. If the weight of an prediction is 2 then\n    it's as if there were another identical copy of that prediction in\n    PREDICTIONS.\n    \n    The algorithm is based on algorithm 2 in the paper 'An introduction\n    to ROC analysis' by Tom Fawcett.\n    \n    ROC AUC is equal to the probability of a randomly chosen positive\n    having higher KEY (score) than a randomly chosen negative element.\n    With equal scores in mind, a more precise version is: AUC is the\n    expectation of the above probability over all possible sequences\n    sorted by scores.\n\n- [function] MEASURE-CONFUSION TRUTHS PREDICTIONS \u0026KEY (TEST #'EQL) TRUTH-KEY PREDICTION-KEY WEIGHT\n\n    Create a CONFUSION-MATRIX from TRUTHS and PREDICTIONS.\n    TRUTHS (keyed by TRUTH-KEY) is a sequence of class labels compared\n    with TEST to another sequence of class labels in PREDICTIONS (keyed\n    by PREDICTION-KEY). If WEIGHT is non-nil, then it is a function that\n    returns the weight of an element of TRUTHS. Weighted cases add their\n    weight to both counts (returned as the first and second values).\n    \n    Note how the returned confusion matrix can be added to another with\n    ADD-TO-COUNTER.\n\n### Classification Counters\n\n- [class] CLASSIFICATION-ACCURACY-COUNTER BASIC-COUNTER\n\n    A BASIC-COUNTER with \"acc.\" as its :TYPE\n    attribute and a PRINT-OBJECT method that prints percentages.\n\n- [class] CROSS-ENTROPY-COUNTER BASIC-COUNTER\n\n    A BASIC-COUNTER with \"xent\" as its :TYPE\n    attribute.\n\n#### Confusion Matrices\n\n- [class] CONFUSION-MATRIX\n\n    A confusion matrix keeps count of classification\n    results. The correct class is called `target' and the output of the\n    classifier is called`prediction'.\n\n- [function] MAKE-CONFUSION-MATRIX \u0026KEY (TEST #'EQL)\n\n    Classes are compared with TEST.\n\n- [generic-function] SORT-CONFUSION-CLASSES MATRIX CLASSES\n\n    Return a list of CLASSES sorted for presentation\n    purposes.\n\n- [generic-function] CONFUSION-CLASS-NAME MATRIX CLASS\n\n    Name of CLASS for presentation purposes.\n\n- [generic-function] CONFUSION-COUNT MATRIX TARGET PREDICTION\n\n- [generic-function] MAP-CONFUSION-MATRIX FN MATRIX\n\n    Call FN with `TARGET`, PREDICTION,\n    COUNT paramaters for each cell in the confusion matrix. Cells with a\n    zero count may be ommitted.\n\n- [generic-function] CONFUSION-MATRIX-CLASSES MATRIX\n\n    A list of all classes. The default is to collect\n    classes from the counts. This can be overridden if, for instance,\n    some classes are not present in the results.\n\n- [function] CONFUSION-MATRIX-ACCURACY MATRIX \u0026KEY FILTER\n\n    Return the overall accuracy of the results in MATRIX. It's computed\n    as the number of correctly classified cases (hits) divided by the\n    name of cases. Return the number of hits and the number of cases as\n    the second and third value. If FILTER function is given, then call\n    it with the target and the prediction of the cell. Disregard cell\n    for which FILTER returns NIL.\n    \n    Precision and recall can be easily computed by giving the right\n    filter, although those are provided in separate convenience\n    functions.\n\n- [function] CONFUSION-MATRIX-PRECISION MATRIX PREDICTION\n\n    Return the accuracy over the cases when the classifier said\n    PREDICTION.\n\n- [function] CONFUSION-MATRIX-RECALL MATRIX TARGET\n\n    Return the accuracy over the cases when the correct class is\n    TARGET.\n\n- [function] ADD-CONFUSION-MATRIX MATRIX RESULT-MATRIX\n\n    Add MATRIX into RESULT-MATRIX.\n\n## Features\n\n###### \\[in package MGL-CORE\\]\n### Feature Selection\n\nThe following *scoring functions* all return an EQUAL hash table\nthat maps features to scores.\n\n- [function] COUNT-FEATURES DOCUMENTS MAPPER \u0026KEY (KEY #'IDENTITY)\n\n    Return scored features as an EQUAL hash table whose keys are\n    features of DOCUMENTS and values are counts of occurrences of\n    features. MAPPER takes a function and a document and calls function\n    with features of the document.\n    \n    ```common-lisp\n    (sort (alexandria:hash-table-alist\n           (count-features '((\"hello\" \"world\")\n                             (\"this\" \"is\" \"our\" \"world\"))\n                           (lambda (fn document)\n                             (map nil fn document))))\n          #'string\u003c :key #'car)\n    =\u003e ((\"hello\" . 1) (\"is\" . 1) (\"our\" . 1) (\"this\" . 1) (\"world\" . 2))\n    ```\n\n\n- [function] FEATURE-LLRS DOCUMENTS MAPPER CLASS-FN \u0026KEY (CLASSES (ALL-DOCUMENT-CLASSES DOCUMENTS CLASS-FN))\n\n    Return scored features as an EQUAL hash table whose keys are\n    features of DOCUMENTS and values are their log likelihood ratios.\n    MAPPER takes a function and a document and calls function with\n    features of the document.\n    \n    ```common-lisp\n    (sort (alexandria:hash-table-alist\n           (feature-llrs '((:a \"hello\" \"world\")\n                           (:b \"this\" \"is\" \"our\" \"world\"))\n                         (lambda (fn document)\n                           (map nil fn (rest document)))\n                         #'first))\n          #'string\u003c :key #'car)\n    =\u003e ((\"hello\" . 2.6032386) (\"is\" . 2.6032386) (\"our\" . 2.6032386)\n        (\"this\" . 2.6032386) (\"world\" . 4.8428774e-8))\n    ```\n\n\n- [function] FEATURE-DISAMBIGUITIES DOCUMENTS MAPPER CLASS-FN \u0026KEY (CLASSES (ALL-DOCUMENT-CLASSES DOCUMENTS CLASS-FN))\n\n    Return scored features as an EQUAL hash table whose keys are\n    features of DOCUMENTS and values are their *disambiguities*. MAPPER\n    takes a function and a document and calls function with features of\n    the document.\n    \n    From the paper 'Using Ambiguity Measure Feature Selection Algorithm\n    for Support Vector Machine Classifier'.\n\n### Feature Encoding\n\nFeatures can rarely be fed directly to algorithms as is, they need\nto be transformed in some way. Suppose we have a simple language\nmodel that takes a single word as input and predicts the next word.\nHowever, both input and output is to be encoded as float vectors of\nlength 1000. What we do is find the top 1000 words by some\nmeasure (see @MGL-FEATURE-SELECTION) and associate these words with\nthe integers in \\[0..999\\] (this is `ENCODE`ing). By using for\nexample [one-hot](http://en.wikipedia.org/wiki/One-hot) encoding, we\ntranslate a word into a float vector when passing in the input. When\nthe model outputs the probability distribution of the next word, we\nfind the index of the max and find the word associated with it (this\nis `DECODE`ing)\n\n- [generic-function] ENCODE ENCODER DECODED\n\n    Encode DECODED with ENCODER. This interface is\n    generic enough to be almost meaningless. See ENCODER/DECODER for a\n    simple, MGL-NLP:BAG-OF-WORDS-ENCODER for a slightly more involved\n    example.\n    \n    If ENCODER is a function designator, then it's simply `FUNCALL`ed\n    with DECODED.\n\n- [generic-function] DECODE DECODER ENCODED\n\n    Decode ENCODED with ENCODER. For an DECODER /\n    ENCODER pair, `(DECODE DECODER (ENCODE ENCODER OBJECT))` must be\n    equal in some sense to `OBJECT`.\n    \n    If DECODER is a function designator, then it's simply `FUNCALL`ed\n    with ENCODED.\n\n- [class] ENCODER/DECODER\n\n    Implements O(1) ENCODE and DECODE by having an\n    internal decoded-to-encoded and an encoded-to-decoded EQUAL hash\n    table. ENCODER/DECODER objects can be saved and loaded (see\n    @MGL-PERSISTENCE) as long as the elements in the hash tables have\n    read/write consitency.\n    \n    ```common-lisp\n    (let ((indexer\n            (make-indexer\n             (alexandria:alist-hash-table '((\"I\" . 3) (\"me\" . 2) (\"mine\" . 1)))\n             2)))\n      (values (encode indexer \"I\")\n              (encode indexer \"me\")\n              (encode indexer \"mine\")\n              (decode indexer 0)\n              (decode indexer 1)\n              (decode indexer 2)))\n    =\u003e 0\n    =\u003e 1\n    =\u003e NIL\n    =\u003e \"I\"\n    =\u003e \"me\"\n    =\u003e NIL\n    ```\n\n\n- [function] MAKE-INDEXER SCORED-FEATURES N \u0026KEY (START 0) (CLASS 'ENCODER/DECODER)\n\n    Take the top N features from SCORED-FEATURES (see\n    @MGL-FEATURE-SELECTION), assign indices to them starting from START.\n    Return an ENCODER/DECODER (or another CLASS) that converts between\n    objects and indices.\n\nAlso see MGL-NLP::@MGL-NLP-BAG-OF-WORDS.\n\n## Gradient Based Optimization\n\n###### \\[in package MGL-OPT\\]\nWe have a real valued, differentiable function F and the task is to\nfind the parameters that minimize its value. Optimization starts\nfrom a single point in the parameter space of F, and this single\npoint is updated iteratively based on the gradient and value of F at\nor around the current point.\n\nNote that while the stated problem is that of global optimization,\nfor non-convex functions, most algorithms will tend to converge to a\nlocal optimum.\n\nCurrently, there are two optimization algorithms:\nMGL-GD::@MGL-GD (with several variants) and MGL-CG::@MGL-CG both of\nwhich are first order methods (they do not need second order\ngradients) but more can be added with the @MGL-OPT-EXTENSION-API.\n\n- [function] MINIMIZE OPTIMIZER GRADIENT-SOURCE \u0026KEY (WEIGHTS (LIST-SEGMENTS GRADIENT-SOURCE)) (DATASET \\*INFINITELY-EMPTY-DATASET\\*)\n\n    Minimize the value of the real valued function represented by\n    GRADIENT-SOURCE by updating some of its parameters in WEIGHTS (a MAT\n    or a sequence of MATs). Return WEIGHTS. DATASET (see\n    MGL-DATASET::@MGL-DATASET) is a set of unoptimized parameters of the\n    same function. For example, WEIGHTS may be the weights of a neural\n    network while DATASET is the training set consisting of inputs\n    suitable for SET-INPUT. The default\n    DATASET, (*INFINITELY-EMPTY-DATASET*) is suitable for when all\n    parameters are optimized, so there is nothing left to come from the\n    environment.\n    \n    Optimization terminates if DATASET is a sampler and it runs out or\n    when some other condition met (see TERMINATION, for example). If\n    DATASET is a SEQUENCE, then it is reused over and over again.\n    \n    Examples for various optimizers are provided in MGL-GD::@MGL-GD and\n    MGL-CG::@MGL-CG.\n\n### Iterative Optimizer\n\n- [class] ITERATIVE-OPTIMIZER\n\n    An abstract base class of MGL-GD::@MGL-GD and\n    MGL-CG::@MGL-CG based optimizers that iterate over instances until a\n    termination condition is met.\n\n- [reader] N-INSTANCES ITERATIVE-OPTIMIZER (:N-INSTANCES = 0)\n\n    The number of instances this optimizer has seen so\n    far. Incremented automatically during optimization.\n\n- [accessor] TERMINATION ITERATIVE-OPTIMIZER (:TERMINATION = NIL)\n\n    If a number, it's the number of instances to train\n    on in the sense of N-INSTANCES. If N-INSTANCES is equal or greater\n    than this value optimization stops. If TERMINATION is NIL, then\n    optimization will continue. If it is T, then optimization will\n    stop. If it is a function of no arguments, then its return value\n    is processed as if it was returned by TERMINATION.\n\n- [accessor] ON-OPTIMIZATION-STARTED ITERATIVE-OPTIMIZER (:ON-OPTIMIZATION-STARTED = NIL)\n\n    An event hook with parameters `(OPTIMIZER\n    GRADIENT-SOURCE N-INSTANCES)`. Called after initializations are\n    performed (INITIALIZE-OPTIMIZER*, INITIALIZE-GRADIENT-SOURCE*) but\n    before optimization is started.\n\n- [accessor] ON-OPTIMIZATION-FINISHED ITERATIVE-OPTIMIZER (:ON-OPTIMIZATION-FINISHED = NIL)\n\n    An event hook with parameters `(OPTIMIZER\n    GRADIENT-SOURCE N-INSTANCES)`. Called when optimization has\n    finished.\n\n- [accessor] ON-N-INSTANCES-CHANGED ITERATIVE-OPTIMIZER (:ON-N-INSTANCES-CHANGED = NIL)\n\n    An event hook with parameters `(OPTIMIZER\n    GRADIENT-SOURCE N-INSTANCES)`. Called when optimization of a batch\n    of instances is done and N-INSTANCES is incremented.\n\nNow let's discuss a few handy utilities.\n\n- [function] MONITOR-OPTIMIZATION-PERIODICALLY OPTIMIZER PERIODIC-FNS\n\n    For each periodic function in the list of PERIODIC-FNS, add a\n    monitor to OPTIMIZER's ON-OPTIMIZATION-STARTED,\n    ON-OPTIMIZATION-FINISHED and ON-N-INSTANCES-CHANGED hooks. The\n    monitors are simple functions that just call each periodic function\n    with the event parameters (OPTIMIZER GRADIENT-SOURCE N-INSTANCES).\n    Return OPTIMIZER.\n    \n    To log and reset the monitors of the gradient source after every\n    1000 instances seen by OPTIMIZER:\n    \n        (monitor-optimization-periodically optimizer\n                                           '((:fn log-my-test-error\n                                              :period 2000)\n                                             (:fn reset-optimization-monitors\n                                              :period 1000\n                                              :last-eval 0)))\n    \n    Note how we don't pass it's allowed to just pass the initargs for a\n    PERIODIC-FN instead of PERIODIC-FN itself. The :LAST-EVAL 0 bit\n    prevents RESET-OPTIMIZATION-MONITORS from being called at the start\n    of the optimization when the monitors are empty anyway.\n\n- [generic-function] RESET-OPTIMIZATION-MONITORS OPTIMIZER GRADIENT-SOURCE\n\n    Report the state of MONITORS of\n    OPTIMIZER and GRADIENT-SOURCE and reset their counters. See\n    MONITOR-OPTIMIZATION-PERIODICALLY for an example of how this is\n    used.\n\n- [method] RESET-OPTIMIZATION-MONITORS (OPTIMIZER ITERATIVE-OPTIMIZER) GRADIENT-SOURCE\n\n    Log the counters of the monitors of OPTIMIZER and GRADIENT-SOURCE\n    and reset them.\n\n- [generic-function] REPORT-OPTIMIZATION-PARAMETERS OPTIMIZER GRADIENT-SOURCE\n\n    A utility that's often called at the start of\n    optimization (from ON-OPTIMIZATION-STARTED). The default\n    implementation logs the description of GRADIENT-SOURCE (as in\n    DESCRIBE) and OPTIMIZER and calls LOG-MAT-ROOM.\n\n### Cost Function\n\nThe function being minimized is often called the *cost* or the\n*loss* function.\n\n- [generic-function] COST MODEL\n\n    Return the value of the cost function being\n    minimized. Calling this only makes sense in the context of an\n    ongoing optimization (see MINIMIZE). The cost is that of a batch of\n    instances.\n\n- [function] MAKE-COST-MONITORS MODEL \u0026KEY OPERATION-MODE ATTRIBUTES\n\n    Return a list of MONITOR objects, each associated with one\n    BASIC-COUNTER with attribute :TYPE \"cost\". Implemented in terms of\n    MAKE-COST-MONITORS\\*.\n\n- [generic-function] MAKE-COST-MONITORS* MODEL OPERATION-MODE ATTRIBUTES\n\n    Identical to MAKE-COST-MONITORS bar the keywords\n    arguments. Specialize this to add to support for new model types.\n\n### Gradient Descent\n\n###### \\[in package MGL-GD\\]\nGradient descent is a first-order optimization algorithm. Relying\ncompletely on first derivatives, it does not even evaluate the\nfunction to be minimized. Let's see how to minimize a numerical lisp\nfunction with respect to some of its parameters.\n\n```commonlisp\n(cl:defpackage :mgl-example-sgd\n  (:use #:common-lisp #:mgl))\n\n(in-package :mgl-example-sgd)\n\n;;; Create an object representing the sine function.\n(defparameter *diff-fn-1*\n  (make-instance 'mgl-diffun:diffun\n                 :fn #'sin\n                 ;; We are going to optimize its only parameter.\n                 :weight-indices '(0)))\n\n;;; Minimize SIN. Note that there is no dataset involved because all\n;;; parameters are being optimized.\n(minimize (make-instance 'sgd-optimizer :termination 1000)\n          *diff-fn-1*\n          :weights (make-mat 1))\n;;; =\u003e A MAT with a single value of about -pi/2.\n\n;;; Create a differentiable function for f(x,y)=(x-y)^2. X is a\n;;; parameter whose values come from the DATASET argument passed to\n;;; MINIMIZE. Y is a parameter to be optimized (a 'weight').\n(defparameter *diff-fn-2*\n  (make-instance 'mgl-diffun:diffun\n                 :fn (lambda (x y)\n                       (expt (- x y) 2))\n                 :parameter-indices '(0)\n                 :weight-indices '(1)))\n\n;;; Find the Y that minimizes the distance from the instances\n;;; generated by the sampler.\n(minimize (make-instance 'sgd-optimizer :batch-size 10)\n          *diff-fn-2*\n          :weights (make-mat 1)\n          :dataset (make-instance 'function-sampler\n                                  :generator (lambda ()\n                                               (list (+ 10\n                                                        (gaussian-random-1))))\n                                  :max-n-samples 1000))\n;;; =\u003e A MAT with a single value of about 10, the expected value of\n;;; the instances in the dataset.\n\n;;; The dataset can be a SEQUENCE in which case we'd better set\n;;; TERMINATION else optimization would never finish.\n(minimize (make-instance 'sgd-optimizer :termination 1000)\n          *diff-fn-2*\n          :weights (make-mat 1)\n          :dataset '((0) (1) (2) (3) (4) (5)))\n;;; =\u003e A MAT with a single value of about 2.5.\n```\n\nWe are going to see a number of accessors for optimizer paramaters.\nIn general, it's allowed to SETF real slot accessors (as opposed to\nreaders and writers) at any time during optimization and so is\ndefining a method on an optimizer subclass that computes the value\nin any way. For example, to decay the learning rate on a per\nmini-batch basis:\n\n```commonlisp\n(defmethod learning-rate ((optimizer my-sgd-optimizer))\n  (* (slot-value optimizer 'learning-rate)\n     (expt 0.998\n           (/ (n-instances optimizer) 60000))))\n```\n\n\n#### Batch Based Optimizers\n\nFirst let's see everything common to all batch based optimizers,\nthen discuss @MGL-GD-SGD-OPTIMIZER, @MGL-GD-ADAM-OPTIMIZER and\n@MGL-GD-NORMALIZED-BATCH-GD-OPTIMIZER. All batch based optimizers\nare `ITERATIVE-OPTIMIZER`s, so see\nMGL-OPT::@MGL-OPT-ITERATIVE-OPTIMIZER too.\n\n- [class] BATCH-GD-OPTIMIZER\n\n    Another abstract base class for gradient based\n    optimizers tath updates all weights simultaneously after chewing\n    through BATCH-SIZE inputs. See subclasses SGD-OPTIMIZER,\n    ADAM-OPTIMIZER and NORMALIZED-BATCH-GD-OPTIMIZER.\n    \n    PER-WEIGHT-BATCH-GD-OPTIMIZER may be a better choice when some\n    weights can go unused for instance due to missing input values.\n\n- [accessor] BATCH-SIZE GD-OPTIMIZER (:BATCH-SIZE = 1)\n\n    After having gone through BATCH-SIZE number of\n    inputs, weights are updated. With BATCH-SIZE 1, one gets\n    Stochastics Gradient Descent. With BATCH-SIZE equal to the number\n    of instances in the dataset, one gets standard, 'batch' gradient\n    descent. With BATCH-SIZE between these two extremes, one gets the\n    most practical 'mini-batch' compromise.\n\n- [accessor] LEARNING-RATE GD-OPTIMIZER (:LEARNING-RATE = 0.1)\n\n    This is the step size along the gradient. Decrease\n    it if optimization diverges, increase it if it doesn't make\n    progress.\n\n- [accessor] MOMENTUM GD-OPTIMIZER (:MOMENTUM = 0)\n\n    A value in the \\[0, 1) interval. MOMENTUM times the\n    previous weight change is added to the gradient. 0 means no\n    momentum.\n\n- [reader] MOMENTUM-TYPE GD-OPTIMIZER (:MOMENTUM-TYPE = :NORMAL)\n\n    One of :NORMAL, :NESTEROV or :NONE. For pure\n    optimization Nesterov's momentum may be better, but it may also\n    increases chances of overfitting. Using :NONE is equivalent to 0\n    momentum, but it also uses less memory. Note that with :NONE,\n    MOMENTUM is ignored even it it is non-zero.\n\n- [accessor] WEIGHT-DECAY GD-OPTIMIZER (:WEIGHT-DECAY = 0)\n\n    An L2 penalty. It discourages large weights, much\n    like a zero mean gaussian prior. WEIGHT-DECAY \\* WEIGHT is added to\n    the gradient to penalize large weights. It's as if the function\n    whose minimum is sought had WEIGHT-DECAY\\*sum\\_i\\{0.5 \\* WEIGHT\\_i^2\\}\n    added to it.\n\n- [accessor] WEIGHT-PENALTY GD-OPTIMIZER (:WEIGHT-PENALTY = 0)\n\n    An L1 penalty. It encourages sparsity.\n    SIGN(WEIGHT) \\* WEIGHT-PENALTY is added to the gradient pushing the\n    weight towards negative infinity. It's as if the function whose\n    minima is sought had WEIGHT-PENALTY\\*sum\\_i\\{abs(WEIGHT\\_i)\\} added to\n    it. Putting it on feature biases consitutes a sparsity constraint\n    on the features.\n\n- [reader] USE-SEGMENT-DERIVATIVES-P GD-OPTIMIZER (:USE-SEGMENT-DERIVATIVES-P = NIL)\n\n    Save memory if both the gradient source (the model\n    being optimized) and the optimizer support this feature. It works\n    like this: the accumulator into which the gradient source is asked\n    to place the derivatives of a segment will be SEGMENT-DERIVATIVES\n    of the segment. This allows the optimizer not to allocate an\n    accumulator matrix into which the derivatives are summed.\n\n- [accessor] AFTER-UPDATE-HOOK GD-OPTIMIZER (:AFTER-UPDATE-HOOK = NIL)\n\n    A list of functions with no arguments called after\n    each weight update.\n\n- [accessor] BEFORE-UPDATE-HOOK BATCH-GD-OPTIMIZER (:BEFORE-UPDATE-HOOK = NIL)\n\n    A list of functions of no parameters. Each\n    function is called just before a weight update takes place (after\n    accumulated gradients have been divided the length of the batch).\n    Convenient to hang some additional gradient accumulating code\n    on.\n\n##### SGD Optimizer\n\n- [class] SGD-OPTIMIZER BATCH-GD-OPTIMIZER\n\n    With BATCH-SIZE 1 this is Stochastic Gradient\n    Descent. With higher batch sizes, one gets mini-batch and Batch\n    Gradient Descent.\n    \n    Assuming that ACCUMULATOR has the sum of gradients for a mini-batch,\n    the weight update looks like this:\n    \n    $$\n    \\Delta\\_w^\\{t+1\\} = momentum \\* \\Delta\\_w^t\n      + \\frac\\{accumulator\\}\\{batchsize\\}\n      + l\\_2 w + l\\_1 sign(w)\n    $$\n    \n    $$\n    w^\\{t+1\\} = w^\\{t\\} - learningrate \\* \\Delta\\_w,\n    $$\n    \n    which is the same as the more traditional formulation:\n    \n    $$\n    \\Delta\\_w^\\{t+1\\} = momentum \\* \\Delta\\_w^\\{t\\}\n      + learningrate \\* \\left(\\frac\\{\\frac\\{df\\}\\{dw\\}\\}\\{batchsize\\}\n                           + l\\_2 w + l\\_1 sign(w)\\right)\n    $$\n    \n    $$\n    w^\\{t+1\\} = w^\\{t\\} - \\Delta\\_w,\n    $$\n    \n    but the former works better when batch size, momentum or learning\n    rate change during the course of optimization. The above is with\n    normal momentum, Nesterov's momentum (see MOMENTUM-TYPE) momentum is\n    also available.\n    \n    See @MGL-GD-BATCH-GD-OPTIMIZER for the description of the various\n    options common to all batch based optimizers.\n\n##### Adam Optimizer\n\n- [class] ADAM-OPTIMIZER BATCH-GD-OPTIMIZER\n\n    Adam is a first-order stochasistic gradient descent\n    optimizer. It maintains an internal estimation for the mean and raw\n    variance of each derivative as exponential moving averages. The step\n    it takes is basically `M/(sqrt(V)+E)` where `M` is the estimated\n    mean, `V` is the estimated variance, and `E` is a small adjustment\n    factor to prevent the gradient from blowing up. See version 5 of the\n    [paper](http://arxiv.org/abs/1412.6980) for more.\n    \n    Note that using momentum is not supported with Adam. In fact, an\n    error is signalled if it's not :NONE.\n    \n    See @MGL-GD-BATCH-GD-OPTIMIZER for the description of the various\n    options common to all batch based optimizers.\n\n- [accessor] LEARNING-RATE ADAM-OPTIMIZER (= 2.0e-4)\n\n    Same thing as LEARNING-RATE but with the default suggested by the Adam paper.\n\n- [accessor] MEAN-DECAY ADAM-OPTIMIZER (:MEAN-DECAY = 0.9)\n\n    A number between 0 and 1 that determines how fast\n    the estimated mean of derivatives is updated. 0 basically gives\n    you RMSPROP (if VARIANCE-DECAY is not too large) or AdaGrad (if\n    VARIANCE-DECAY is close to 1 and the learning rate is annealed.\n    This is $\\beta\\_1$ in the paper.\n\n- [accessor] MEAN-DECAY-DECAY ADAM-OPTIMIZER (:MEAN-DECAY-DECAY = (- 1 1.0d-7))\n\n    A value that should be close to 1. MEAN-DECAY is\n    multiplied by this value after each update. This is $\\lambda$ in\n    the paper.\n\n- [accessor] VARIANCE-DECAY ADAM-OPTIMIZER (:VARIANCE-DECAY = 0.999)\n\n    A number between 0 and 1 that determines how fast\n    the estimated variance of derivatives is updated. This is\n    $\\beta\\_2$ in the paper.\n\n- [accessor] VARIANCE-ADJUSTMENT ADAM-OPTIMIZER (:VARIANCE-ADJUSTMENT = 1.0d-7)\n\n    Within the bowels of adam, the estimated mean is\n    divided by the square root of the estimated variance (per weight)\n    which can lead to numerical problems if the denominator is near\n    zero. To avoid this, VARIANCE-ADJUSTMENT, which should be a small\n    positive number, is added to the denominator. This is `epsilon` in\n    the paper.\n\n##### Normalized Batch Optimizer\n\n- [class] NORMALIZED-BATCH-GD-OPTIMIZER BATCH-GD-OPTIMIZER\n\n    Like BATCH-GD-OPTIMIZER but keeps count of how many\n    times each weight was used in the batch and divides the accumulated\n    gradient by this count instead of dividing by N-INSTANCES-IN-BATCH.\n    This only makes a difference if there are missing values in the\n    learner that's being trained. The main feature that distuinguishes\n    this class from PER-WEIGHT-BATCH-GD-OPTIMIZER is that batches end at\n    same time for all weights.\n\n- [accessor] N-WEIGHT-USES-IN-BATCH NORMALIZED-BATCH-GD-OPTIMIZER\n\n    Number of uses of the weight in its current batch.\n\n#### Segmented GD Optimizer\n\n- [class] SEGMENTED-GD-OPTIMIZER\n\n    An optimizer that delegates training of segments to\n    other optimizers. Useful to delegate training of different segments\n    to different optimizers (capable of working with segmentables) or\n    simply to not train all segments.\n\n- [reader] SEGMENTER SEGMENTED-GD-OPTIMIZER (:SEGMENTER)\n\n    When this optimizer is initialized it loops over\n    the segment of the learner with MAP-SEGMENTS. SEGMENTER is a\n    function that is called with each segment and returns an optimizer\n    or NIL. Several segments may be mapped to the same optimizer.\n    After the segment-\u003eoptimizer mappings are collected, each\n    optimizer is initialized by INITIALIZE-OPTIMIZER with the list of\n    segments mapped to it.\n\n- [reader] SEGMENTS SEGMENTED-GD-OPTIMIZER\n\nSEGMENTED-GD-OPTIMIZER inherits from `ITERATIVE-OPTIMIZER`, so see\nMGL-OPT::@MGL-OPT-ITERATIVE-OPTIMIZER too.\n\n#### Per-weight Optimization\n\n- [class] PER-WEIGHT-BATCH-GD-OPTIMIZER\n\n    This is much like @MGL-GD-BATCH-GD-OPTIMIZER but it\n    is more clever about when to update weights. Basically every weight\n    has its own batch independent from the batches of others. This has\n    desirable properties. One can for example put two neural networks\n    together without adding any connections between them and the\n    learning will produce results equivalent to the separated case.\n    Also, adding inputs with only missing values does not change\n    anything.\n    \n    Due to its very non-batch nature, there is no CUDA implementation of\n    this optimizer.\n\n- [accessor] N-WEIGHT-USES-IN-BATCH PER-WEIGHT-BATCH-GD-OPTIMIZER\n\n    Number of uses of the weight in its current batch.\n\n#### Utilities\n\n- [function] CLIP-L2-NORM MATS L2-UPPER-BOUND \u0026KEY CALLBACK\n\n    Scale MATS so that their $L\\_2$ norm does not exceed L2-UPPER-BOUND.\n    \n    Compute the norm of of MATS as if they were a single vector. If the\n    norm is greater than L2-UPPER-BOUND, then scale each matrix\n    destructively by the norm divided by L2-UPPER-BOUND and if non-NIL\n    call the function CALLBACK with the scaling factor.\n\n- [function] ARRANGE-FOR-CLIPPING-GRADIENTS BATCH-GD-OPTIMIZER L2-UPPER-BOUND \u0026KEY CALLBACK\n\n    Make it so that the norm of the batch normalized gradients\n    accumulated by BATCH-GD-OPTIMIZER is clipped to L2-UPPER-BOUND\n    before every update. See CLIP-L2-NORM.\n\n### Conjugate Gradient\n\n###### \\[in package MGL-CG\\]\nConjugate gradient is a first-order optimization algorithm. It's\nmore advanced than gradient descent as it does line searches which\nunfortunately also makes it unsuitable for non-deterministic\nfunctions. Let's see how to minimize a numerical lisp function with\nrespect to some of its parameters.\n\n```\n;;; Create an object representing the sine function.\n(defparameter *diff-fn-1*\n  (make-instance 'mgl-diffun:diffun\n                 :fn #'sin\n                 ;; We are going to optimize its only parameter.\n                 :weight-indices '(0)))\n\n;;; Minimize SIN. Note that there is no dataset involved because all\n;;; parameters are being optimized.\n(minimize (make-instance 'cg-optimizer\n                         :batch-size 1\n                         :termination 1)\n          *diff-fn-1*\n          :weights (make-mat 1))\n;;; =\u003e A MAT with a single value of about -pi/2.\n\n;;; Create a differentiable function for f(x,y)=(x-y)^2. X is a\n;;; parameter whose values come from the DATASET argument passed to\n;;; MINIMIZE. Y is a parameter to be optimized (a 'weight').\n(defparameter *diff-fn-2*\n  (make-instance 'mgl-diffun:diffun\n                 :fn (lambda (x y)\n                       (expt (- x y) 2))\n                 :parameter-indices '(0)\n                 :weight-indices '(1)))\n\n;;; Find the Y that minimizes the distance from the instances\n;;; generated by the sampler.\n(minimize (make-instance 'cg-optimizer :batch-size 10)\n          *diff-fn-2*\n          :weights (make-mat 1)\n          :dataset (make-instance 'function-sampler\n                                  :generator (lambda ()\n                                               (list (+ 10\n                                                        (gaussian-random-1))))\n                                  :max-n-samples 1000))\n;;; =\u003e A MAT with a single value of about 10, the expected value of\n;;; the instances in the dataset.\n\n;;; The dataset can be a SEQUENCE in which case we'd better set\n;;; TERMINATION else optimization would never finish. Note how a\n;;; single epoch suffices.\n(minimize (make-instance 'cg-optimizer :termination 6)\n          *diff-fn-2*\n          :weights (make-mat 1)\n          :dataset '((0) (1) (2) (3) (4) (5)))\n;;; =\u003e A MAT with a single value of about 2.5.\n```\n\n\n- [function] CG FN W \u0026KEY (MAX-N-LINE-SEARCHES \\*DEFAULT-MAX-N-LINE-SEARCHES\\*) (MAX-N-EVALUATIONS-PER-LINE-SEARCH \\*DEFAULT-MAX-N-EVALUATIONS-PER-LINE-SEARCH\\*) (MAX-N-EVALUATIONS \\*DEFAULT-MAX-N-EVALUATIONS\\*) (SIG \\*DEFAULT-SIG\\*) (RHO \\*DEFAULT-RHO\\*) (INT \\*DEFAULT-INT\\*) (EXT \\*DEFAULT-EXT\\*) (RATIO \\*DEFAULT-RATIO\\*) SPARE-VECTORS\n\n    CG-OPTIMIZER passes each batch of data to this function with its\n    CG-ARGS passed on.\n    \n    Minimize a differentiable multivariate function with conjugate\n    gradient. The Polak-Ribiere flavour of conjugate gradients is used\n    to compute search directions, and a line search using quadratic and\n    cubic polynomial approximations and the Wolfe-Powell stopping\n    criteria is used together with the slope ratio method for guessing\n    initial step sizes. Additionally a bunch of checks are made to make\n    sure that exploration is taking place and that extrapolation will\n    not be unboundedly large.\n    \n    FN is a function of two parameters: WEIGHTS and DERIVATIVES. WEIGHTS\n    is a MAT of the same size as W that is where the search start from.\n    DERIVATIVES is also a MAT of that size and it is where FN shall\n    place the partial derivatives. FN returns the value of the function\n    that is being minimized.\n    \n    CG performs a number of line searches and invokes FN at each step. A\n    line search invokes FN at most MAX-N-EVALUATIONS-PER-LINE-SEARCH\n    number of times and can succeed in improving the minimum by the\n    sufficient margin or it can fail. Note, the even a failed line\n    search may improve further and hence change the weights it's just\n    that the improvement was deemed too small. CG stops when either:\n    \n    - two line searches fail in a row\n    \n    - MAX-N-LINE-SEARCHES is reached\n    \n    - MAX-N-EVALUATIONS is reached\n    \n    CG returns a MAT that contains the best weights, the minimum, the\n    number of line searches performed, the number of succesful line\n    searches and the number of evaluations.\n    \n    When using MAX-N-EVALUATIONS remember that there is an extra\n    evaluation of FN before the first line search.\n    \n    SPARE-VECTORS is a list of preallocated MATs of the same size as W.\n    Passing 6 of them covers the current need of the algorithm and it\n    will not cons up vectors of size W at all.\n    \n    NOTE: If the function terminates within a few iterations, it could\n    be an indication that the function values and derivatives are not\n    consistent (ie, there may be a bug in the implementation of FN\n    function).\n    \n    SIG and RHO are the constants controlling the Wolfe-Powell\n    conditions. SIG is the maximum allowed absolute ratio between\n    previous and new slopes (derivatives in the search direction), thus\n    setting SIG to low (positive) values forces higher precision in the\n    line-searches. RHO is the minimum allowed fraction of the\n    expected (from the slope at the initial point in the linesearch).\n    Constants must satisfy 0 \u003c RHO \u003c SIG \u003c 1. Tuning of SIG (depending\n    on the nature of the function to be optimized) may speed up the\n    minimization; it is probably not worth playing much with RHO.\n\n- [variable] *DEFAULT-INT* 0.1\n\n    Don't reevaluate within INT of the limit of the current bracket.\n\n- [variable] *DEFAULT-EXT* 3\n\n    Extrapolate maximum EXT times the current step-size.\n\n- [variable] *DEFAULT-SIG* 0.1\n\n    SIG and RHO are the constants controlling the Wolfe-Powell\n    conditions. SIG is the maximum allowed absolute ratio between\n    previous and new slopes (derivatives in the search direction), thus\n    setting SIG to low (positive) values forces higher precision in the\n    line-searches.\n\n- [variable] *DEFAULT-RHO* 0.05\n\n    RHO is the minimum allowed fraction of the expected (from the slope\n    at the initial point in the linesearch). Constants must satisfy 0 \u003c\n    RHO \u003c SIG \u003c 1.\n\n- [variable] *DEFAULT-RATIO* 10\n\n    Maximum allowed slope ratio.\n\n- [variable] *DEFAULT-MAX-N-LINE-SEARCHES* NIL\n\n- [variable] *DEFAULT-MAX-N-EVALUATIONS-PER-LINE-SEARCH* 20\n\n- [variable] *DEFAULT-MAX-N-EVALUATIONS* NIL\n\n- [class] CG-OPTIMIZER ITERATIVE-OPTIMIZER\n\n    Updates all weights simultaneously after chewing\n    through BATCH-SIZE inputs.\n\n- [accessor] BATCH-SIZE CG-OPTIMIZER (:BATCH-SIZE)\n\n    After having gone through BATCH-SIZE number of\n    instances, weights are updated. Normally, CG operates on all\n    available data, but it may be useful to introduce some noise into\n    the optimization to reduce overfitting by using smaller batch\n    sizes. If BATCH-SIZE is not set, it is initialized to the size of\n    the dataset at the start of optimization.\n\n- [accessor] CG-ARGS CG-OPTIMIZER (:CG-ARGS = 'NIL)\n\n- [accessor] ON-CG-BATCH-DONE CG-OPTIMIZER (:ON-CG-BATCH-DONE = NIL)\n\n    An event hook called when processing a conjugate\n    gradient batch is done. The handlers on the hook are called with 8\n    arguments:\n    \n        (optimizer gradient-source instances\n         best-w best-f n-line-searches\n         n-succesful-line-searches n-evaluations)\n    \n    The latter 5 of which are the return values of the CG function.\n\n- [generic-function] LOG-CG-BATCH-DONE OPTIMIZER GRADIENT-SOURCE INSTANCES BEST-W BEST-F N-LINE-SEARCHES N-SUCCESFUL-LINE-SEARCHES N-EVALUATIONS\n\n    This is a function can be added to\n    ON-CG-BATCH-DONE. The default implementation simply logs the event\n    arguments.\n\n- [reader] SEGMENT-FILTER CG-OPTIMIZER (:SEGMENT-FILTER = (CONSTANTLY T))\n\n    A predicate function on segments that filters out\n    uninteresting segments. Called from INITIALIZE-OPTIMIZER\\*.\n\n### Extension API\n\n#### Implementing Optimizers\n\nThe following generic functions must be specialized for new\noptimizer types.\n\n- [generic-function] MINIMIZE* OPTIMIZER GRADIENT-SOURCE WEIGHTS DATASET\n\n    Called by MINIMIZE after INITIALIZE-OPTIMIZER\\* and\n    INITIALIZE-GRADIENT-SOURCE\\*, this generic function is the main\n    extension point for writing optimizers.\n\n- [generic-function] INITIALIZE-OPTIMIZER* OPTIMIZER GRADIENT-SOURCE WEIGHTS DATASET\n\n    Called automatically before training starts, this\n    function sets up OPTIMIZER to be suitable for optimizing\n    GRADIENT-SOURCE. It typically creates appropriately sized\n    accumulators for the gradients.\n\n- [generic-function] SEGMENTS OPTIMIZER\n\n    Several weight matrices known as *segments* can be\n    optimized by a single optimizer. This function returns them as a\n    list.\n\nThe rest are just useful for utilities for implementing\noptimizers.\n\n- [function] TERMINATE-OPTIMIZATION-P N-INSTANCES TERMINATION\n\n    Utility function for subclasses of ITERATIVE-OPTIMIZER. It returns\n    whether optimization is to be terminated based on N-INSTANCES and\n    TERMINATION that are values of the respective accessors of\n    ITERATIVE-OPTIMIZER.\n\n- [function] SET-N-INSTANCES OPTIMIZER GRADIENT-SOURCE N-INSTANCES\n\n    Set N-INSTANCES of OPTIMIZER and\n    fire ON-N-INSTANCES-CHANGED. ITERATIVE-OPTIMIZER subclasses must\n    call this to increment N-INSTANCES.\n\n- [class] SEGMENT-SET\n\n    This is a utility class for optimizers that have a\n    list of SEGMENTS and (the weights being optimized) is able to copy\n    back and forth between those segments and a single MAT (the\n    accumulator).\n\n- [reader] SEGMENTS SEGMENT-SET (:SEGMENTS)\n\n    A list of weight matrices.\n\n- [reader] SIZE SEGMENT-SET\n\n    The sum of the sizes of the weight matrices of\n    SEGMENTS.\n\n- [macro] DO-SEGMENT-SET (SEGMENT \u0026OPTIONAL START) SEGMENT-SET \u0026BODY BODY\n\n    Iterate over SEGMENTS in SEGMENT-SET. If START is specified, the it\n    is bound to the start index of SEGMENT within SEGMENT-SET. The start\n    index is the sum of the sizes of previous segments.\n\n- [function] SEGMENT-SET\u003c-MAT SEGMENT-SET MAT\n\n    Copy the values of MAT to the weight matrices of SEGMENT-SET as if\n    they were concatenated into a single MAT.\n\n- [function] SEGMENT-SET-\u003eMAT SEGMENT-SET MAT\n\n    Copy the values of SEGMENT-SET to MAT as if they were concatenated\n    into a single MAT.\n\n#### Implementing Gradient Sources\n\nWeights can be stored in a multitude of ways. Optimizers need to\nupdate weights, so it is assumed that weights are stored in any\nnumber of MAT objects called segments.\n\nThe generic functions in this section must all be specialized for\nnew gradient sources except where noted.\n\n- [generic-function] MAP-SEGMENTS FN GRADIENT-SOURCE\n\n    Apply FN to each segment of GRADIENT-SOURCE.\n\n- [generic-function] MAP-SEGMENT-RUNS FN SEGMENT\n\n    Call FN with start and end of intervals of\n    consecutive indices that are not missing in SEGMENT. Called by\n    optimizers that support partial updates. The default implementation\n    assumes that all weights are present. This only needs to be\n    specialized if one plans to use an optimizer that knows how to deal\n    unused/missing weights such as MGL-GD:NORMALIZED-BATCH-GD-OPTIMIZER\n    and OPTIMIZER MGL-GD:PER-WEIGHT-BATCH-GD-OPTIMIZER.\n\n- [generic-function] SEGMENT-WEIGHTS SEGMENT\n\n    Return the weight matrix of SEGMENT. A segment\n    doesn't need to be a MAT object itself. For example, it may be a\n    MGL-BM:CHUNK of a MGL-BM:BM or a MGL-BP:LUMP of a\n    MGL-BP:BPN whose NODES slot holds the weights.\n\n- [method] SEGMENT-WEIGHTS (MAT MAT)\n\n    When the segment is really a MAT, then just return it.\n\n- [generic-function] SEGMENT-DERIVATIVES SEGMENT\n\n    Return the derivatives matrix of SEGMENT. A segment\n    doesn't need to be a MAT object itself. For example, it may be a\n    MGL-BM:CHUNK of a MGL-BM:BM or a MGL-BP:LUMP of a\n    MGL-BP:BPN whose DERIVATIVES slot holds the gradient.\n\n- [function] LIST-SEGMENTS GRADIENT-SOURCE\n\n    A utility function that returns the list of segments from\n    MAP-SEGMENTS on GRADIENT-SOURCE.\n\n- [generic-function] INITIALIZE-GRADIENT-SOURCE* OPTIMIZER GRADIENT-SOURCE WEIGHTS DATASET\n\n    Called automatically before MINIMIZE\\* is called,\n    this function may be specialized if GRADIENT-SOURCE needs some kind\n    of setup.\n\n- [method] INITIALIZE-GRADIENT-SOURCE* OPTIMIZER GRADIENT-SOURCE WEIGHTS DATASET\n\n    The default method does nothing.\n\n- [generic-function] ACCUMULATE-GRADIENTS* GRADIENT-SOURCE SINK BATCH MULTIPLIER VALUEP\n\n    Add MULTIPLIER times the sum of first-order\n    gradients to accumulators of SINK (normally accessed with\n    DO-GRADIENT-SINK) and if VALUEP, return the sum of values of the\n    function being optimized for a BATCH of instances. GRADIENT-SOURCE\n    is the object representing the function being optimized, SINK is\n    gradient sink.\n    \n    Note the number of instances in BATCH may be larger than what\n    GRADIENT-SOURCE process in one go (in the sense of say,\n    MAX-N-STRIPES), so DO-BATCHES-FOR-MODEL or something like (GROUP\n    BATCH MAX-N-STRIPES) can be handy.\n\n#### Implementing Gradient Sinks\n\nOptimizers call ACCUMULATE-GRADIENTS\\* on gradient sources. One\nparameter of ACCUMULATE-GRADIENTS\\* is the SINK. A gradient sink\nknows what accumulator matrix (if any) belongs to a segment. Sinks\nare defined entirely by MAP-GRADIENT-SINK.\n\n- [generic-function] MAP-GRADIENT-SINK FN SINK\n\n    Call FN of lambda list (SEGMENT ACCUMULATOR) on\n    each segment and their corresponding accumulator MAT in SINK.\n\n- [macro] DO-GRADIENT-SINK ((SEGMENT ACCUMULATOR) SINK) \u0026BODY BODY\n\n    A convenience macro on top of MAP-GRADIENT-SINK.\n\n## Differentiable Functions\n\n###### \\[in package MGL-DIFFUN\\]\n- [class] DIFFUN\n\n    DIFFUN dresses a lisp function (in its FN slot) as\n    a gradient source (see MGL-OPT::@MGL-OPT-GRADIENT-SOURCE), which\n    allows it to be used in MINIMIZE. See the examples in\n    MGL-GD::@MGL-GD and MGL-CG::@MGL-CG.\n\n- [reader] FN DIFFUN (:FN)\n\n    A real valued lisp function. It may have any\n    number of parameters.\n\n- [reader] PARAMETER-INDICES DIFFUN (:PARAMETER-INDICES = NIL)\n\n    The list of indices of parameters that we don't\n    optimize. Values for these will come from the DATASET argument of\n    MINIMIZE.\n\n- [reader] WEIGHT-INDICES DIFFUN (:WEIGHT-INDICES = NIL)\n\n    The list of indices of parameters to be optimized,\n    the values of which will come from the WEIGHTS\n    argument of MINIMIZE.\n\n## Backpropagation Neural Networks\n\n###### \\[in package MGL-BP\\]\n### Backprop Overview\n\nBackpropagation Neural Networks are just functions with lots of\nparameters called *weights* and a layered structure when presented\nas a [computational\ngraph](http://en.wikipedia.org/wiki/Automatic_differentiation). The\nnetwork is trained to MINIMIZE some kind of *loss function* whose\nvalue the network computes.\n\nIn this implementation, a BPN is assembled from several\n`LUMP`s (roughly corresponding to layers). Both feed-forward and\nrecurrent neural nets are supported (FNN and RNN, respectively).\n`BPN`s can contain not only `LUMP`s but other `BPN`s, too. As we\nsee, networks are composite objects and the abstract base class for\ncomposite and simple parts is called CLUMP.\n\n- [class] CLUMP\n\n    A CLUMP is a LUMP or a BPN. It represents\n    a differentiable function. Arguments of clumps are given during\n    instantiation. Some arguments are clumps themselves so they get\n    permenantly wired together like this:\n    \n    ```commonlisp\n    (-\u003ev*m (-\u003einput :size 10 :name 'input)\n           (-\u003eweight :dimensions '(10 20) :name 'weight)\n           :name 'activation)\n    ```\n    \n    The above creates three clumps: the vector-matrix multiplication\n    clumps called `ACTIVATION` which has a reference to its operands:\n    INPUT and WEIGHT. Note that the example just defines a function, no\n    actual computation has taken place, yet.\n    \n    This wiring of `CLUMP`s is how one builds feed-forward nets (FNN) or\n    recurrent neural networks (RNN) that are `CLUMP`s themselves so one\n    can build nets in a hiearchical style if desired. Non-composite\n    `CLUMP`s are called LUMP (note the loss of `C` that stands for\n    composite). The various LUMP subtypes correspond to different layer\n    types (-\u003eSIGMOID, -\u003eDROPOUT, -\u003eRELU, -\u003eTANH, etc).\n\nAt this point, you may want to jump ahead to get a feel for how\nthings work by reading the @MGL-FNN-TUTORIAL.\n\n### Clump API\n\nThese are mostly for extension purposes. About the only thing\nneeded from here for normal operation is NODES when clamping inputs\nor extracting predictions.\n\n- [generic-function] STRIPEDP CLUMP\n\n    For efficiency, forward and backprop phases do\n    their stuff in batch mode: passing a number of instances through the\n    network in batches. Thus clumps must be able to store values of and\n    gradients for each of these instances. However, some clumps produce\n    the same result for each instance in a batch. These clumps are the\n    weights, the parameters of the network. STRIPEDP returns true iff\n    CLUMP does not represent weights (i.e. it's not a -\u003eWEIGHT).\n    \n    For striped clumps, their NODES and DERIVATIVES are MAT objects with\n    a leading dimension (number of rows in the 2d case) equal to the\n    number of instances in the batch. Non-striped clumps have no\n    restriction on their shape apart from what their usage dictates.\n\n- [generic-function] NODES OBJECT\n\n    Returns a MAT object representing the state or\n    result of OBJECT. The first dimension of the returned matrix is\n    equal to the number of stripes.\n\n`CLUMP`s' `NODES` holds the result computed by the most recent\nFORWARD. For -\u003eINPUT lumps, this is where input values shall be\nplaced (see SET-INPUT). Currently, the matrix is always two\ndimensional but this restriction may go away in the future.\n\n- [generic-function] DERIVATIVES CLUMP\n\n    Return the MAT object representing the partial\n    derivatives of the function CLUMP computes. The returned partial\n    derivatives were accumulated by previous BACKWARD calls.\n    \n    This matrix is shaped like the matrix returned by NODES.\n\n- [generic-function] FORWARD CLUMP\n\n    Compute the values of the function represented by\n    CLUMP for all stripes and place the results into NODES of CLUMP.\n\n- [generic-function] BACKWARD CLUMP\n\n    Compute the partial derivatives of the function\n    represented by CLUMP and add them to DERIVATIVES of the\n    corresponding argument clumps. The DERIVATIVES of CLUMP contains the\n    sum of partial derivatives of all clumps by the corresponding\n    output. This function is intended to be called after a FORWARD pass.\n    \n    Take the -\u003eSIGMOID clump for example when the network is being\n    applied to a batch of two instances `x1` and `x2`. `x1` and `x2` are\n    set in the -\u003eINPUT lump X. The sigmoid computes `1/(1+exp(-x))`\n    where `X` is its only argument clump.\n    \n        f(x) = 1/(1+exp(-x))\n    \n    When BACKWARD is called on the sigmoid lump, its DERIVATIVES is a\n    2x1 MAT object that contains the partial derivatives of the loss\n    function:\n    \n        dL(x1)/df\n        dL(x2)/df\n    \n    Now the BACKWARD method of the sigmoid needs to add `dL(x1)/dx1` and\n    `dL(x2)/dx2` to DERIVATIVES of `X`. Now, `dL(x1)/dx1 = dL(x1)/df *\n    df(x1)/dx1` and the first term is what we have in DERIVATIVES of the\n    sigmoid so it only needs to calculate the second term.\n\nIn addition to the above, clumps also have to support SIZE,\nN-STRIPES, MAX-N-STRIPES (and the SETF methods of the latter two)\nwhich can be accomplished just by inheriting from BPN, FNN, RNN, or\na LUMP.\n\n### BPNs\n\n- [class] BPN CLUMP\n\n    Abstract base class for FNN and RNN.\n\n- [reader] N-STRIPES BPN (:N-STRIPES = 1)\n\n    The current number of instances the network has.\n    This is automatically set to the number of instances passed to\n    SET-INPUT, so it rarely has to be manipulated directly although it\n    can be set. When set N-STRIPES of all CLUMPS get set to the same\n    value.\n\n- [reader] MAX-N-STRIPES BPN (:MAX-N-STRIPES = NIL)\n\n    The maximum number of instances the network can\n    operate on in parallel. Within BUILD-FNN or BUILD-RNN, it defaults\n    to MAX-N-STRIPES of that parent network, else it defaults to 1.\n    When set MAX-N-STRIPES of all CLUMPS get set to the same value.\n\n- [reader] CLUMPS BPN (:CLUMPS = (MAKE-ARRAY 0 :ELEMENT-TYPE 'CLUMP :ADJUSTABLE T :FILL-POINTER T))\n\n    A topological sorted adjustable array with a fill\n    pointer that holds the clumps that make up the network. Clumps are\n    added to it by ADD-CLUMP or, more often, automatically when within\n    a BUILD-FNN or BUILD-RNN. Rarely needed, FIND-CLUMP takes care of\n    most uses.\n\n- [function] FIND-CLUMP NAME BPN \u0026KEY (ERRORP T)\n\n    Find the clump with NAME among CLUMPS of BPN. As always, names are\n    compared with EQUAL. If not found, then return NIL or signal and\n    error depending on ERRORP.\n\n- [function] ADD-CLUMP CLUMP BPN\n\n    Add CLUMP to BPN. MAX-N-STRIPES of CLUMP gets set to that of BPN.\n    It is an error to add a clump with a name already used by one of the\n    CLUMPS of BPN.\n\n#### Training\n\n`BPN`s are trained to minimize the loss function they compute.\nBefore a BPN is passed to MINIMIZE (as its `GRADIENT-SOURCE`\nargument), it must be wrapped in a BP-LEARNER object. BP-LEARNER has\nMONITORS slot which is used for example by\nRESET-OPTIMIZATION-MONITORS.\n\nWithout the bells an whistles, the basic shape of training is this:\n\n```commonlisp\n(minimize optimizer (make-instance 'bp-learner :bpn bpn)\n          :dataset dataset)\n```\n\n\n- [class] BP-LEARNER\n\n- [reader] BPN BP-LEARNER (:BPN)\n\n    The BPN for which this BP-LEARNER provides the\n    gradients.\n\n- [accessor] MONITORS BP-LEARNER (:MONITORS = NIL)\n\n    A list of `MONITOR`s.\n\n#### Monitoring\n\n- [function] MONITOR-BPN-RESULTS DATASET BPN MONITORS\n\n    For every batch (of size MAX-N-STRIPES of BPN) of instances in\n    DATASET, set the batch as the next input with SET-INPUT, perform a\n    FORWARD pass and apply MONITORS to the BPN (with APPLY-MONITORS).\n    Finally, return the counters of MONITORS. This is built on top of\n    MONITOR-MODEL-RESULTS.\n\n- [function] MAKE-STEP-MONITOR-MONITORS RNN \u0026KEY (COUNTER-VALUES-FN #'COUNTER-RAW-VALUES) (MAKE-COUNTER #'MAKE-STEP-MONITOR-MONITOR-COUNTER)\n\n    Return a list of monitors, one for every monitor in STEP-MONITORS\n    of RNN. These monitors extract the results from their warp\n    counterpairs with COUNTER-VALUES-FN and add them to their own\n    counter that's created by MAKE-COUNTER. Wow. Ew. The idea is that\n    one does something like this do monitor warped prediction:\n    \n    ```commonlisp\n    (let ((*warp-time* t))\n      (setf (step-monitors rnn)\n            (make-cost-monitors rnn :attributes '(:event \"warped pred.\")))\n      (monitor-bpn-results dataset rnn\n                           ;; Just collect and reset the warp\n                           ;; monitors after each batch of\n                           ;; instances.\n                           (make-step-monitor-monitors rnn)))\n    ```\n\n\n- [generic-function] MAKE-STEP-MONITOR-MONITOR-COUNTER STEP-COUNTER\n\n    In an RNN, STEP-COUNTER aggregates results of all\n    the time steps during the processing of instances in the current\n    batch. Return a new counter into which results from STEP-COUNTER can\n    be accumulated when the processing of the batch is finished. The\n    default implementation creates a copy of STEP-COUNTER.\n\n#### Feed-Forward Nets\n\nFNN and RNN have a lot in common (see their common superclass, BPN).\nThere is very limited functionality that's specific to FNNs so let's\nget them out of they way before we study a full example.\n\n- [class] FNN BPN\n\n    A feed-forward neural net (as opposed to a\n    recurrent one, see RNN).\n\n- [macro] BUILD-FNN (\u0026KEY FNN (CLASS ''FNN) INITARGS MAX-N-STRIPES NAME) \u0026BODY CLUMPS\n\n    Syntactic sugar to assemble FNNs from CLUMPs. Like LET\\*, it is a\n    sequence of bindings (of symbols to CLUMPs). The names of the clumps\n    created default to the symbol of the binding. In case a clump is not\n    bound to a symbol (because it was created in a nested expression),\n    the local function CLUMP can be used to find the clump with the\n    given name in the fnn being built. Example:\n    \n        (build-fnn ()\n          (features (-\u003einput :size n-features))\n          (biases (-\u003eweight :size n-features))\n          (weights (-\u003eweight :size (* n-hiddens n-features)))\n          (activations0 (-\u003ev*m :weights weights :x (clump 'features)))\n          (activations (-\u003e+ :args (list biases activations0)))\n          (output (-\u003esigmoid :x activations)))\n\n\n##### FNN Tutorial\n\nHopefully this example from `example/digit-fnn.lisp` illustrates\nthe concepts involved. If it's too dense despite the comments, then\nread up on MGL-DATASET::@MGL-DATASET, MGL-OPT::@MGL-OPT and come back.\n\n```commonlisp\n(cl:defpackage :mgl-example-digit-fnn\n  (:use #:common-lisp #:mgl))\n\n(in-package :mgl-example-digit-fnn)\n\n;;; There are 10 possible digits used as inputs ...\n(defparameter *n-inputs* 10)\n;;; and we want to learn the rule that maps the input digit D to (MOD\n;;; (1+ D) 3).\n(defparameter *n-outputs* 3)\n\n;;; We define a feed-forward net to be able to specialize how inputs\n;;; are translated by adding a SET-INPUT method later.\n(defclass digit-fnn (fnn)\n  ())\n\n;;; Build a DIGIT-FNN with a single hidden layer of rectified linear\n;;; units and a softmax output.\n(defun make-digit-fnn (\u0026key (n-hiddens 5))\n  (build-fnn (:class 'digit-fnn)\n    (input (-\u003einput :size *n-inputs*))\n    (hidden-activation (-\u003eactivation input :size n-hiddens))\n    (hidden (-\u003erelu hidden-activation))\n    (output-activation (-\u003eactivation hidden :size *n-outputs*))\n    (output (-\u003esoftmax-xe-loss output-activation))))\n\n;;; This method is called with batches of 'instances' (input digits in\n;;; this case) by MINIMIZE and also by MONITOR-BPN-RESULTS before\n;;; performing a forward pass (i.e. computing the value of the\n;;; function represented by the network). Its job is to encode the\n;;; inputs by populating rows of the NODES matrix of the INPUT clump.\n;;;\n;;; Each input is encoded as a row of zeros with a single 1 at index\n;;; determined by the input digit. This is called one-hot encoding.\n;;; The TARGET could be encoded the same way, but instead we use the\n;;; sparse option supported by TARGET of -\u003eSOFTMAX-XE-LOSS.\n(defmethod set-input (digits (fnn digit-fnn))\n  (let* ((input (nodes (find-clump 'input fnn)))\n         (output-lump (find-clump 'output fnn)))\n    (fill! 0 input)\n    (loop for i upfrom 0\n          for digit in digits\n          do (setf (mref input i digit) 1))\n    (setf (target output-lump)\n          (mapcar (lambda (digit)\n                    (mod (1+ digit) *n-outputs*))\n                  digits))))\n\n;;; Train the network by minimizing the loss (cross-entropy here) with\n;;; stochastic gradient descent.\n(defun train-digit-fnn ()\n  (let ((optimizer\n          ;; First create the optimizer for MINIMIZE.\n          (make-instance 'segmented-gd-optimizer\n                         :segmenter\n                         ;; We train each weight lump with the same\n                         ;; parameters and, in fact, the same\n                         ;; optimizer. But it need not be so, in\n                         ;; general.\n                         (constantly\n                          (make-instance 'sgd-optimizer\n                                         :learning-rate 1\n                                         :momentum 0.9\n                                         :batch-size 100))))\n        (fnn (make-digit-fnn)))\n    ;; The number of instances the FNN can work with in parallel. It's\n    ;; usually equal to the batch size or is a its divisor.\n    (setf (max-n-stripes fnn) 50)\n    ;; Initialize all weights randomly.\n    (map-segments (lambda (weights)\n                    (gaussian-random! (nodes weights) :stddev 0.01))\n                  fnn)\n    ;; Arrange for training and test error to be logged.\n    (monitor-optimization-periodically\n     optimizer '((:fn log-test-error :period 10000)\n                 (:fn reset-optimization-monitors :period 1000)))\n    ;; Finally, start the optimization.\n    (minimize optimizer\n              ;; Dress FNN in a BP-LEARNER and attach monitors for the\n              ;; cost to it. These monitors are going to be logged and\n              ;; reset after every 100 training instance by\n              ;; RESET-OPTIMIZATION-MONITORS above.\n              (make-instance 'bp-learner\n                             :bpn fnn\n                             :monitors (make-cost-monitors\n                                        fnn :attributes `(:event \"train\")))\n              ;; Training stops when the sampler runs out (after 10000\n              ;; instances).\n              :dataset (make-sampler 10000))))\n\n;;; Return a sampler object that produces MAX-N-SAMPLES number of\n;;; random inputs (numbers between 0 and 9).\n(defun make-sampler (max-n-samples)\n  (make-instance 'function-sampler :max-n-samples max-n-samples\n                 :generator (lambda () (random *n-inputs*))))\n\n;;; Log the test error. Also, describe the optimizer and the bpn at\n;;; the beginning of training. Called periodically during training\n;;; (see above).\n(defun log-test-error (optimizer learner)\n  (when (zerop (n-instances optimizer))\n    (describe optimizer)\n    (describe (bpn learner)))\n  (log-padded\n   (monitor-bpn-results (make-sampler 1000) (bpn learner)\n                        (make-cost-monitors\n                         (bpn learner) :attributes `(:event \"pred.\")))))\n\n#|\n\n;;; Transcript follows:\n(repeatably ()\n  (let ((*log-time* nil))\n    (train-digit-fnn)))\n.. training at n-instances: 0\n.. train cost: 0.000e+0 (0)\n.. #\u003cSEGMENTED-GD-OPTIMIZER {100E112E93}\u003e\n..  SEGMENTED-GD-OPTIMIZER description:\n..    N-INSTANCES = 0\n..    OPTIMIZERS = (#\u003cSGD-OPTIMIZER\n..                    #\u003cSEGMENT-SET\n..                      (#\u003c-\u003eWEIGHT # :SIZE 15 1/1 :NORM 0.04473\u003e\n..                       #\u003c-\u003eWEIGHT # :SIZE 3 1/1 :NORM 0.01850\u003e\n..                       #\u003c-\u003eWEIGHT # :SIZE 50 1/1 :NORM 0.07159\u003e\n..                       #\u003c-\u003eWEIGHT # :SIZE 5 1/1 :NORM 0.03056\u003e)\n..                      {100E335B73}\u003e\n..                    {100E06DF83}\u003e)\n..    SEGMENTS = (#\u003c-\u003eWEIGHT (HIDDEN OUTPUT-ACTIVATION) :SIZE\n..                  15 1/1 :NORM 0.04473\u003e\n..                #\u003c-\u003eWEIGHT (:BIAS OUTPUT-ACTIVATION) :SIZE\n..                  3 1/1 :NORM 0.01850\u003e\n..                #\u003c-\u003eWEIGHT (INPUT HIDDEN-ACTIVATION) :SIZE\n..                  50 1/1 :NORM 0.07159\u003e\n..                #\u003c-\u003eWEIGHT (:BIAS HIDDEN-ACTIVATION) :SIZE\n..                  5 1/1 :NORM 0.03056\u003e)\n..  \n.. #\u003cSGD-OPTIMIZER {100E06DF83}\u003e\n..  GD-OPTIMIZER description:\n..    N-INSTANCES = 0\n..    SEGMENT-SET = #\u003cSEGMENT-SET\n..                    (#\u003c-\u003eWEIGHT (HIDDEN OUTPUT-ACTIVATION) :SIZE\n..                       15 1/1 :NORM 0.04473\u003e\n..                     #\u003c-\u003eWEIGHT (:BIAS OUTPUT-ACTIVATION) :SIZE\n..                       3 1/1 :NORM 0.01850\u003e\n..                     #\u003c-\u003eWEIGHT (INPUT HIDDEN-ACTIVATION) :SIZE\n..                       50 1/1 :NORM 0.07159\u003e\n..                     #\u003c-\u003eWEIGHT (:BIAS HIDDEN-ACTIVATION) :SIZE\n..                       5 1/1 :NORM 0.03056\u003e)\n..                    {100E335B73}\u003e\n..    LEARNING-RATE = 1.00000e+0\n..    MOMENTUM = 9.00000e-1\n..    MOMENTUM-TYPE = :NORMAL\n..    WEIGHT-DECAY = 0.00000e+0\n..    WEIGHT-PENALTY = 0.00000e+0\n..    N-AFTER-UPATE-HOOK = 0\n..    BATCH-SIZE = 100\n..  \n..  BATCH-GD-OPTIMIZER description:\n..    N-BEFORE-UPATE-HOOK = 0\n..  #\u003cDIGIT-FNN {100E11A423}\u003e\n..   BPN description:\n..     CLUMPS = #(#\u003c-\u003eINPUT INPUT :SIZE 10 1/50 :NORM 0.00000\u003e\n..                #\u003c-\u003eACTIVATION\n..                  (HIDDEN-ACTIVATION :ACTIVATION) :STRIPES 1/50\n..                  :CLUMPS 4\u003e\n..                #\u003c-\u003eRELU HIDDEN :SIZE 5 1/50 :NORM 0.00000\u003e\n..                #\u003c-\u003eACTIVATION\n..                  (OUTPUT-ACTIVATION :ACTIVATION) :STRIPES 1/50\n..                  :CLUMPS 4\u003e\n..                #\u003c-\u003eSOFTMAX-XE-LOSS OUTPUT :SIZE 3 1/50 :NORM 0.00000\u003e)\n..     N-STRIPES = 1\n..     MAX-N-STRIPES = 50\n..   pred. cost: 1.100d+0 (1000.00)\n.. training at n-instances: 1000\n.. train cost: 1.093d+0 (1000.00)\n.. training at n-instances: 2000\n.. train cost: 5.886d-1 (1000.00)\n.. training at n-instances: 3000\n.. train cost: 3.574d-3 (1000.00)\n.. training at n-instances: 4000\n.. train cost: 1.601d-7 (1000.00)\n.. training at n-instances: 5000\n.. train cost: 1.973d-9 (1000.00)\n.. training at n-instances: 6000\n.. train cost: 4.882d-10 (1000.00)\n.. training at n-instances: 7000\n.. train cost: 2.771d-10 (1000.00)\n.. training at n-instances: 8000\n.. train cost: 2.283d-10 (1000.00)\n.. training at n-instances: 9000\n.. train cost: 2.123d-10 (1000.00)\n.. training at n-instances: 10000\n.. train cost: 2.263d-10 (1000.00)\n.. pred. cost: 2.210d-10 (1000.00)\n..\n==\u003e (#\u003c-\u003eWEIGHT (:BIAS HIDDEN-ACTIVATION) :SIZE 5 1/1 :NORM 2.94294\u003e\n--\u003e  #\u003c-\u003eWEIGHT (INPUT HIDDEN-ACTIVATION) :SIZE 50 1/1 :NORM 11.48995\u003e\n--\u003e  #\u003c-\u003eWEIGHT (:BIAS OUTPUT-ACTIVATION) :SIZE 3 1/1 :NORM 3.39103\u003e\n--\u003e  #\u003c-\u003eWEIGHT (HIDDEN OUTPUT-ACTIVATION) :SIZE 15 1/1 :NORM 11.39339\u003e)\n\n|#\n```\n\n#### Recurrent Neural Nets\n\n##### RNN Tutorial\n\nHopefully this example from `example/sum-sign-fnn.lisp` illustrates\nthe concepts involved. Make sure you are comfortable with\n@MGL-FNN-TUTORIAL before reading this.\n\n```commonlisp\n(cl:defpackage :mgl-example-sum-sign-rnn\n  (:use #:common-lisp #:mgl))\n\n(in-package :mgl-example-sum-sign-rnn)\n\n;;; There is a single input at each time step...\n(defparameter *n-inputs* 1)\n;;; and we want to learn the rule that outputs the sign of the sum of\n;;; inputs so far in the sequence.\n(defparameter *n-outputs* 3)\n\n;;; Generate a training example that's a sequence of random length\n;;; between 1 and LENGTH. Elements of the sequence are lists of two\n;;; elements:\n;;;\n;;; 1. The input for the network (a single random number).\n;;;\n;;; 2. The sign of the sum of inputs so far encoded as 0, 1, 2 (for\n;;;    negative, zero and positive values). To add a twist, the sum is\n;;;    reset whenever a negative input is seen.\n(defun make-sum-sign-instance (\u0026key (length 10))\n  (let ((length (max 1 (random length)))\n        (sum 0))\n    (loop for i below length\n          collect (let ((x (1- (* 2 (random 2)))))\n                    (incf sum x)\n                    (when (\u003c x 0)\n                      (setq sum x))\n                    (list x (cond ((minusp sum) 0)\n                                  ((zerop sum) 1)\n                                  (t 2)))))))\n\n;;; Build an RNN with a single lstm hidden layer and softmax output.\n;;; For each time step, a SUM-SIGN-FNN will be instantiated.\n(defun make-sum-sign-rnn (\u0026key (n-hiddens 1))\n  (build-rnn ()\n    (build-fnn (:class 'sum-sign-fnn)\n      (input (-\u003einput :size 1))\n      (h (-\u003elstm input :name 'h :size n-hiddens))\n      (prediction (-\u003esoftmax-xe-loss (-\u003eactivation h :name 'prediction\n                                                   :size *n-outputs*))))))\n\n;;; We define this class to be able to specialize how inputs are\n;;; translated by adding a SET-INPUT method later.\n(defclass sum-sign-fnn (fnn)\n  ())\n\n;;; We have a batch of instances from MAKE-SUM-SIGN-INSTANCE for the\n;;; RNN. This function is invoked with elements of these instances\n;;; belonging to the same time step (i.e. at the same index) and sets\n;;; the input and target up.\n(defmethod set-input (instances (fnn sum-sign-fnn))\n  (let ((input-nodes (nodes (find-clump 'input fnn))))\n    (setf (target (find-clump 'prediction fnn))\n          (loop for stripe upfrom 0\n                for instance in instances\n                collect\n                ;; Sequences in the batch are not of equal length. The\n                ;; RNN sends a NIL our way if a sequence has run out.\n                (when instance\n                  (destructuring-bind (input target) instance\n                    (setf (mref input-nodes stripe 0) input)\n                    target))))))\n\n;;; Train the network by minimizing the loss (cross-entropy here) with\n;;; the Adam optimizer.\n(defun train-sum-sign-rnn ()\n  (let ((rnn (make-sum-sign-rnn)))\n    (setf (max-n-stripes rnn) 50)\n    ;; Initialize the weights in the usual sqrt(1 / fan-in) style.\n    (map-segments (lambda (weights)\n                    (let* ((fan-in (mat-dimension (nodes weights) 0))\n                           (limit (sqrt (/ 6 fan-in))))\n                      (uniform-random! (nodes weights)\n                                       :limit (* 2 limit))\n                      (.+! (- limit) (nodes weights))))\n                  rnn)\n    (minimize (monitor-optimization-periodically\n               (make-instance 'adam-optimizer\n                              :learning-rate 0.2\n                              :mean-decay 0.9\n                              :mean-decay-decay 0.9\n                              :variance-decay 0.9\n                              :batch-size 100)\n               '((:fn log-test-error :period 30000)\n                 (:fn reset-optimization-monitors :period 3000)))\n              (make-instance 'bp-learner\n                             :bpn rnn\n                             :monitors (make-cost-monitors rnn))\n              :dataset (make-sampler 30000))))\n\n;;; Return a sampler object that produces MAX-N-SAMPLES number of\n;;; random inputs.\n(defun make-sampler (max-n-samples \u0026key (length 10))\n  (make-instance 'function-sampler :max-n-samples max-n-samples\n                 :generator (lambda ()\n                              (make-sum-sign-instance :length length))))\n\n;;; Log the test error. Also, describe the optimizer and the bpn at\n;;; the beginning of training. Called periodically during training\n;;; (see above).\n(defun log-test-error (optimizer learner)\n  (when (zerop (n-instances optimizer))\n    (describe optimizer)\n    (describe (bpn learner)))\n  (let ((rnn (bpn learner)))\n    (log-padded\n     (append\n      (monitor-bpn-results (make-sampler 1000) rnn\n                           (make-cost-monitors\n                            rnn :attributes '(:event \"pred.\")))\n      ;; Same result in a different way: monitor predictions for\n      ;; sequences up to length 20, but don't unfold the RNN\n      ;; unnecessarily to save memory.\n      (let ((*warp-time* t))\n        (monitor-bpn-results (make-sampler 1000 :length 20) rnn\n                             ;; Just collect and reset the warp\n                             ;; monitors after each batch of\n                             ;; instances.\n                             (make-cost-monitors\n                              rnn :attributes '(:event \"warped pred.\"))))))\n    ;; Verify that no further unfoldings took place.\n    (assert (\u003c= (length (clumps rnn)) 10)))\n  (log-mat-room))\n\n#|\n\n;;; Transcript follows:\n(let (;; Backprop nets do not need double float. Using single floats\n      ;; is faster and needs less memory.\n      (*default-mat-ctype* :float)\n      ;; Enable moving data in and out of GPU memory so that the RNN\n      ;; can work with sequences so long that the unfolded network\n      ;; wouldn't otherwise fit in the GPU.\n      (*cuda-window-start-time* 1)\n      (*log-time* nil))\n  ;; Seed the random number generators.\n  (repeatably ()\n    ;; Enable CUDA if available.\n    (with-cuda* ()\n      (train-sum-sign-rnn))))\n.. training at n-instances: 0\n.. cost: 0.000e+0 (0)\n.. #\u003cADAM-OPTIMIZER {1006CD5663}\u003e\n..  GD-OPTIMIZER description:\n..    N-INSTANCES = 0\n..    SEGMENT-SET = #\u003cSEGMENT-SET\n..                    (#\u003c-\u003eWEIGHT (H #) :SIZE 1 1/1 :NORM 1.73685\u003e\n..                     #\u003c-\u003eWEIGHT (H #) :SIZE 1 1/1 :NORM 0.31893\u003e\n..                     #\u003c-\u003eWEIGHT (#1=# #2=# :PEEPHOLE) :SIZE\n..                       1 1/1 :NORM 1.81610\u003e\n..                     #\u003c-\u003eWEIGHT (H #2#) :SIZE 1 1/1 :NORM 0.21965\u003e\n..                     #\u003c-\u003eWEIGHT (#1# #3=# :PEEPHOLE) :SIZE\n..                       1 1/1 :NORM 1.74939\u003e\n..                     #\u003c-\u003eWEIGHT (H #3#) :SIZE 1 1/1 :NORM 0.40377\u003e\n..                     #\u003c-\u003eWEIGHT (H PREDICTION) :SIZE\n..                       3 1/1 :NORM 2.15898\u003e\n..                     #\u003c-\u003eWEIGHT (:BIAS PREDICTION) :SIZE\n..                       3 1/1 :NORM 2.94470\u003e\n..                     #\u003c-\u003eWEIGHT (#1# #4=# :PEEPHOLE) :SIZE\n..                       1 1/1 :NORM 0.97601\u003e\n..                     #\u003c-\u003eWEIGHT (INPUT #4#) :SIZE 1 1/1 :NORM 0.65261\u003e\n..                     #\u003c-\u003eWEIGHT (:BIAS #4#) :SIZE 1 1/1 :NORM 0.37653\u003e\n..                     #\u003c-\u003eWEIGHT (INPUT #1#) :SIZE 1 1/1 :NORM 0.92334\u003e\n..                     #\u003c-\u003eWEIGHT (:BIAS #1#) :SIZE 1 1/1 :NORM 0.01609\u003e\n..                     #\u003c-\u003eWEIGHT (INPUT #5=#) :SIZE 1 1/1 :NORM 1.09995\u003e\n..                     #\u003c-\u003eWEIGHT (:BIAS #5#) :SIZE 1 1/1 :NORM 1.41244\u003e\n..                     #\u003c-\u003eWEIGHT (INPUT #6=#) :SIZE 1 1/1 :NORM 0.40475\u003e\n..                     #\u003c-\u003eWEIGHT (:BIAS #6#) :SIZE 1 1/1 :NORM 1.75358\u003e)\n..                    {1006CD8753}\u003e\n..    LEARNING-RATE = 2.00000e-1\n..    MOMENTUM = NONE\n..    MOMENTUM-TYPE = :NONE\n..    WEIGHT-DECAY = 0.00000e+0\n..    WEIGHT-PENALTY = 0.00000e+0\n..    N-AFTER-UPATE-HOOK = 0\n..    BATCH-SIZE = 100\n..  \n..  BATCH-GD-OPTIMIZER description:\n..    N-BEFORE-UPATE-HOOK = 0\n..  \n..  ADAM-OPTIMIZER description:\n..    MEAN-DECAY-RATE = 1.00000e-1\n..    MEAN-DECAY-RATE-DECAY = 9.00000e-1\n..    VARIANCE-DECAY-RATE = 1.00000e-1\n..    VARIANCE-ADJUSTMENT = 1.00000d-7\n..  #\u003cRNN {10047C77E3}\u003e\n..   BPN description:\n..     CLUMPS = #(#\u003cSUM-SIGN-FNN :STRIPES 1/50 :CLUMPS 4\u003e\n..                #\u003cSUM-SIGN-FNN :STRIPES 1/50 :CLUMPS 4\u003e)\n..     N-STRIPES = 1\n..     MAX-N-STRIPES = 50\n..   \n..   RNN description:\n..     MAX-LAG = 1\n..   pred.        cost: 1.223e+0 (4455.00)\n.. warped pred. cost: 1.228e+0 (9476.00)\n.. Foreign memory usage:\n.. foreign arrays: 162 (used bytes: 39,600)\n.. CUDA memory usage:\n.. device arrays: 114 (used bytes: 220,892, pooled bytes: 19,200)\n.. host arrays: 162 (used bytes: 39,600)\n.. host-\u003edevice copies: 6,164, device-\u003ehost copies: 4,490\n.. training at n-instances: 3000\n.. cost: 3.323e-1 (13726.00)\n.. training at n-instances: 6000\n.. cost: 3.735e-2 (13890.00)\n.. training at n-instances: 9000\n.. cost: 1.012e-2 (13872.00)\n.. training at n-instances: 12000\n.. cost: 3.026e-3 (13953.00)\n.. training at n-instances: 15000\n.. cost: 9.267e-4 (13948.00)\n.. training at n-instances: 18000\n.. cost: 2.865e-4 (13849.00)\n.. training at n-instances: 21000\n.. cost: 8.893e-5 (13758.00)\n.. training at n-instances: 24000\n.. cost: 2.770e-5 (13908.00)\n.. training at n-instances: 27000\n.. cost: 8.514e-6 (13570.00)\n.. training at n-instances: 30000\n.. cost: 2.705e-6 (13721.00)\n.. pred.        cost: 1.426e-6 (4593.00)\n.. warped pred. cost: 1.406e-6 (9717.00)\n.. Foreign memory usage:\n.. foreign arrays: 216 (used bytes: 52,800)\n.. CUDA memory usage:\n.. device arrays: 148 (used bytes: 224,428, pooled bytes: 19,200)\n.. host arrays: 216 (used bytes: 52,800)\n.. host-\u003edevice copies: 465,818, device-\u003ehost copies: 371,990\n..\n==\u003e (#\u003c-\u003eWEIGHT (H (H :OUTPUT)) :SIZE 1 1/1 :NORM 0.10624\u003e\n--\u003e  #\u003c-\u003eWEIGHT (H (H :CELL)) :SIZE 1 1/1 :NORM 0.94460\u003e\n--\u003e  #\u003c-\u003eWEIGHT ((H :CELL) (H :FORGET) :PEEPHOLE) :SIZE 1 1/1 :NORM 0.61312\u003e\n--\u003e  #\u003c-\u003eWEIGHT (H (H :FORGET)) :SIZE 1 1/1 :NORM 0.38093\u003e\n--\u003e  #\u003c-\u003eWEIGHT ((H :CELL) (H :INPUT) :PEEPHOLE) :SIZE 1 1/1 :NORM 1.17956\u003e\n--\u003e  #\u003c-\u003eWEIGHT (H (H :INPUT)) :SIZE 1 1/1 :NORM 0.88011\u003e\n--\u003e  #\u003c-\u003eWEIGHT (H PREDICTION) :SIZE 3 1/1 :NORM 49.93808\u003e\n--\u003e  #\u003c-\u003eWEIGHT (:BIAS PREDICTION) :SIZE 3 1/1 :NORM 10.98112\u003e\n--\u003e  #\u003c-\u003eWEIGHT ((H :CELL) (H :OUTPUT) :PEEPHOLE) :SIZE 1 1/1 :NORM 0.67996\u003e\n--\u003e  #\u003c-\u003eWEIGHT (INPUT (H :OUTPUT)) :SIZE 1 1/1 :NORM 0.65251\u003e\n--\u003e  #\u003c-\u003eWEIGHT (:BIAS (H :OUTPUT)) :SIZE 1 1/1 :NORM 10.23003\u003e\n--\u003e  #\u003c-\u003eWEIGHT (INPUT (H :CELL)) :SIZE 1 1/1 :NORM 5.98116\u003e\n--\u003e  #\u003c-\u003eWEIGHT (:BIAS (H :CELL)) :SIZE 1 1/1 :NORM 0.10681\u003e\n--\u003e  #\u003c-\u003eWEIGHT (INPUT (H :FORGET)) :SIZE 1 1/1 :NORM 4.46301\u003e\n--\u003e  #\u003c-\u003eWEIGHT (:BIAS (H :FORGET)) :SIZE 1 1/1 :NORM 1.57195\u003e\n--\u003e  #\u003c-\u003eWEIGHT (INPUT (H :INPUT)) :SIZE 1 1/1 :NORM 0.36401\u003e\n--\u003e  #\u003c-\u003eWEIGHT (:BIAS (H :INPUT)) :SIZE 1 1/1 :NORM 8.63833\u003e)\n\n|#\n```\n\n- [class] RNN BPN\n\n    A recurrent neural net (as opposed to a\n    feed-forward one. It is typically built with BUILD-RNN that's no\n    more than a shallow convenience macro.\n    \n    An RNN takes instances as inputs that are sequences of variable\n    length. At each time step, the next unprocessed elements of these\n    sequences are set as input until all input sequences in the batch\n    run out. To be able to perform backpropagation, all intermediate\n    `LUMP`s must be kept around, so the recursive connections are\n    transformed out by\n    [unfolding](http://en.wikipedia.org/wiki/Backpropagation_through_time)\n    the network. Just how many lumps this means depends on the length of\n    the sequences.\n    \n    When an RNN is created, `MAX-LAG + 1` BPNs are instantiated so\n    that all weights are present and one can start training it.\n\n- [reader] UNFOLDER RNN (:UNFOLDER)\n\n    The UNFOLDER of an RNN is function of no arguments\n    that builds and returns a BPN. The unfolder is allowed to create\n    networks with arbitrary topology even different ones for different\n    TIME-STEPs with the help of LAG, or nested RNNs. Weights of\n    the same name are shared between the folds. That is, if a -\u003eWEIGHT\n    lump were to be created and a weight lump of the same name already\n    exists, then the existing lump will be added to the BPN created by\n    UNFOLDER.\n\n- [reader] MAX-LAG RNN (:MAX-LAG = 1)\n\n    The networks built by UNFOLDER may contain new\n    weights up to time step MAX-LAG. Beyond that point, all weight\n    lumps must be reappearances of weight lumps with the same name at\n    previous time steps. Most recurrent networks reference only the\n    state of lumps at the previous time step (with the function LAG),\n    hence the default of 1. But it is possible to have connections to\n    arbitrary time steps. The maximum connection lag must be specified\n    when creating the RNN.\n\n- [accessor] CUDA-WINDOW-START-TIME RNN (:CUDA-WINDOW-START-TIME = \\*CUDA-WINDOW-START-TIME\\*)\n\n    Due to unfolding, the memory footprint of an RNN\n    is almost linear in the number of time steps (i.e. the max\n    sequence length). For prediction, this is addressed by\n    @MGL-RNN-TIME-WARP. For training, we cannot discard results of\n    previous time steps because they are needed for backpropagation,\n    but we can at least move them out of GPU memory if they are not\n    going to be used for a while and copy them back before they are\n    needed. Obviously, this is only relevant if CUDA is being used.\n    \n    If CUDA-WINDOW-START-TIME is NIL, then this feature is turned off.\n    Else, during training, at CUDA-WINDOW-START-TIME or later time\n    steps, matrices belonging to non-weight lumps may be forced out of\n    GPU memory and later brought back as neeeded.\n    \n    This feature is implemented in terms of\n    MGL-MAT:WITH-SYNCING-CUDA-FACETS that uses CUDA host memory (also\n    known as *page-locked* or *pinned memory*) to do asynchronous\n    copies concurrently with normal computation. The consequence of\n    this is that it is now main memory usage that's unbounded which\n    toghether with page-locking makes it a potent weapon to bring a\n    machine to a halt. You were warned.\n\n- [variable] *CUDA-WINDOW-START-TIME* NIL\n\n    The default for CUDA-WINDOW-START-TIME.\n\n- [macro] BUILD-RNN (\u0026KEY RNN (CLASS ''RNN) NAME INITARGS MAX-N-STRIPES (MAX-LAG 1)) \u0026BODY BODY\n\n    Create an RNN with MAX-N-STRIPES and MAX-LAG whose UNFOLDER is BODY\n    wrapped in a lambda. Bind symbol given as the RNN argument to the\n    RNN object so that BODY can see it.\n\n- [function] LAG NAME \u0026KEY (LAG 1) RNN PATH\n\n    In RNN or if it's NIL the RNN being extended with another\n    BPN (called *unfolding*), look up the CLUMP with NAME in the BPN\n    that's LAG number of time steps before the BPN being added. If this\n    function is called from UNFOLDER of an RNN (which is what happens\n    behind the scene in the body of BUILD-RNN), then it returns an\n    opaque object representing a lagged connection to a clump, else it\n    returns the CLUMP itself.\n    \n    FIXDOC: PATH\n\n- [function] TIME-STEP \u0026KEY (RNN \\*RNN\\*)\n\n    Return the time step RNN is currently executing or being unfolded for.\n    It is 0 when the RNN is being unfolded for the first time.\n\n- [method] SET-INPUT INSTANCES (RNN RNN)\n\n    RNNs operate on batches of instances just like FNNs. But the\n    instances here are like datasets: sequences or samplers and they are\n    turned into sequences of batches of instances with\n    MAP-DATASETS :IMPUTE NIL. The batch of instances at index 2 is\n    clamped onto the BPN at time step 2 with SET-INPUT.\n    \n    When the input sequences in the batch are not of the same length,\n    already exhausted sequences will produce NIL (due to :IMPUTE NIL)\n    above. When such a NIL is clamped with SET-INPUT on a BPN of the\n    RNN, SET-INPUT must set the IMPORTANCE of the -\u003eERROR lumps to 0\n    else training would operate on the noise left there by previous\n    invocations.\n\n##### Time Warp\n\nThe unbounded memory usage of `RNN`s with one BPN allocated per\ntime step can become a problem. For training, where the gradients\noften have to be backpropagated from the last time step to the very\nbeginning, this is hard to solve but with CUDA-WINDOW-START-TIME the\nlimit is no longer GPU memory.\n\nFor prediction on the other hand, one doesn't need to keep old steps\naround indefinitely: they can be discarded when future time steps\nwill never reference them again.\n\n- [variable] *WARP-TIME* NIL\n\n    Controls whether warping is enabled (see @MGL-RNN-TIME-WARP). Don't\n    enable it for training, as it would make backprop impossible.\n\n- [function] WARPED-TIME \u0026KEY (RNN \\*RNN\\*) (TIME (TIME-STEP :RNN RNN)) (LAG 0)\n\n    Return the index of the BPN in CLUMPS of RNN whose task it is to\n    execute computation at `(- (TIME-STEP RNN) LAG)`. This is normally\n    the same as TIME-STEP (disregarding LAG). That is, CLUMPS can be\n    indexed by TIME-STEP to get the BPN. However, when *WARP-TIME* is\n    true, execution proceeds in a cycle as the structure of the network\n    allows.\n    \n    Suppose we have a typical RNN that only ever references the previous\n    time step so its MAX-LAG is 1. Its UNFOLDER returns `BPN`s of\n    identical structure bar a shift in their time lagged connections\n    except for the very first, so WARP-START and WARP-LENGTH are both 1.\n    If *WARP-TIME* is NIL, then the mapping from TIME-STEP to the BPN in\n    CLUMPS is straightforward:\n    \n        time:   |  0 |  1 |  2 |  3 |  4 |  5\n        --------+----+----+----+----+----+----\n        warped: |  0 |  1 |  2 |  3 |  4 |  5\n        --------+----+----+----+----+----+----\n        bpn:    | b0 | b1 | b2 | b3 | b4 | b5\n    \n    When *WARP-TIME* is true, we reuse the `B1` - `B2` bpns in a loop:\n    \n        time:   |  0 |  1 |  2 |  3 |  4 |  5\n        --------+----+----+----+----+----+----\n        warped: |  0 |  1 |  2 |  1 |  2 |  1\n        --------+----+----+----+----+----+----\n        bpn:    | b0 | b1 | b2 | b1*| b2 | b1*\n    \n    `B1*` is the same BPN as `B1`, but its connections created by LAG go\n    through warped time and end up referencing `B2`. This way, memory\n    consumption is independent of the number time steps needed to\n    process a sequence or make predictions.\n    \n    To be able to pull this trick off WARP-START and WARP-LENGTH must be\n    specified when the RNN is instantiated. In general, with\n    *WARP-TIME* `(+ WARP-START (MAX 2 WARP-LENGTH))` bpns are needed.\n    The 2 comes from the fact that with cycle length 1 a bpn would need\n    to takes its input from itself which is problematic because it has\n    NODES for only one set of values.\n\n- [reader] WARP-START RNN (:WARP-START = 1)\n\n    The TIME-STEP from which UNFOLDER will create\n    `BPN`s that essentially repeat every WARP-LENGTH steps.\n\n- [reader] WARP-LENGTH RNN (:WARP-LENGTH = 1)\n\n    An integer such that the BPN UNFOLDER creates at\n    time step `I` (where `(\u003c= WARP-START I)`) is identical to the BPN\n    created at time step `(+ WARP-START (MOD (- I WARP-START)\n    WARP-LENGTH))` except for a shift in its time lagged\n    connections.\n\n- [accessor] STEP-MONITORS RNN (:STEP-MONITORS = NIL)\n\n    During training, unfolded `BPN`s corresponding to\n    previous time steps may be expensive to get at because they are no\n    longer in GPU memory. This consideration also applies to making\n    prediction with the additional caveat that with *WARP-TIME* true,\n    previous states are discarded so it's not possible to gather\n    statistics after FORWARD finished.\n    \n    Add monitor objects to this slot and they will be automatically\n    applied to the RNN after each step when `FORWARD`ing the RNN\n    during training or prediction. To be able to easily switch between\n    sets of monitors, in addition to a list of monitors this can be a\n    symbol or a function, too. If it's a symbol, then its a designator\n    for its SYMBOL-VALUE. If it's a function, then it must have no\n    arguments and it's a designator for its return value.\n\n### Lumps\n\n#### Lump Base Class\n\n- [class] LUMP CLUMP\n\n    A LUMP is a simple, layerlike component of a neural\n    network. There are many kinds of lumps, each of which performs a\n    specific operation or just stores inputs and weights. By convention,\n    the names of lumps start with the prefix `-\u003e`. Defined as classes,\n    they also have a function of the same name as the class to create\n    them easily. These maker functions typically have keyword arguments\n    corresponding to initargs of the class, with some (mainly the input\n    lumps) turned into normal positional arguments. So instead of having\n    to do\n    \n        (make-instance '-\u003etanh :x some-input :name 'my-tanh)\n    \n    one can simply write\n    \n        (-\u003etanh some-input :name 'my-tanh)\n    \n    Lumps instantiated in any way within a BUILD-FNN or BUILD-RNN are\n    automatically added to the network being built.\n    \n    A lump has its own NODES and DERIVATIVES matrices allocated for it\n    in which the results of the forward and backward passes are stored.\n    This is in contrast to a BPN whose NODES and DERIVATIVES\n    are those of its last constituent CLUMP.\n    \n    Since lumps almost always live within a BPN, their\n    N-STRIPES and MAX-N-STRIPES are\n    handled automagically behind the scenes.\n\n- [reader] SIZE LUMP (:SIZE)\n\n    The number of values in a single stripe.\n\n- [reader] DEFAULT-VALUE LUMP (:DEFAULT-VALUE = 0)\n\n    Upon creation or resize the lump's nodes get\n    filled with this value.\n\n- [generic-function] DEFAULT-SIZE LUMP\n\n    Return a default for the SIZE of\n    LUMP if one is not supplied at instantiation. The value is often\n    computed based on the sizes of the inputs. This function is for\n    implementing new lump types.\n\n- [reader] NODES LUMP (= NIL)\n\n    The values computed by the lump in the forward\n    pass are stored here. It is an `N-STRIPES * SIZE` matrix that has\n    storage allocated for `MAX-N-STRIPES * SIZE` elements for\n    non-weight lumps. -\u003eWEIGHT lumps have no stripes nor restrictions\n    on their shape.\n\n- [reader] DERIVATIVES LUMP\n\n    The derivatives computed in the backward pass are\n    stored here. This matrix is very much like NODES\n    in shape and size.\n\n#### Inputs\n\n##### Input Lump\n\n- [class] -\u003eINPUT -\\\u003eDROPOUT\n\n    A lump that has no input lumps, does not change its\n    values in the forward pass (except when DROPOUT is non-zero), and does not compute derivatives. *Clamp*\n    inputs on NODES of input lumps in SET-INPUT.\n    \n    For convenience, -\u003eINPUT can perform dropout itself although it\n    defaults to no dropout.\n    \n    ```common-lisp\n    (-\u003einput :size 10 :name 'some-input)\n    ==\u003e #\u003c-\u003eINPUT SOME-INPUT :SIZE 10 1/1 :NORM 0.00000\u003e\n    ```\n\n\n- [accessor] DROPOUT -\\\u003eINPUT (= NIL)\n\n    See DROPOUT.\n\n##### Embedding Lump\n\nThis lump is like an input and a simple activation molded together\nin the name of efficiency.\n\n- [class] -\u003eEMBEDDING LUMP\n\n    Select rows of WEIGHTS, one row for each index in\n    INPUT-ROW-INDICES. This lump is equivalent to adding an -\u003eINPUT lump\n    with a one hot encoding scheme and a -\u003eV\\*M lump on top of it, but it\n    is more efficient in execution and in memory usage, because it works\n    with a sparse representation of the input.\n    \n    The SIZE of this lump is the number of columns of WEIGHTS which is\n    determined automatically.\n    \n    ```common-lisp\n    (-\u003eembedding :weights (-\u003eweight :name 'embedding-weights\n                                    :dimensions '(3 5))\n                 :name 'embeddings)\n    ==\u003e #\u003c-\u003eEMBEDDING EMBEDDINGS :SIZE 5 1/1 :NORM 0.00000\u003e\n    ```\n\n\n- [reader] WEIGHTS -\\\u003eEMBEDDING (:WEIGHTS)\n\n    A weight lump whose rows indexed by\n    INPUT-ROW-INDICES are copied to the output of this lump.\n\n- [reader] INPUT-ROW-INDICES -\\\u003eEMBEDDING (:INPUT-ROW-INDICES)\n\n    A sequence of batch size length of row indices. To\n    be set in SET-INPUT.\n\n#### Weight Lump\n\n- [class] -\u003eWEIGHT LUMP\n\n    A set of optimizable parameters of some kind. When\n    a BPN is is trained (see @MGL-BP-TRAINING) the NODES of weight lumps\n    will be changed. Weight lumps perform no computation.\n    \n    Weights can be created by specifying the total size or the\n    dimensions:\n    \n    ```common-lisp\n    (dimensions (-\u003eweight :size 10 :name 'w))\n    =\u003e (1 10)\n    (dimensions (-\u003eweight :dimensions '(5 10) :name 'w))\n    =\u003e (5 10)\n    ```\n\n\n- [reader] DIMENSIONS -\\\u003eWEIGHT (:DIMENSIONS)\n\n    NODES and DERIVATIVES of this lump will be\n    allocated with these dimensions.\n\n- [macro] WITH-WEIGHTS-COPIED (FROM-BPN) \u0026BODY BODY\n\n    In BODY -\u003eWEIGHT will first look up if a weight lump of the same\n    name exists in FROM-BPN and return that, or else create a weight\n    lump normally. If FROM-BPN is NIL, then no weights are copied.\n\n#### Activations\n\n##### Activation Subnet\n\nSo we have some inputs. Usually the next step is to multiply the\ninput vector with a weight matrix and add biases. This can be done\ndirectly with -\u003e+, -\u003eV\\*M and -\u003eWEIGHT, but it's more convenient to\nuse activation subnets to reduce the clutter.\n\n- [class] -\u003eACTIVATION BPN\n\n    Activation subnetworks are built by the function\n    -\u003eACTIVATION and they have a number of lumps hidden inside them.\n    Ultimately, this subnetwork computes a sum like `sum_i x_i * W_i +\n    sum_j y_j .* V_j + biases` where `x_i` are input lumps, `W_i` are\n    dense matrices representing connections, while `V_j` are peephole\n    connection vectors that are mulitplied in an elementwise manner with\n    their corresponding input `y_j`.\n\n- [function] -\u003eACTIVATION INPUTS \u0026KEY (NAME (GENSYM)) SIZE PEEPHOLES (ADD-BIAS-P T)\n\n    Create a subnetwork of class -\u003eACTIVATION that computes the over\n    activation from dense connection from lumps in INPUTS, and\n    elementwise connection from lumps in PEEPHOLES. Create new -\u003eWEIGHT\n    lumps as necessary. INPUTS and PEEPHOLES can be a single lump or a\n    list of lumps. Finally, if ADD-BIAS-P, then add an elementwise bias\n    too. SIZE must be specified explicitly, because it is not possible\n    to determine it unless there are peephole connections.\n    \n    ```common-lisp\n    (-\u003eactivation (-\u003einput :size 10 :name 'input) :name 'h1 :size 4)\n    ==\u003e #\u003c-\u003eACTIVATION (H1 :ACTIVATION) :STRIPES 1/1 :CLUMPS 4\u003e\n    ```\n    \n    This is the basic workhorse of neural networks which takes care of\n    the linear transformation whose results and then fed to some\n    non-linearity (-\u003eSIGMOID, -\u003eTANH, etc).\n    \n    The name of the subnetwork clump is `(,NAME :ACTIVATION)`. The bias\n    weight lump (if any) is named `(:BIAS ,NAME)`. Dense connection\n    weight lumps are named are named after the input and NAME: `(,(NAME\n    INPUT) ,NAME)`, while peepholes weight lumps are named `(,(NAME\n    INPUT) ,NAME :PEEPHOLE)`. This is useful to know if, for example,\n    they are to be initialized differently.\n\n##### Batch-Normalization\n\n- [class] -\u003eBATCH-NORMALIZED LUMP\n\n    This is an implementation of v3 of the [Batch\n    Normalization paper](http://arxiv.org/abs/1502.03167). The output of\n    -\u003eBATCH-NORMALIZED is its input normalized so that for all elements\n    the mean across stripes is zero and the variance is 1. That is, the\n    mean of the batch is subtracted from the inputs and they are\n    rescaled by their sample stddev. Actually, after the normalization\n    step the values are rescaled and shifted (but this time with learnt\n    parameters) in order to keep the representational power of the model\n    the same. The primary purpose of this lump is to speed up learning,\n    but it also acts as a regularizer. See the paper for the details.\n    \n    To normalize the output of LUMP without no additional\n    regularizer effect:\n    \n    ```commonlisp\n    (-\u003ebatch-normalized lump :batch-size :use-population)\n    ```\n    \n    The above uses an exponential moving average to estimate the mean\n    and variance of batches and these estimations are used at both\n    training and test time. In contrast to this, the published version\n    uses the sample mean and variance of the current batch at training\n    time which injects noise into the process. The noise is higher for\n    lower batch sizes and has a regularizing effect. This is the default\n    behavior (equivalent to `:BATCH-SIZE NIL`):\n    \n    ```commonlisp\n    (-\u003ebatch-normalized lump)\n    ```\n    \n    For performance reasons one may wish to process a higher number of\n    instances in a batch (in the sense of N-STRIPES) and get the\n    regularization effect associated with a lower batch size. This is\n    possible by setting :BATCH-SIZE to a divisor of the the number of\n    stripes. Say, the number of stripes is 128, but we want as much\n    regularization as we would get with 32:\n    \n    ```commonlisp\n    (-\u003ebatch-normalized lump :batch-size 32)\n    ```\n    \n    The primary input of -\u003eBATCH-NORMALIZED is often an -\u003eACTIVATION and\n    its output is fed into an activation function (see\n    @MGL-BP-ACTIVATION-FUNCTIONS).\n\n- [reader] BATCH-NORMALIZATION -\\\u003eBATCH-NORMALIZED (:NORMALIZATION)\n\n    The -\u003eBATCH-NORMALIZATION of this lump. May be\n    shared between multiple -\u003eBATCH-NORMALIZED lumps.\n    \n    Batch normalization is special in that it has state apart from the\n    computed results (NODES) and its derivatives (DERIVATIVES). This\n    state is the estimated mean and variance of its inputs and they\n    are encapsulated by -\u003eBATCH-NORMALIZATION.\n    \n    If NORMALIZATION is not given at instantiation, then a new\n    -\u003eBATCH-NORMALIZATION object will be created automatically,\n    passing :BATCH-SIZE, :VARIANCE-ADJUSTMENT, and :POPULATION-DECAY\n    arguments on to -\u003eBATCH-NORMALIZATION. See BATCH-SIZE, VARIANCE-ADJUSTMENT and POPULATION-DECAY. New scale and shift weight lumps will be\n    created with names:\n    \n        `(,name :scale)\n        `(,name :shift)\n    \n    where `NAME` is the NAME of this lump.\n    \n    This default behavior covers the use-case where the statistics\n    kept by -\u003eBATCH-NORMALIZATION are to be shared only between time\n    steps of an RNN.\n\n- [class] -\u003eBATCH-NORMALIZATION -\\\u003eWEIGHT\n\n    The primary purpose of this class is to hold the\n    estimated mean and variance of the inputs to be normalized and allow\n    them to be shared between multiple -\u003eBATCH-NORMALIZED lumps that\n    carry out the computation. These estimations are saved and loaded by\n    SAVE-STATE and LOAD-STATE.\n    \n    ```commonlisp\n    (-\u003ebatch-normalization (-\u003eweight :name '(h1 :scale) :size 10)\n                           (-\u003eweight :name '(h1 :shift) :size 10)\n                           :name '(h1 :batch-normalization))\n    ```\n\n\n- [reader] SCALE -\\\u003eBATCH-NORMALIZATION (:SCALE)\n\n    A weight lump of the same size as SHIFT. This is\n    $\\gamma$ in the paper.\n\n- [reader] SHIFT -\\\u003eBATCH-NORMALIZATION (:SHIFT)\n\n    A weight lump of the same size as SCALE. This is\n    $\\beta$ in the paper.\n\n- [reader] BATCH-SIZE -\\\u003eBATCH-NORMALIZATION (:BATCH-SIZE = NIL)\n\n    Normally all stripes participate in the batch.\n    Lowering the number of stripes may increase the regularization\n    effect, but it also makes the computation less efficient. By\n    setting BATCH-SIZE to a divisor of N-STRIPES one can decouple the\n    concern of efficiency from that of regularization. The default\n    value, NIL, is equivalent to N-STRIPES. BATCH-SIZE only affects\n    training.\n    \n    With the special value :USE-POPULATION, instead of the mean and\n    the variance of the current batch, use the population statistics\n    for normalization. This effectively cancels the regularization\n    effect, leaving only the faster learning.\n\n- [reader] VARIANCE-ADJUSTMENT -\\\u003eBATCH-NORMALIZATION (:VARIANCE-ADJUSTMENT = 1.0e-4)\n\n    A small positive real number that's added to the\n    sample variance. This is $\\epsilon$ in the paper.\n\n- [reader] POPULATION-DECAY -\\\u003eBATCH-NORMALIZATION (:POPULATION-DECAY = 0.99)\n\n    While training, an exponential moving average of\n    batch means and standard deviances (termed *population\n    statistics*) is updated. When making predictions, normalization is\n    performed using these statistics. These population statistics are\n    persisted by SAVE-STATE.\n\n- [function] -\u003eBATCH-NORMALIZED-ACTIVATION INPUTS \u0026KEY (NAME (GENSYM)) SIZE PEEPHOLES BATCH-SIZE VARIANCE-ADJUSTMENT POPULATION-DECAY\n\n    A utility functions that creates and wraps an -\u003eACTIVATION in\n    -\u003eBATCH-NORMALIZED and with its BATCH-NORMALIZATION the two weight\n    lumps for the scale and shift\n    parameters. `(-\u003eBATCH-NORMALIZED-ACTIVATION INPUTS :NAME 'H1 :SIZE\n    10)` is equivalent to:\n    \n    ```commonlisp\n    (-\u003ebatch-normalized (-\u003eactivation inputs :name 'h1 :size 10 :add-bias-p nil)\n                        :name '(h1 :batch-normalized-activation))\n    ```\n    \n    Note how biases are turned off since normalization will cancel them\n    anyway (but a shift is added which amounts to the same effect).\n\n#### Activation Functions\n\nNow we are moving on to the most important non-linearities to which\nactivations are fed.\n\n##### Sigmoid Lump\n\n- [class] -\u003eSIGMOID -\\\u003eDROPOUT LUMP\n\n    Applies the `1/(1 + e^{-x})` function elementwise\n    to its inputs. This is one of the classic non-linearities for neural\n    networks.\n    \n    For convenience, -\u003eSIGMOID can perform dropout itself although it\n    defaults to no dropout.\n    \n    ```common-lisp\n    (-\u003esigmoid (-\u003eactivation (-\u003einput :size 10) :size 5) :name 'this)\n    ==\u003e #\u003c-\u003eSIGMOID THIS :SIZE 5 1/1 :NORM 0.00000\u003e\n    ```\n    \n    The SIZE of this lump is the size of its input which is determined\n    automatically.\n\n- [accessor] DROPOUT -\\\u003eSIGMOID (= NIL)\n\n    See DROPOUT.\n\n##### Tanh Lump\n\n- [class] -\u003eTANH LUMP\n\n    Applies the TANH function to its input in an\n    elementwise manner. The SIZE of this lump is the size of its input\n    which is determined automatically.\n\n##### Scaled Tanh Lump\n\n- [class] -\u003eSCALED-TANH LUMP\n\n    Pretty much like TANH but its input and output is\n    scaled in such a way that the variance of its output is close to 1\n    if the variance of its input is close to 1 which is a nice property\n    to combat vanishing gradients. The actual function is `1.7159 *\n    tanh(2/3 * x)`. The SIZE of this lump is the size of its input which\n    is determined automatically.\n\n##### Relu Lump\n\nWe are somewhere around year 2007 by now.\n\n- [class] -\u003eRELU LUMP\n\n    `max(0,x)` activation function. Be careful, relu\n    units can get stuck in the off state: if they move to far to\n    negative territory it can be very difficult to get out of it. The\n    SIZE of this lump is the size of its input which is determined\n    automatically.\n\n##### Max Lump\n\nWe are in about year 2011.\n\n- [class] -\u003eMAX LUMP\n\n    This is basically maxout without dropout (see\n    http://arxiv.org/abs/1302.4389). It groups its inputs by\n    GROUP-SIZE, and outputs the maximum of each group.\n    The SIZE of the output is automatically calculated, it is the size\n    of the input divided by GROUP-SIZE.\n    \n    ```common-lisp\n    (-\u003emax (-\u003einput :size 120) :group-size 3 :name 'my-max)\n    ==\u003e #\u003c-\u003eMAX MY-MAX :SIZE 40 1/1 :NORM 0.00000 :GROUP-SIZE 3\u003e\n    ```\n    \n    The advantage of -\u003eMAX over -\u003eRELU is that flow gradient is never\n    stopped so there is no problem of units getting stuck in off\n    state.\n\n- [reader] GROUP-SIZE -\\\u003eMAX (:GROUP-SIZE)\n\n    The number of inputs in each group.\n\n##### Min Lump\n\n- [class] -\u003eMIN LUMP\n\n    Same as -\u003eMAX, but it computes the MIN of groups.\n    Rarely useful.\n\n- [reader] GROUP-SIZE -\\\u003eMIN (:GROUP-SIZE)\n\n    The number of inputs in each group.\n\n##### Max-Channel Lump\n\n- [class] -\u003eMAX-CHANNEL LUMP\n\n    Called LWTA (Local Winner Take All) or\n    Channel-Out (see http://arxiv.org/abs/1312.1909) in the literature\n    it is basically -\u003eMAX, but instead of producing one output per\n    group, it just produces zeros for all unit but the one with the\n    maximum value in the group. This allows the next layer to get some\n    information about the path along which information flowed. The SIZE\n    of this lump is the size of its input which is determined\n    automatically.\n\n- [reader] GROUP-SIZE -\\\u003eMAX-CHANNEL (:GROUP-SIZE)\n\n    The number of inputs in each group.\n\n#### Losses\n\nUltimately, we need to tell the network what to learn which means\nthat the loss function to be minimized needs to be constructed as\npart of the network.\n\n##### Loss Lump\n\n- [class] -\u003eLOSS -\\\u003eSUM\n\n    Calculate the loss for the instances in the batch.\n    The main purpose of this lump is to provide a training signal.\n    \n    An error lump is usually a leaf in the graph of lumps (i.e. there\n    are no other lumps whose input is this one). The special thing about\n    error lumps is that 1 (but see IMPORTANCE) is added automatically to\n    their derivatives. Error lumps have exactly one node (per stripe)\n    whose value is computed as the sum of nodes in their input lump.\n\n- [accessor] IMPORTANCE -\\\u003eLOSS (:IMPORTANCE = NIL)\n\n    This is to support weighted instances. That is\n    when not all training instances are equally important. If non-NIL,\n    a 1d MAT with the importances of stripes of the batch. When\n    IMPORTANCE is given (typically in SET-INPUT), then instead of\n    adding 1 to the derivatives of all stripes, IMPORTANCE is added\n    elemtwise.\n\n##### Squared Difference Lump\n\nIn regression, the squared error loss is most common. The squared\nerror loss can be constructed by combining -\u003eSQUARED-DIFFERENCE with\na -\u003eLOSS.\n\n- [class] -\u003eSQUARED-DIFFERENCE LUMP\n\n    This lump takes two input lumps and calculates\n    their squared difference `(x - y)^2` in an elementwise manner. The\n    SIZE of this lump is automatically determined from the size of its\n    inputs. This lump is often fed into -\u003eLOSS that sums the squared\n    differences and makes it part of the function to be minimized.\n    \n    ```common-lisp\n    (-\u003eloss (-\u003esquared-difference (-\u003eactivation (-\u003einput :size 100)\n                                                :size 10)\n                                  (-\u003einput :name 'target :size 10))\n            :name 'squared-error)\n    ==\u003e #\u003c-\u003eLOSS SQUARED-ERROR :SIZE 1 1/1 :NORM 0.00000\u003e\n    ```\n    \n    Currently this lump is not CUDAized, but it will copy data from the\n    GPU if it needs to.\n\n##### Softmax Cross-Entropy Loss Lump\n\n- [class] -\u003eSOFTMAX-XE-LOSS LUMP\n\n    A specialized lump that computes the softmax of its\n    input in the forward pass and backpropagates a cross-entropy loss.\n    The advantage of doing these together is numerical stability. The\n    total cross-entropy is the sum of cross-entropies per group of\n    GROUP-SIZE elements:\n    \n    $$\n    XE(x) = - \\sum\\_\\{i=1,g\\} t\\_i \\ln(s\\_i),\n    $$\n    \n    where `g` is the number of classes (GROUP-SIZE), `t_i` are the targets (i.e. the true\n    probabilities of the class, often all zero but one), `s_i` is the\n    output of softmax calculated from input `X`:\n    \n    $$\n    s\\_i = \\{softmax\\}(x\\_1, x\\_2, ..., x\\_g) =\n      \\frac\\{e^x\\_i\\}\\{\\sum\\_\\{j=1,g\\} e^x\\_j\\}\n    $$\n    \n    In other words, in the forward phase this lump takes input `X`,\n    computes its elementwise EXP, normalizes each group of\n    GROUP-SIZE elements to sum to 1 to get\n    the softmax which is the result that goes into NODES. In the\n    backward phase, there are two sources of gradients: the lumps that\n    use the output of this lump as their input (currently not\n    implemented and would result in an error) and an implicit\n    cross-entropy loss.\n    \n    One can get the cross-entropy calculated in the most recent forward\n    pass by calling COST on this lump.\n    \n    This is the most common loss function for classification. In fact,\n    it is nearly ubiquitous. See the @MGL-FNN-TUTORIAL and the\n    @MGL-RNN-TUTORIAL for how this loss and SET-INPUT work together.\n\n- [reader] GROUP-SIZE -\\\u003eSOFTMAX-XE-LOSS (:GROUP-SIZE)\n\n    The number of elements in a softmax group. This is\n    the number of classes for classification. Often GROUP-SIZE is\n    equal to SIZE (it is the default), but in general the only\n    constraint is that SIZE is a multiple of GROUP-SIZE.\n\n- [accessor] TARGET -\\\u003eSOFTMAX-XE-LOSS (:TARGET = NIL)\n\n    Set in SET-INPUT, this is either a MAT of the same\n    size as the input lump `X` or if the target is very sparse, this\n    can also be a sequence of batch size length that contains the\n    index value pairs of non-zero entries:\n    \n        (;; first instance in batch has two non-zero targets\n         (;; class 10 has 30% expected probability\n          (10 . 0.3)\n          ;; class 2 has 70% expected probability\n          (2 .  0.7))\n         ;; second instance in batch puts 100% on class 7\n         7\n         ;; more instances in the batch follow\n         ...)\n    \n    Actually, in the rare case where GROUP-SIZE is not SIZE (i.e. there are several softmax\n    normalization groups for every example), the length of the above\n    target sequence is BATCH-SIZE \\* N-GROUPS. Indices are always\n    relative to the start of the group.\n    \n    If GROUP-SIZE is large (for example,\n    in neural language models with a huge number of words), using\n    sparse targets can make things go much faster, because calculation\n    of the derivative is no longer quadratic.\n    \n    Giving different weights to training instances is implicitly\n    supported. While target values in a group should sum to 1,\n    multiplying all target values with a weight `W` is equivalent to\n    training that `W` times on the same example.\n\n- [function] ENSURE-SOFTMAX-TARGET-MATRIX SOFTMAX-XE-LOSS N\n\n    Set TARGET of SOFTMAX-XE-LOSS to a MAT capable of holding the dense\n    target values for N stripes.\n\n#### Stochasticity\n\n##### Dropout Lump\n\n- [class] -\u003eDROPOUT LUMP\n\n    The output of this lump is identical to its input,\n    except it randomly zeroes out some of them during training which act\n    as a very strong regularizer. See Geoffrey Hinton's 'Improving\n    neural networks by preventing co-adaptation of feature\n    detectors'.\n    \n    The SIZE of this lump is the size of its input which is determined\n    automatically.\n\n- [accessor] DROPOUT -\\\u003eDROPOUT (:DROPOUT = 0.5)\n\n    If non-NIL, then in the forward pass zero out each\n    node in this chunk with DROPOUT probability.\n\n##### Gaussian Random Lump\n\n- [class] -\u003eGAUSSIAN-RANDOM LUMP\n\n    This lump has no input, it produces normally\n    distributed independent random numbers with MEAN and VARIANCE (or\n    VARIANCE-FOR-PREDICTION). This is useful building block for noise\n    based regularization methods.\n    \n    ```common-lisp\n    (-\u003egaussian-random :size 10 :name 'normal :mean 1 :variance 2)\n    ==\u003e #\u003c-\u003eGAUSSIAN-RANDOM NORMAL :SIZE 10 1/1 :NORM 0.00000\u003e\n    ```\n\n\n- [accessor] MEAN -\\\u003eGAUSSIAN-RANDOM (:MEAN = 0)\n\n    The mean of the normal distribution.\n\n- [accessor] VARIANCE -\\\u003eGAUSSIAN-RANDOM (:VARIANCE = 1)\n\n    The variance of the normal distribution.\n\n- [accessor] VARIANCE-FOR-PREDICTION -\\\u003eGAUSSIAN-RANDOM (:VARIANCE-FOR-PREDICTION = 0)\n\n    If not NIL, then this value overrides VARIANCE\n    when not in training (i.e. when making predictions).\n\n##### Binary Sampling Lump\n\n- [class] -\u003eSAMPLE-BINARY LUMP\n\n    Treating values of its input as probabilities,\n    sample independent binomials. Turn true into 1 and false into 0. The\n    SIZE of this lump is determined automatically from the size of its\n    input.\n    \n    ```common-lisp\n    (-\u003esample-binary (-\u003einput :size 10) :name 'binarized-input)\n    ==\u003e #\u003c-\u003eSAMPLE-BINARY BINARIZED-INPUT :SIZE 10 1/1 :NORM 0.00000\u003e\n    ```\n\n\n#### Arithmetic\n\n##### Sum Lump\n\n- [class] -\u003eSUM LUMP\n\n    Computes the sum of all nodes of its input per\n    stripe. This SIZE of this lump is always 1.\n\n##### Vector-Matrix Multiplication Lump\n\n- [class] -\u003eV*M LUMP\n\n    Perform `X * WEIGHTS` where `X` (the input) is of\n    size `M` and WEIGHTS is a -\u003eWEIGHT whose single stripe is taken to\n    be of dimensions `M x N` stored in row major order. `N` is the size\n    of this lump. If TRANSPOSE-WEIGHTS-P then WEIGHTS is `N x M` and `X\n    * WEIGHTS'` is computed.\n\n- [reader] WEIGHTS -\\\u003eV\\*M (:WEIGHTS)\n\n    A -\u003eWEIGHT lump.\n\n- [reader] TRANSPOSE-WEIGHTS-P -\\\u003eV\\*M (:TRANSPOSE-WEIGHTS-P = NIL)\n\n    Determines whether the input is multiplied by\n    WEIGHTS or its transpose.\n\n##### Elementwise Addition Lump\n\n- [class] -\u003e+ LUMP\n\n    Performs elementwise addition on its input lumps.\n    The SIZE of this lump is automatically determined from the size of\n    its inputs if there is at least one. If one of the inputs is a\n    -\u003eWEIGHT lump, then it is added to every stripe.\n    \n    ```common-lisp\n    (-\u003e+ (list (-\u003einput :size 10) (-\u003eweight :size 10 :name 'bias))\n         :name 'plus)\n    ==\u003e #\u003c-\u003e+ PLUS :SIZE 10 1/1 :NORM 0.00000\u003e\n    ```\n\n\n##### Elementwise Multiplication Lump\n\n- [class] -\u003e* LUMP\n\n    Performs elementwise multiplication on its two\n    input lumps. The SIZE of this lump is automatically determined from\n    the size of its inputs. Either input can be a -\u003eWEIGHT lump.\n    \n    ```common-lisp\n    (-\u003e* (-\u003einput :size 10) (-\u003eweight :size 10 :name 'scale)\n         :name 'mult)\n    ==\u003e #\u003c-\u003e* MULT :SIZE 10 1/1 :NORM 0.00000\u003e\n    ```\n\n\n##### Abs Lump\n\n- [class] -\u003eABS LUMP\n\n##### Exp Lump\n\n- [class] -\u003eEXP LUMP\n\n##### Normalized Lump\n\n- [class] -\u003eNORMALIZED LUMP\n\n#### Operations for RNNs\n\n##### LSTM Subnet\n\n- [class] -\u003eLSTM BPN\n\n    Long-Short Term Memory subnetworks are built by the\n    function -\u003eLSTM and they have many lumps hidden inside them. These\n    lumps are packaged into a subnetwork to reduce clutter.\n\n- [function] -\u003eLSTM INPUTS \u0026KEY NAME CELL-INIT OUTPUT-INIT SIZE (ACTIVATION-FN '-\\\u003eACTIVATION) (GATE-FN '-\\\u003eSIGMOID) (INPUT-FN '-\\\u003eTANH) (OUTPUT-FN '-\\\u003eTANH) (PEEPHOLES T)\n\n    Create an LSTM layer consisting of input, forget, output gates with\n    which input, cell state and output are scaled. Lots of lumps are\n    created, the final one representing to output of the LSTM has NAME.\n    The rest of the lumps are named automatically based on NAME. This\n    function returns only the output lump (`m`), but all created lumps\n    are added automatically to the BPN being built.\n    \n    There are many papers and tutorials on LSTMs. This version is well\n    described in \"Long Short-Term Memory Recurrent Neural Network\n    Architectures for Large Scale Acoustic Modeling\" (2014, Hasim Sak,\n    Andrew Senior, Francoise Beaufays). Using the notation from that\n    paper:\n    \n    $$\n    i\\_t = s(W\\_\\{ix\\} x\\_t + W\\_\\{im\\} m\\_\\{t-1\\} + W\\_\\{ic\\} \\odot\n    c\\_\\{t-1\\} + b\\_i)\n    $$\n    \n    $$\n    f\\_t = s(W\\_\\{fx\\} x\\_t + W\\_\\{fm\\} m\\_\\{t-1\\} + W\\_\\{fc\\} \\odot\n    c\\_\\{t-1\\} + b\\_f)\n    $$\n    \n    $$\n    c\\_t = f\\_t \\odot c\\_\\{t-1\\} + i\\_t \\odot g(W\\_\\{cx\\} x\\_t +\n    W\\_\\{cm\\} m\\_\\{t-1\\} + b\\_c)\n    $$\n    \n    $$\n    o\\_t = s(W\\_\\{ox\\} x\\_t + W\\_\\{om\\} m\\_\\{t-1\\} + W\\_\\{oc\\} \\odot\n    c\\_t + b\\_o)\n    $$\n    \n    $$\n    m\\_t = o\\_t \\odot h(c\\_t),\n    $$\n    \n    where `i`, `f`, and `o` are the input, forget and output gates. `c`\n    is the cell state and `m` is the actual output.\n    \n    Weight matrices for connections from `c` (`W_ic`, `W_fc` and `W_oc`)\n    are diagonal and represented by just the vector of diagonal values.\n    These connections are only added if PEEPHOLES is true.\n    \n    A notable difference from the paper is that in addition to being a\n    single lump, `x_t` (INPUTS) can also be a list of lumps. Whenever\n    some activation is to be calculated based on `x_t`, it is going to\n    be the sum of individual activations. For example, `W_ix * x_t` is\n    really `sum_j W_ijx * inputs_j`.\n    \n    If CELL-INIT is non-NIL, then it must be a CLUMP of SIZE form which\n    stands for the initial state of the value cell (`c_{-1}`). CELL-INIT\n    being NIL is equivalent to the state of all zeros.\n    \n    ACTIVATION-FN defaults to -\u003eACTIVATION, but it can be for example\n    -\u003eBATCH-NORMALIZED-ACTIVATION. In general, functions like the\n    aforementioned two with signature like (INPUTS \u0026KEY NAME SIZE\n    PEEPHOLES) can be passed as ACTIVATION-FN.\n\n##### Sequence Barrier Lump\n\n- [class] -\u003eSEQ-BARRIER LUMP\n\n    In an RNN, processing of stripes (instances in the\n    batch) may require different number of time step so the final state\n    for stripe 0 is in stripe 0 of some lump L at time step 7, while for\n    stripe 1 it is in stripe 1 of sump lump L at time step 42.\n    \n    This lump copies the per-stripe states from different lumps into a\n    single lump so that further processing can take place (typically\n    when the RNN is embedded in another network).\n    \n    The SIZE of this lump is automatically set to the size of the lump\n    returned by `(FUNCALL SEQ-ELT-FN 0)`.\n\n- [reader] SEQ-ELT-FN -\\\u003eSEQ-BARRIER (:SEQ-ELT-FN)\n\n    A function of an INDEX argument that returns the\n    lump with that index in some sequence.\n\n- [accessor] SEQ-INDICES -\\\u003eSEQ-BARRIER\n\n    A sequence of length batch size of indices. The\n    element at index `I` is the index to be passed to SEQ-ELT-FN to\n    find the lump whose stripe `I` is copied to stripe `I` of this\n    this lump.\n\n### Utilities\n\n- [function] RENORMALIZE-ACTIVATIONS -\\\u003eV\\*M-LUMPS L2-UPPER-BOUND\n\n    If the l2 norm of the incoming weight vector of the a unit is\n    larger than L2-UPPER-BOUND then renormalize it to L2-UPPER-BOUND.\n    The list of -\u003eV\\*M-LUMPS is assumed to be eventually fed to the same\n    lump.\n    \n    To use it, group the activation clumps into the same GD-OPTIMIZER\n    and hang this function on AFTER-UPDATE-HOOK, that latter of which is\n    done for you ARRANGE-FOR-RENORMALIZING-ACTIVATIONS.\n    \n    See \"Improving neural networks by preventing co-adaptation of\n    feature detectors (Hinton, 2012)\",\n    \u003chttp://arxiv.org/pdf/1207.0580.pdf\u003e.\n\n- [function] ARRANGE-FOR-RENORMALIZING-ACTIVATIONS BPN OPTIMIZER L2-UPPER-BOUND\n\n    By pushing a lambda to AFTER-UPDATE-HOOK of OPTIMIZER arrange for\n    all weights beings trained by OPTIMIZER to be renormalized (as in\n    RENORMALIZE-ACTIVATIONS with L2-UPPER-BOUND).\n    \n    It is assumed that if the weights either belong to an activation\n    lump or are simply added to the activations (i.e. they are biases).\n\n## Boltzmann Machines\n\n\n## Gaussian Processes\n\n\n## Natural Language Processing\n\n###### \\[in package MGL-NLP\\]\nThis in nothing more then a couple of utilities for now which may\ngrow into a more serious toolset for NLP eventually.\n\n- [function] MAKE-N-GRAM-MAPPEE FUNCTION N\n\n    Make a function of a single argument that's suitable as the\n    function argument to a mapper function. It calls FUNCTION with every\n    N element.\n    \n    ```common-lisp\n    (map nil (make-n-gram-mappee #'print 3) '(a b c d e))\n    ..\n    .. (A B C) \n    .. (B C D) \n    .. (C D E) \n    ```\n\n\n- [function] BLEU CANDIDATES REFERENCES \u0026KEY CANDIDATE-KEY REFERENCE-KEY (N 4)\n\n    Compute the [BLEU score](http://en.wikipedia.org/wiki/BLEU) for\n    bilingual CORPUS. BLEU measures how good a translation is compared\n    to human reference translations.\n    \n    CANDIDATES (keyed by CANDIDATE-KEY) and REFERENCES (keyed by\n    REFERENCE-KEY) are sequences of sentences. A sentence is a sequence\n    of words. Words are compared with EQUAL, and may be any kind of\n    object (not necessarily strings).\n    \n    Currently there is no support for multiple reference translations. N\n    determines the largest n-grams to consider.\n    \n    The first return value is the BLEU score (between 0 and 1, not as a\n    percentage). The second value is the sum of the lengths of\n    CANDIDATES divided by the sum of the lengths of REFERENCES (or NIL,\n    if the denominator is 0). The third is a list of n-gram\n    precisions (also between 0 and 1 or NIL), one for each element in\n    \\[1..`N`\\].\n    \n    This is basically a reimplementation of\n    [multi-bleu.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl).\n    \n    ```common-lisp\n    (bleu '((1 2 3 4) (a b))\n          '((1 2 3 4) (1 2)))\n    =\u003e 0.8408964\n    =\u003e 1\n    =\u003e (;; 1-gram precision: 4/6\n        2/3\n        ;; 2-gram precision: 3/4\n        3/4\n        ;; 3-gram precision: 2/2\n        1\n        ;; 4-gram precision: 1/1\n        1)\n    ```\n\n\n### Bag of Words\n\n- [class] BAG-OF-WORDS-ENCODER\n\n    ENCODE all features of a document with a sparse\n    vector. Get the features of document from MAPPER, encode each\n    feature with FEATURE-ENCODER. FEATURE-ENCODER may return NIL if the\n    feature is not used. The result is a vector of encoded-feature/value\n    conses. encoded-features are unique (under ENCODED-FEATURE-TEST)\n    within the vector but are in no particular order.\n    \n    Depending on KIND, value is calculated in various ways:\n    \n    - For :FREQUENCY it is the number of times the corresponding feature\n    was found in DOCUMENT.\n    \n    - For :BINARY it is always 1.\n    \n    - For :NORMALIZED-FREQUENCY and :NORMALIZED-BINARY are like the\n      unnormalized counterparts except that as the final step values in\n      the assembled sparse vector are normalized to sum to 1.\n    \n    - Finally, :COMPACTED-BINARY is like :BINARY but the return values\n      is not a vector of conses, but a vector of element-type\n      ENCODED-FEATURE-TYPE.\n    \n    ```common-lisp\n    (let* ((feature-indexer\n             (make-indexer\n              (alexandria:alist-hash-table '((\"I\" . 3) (\"me\" . 2) (\"mine\" . 1)))\n              2))\n           (bag-of-words-encoder\n             (make-instance 'bag-of-words-encoder\n                            :feature-encoder feature-indexer\n                            :feature-mapper (lambda (fn document)\n                                              (map nil fn document))\n                            :kind :frequency)))\n      (encode bag-of-words-encoder '(\"All\" \"through\" \"day\" \"I\" \"me\" \"mine\"\n                                     \"I\" \"me\" \"mine\" \"I\" \"me\" \"mine\")))\n    =\u003e #((0 . 3.0d0) (1 . 3.0d0))\n    ```\n\n\n- [reader] FEATURE-ENCODER BAG-OF-WORDS-ENCODER (:FEATURE-ENCODER)\n\n- [reader] FEATURE-MAPPER BAG-OF-WORDS-ENCODER (:FEATURE-MAPPER)\n\n- [reader] ENCODED-FEATURE-TEST BAG-OF-WORDS-ENCODER (:ENCODED-FEATURE-TEST = #'EQL)\n\n- [reader] ENCODED-FEATURE-TYPE BAG-OF-WORDS-ENCODER (:ENCODED-FEATURE-TYPE = T)\n\n- [reader] BAG-OF-WORDS-KIND BAG-OF-WORDS-ENCODER (:KIND = :BINARY)\n\n* * *\n###### \\[generated by [MGL-PAX](https://github.com/melisgl/mgl-pax)\\]\n","funding_links":[],"categories":["Common Lisp","Machine Learning"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmelisgl%2Fmgl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmelisgl%2Fmgl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmelisgl%2Fmgl/lists"}