{"id":13737730,"url":"https://github.com/rom1mouret/mpcl","last_synced_at":"2026-04-04T14:53:57.795Z","repository":{"id":215882133,"uuid":"330183831","full_name":"rom1mouret/mpcl","owner":"rom1mouret","description":"theoretical framework for continual learning / incremental learning","archived":false,"fork":false,"pushed_at":"2021-08-05T19:09:07.000Z","size":737,"stargazers_count":1,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-11-15T06:32:20.611Z","etag":null,"topics":["catastrophic-forgetting","continual-learning","incremental-learning","mnist"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rom1mouret.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-01-16T14:42:56.000Z","updated_at":"2022-07-17T09:02:05.000Z","dependencies_parsed_at":null,"dependency_job_id":"91429af7-2f69-4cad-90e9-de4efebdf8f4","html_url":"https://github.com/rom1mouret/mpcl","commit_stats":null,"previous_names":["rom1mouret/mpcl"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rom1mouret%2Fmpcl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rom1mouret%2Fmpcl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rom1mouret%2Fmpcl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rom1mouret%2Fmpcl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rom1mouret","download_url":"https://codeload.github.com/rom1mouret/mpcl/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253095911,"owners_count":21853508,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["catastrophic-forgetting","continual-learning","incremental-learning","mnist"],"created_at":"2024-08-03T03:01:58.667Z","updated_at":"2026-04-04T14:53:57.778Z","avatar_url":"https://github.com/rom1mouret.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Meaning-preserving Continual Learning (MPCL)\n\nThis is a follow-up to [domain_IL](https://github.com/rom1mouret/domain_IL).\nThe core idea remains the same but it is framed differently.\n\n```diff\n+ 2021 Feb update: MPCL rules are now explained in the slides.\n+ 2021 March update: the slides now introduce the problem with intuition pumps.\n```\nLink to [MPCL_v1_slides.pdf](MPCL_v1_slides.pdf)\n\nMPCL posits that latent representations acquire meaning by acting on the\noutside world.\n\nFor continual learning to be manageable in complex environments and avoid\n[catastrophic forgetting](https://github.com/rom1mouret/forgetful-networks),\nmeaning must remain stable over time. This is the core idea behind MPCL.\n\n### Situated meaning\n\nAs the inputs and outputs of algorithms have no intrinsic meaning,\nit is often the prerogative of the programmer to attach meaning to variables.\n\nThere are two kinds of meaning at play here.\n\n1. meaning that emerges from the interplay with the environment. For instance,\nfrogs might view insects as mere calorie dispensers. Needless to say,\nhumans don't see insects the same way.\n2. meaning from the programmer's perspective, which is roughly the same for\n[everyone](https://en.wikipedia.org/wiki/Intersubjectivity).\n\nSince the programmer's perspective (e.g. labels) is a byproduct of her environment,\nit is not too much of a leap to treat her perspective as a proxy for her environment.\nThis is how I want to get away with explicitly modeling the environment in MPCL v1.\n\nTake for instance a model categorizing x-ray images of tumors into malignant or benign.\nIf the model is deployed in a hospital, those labels have a tangible impact on the\nenvironment. Labels bridge the gap between models and their deployment environment.\nIf you swap \"malignant\" with \"benign\" without changing the model's computation,\nyou end up with a very different situation.\n\nSo MPCL has two jobs:\n\n- making sure the two kinds of meaning align. As we provide more and more training examples,\nI expect the first kind of meaning to converge towards the second kind.\n- making sure meaning remains stable over time.\n\nIn what follows, I will use \"latent units\" and \"abstraction layer\" interchangeably.\nIt refers to the last NN layer of the processing modules. \n\n### Continual learning (example)\n\nYour system's training journey might start like that:\n- module A: time T0 to T1: training the model to recognize human faces.\n- module B: time T2 to T3: training the model to tell cats and dogs apart.\n- module A: time T4 to T5: training the model to get better at recognizing human\nfaces, maybe in a novel context.\n\nAt time T1, your model and its abstraction layer are perfectly fit for recognizing\nfaces within domains it was trained on.\n\nAt time T2, your model moves on to learning to distinguish between cats and dogs.\nIt can build on module A's abstraction layer but it should not interfere with it\nbecause the function of module A's abstraction layer is to recognize human faces, not\nto recognize human faces AND pets. In programming terms, module A's network is\nentirely frozen between T2 and T3.\n\nAt T4, we are back to face recognition.\nModule A's internals can be safely refined so long as it keeps on delivering\nabstractions that fulfill the same invariant function.\nThis might interfere with module B, but not in a\n[destructive manner](https://en.wikipedia.org/wiki/Catastrophic_interference).\n\n### Theoretical framework\n\nIn broad terms, we anchor latent representations by forcing them\nto remain good at realizing the function they were originally trained for.\nHow the system realizes the\nfunction-that-latent-representations-were-originally-trained-for *is* what\ndefines latent representations.\n\n(By \"function\" I do not mean the mapping from sensory inputs to labels.\nI mean the functions that map the abstraction layer to labels, or other kinds of outputs.\nRecognizing faces in photographs fulfills the same function as recognizing faces\nin real life.\nYour friend's face is an abstraction in virtue of the realizability of the same\nrecognition function across many domains/contexts.)\n\nNow, my goal is to formalize this idea and characterize the \"meaning of latent\nunits\" in more rigorous terms.\n\n[link to MPCL Framework v1 pdf](MPCL-Framework-v1.pdf)\n\n\n### Domain generalization\n\nBoth continual learning and [domain generalization](https://arxiv.org/pdf/2103.02503.pdf)\ntechniques try to create rich representations in a non-i.i.d. setting.\nDomain generalization can be seen as a special case of\n[domain-IL](https://arxiv.org/pdf/1904.07734v1.pdf) wherein most of\nthe hard work is done *before* observing out-of-distribution data.\n\nOne of the most promising approaches to domain generalization is causal\nrepresentation learning. The idea of uncovering the causal structure of the\nproblem is similar to MPCL's meaning alignment.\n\nMPCL is akin to inverse domain generalization.\nInstead of building high-quality abstraction hypotheses that are meant to\ngeneralize well to unknown domains from the get go, the MPCL way would be to\ntry out hypotheses over many domains to assess their generalization power.\nIf they pass the quality test, they are entrusted with stronger, longer-lasting\nconnectivity with other modules.\n\n# Proto-MPCL\n\nProto-MPCL is a solution to catastrophic forgetting wherein each input domain\nis routed to its own dedicated processor. We rely on finding inconsistencies and\ndiscrepancies to detect domain boundaries.\n\nMPCL is not strongly committed to that way of overcoming forgetting,\nbut this multi-processor approach makes it easier to make sense of generalization,\nabstraction and transferability from one domain to another.\n\nI will attempt to frame bare-bones [Domain-IL and Class-IL](https://arxiv.org/pdf/1904.07734v1.pdf)\nwithin this framework on MNIST and EMNIST datasets.\nAdmittedly, it takes a bit of shoehorning because simple systems don't create\nmany opportunities for Proto-MPCL to take advantage of the complexity to find\ninconsistencies, for example by cross-checking predictions from multiple modules\nof the system.  \n\nMoreover, complex systems are generally upheld by high-dimensional latent\nrepresentations, with a [lot of empty space](https://en.wikipedia.org/wiki/Curse_of_dimensionality#Blessing_of_dimensionality)\nwherein you would typically spot inconsistent configurations of latent values.\n\nYou will notice that the implemented system doesn't showcase any of the\ninter-module [MPCL rules](MPCL_v1_slides.pdf).\nThis is because MNIST and EMNIST are not readily modularizable problems, so\nwe are essentially stuck with the one-module scenario.\n\nSince we are not dealing with how modules are connected with one another,\nI am dubbing the one-module case \"Proto-MPCL\". For MPCL to work at a larger scale,\nwe need proto-MPCL to do quite well on isolated continual-learning tasks. On paper,\nit doesn't need to be perfect though. Like I said above, the more\nmodules you combine to satisfy various goals, the easier it gets to find\nways of detecting inconsistencies between the modules' outputs.\n\n## Domain-IL classification on Permuted MNIST\n\nIn this scenario, the most straight-forward approach is to tether the\nabstraction layer directly to the output classes.\nTo do so, we train a classifier to map the abstraction layer to classes on an arbitrary\ndomain and freeze the classifier right away.\n\nFollowing the naming conventions in [MPCL-Framework-v1](MPCL-Framework-v1.pdf),\n*C* is a softmax classifier and *e* is negative cross-entropy.\n\nNext, we train a processor for each domain under the constraint that they must\nbe good at predicting the target labels.\n\nThe tricky part is to detect domain boundaries when making predictions. We have\nto resort to using a surrogate function.\n\nHere are the results on [Permuted MNIST](https://arxiv.org/pdf/1312.6211.pdf),\nusing the confidence of *C* as a surrogate for the degree to which the output\nof a processor \"conveys meaning\".\n\n![Permuted MNIST](images/mnist_results.png)\n\nMultiple rounds are plotted. The average is represented with the most opaque red.\n\ncode: [mnist_domain_il.py](mnist_domain_il.py)\n\n## Class-IL on EMNIST\n\nFor Class-IL, I chose EMNIST because it includes more classes than MNIST.\nThe anchoring step is performed on EMNIST digits while the continual learning\nand the evaluation are done on EMNIST letters.\n\nProto-MPCL doesn't lend itself well to Class-IL, but with a bit of trickery,\nEMNIST Class-IL can be expressed as a Domain-IL problem.\n\n- instead of training a classifier to recognize characters, have it predict how\ncharacters are rotated or flipped. This will anchor the latent units just as\nwell as letter recognition.\n- interpret each letter as a distinct domain.\n\nThe training algorithm is essentially a copy-paste of MNIST Domain-IL except\nthat we randomly rotate/flip the training digits on the fly.\n\n#### predicting\n\nAll the 7 transformations (90-rotation, horizontal flip etc.) are controlled by\nus. Therefore, the expected output of the classifier is always known, even on\nthe test set.\n\nTo make predictions, we apply every transformation to each letter and choose the\ndomain for which the classifier is the most correct regarding how the letter has\nbeen transformed.\n\n![Permuted MNIST](images/emnist_results.png)\n\nAs with Domain-IL, the model was trained with `increment=1`, the most challenging scenario.\n\ncode: [emnist_class_il.py](emnist_class_il.py), [plot_emnist.py](plot_emnist.py)\n\nThe green curve shows how a collection of one-class classifiers would perform,\nunder some simplifying assumptions.\n\n- false positive rate is 5%, in the same ballpark as my classifier.\n- false negative rate is 0%.\n- if multiple classifiers report a positive match, we randomly choose between them.\n\nIn simple settings such as MNIST and EMNIST, MPCL boils down to self-supervised\noutlier detection. I believe it will prove more fruitful in more complex settings.\n\n## Triangular activation\n\nThe layers of the processors are activated by an unusual function.\n\n```python3\nclass Triangle(torch.nn.Module):\n    def forward(self, x):\n        return x.abs()\n```\n\nWith ReLU and the like, out-of-distribution inputs saturate the activation\nfunctions almost as much as in-distribution inputs, which makes ReLU networks\nill-equipped for measuring confidence levels on OOD data.\n\nIt is not a matter of calibration.\nIt's fine for a classifier to be overconfident / underconfident in a consistent\nfashion, but we can't afford the network to throw in the towel when it\nencounters something it doesn't know.  \n\nI prefer `Triangle` because it doesn't throw anything away, not even noise.\nIt merely folds the input.\n\nDisclaimer: I haven't studied in depth the dynamics of Triangle-activated\nnetworks so it's possible that Triangle doesn't do what I think it does.\nI have noticed improvements on Domain-IL though.\n\n## Softmax\n\nSoftmaxing the last layer of the classifier was a mistake and I will try without\nsoftmax in subsequent experiments.\nSoftmax is not a good choice for MPCL because there are infinite solutions to\n`softmax(x) = y`, thus it doesn't constrain `x` enough for `x` (or anything\nupstream of `x`) to be transferable to other modules.\n\nIt doesn't affect Permuted MNIST and EMNIST results that much because latent\nrepresentations are not transferred to other modules in our experimental setup.\n\n\n## Terminology\n\nA *group* of latent units has a *function*.\nWithin this group, individual latent units have a *meaning*.\n\nIt doesn't have to be called \"meaning\", but the word \"meaning\" is intuitive\nfor a number of reasons. First, it has an obvious connection with\nintelligibility, a prerequisite to qualifying behavior patterns as rational\nor intelligent.\nFurthermore, the typical way of detecting inconsistencies is to look for configurations\nof latent values that are *meaningless*, in the sense that they do not realize\nany function.\n\nI plan on extending MPCL to function-free intrinsic meaning (latent units that get\ntheir meaning from adjacent units), in contrast to extrinsic meaning (latent\nunits that get their meaning from external labels/feedback).\n\n## FAQ\n\n##### \u003e Must domain labels be known at training time?\n\nYes, this is how I have evaluated my models.\nIt is not a hard constraint from the framework though.\n\n##### \u003e Must domain labels be known at runtime?\n\nNo.\n\n##### \u003e Must module labels be known at runtime?\n\nYes. It is not a hard constraint from the framework either.\n\n##### \u003e Must task labels be known at runtime?\n\nI find the idea of task confusing so it is no longer part of the framework.\n\nThere is a single task in Permuted MNIST, that of classifying digits.\n\nThe so-called Permuted MNIST tasks are treated as domains/contexts by MPCL,\nso they don't need to be known at runtime.\n\n##### \u003e Can it be used for regression?\n\nI haven't tried regression yet.\nIt comes with a few challenges.\n1. If the regression model has only one numerical output,\nit won't be enough to constrain multi-dimensional latent layers,\nunless latent values are sparse or binary.\n2. I am not sure what would be the best way to detect inconsistencies from numerical outputs.\nPerhaps an ensemble of regressors could reveal discrepancies, or a Bayesian NN.\n\nA workaround is to train the processors with a surrogate classification loss and have the classifier predict\nboth labels and the desired numerical targets at the same time.\n\n##### \u003e Can an MPCL system run at fixed capacity?\n\nNo, the system keeps growing as it learns new domains,\nbut infrequently used domains can be safely removed to free up some space.\nAlternatively, they can be distilled down to smaller models.\n\n##### \u003e Can models be revisited later on when new training examples are made available?\n\nYes.\n\n\n##### \u003e If domains A and B are similar, does learning A help with learning B?\n\nNot in vanilla MPCL.\nBut nothing stops you from implementing multi-task learning techniques orthogonally to MPLC.\nFor instance, you can [implement soft parameter sharing](https://ruder.io/multi-task/index.html#softparametersharing) between processors.\n\n##### \u003e Can trained abstraction layers be safely connected to other modules without interference risks?\n\nYes, that's the whole point of MPCL.\nWhen new domains are learned, it is beneficial to downstream modules,\nrather than destructive, as in this [zero-shot learning experiment](https://github.com/rom1mouret/domain_IL).\n\n##### \u003e Is there any limitation to the expressive power of MPCL models?\n\nFeature processors can be arbitrarily complex,\nthough it is better if they don't aggressively filter noise out.\nThey don't have to be differentiable.\nHowever, the [models mapping latent representations to external labels/actions are highly constrained](MPCL-Framework-v1.pdf).\nIn practice, linear mapping should do fine.\n\n##### \u003e Does MPCL operate on the same level as common continual learning algorithms such as [EWS](https://arxiv.org/pdf/1612.00796.pdf)?\n\nNot exactly.\nThe starting point of MPCL is a principle as abstract as [Hebb's rule](https://en.wikipedia.org/wiki/Hebbian_theory) or the [free energy principle](https://en.wikipedia.org/wiki/Free_energy_principle).\nSimply put, this principle states that situated meaning must remain stable across time.\nMPCL is an attempt to derive an actionable framework from this (somewhat vague) principle.\n\nMoreover, it is dealing with modules.  \n\n##### \u003e Must processors be trained one class at a time in the Class-IL setting?\n\nYes. If you get your data with `increment\u003e1`, split the data into 1-increments.\n\n##### \u003e Isn't just glorified [one-class classification](https://en.wikipedia.org/wiki/One-class_classification) with a surrogate loss?\n\nMaybe.\nI haven't pushed MPCL far enough to see if it can bring something truly new to ML.\nIt would certainly look less like one-class classification if you were to apply it to highly modular architectures.\n\n##### \u003e Does AI have to be modular? Why not a single neural network?\n\nI am agnostic to this question, but MPCL does rely on the assumption that\nthe system can be broken down into modules.\n\n##### \u003e Is MPCL biologically plausible?\n\nHopefully it is at some abstract level, but it is definitely not in any general sense\nof biological plausibility.\nFor one thing, brains cannot allocate new feature processors out of thin air.\nAlso, the outside world is not labeled.\n\n##### \u003e How to train non-differentiable models within this framework?\n\nProcessors (inputs -\u003e abstraction) and classifiers/regressors (abstraction -\u003e targets) are typically trained conjointly.\nThis is where gradient descent shines, provided all the models involved are differentiable.\n\nYou may be able to get good results\nwith [coordinate descent](https://en.wikipedia.org/wiki/Coordinate_descent) on non-differentiable models.\n\nAlternatively, if the classifiers/regressors are designed to be analytically invertible,\nthen you can calculate the latent values that correspond to the targets, and train processors to\npredict the latent values as if they were the ground-truth.\n\n\n##### \u003e Isn't stabilizing meaning the same as stabilizing representations?\n\nI mean \"representation\" in the ML sense, as in \"[representation learning](https://en.wikipedia.org/wiki/Feature_learning)\".\nI do not mean \"mental representation\".\n\nIn that sense, if Representation-Preserving Continual Learning (RPCL) were a thing,\nit would not be the same thing as MPCL. It would be more limited and more limiting.\n\nNot every representation vector needs stability, whereas meaning always needs stability.\nAlso, representation vectors can be more fine-grained than meaning.\n\nIn a classification setting, each class' representation vectors need stability.\nMeaning is not a particularly useful concept in such a setting because it is\nconceivable to enumerate all the representation vectors that need stability\nwithout resorting to any other concept.\n\nIn regression and motor settings, however, it becomes harder to identify the representations\nthat need stability`*` and it is useful to look at the problem from a meaning angle,\ni.e. the link between representations and goals.\nGoal-realizing functions need to remain *accurate*.\nThey need not remain *stable* at all times.\n\n`*` I’m still debating whether it would be practical or not to require stability\nfor every single representation vector of the training set,\nthereby doing away with meaning.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2From1mouret%2Fmpcl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2From1mouret%2Fmpcl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2From1mouret%2Fmpcl/lists"}