{"id":30294473,"url":"https://github.com/linkedin/dagli","last_synced_at":"2025-08-17T01:35:11.960Z","repository":{"id":57726228,"uuid":"181747297","full_name":"linkedin/dagli","owner":"linkedin","description":"Framework for defining machine learning models, including feature generation and transformations, as directed acyclic graphs (DAGs).","archived":false,"fork":false,"pushed_at":"2023-10-23T18:24:07.000Z","size":112302,"stargazers_count":354,"open_issues_count":7,"forks_count":39,"subscribers_count":19,"default_branch":"master","last_synced_at":"2025-03-15T09:12:53.271Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/linkedin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-04-16T18:43:56.000Z","updated_at":"2024-09-23T20:46:40.000Z","dependencies_parsed_at":"2022-09-17T13:41:35.544Z","dependency_job_id":null,"html_url":"https://github.com/linkedin/dagli","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/linkedin/dagli","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkedin%2Fdagli","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkedin%2Fdagli/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkedin%2Fdagli/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkedin%2Fdagli/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/linkedin","download_url":"https://codeload.github.com/linkedin/dagli/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linkedin%2Fdagli/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270796216,"owners_count":24647319,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-16T02:00:11.002Z","response_time":91,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-17T01:35:09.933Z","updated_at":"2025-08-17T01:35:11.924Z","avatar_url":"https://github.com/linkedin.png","language":"Java","funding_links":[],"categories":["人工智能"],"sub_categories":[],"readme":"# Dagli\n[![Maven badge](https://maven-badges.herokuapp.com/maven-central/com.linkedin.dagli/core/badge.svg)](https://search.maven.org/search?q=g:com.linkedin.dagli)\n[![javadoc](https://javadoc.io/badge2/com.linkedin.dagli/all/javadoc.svg)](https://javadoc.io/doc/com.linkedin.dagli/all)\n\nDagli is a machine learning framework that makes it easy to write bug-resistant, readable, efficient, maintainable and \ntrivially deployable models in [Java 9+](documentation/java.md) (and other JVM languages).\n\nHere's an introductory example of a text classifier implemented as a pipeline that uses the active leaves of a \nGradient Boosted Decision Tree model (XGBoost) as well as a high-dimensional set of ngrams as features in a logistic \nregression classifier:\n\n    Placeholder\u003cString\u003e text = new Placeholder\u003c\u003e();\n    Placeholder\u003cLabelType\u003e label = new Placeholder\u003c\u003e(); \n    Tokens tokens = new Tokens().withInput(text);\n    \n    NgramVector unigramFeatures = new NgramVector().withMaxSize(1).withInput(tokens);\n    Producer\u003cVector\u003e leafFeatures = new XGBoostClassification\u003c\u003e()\n        .withFeaturesInput(unigramFeatures)\n        .withLabelInput(label)\n        .asLeafFeatures();\n\n    NgramVector ngramFeatures = new NgramVector().withMaxSize(3).withInput(tokens);\n    LiblinearClassification\u003cLabelType\u003e prediction = new LiblinearClassification\u003cLabelType\u003e()\n        .withFeaturesInput().fromVectors(ngramFeatures, leafFeatures)\n        .withLabelInput(label);\n\n    DAG2x1.Prepared\u003cString, LabelType, DiscreteDistribution\u003cLabelType\u003e\u003e trainedModel = \n        DAG.withPlaceholders(text, label).withOutput(prediction).prepare(textList, labelList);\n    \n    LabelType predictedLabel = trainedModel.apply(\"Some text for which to predict a label\", null);\n    // trainedModel now can be serialized and later loaded on a server, in a CLI app, in a Hive UDF...\n\nThis code is fairly minimal; Dagli also provides mechanisms to more elegantly encapsulate example data \n([@Structs](documentation/structs.md)), read in data (e.g. from delimiter-separated value or Avro files), evaluate model \nperformance, and much more.  You can find demonstrations of these among the \n[many code examples provided with Dagli](documentation/examples.md).    \n\n# Maven Coordinates\nDagli is [split into a number of modules](documentation/modules.md) that are published to \n[Maven Central](https://search.maven.org/search?q=g:com.linkedin.dagli); just add dependencies on those you need in your \nproject.  For example, the dependencies for our above introductory example might look like this in Gradle:\n\n    implementation 'com.linkedin.dagli:common:15.0.0-beta9'            // commonly used transformers: bucketization, model selection, ngram featurization, etc.\n    implementation 'com.linkedin.dagli:text-tokenization:15.0.0-beta9' // the text tokenization transformer (\"Tokens\")\n    implementation 'com.linkedin.dagli:liblinear:15.0.0-beta9'         // the Dagli Liblinear classification model\n    implementation 'com.linkedin.dagli:xgboost:15.0.0-beta9'           // the Dagli XGBoost classification and regression models\n    \nIf you're in a hurry, you can instead add a dependency on `all`:\n\n    implementation 'com.linkedin.dagli:all:15.0.0-beta9'  // not recommended for production due to classpath bloat \n\nTo train neural networks, you'll also need to add a\n[dependency for either CPU- or GPU-backed linear algebra](examples/neural-network/build.gradle):\n\n    implementation \"org.nd4j:nd4j-native-platform:1.0.0-beta7\" // CPU-only computation\n    // implementation \"org.nd4j:nd4j-cuda-10.2-platform:1.0.0-beta7\" // alternatively, we can use CUDA 10.2 (GPU)\n    // implementation \"org.deeplearning4j:deeplearning4j-cuda-10.2:1.0.0-beta7\" // along with cuDNN 7.6 (optional)\n    \n    \n# Benefits\n- Write your machine learning pipeline as a directed acyclic graph (DAG) **once** for both training and inference.  No \nneed to specify a pipeline for training and a separate pipeline for inference.  You define it, train it, and predict \nwith a single pipeline definition.\n- Bug-resiliency: easy-to-read ML pipeline definitions, ubiquitous static typing, and most things in Dagli are \n**immutable**.\n- Portability: works on your server, in a Hadoop mapper, a CLI program, in your IDE, etc. on any platform\n- Deployability: an entire pipeline is serialized and deserialized as a single object\n- Abstraction: creating new transformations and models is straightforward and these can be reused in any Dagli pipeline\n- Speed: highly parallel multithreaded execution, graph (pipeline) optimizations, minibatching\n- Inventory: many, many useful pipeline components ready to use, right out of the box.  Neural networks, logistic \nregression, gradient boosted decision trees, FastText, cross-validation, cross-training, feature selection, data \nreaders, evaluation, feature transformations...\n- Java: easily use from any JVM language with the support of your IDE's code completion, type hints, inline \ndocumentation, etc.\n\n# Overview\nAs might be surmised from the name, \n[Dagli represents machine learning pipelines as directed acyclic graphs](documentation/dag.md) (DAGs).\n\n- The \"roots\" of the graph \n    - `Placeholder`s (which represent the training and inference example data)\n    - `Generator`s (which automatically generate a value for each example, such as a `Constant`, `ExampleIndex`, \n    `RandomDouble`, etc.)\n- Transformers, the \"child nodes\" of the graph\n    - Data transformations (e.g. `Tokens`, `BucketIndex`, `Rank`, `Index`, etc.)\n    - Learned models (e.g. `XGBoostRegression`, `LiblinearClassifier`, `NeuralNetwork`, etc.)\n\nTransformers may be *preparable* or *prepared*.  Dagli uses the word \"preparation\" rather than \"training\" because many \n`PreparableTransformer`s are not statistical models; e.g. `BucketIndex` examines all the preparation examples to find \nthe optimal bucket boundaries with the most even distribution of values amongst the buckets.\n\nWhen a DAG is prepared with training/preparation data, the `PreparableTransformer`s (like `BucketIndex` or \n`XGBoostRegression`) become `PreparedTransformer`s (like `BucketIndex.Prepared` or `XGBoostRegression.Prepared`) which \nare then subsequently used to actually transform the input values (both during DAG preparation so the results may be fed\nto downstream transformers and later, during inference in the prepared DAG).\n\nOf course, many transformers are already \"prepared\" and don't require preparation; a prepared DAG containing no \npreparable transformers may be created directly (e.g. `DAG.Prepared.withPlaceholders(...).withOutputs(...)`) and used to\ntransform data without any preparation/training step. \n\nDAGs are encapsulated by a `DAG` class corresponding to their input and output arities, e.g. `DAG2x1\u003cString, Integer, \nDouble\u003e` is a pipeline that accepts examples with a `String` and `Integer` feature and outputs a `Double` result.\nGenerally, it's better design to provide all the example data together as a single [@Struct](documentation/structs.md) \nor other type rather than as multiple inputs.  DAGs are also themselves transformers and can thus be embedded within \nother, larger DAGs.\n\n# Examples\nProbably the easiest way to get a feel for how Dagli models are written and used is from the \n[numerous code examples](documentation/examples.md).  The example code is more verbose than would be seen in practice, \nbut--combined with explanatory comments for almost every step--these can be an excellent pedagogic tool.\n\n# Finding the Right Transformer\nDagli includes a large and growing library of transformers.  The [examples](documentation/examples.md) illustrate the\nuse of a number of transformers, and the [Javadoc](https://javadoc.io/doc/com.linkedin.dagli/all) is searchable.  You\nmay also want to check the [module summary](documentation/modules.md) for a broader overview of what is available.\n\n# Adding New Transformers\nIf an existing transformer doesn't do what you want, you can often wrap an existing function/method with a \n`FunctionResultX` transformer (where `X` is the function's arity, e.g. 1 or 4).  Otherwise, it's \n[easy to create your own transformers](documentation/transformers.md).  \n\n# Documentation\n- [Overview of Dagli Examples](documentation/examples.md)\n- [Overview of Dagli Modules](documentation/modules.md)\n- [How Dagli Represents ML Pipelines as DAGs](documentation/dag.md)\n- [Usage and Creation of Transformers](documentation/transformers.md)\n- [@Structs: Autogenerated, immutable convenience classes for storing fields](documentation/structs.md)\n- [Using Avro Data with Dagli](documentation/avro.md)\n\n# Alternative ML Solutions\n\nDagli lets Java (and JVM) developers easily define readable, reusable, bug-resistant models and train them efficiently on \nmodern multicore, GPU-equipped machines.\n\nOf course, there is no \"one size fits all\" ML framework.  Dagli provides a layer-oriented API for defining novel neural\nnetworks, but for unusual architectures or cutting-edge research, TensorFlow, PyTorch, DeepLearning4J and others may be \nbetter options (Dagli supports the integration of arbitrary DeepLearning4J architectures into the model pipeline \nout-of-the-box, and, for example, pre-trained TensorFlow models can also be incorporated with a custom wrapper.)\n\nSimilarly, while Dagli models have been trained with *billions* of examples, extremely large scale training across \nmultiple machines may be better served by platforms such as Hadoop, Spark, and Kubeflow.  Hadoop/Hive/Spark/Presto/etc. \nare of course commonly used to pull data to train and evaluate Dagli models, but it is also very feasible to, e.g. create\ncustom UDFs that train, evaluate or apply Dagli models.  \n\n[Further discussion comparing extant pipelined and joint modeling with Dagli](documentation/comparison.md).\n\n\n# Version History\n- `15.0.0-beta9`: *10/4/21*:\n    - `BinaryConfusionMatrix` now calculates F1-scores as 0 (rather than NaN) when precision and recall are both 0\n    - Fixed corner case where neural networks with multiple logically equivalent layers were improperly considered \n      invalid.\n    - Fixed vector sequence input bug in DL4J neural networks\n- `15.0.0-beta8`: *8/21/21*: Added default constructors to Dagli's implementation of DL4J vertices where needed to \n    ensure their serializability \n- `15.0.0-beta7`: *4/12/21*: Loosened erroneously-strict generic constraint on argument to \n   `NNClassification::withMultilabelLabelsInput(...)` \n- `15.0.0-beta6`: *1/26/21*: Added workaround for \n    [DL4J bug](https://community.konduit.ai/t/bertiterator-produces-npe-while-training-on-gpu/580) that caused a null \n    pointer exception when using CUDA (GPU) to train neural networks.  Thanks to @cyberbeat for reporting this.\n- `15.0.0-beta5`: *11/15/20*: [aggregated Javadoc](https://javadoc.io/doc/com.linkedin.dagli/all) now available\n- `15.0.0-beta4`: *11/11/20*: `xgboost` now bundles in [support for Windows](xgboost/README.md)\n- `15.0.0-beta3`: *11/9/20*: Input Configurators and `MermaidVisualization`\n    - This is a major version increment and may not be compatible with models from 14.*\n    - [Input configurators](documentation/transformers.md#input-configurators) for more convenient, readable \n      configuration of transformer inputs; e.g., \n      `new LiblinearClassification\u003cLabelType\u003e().withFeaturesInput().fromNumbers(numberInput1, numberInput2...)...`\n    - New graph visualizer for rendering Dagli graphs as Mermaid markup\n    - [Full list of improvements](documentation/v15.0.0-beta3.md)\n- `14.0.0-beta2` *9/27/20*: update dependency metadata to prevent the annotation processors' dependencies from \n  transitively leaking into the client's classpath  \n- `14.0.0-beta1`: initial public release\n\n## Versioning Policy\nDagli's current public release is designated as \"beta\" due to extensive changes relative to previous \n(LinkedIn-internal) releases and the greater diversity of applications entailed by a public release. \n\nWhile in beta, releases with potentially breaking API or serialization changes will be accompanied by a major version \nincrement (e.g. `14.0.0-beta2` to `15.0.0-beta3`).  After the beta period concludes, subsequent revisions will be backward\ncompatible to allow large projects to depend on multiple versions of Dagli without dependency shading.\n\n# License\n[Licensed under the BSD 2-Clause license](LICENSE).\n\nCopyright 2020 LinkedIn Corporation.  All Rights Reserved.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinkedin%2Fdagli","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flinkedin%2Fdagli","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinkedin%2Fdagli/lists"}