Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/darkeyes/mrreg
The framework for finding multiresolution partitions that have homogeneous linear models from multiresolution dataset.
https://github.com/darkeyes/mrreg
data-science minimum-description-length multiresolution r regression-analysis
Last synced: 7 days ago
JSON representation
The framework for finding multiresolution partitions that have homogeneous linear models from multiresolution dataset.
- Host: GitHub
- URL: https://github.com/darkeyes/mrreg
- Owner: DarkEyes
- License: other
- Created: 2020-05-03T04:13:13.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2022-08-19T15:12:23.000Z (about 2 years ago)
- Last Synced: 2024-10-29T08:31:58.870Z (8 days ago)
- Topics: data-science, minimum-description-length, multiresolution, r, regression-analysis
- Language: R
- Homepage:
- Size: 1.82 MB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
MRReg: MDL Multiresolution Linear Regression Framework
==========================================================
[![Travis CI build status](https://api.travis-ci.com/DarkEyes/MRReg.svg)](https://app.travis-ci.com/github/DarkEyes/MRReg/)
[![minimal R version](https://img.shields.io/badge/R%3E%3D-3.5.0-6666ff.svg)](https://cran.r-project.org/)
[![CRAN Status Badge](https://www.r-pkg.org/badges/version-last-release/MRReg)](https://cran.r-project.org/package=MRReg)
[![Download](https://cranlogs.r-pkg.org/badges/grand-total/MRReg)](https://cran.r-project.org/package=MRReg)
[![arXiv](https://img.shields.io/badge/cs.LG-arXiv%3A1907.05234-B31B1B.svg)](https://arxiv.org/abs/1907.05234)
[![](https://img.shields.io/badge/doi-10.1145%2F3424670-yellow)](https://doi.org/10.1145/3424670)
[![License](https://img.shields.io/badge/License-MIT-orange.svg)](https://spdx.org/licenses/MIT.html)In this work, we provide the framework to analyze multiresolution partitions (e.g. country, provinces, subdistrict) where each individual data point belongs to only one partition in each layer (e.g. i belongs to subdistrict A, province P, and country Q).
We assume that a partition in a higher layer subsumes lower-layer partitions (e.g. a nation is at the 1st layer subsumes all provinces at the 2nd layer).
Given N individuals that have a pair of real values (x,y) that generated from independent variable X and dependent variable Y.
Each individual i belongs to one partition per layer.Our goal is to find which partitions at which highest level that all individuals in the these partitions share the same linear model Y=f(X) where f is a linear function.
The framework deploys the Minimum Description Length principle (MDL) to infer solutions.
Installation
------------For the newest version on github, please call the following command in R terminal.
``` r
remotes::install_github("DarkEyes/MRReg")
```
This requires a user to install the "remotes" package before installing MRReg.Example: Inferred optimal homogeneous partitions
----------------------------------------------------------------------------------In the first step, we generate a simulation dataset.
All simulation types have three layers except the type 4 has four layers.
The type-1 simulation has all individuals belong to the same homogeneous partition in the first layer.
The type-2 simulation has four homogeneous partitions in a second layer. Each partition has its own models.
The type-3 simulation has eight homogeneous partitions in a third layer. Each partition has its own models
The type-4 simulation has one homogeneous partition in a second layer, four homogeneous partitions in a third layer, and eight homogeneous partitions in a fourth layer. Each partition has its own model.
In this example, we use type-4 simulation.
```{r}
library(MRReg)# Generate simulation data type 4 by having 100 individuals per homogeneous partition.
DataT<-SimpleSimulation(100,type=4)gamma <- 0.05 # Gamma parameter
out<-FindMaxHomoOptimalPartitions(DataT,gamma)
```
Then we plot the optimal homogeneous tree.```{r}
plotOptimalClustersTree(out)
```The red nodes are homogeneous partitions.
All children of a homogeneous partition node share the same linear model.Lastly, we can print the result in text form.
```{r}
PrintOptimalClustersResult(out, selFeature = TRUE)
```
The result is below.
```{r}
[1] "========== List of Optimal Clusters =========="
[1] "Layer2,ClS-C1:clustInfoRecRatio=0.08,modelInfoRecRatio=0.72, eta(C)cv=1.00"
[1] "Selected features"
[1] 2
[1] "Layer3,ClS-C11:clustInfoRecRatio=0.10,modelInfoRecRatio=0.63, eta(C)cv=1.00"
[1] "Selected features"
[1] 2
[1] "Layer3,ClS-C12:clustInfoRecRatio=0.10,modelInfoRecRatio=0.70, eta(C)cv=1.00"
[1] "Selected features"
[1] 3
[1] "Layer3,ClS-C13:clustInfoRecRatio=0.10,modelInfoRecRatio=0.68, eta(C)cv=1.00"
[1] "Selected features"
[1] 4
[1] "Layer3,ClS-C14:clustInfoRecRatio=0.09,modelInfoRecRatio=0.61, eta(C)cv=1.00"
[1] "Selected features"
[1] 5
[1] "Layer4,ClS-C21:clustInfoRecRatio=NA,modelInfoRecRatio=0.61, eta(C)cv=1.00"
[1] "Selected features"
[1] 2
[1] "Layer4,ClS-C22:clustInfoRecRatio=NA,modelInfoRecRatio=0.58, eta(C)cv=1.00"
[1] "Selected features"
[1] 3
[1] "Layer4,ClS-C23:clustInfoRecRatio=NA,modelInfoRecRatio=0.61, eta(C)cv=1.00"
[1] "Selected features"
[1] 4
[1] "Layer4,ClS-C24:clustInfoRecRatio=NA,modelInfoRecRatio=0.46, eta(C)cv=1.00"
[1] "Selected features"
[1] 5
[1] "Layer4,ClS-C25:clustInfoRecRatio=NA,modelInfoRecRatio=0.55, eta(C)cv=1.00"
[1] "Selected features"
[1] 6
[1] "Layer4,ClS-C26:clustInfoRecRatio=NA,modelInfoRecRatio=0.60, eta(C)cv=1.00"
[1] "Selected features"
[1] 7
[1] "Layer4,ClS-C27:clustInfoRecRatio=NA,modelInfoRecRatio=0.63, eta(C)cv=1.00"
[1] "Selected features"
[1] 8
[1] "Layer4,ClS-C28:clustInfoRecRatio=NA,modelInfoRecRatio=0.61, eta(C)cv=1.00"
[1] "Selected features"
[1] 9
[1] "min eta(C)cv:1.000000"
```
Note for selected features: 1 is reserved for an intercept, and d is a selected feature if Y[i] ~ X[i,d-1] in linear model.
Note that the clustInfoRecRatio values are always NA for last-layer partitions.Explanation: FindMaxHomoOptimalPartitions(DataT,gamma)
----------------------------------------------------------------------------------- INPUT: DataT$X[i,j] is the value of jth independent variable of ith individual.
- INPUT: DataT$Y[i] is the value of dependent variable of ith individual.
- INPUT: DataT$clsLayer[i,k] is the cluster label of ith individual in kth cluster layer.- OUTPUT: out$Copt[p,1] is equal to k implies that a cluster that is a pth member of the maximal homogeneous partition is at kth layer and the cluster name in kth layer is Copt[p,2]
- OUTPUT: out$Copt[p,3] is "Model Information Reduction Ratio" of pth member of the maximal homogeneous partition: positive means the linear model is better than the null model.
- OUTPUT: out$Copt[p,4] is $\eta( {C} )_{\text{cv}}$ of pth member of the maximal homogeneous partition. The greater Copt[p,4], the higher homogeneous degree of this cluster.
- OUTPUT: out$models[[k]][[j]] is the linear regression model of jth cluster in kth layer.
- OUTPUT: out$models[[k]][[j]]$clustInfoRecRatio is the "Cluster Information Reduction Ratio" between the jth cluster in kth layer and its children clusters in (k+1)th layer: positive means current cluster is better than its children clusters. Hence, we should keep this cluster at the member of maximal homogeneous partition instead of its children.Citation
----------------------------------------------------------------------------------
Chainarong Amornbunchornvej, Navaporn Surasvadi, Anon Plangprasopchok, and Suttipong Thajchayapong (2021). Identifying Linear Models in Multi-Resolution Population Data using Minimum Description Length Principle to Predict Household Income. ACM Transactions on Knowledge Discovery from Data (TKDD), 15(2), 1-30. https://doi.org/10.1145/3424670Contact
----------------------------------------------------------------------------------
- Developer: C. Amornbunchornvej
- Strategic Analytics Networks with Machine Learning and AI (SAI), NECTEC, Thailand
- Homepage: Link