An open API service indexing awesome lists of open source software.

https://github.com/bigmlcom/bigml-php

BigML.io Php Bindings
https://github.com/bigmlcom/bigml-php

Last synced: about 1 year ago
JSON representation

BigML.io Php Bindings

Awesome Lists containing this project

README

          

BigML PHP Bindings
==================

In this repository you'll find an open source PHP library that gives
you a simple way to interact with `BigML `_ using its
`API `_.

This module is licensed under the `Apache License, Version
2.0 `_.**

**Notice: the BigML PHP bindings are not entirely backwards compatible
with versions prior to 2.0. In particular, all static methods have been
removed from the BigML class. So, if you ever used the syntaxis, e.g.:**

.. code-block:: php

BigML::create_source(...);

**you will get an error. On the other hand, if you followed the syntaxis
which was documented in this README, i.e.,**

.. code-block:: php

$api->create_source(...)

**you will be fine.**

**Additionally, notice that the old constructor which accepted all of
its parameters as individual arguments has been deprecated in favour
of a new one supporting the specification of named parameters. Use the
new syntaxis instead, as described below. The old constructor syntaxis
will be maintained until version 3.0, then removed.**

.. contents:: Table of Contents

Requirements
------------

PHP 8.0 or higher is currently supported by these bindings.

You will also need to have the non-default extensions `mbstring
`_, `cURL
`_, and `OpenSSL
`_ installed. Depending on
how you installed PHP, you may already have one or more of these
extensions.

To check which modules you have currently installed, run

.. code-block:: bash

php -m

To install with Linux:

At the command line, run

.. code-block:: bash

sudo apt-get install phpXY-mbstring
sudo apt-get install phpXY-curl

where XY is the PHP version currently installed on your system (e.g.,
php72-curl).

To install with MacOS:

At the command line, run

.. code-block:: bash

sudo port install phpXY-mbstring
sudo port install phpXY-curl
sudo port install phpXY-openssl

where XY is the PHP version currently installed on your system (e.g.,
php72-curl).

If you installed PHP by tapping homebrew-php, mbstring should already
be installed. You will still need to install curl and openssl using

.. code-block:: bash

brew install --with-openssl curl

To install with Windows:

If you have access to the php.ini, remove the semicolon in front of
these lines in the php.ini

.. code-block:: bash

extension = php_mbstring.dll
extension = php_curl.dll
extension = php_openssl.dll

You will have to be sure you have these dll files, and they are
available on your PATH. You may also need to check that `libeay32.dll`
and `ssleay32.dll` are in your php directory.

Once you have made the changes, don't forget to restart your server
for them to take effect.

Importing the module
--------------------

Using Composer
""""""""""""""

If you are currently using Composer to manage your project's
libraries, simply add the following to your current `composer.json`

.. code-block:: json

{
"repositories": [
{
"type": "vcs",
"url": "https://github.com/bigmlcom/bigml-php/"
}
],
"require": {
"bigml/bigml-php": "dev-master",
"wamania/php-stemmer": "@dev"
},
"autoload":{
"classmap": ["vendor/bigml/bigml-php/bigml/"]
}
}

At the command line, run the command

.. code-block:: bash

php composer.phar install

This will install this library and all required library dependencies
(but not extensions such as mbstring).

In your code:

At the beginning of your file include the line

.. code-block:: php

php
require 'vendor/autoload.php';

Cloning from GitHub
"""""""""""""""""""

If you would prefer, you can manually clone this repo from GitHub. You
will still need to use Composer to install some third-party libraries.

If you haven't already done so, you will need to install `Composer
`_.

Linux/OSX:

Follow the instructions in the `download section `_ to get the
`composer.phar` file, and run

.. code-block:: bash

php composer.phar install

This will install all necessary dependencies.

Windows:

Follow the instructions on the Composer website for `downloading `_ Composer, and run

.. code-block:: bash

php composer.phar install

This will install all necessary dependencies.

In your code:

At the beginning of your file you will need to include the various
files you will be using. If you will be making any remote calls, you
will need bigml.php. If you will be making any local models, you will
need their specific files. The most common files to include are

.. code-block:: php

`_. and are always
transmitted over HTTPS.

This module will look for your username and API key in the environment
variables BIGML_USERNAME and BIGML_API_KEY respectively. You can add
the following lines to your .bashrc or .bash_profile to set those
variables automatically when you log in

.. code-block:: bash

export BIGML_USERNAME=myusername
export BIGML_API_KEY=a11e579e7e53fb9abd646a6ff8aa99d4afe83ac2

With that environment and your aliases set up, connecting to BigML is
a breeze

.. code-block:: php

$api = new BigML\BigML();

Otherwise, you can initialize directly when instantiating the BigML
class as follows by manually supplying your credentials:

.. code-block:: php

$api = new BigML\BigML([ "username" => "myusername",
"apiKey" => "my_api_key"]);

Caching
-------

An important feature provided by the api constructor is the
specification of a local cache to speed up the retrieval of
resources. If you supply a storage for your BigML instance, the PHP
bindings will hit the network only once for each resource. On
subsequent accesses, the resource will be retrieved from the local
cache.

This is how you can set the storage argument when you instantiate the
BigML class:

.. code-block:: php

$api = new BigML\BigML([ "username" => "myusername",
"apiKey" => "my_api_key",
"storage" => "storage/data"]);

Or, more succinctly:

.. code-block:: php

$api = new BigML\BigML(["storage" => "storage/data"]);

if you have your environment set.

All resources will be created, updated, or retrieved in/from the chosen directory.

Virtual Private Clouds
----------------------

For Virtual Private Cloud setups, you can change the remote server domain:

.. code-block:: php

$api = new BigML\BigML([ "username" => "myusername",
"apiKey" => "my_api_key",
"domain" => "my_VPC.bigml.io",
"storage" => "storage/data"]);

NOTICE: BigML API used to provide a sandbox mode, also know as
development mode. This has been deprecated and is not supported in the
PHP binding anymore. To guarantee backward-compatibility, the BigML
class constructor still supports the specification of a ``dev_mode``
argument, but it is now ignored.

Projects and Organizations
--------------------------

When you instantiate the BigML class you can specify a project or
organization that the instance shall default to:

.. code-block:: php

$api = new BigML\BigML(["username" => "myusername",
"apiKey" => "my_api_key",
"project" => $projectID]);

$api = new BigML\BigML(["username" => "myusername",
"apiKey" => "my_api_key",
"organization" => $organization]);

When $project is set to a project ID and that project exists for an
organization, the user is considered to be working in an organization
project. The scope of the API requests will be limited to this project
and permissions should be previously given by the organization
administrator.

If the specified project does not belong to an organization but is a
project of the user's, then the scope of all API requests will be
limited to that project.

When $organization is set to an organization ID, the user is considered
to be working for an organization. The scope of the API requests will
be limited to the projects of the organization and permissions need to
be previously given by the organization administrator.

Quick Start
-----------

Imagine that you want to use `this csv
file `_ containing the `Iris
flower dataset `_ to
predict the species of a flower whose ``sepal length`` is ``5`` and
whose ``sepal width`` is ``2.5``. A preview of the dataset is shown
below. It has 4 numeric fields: ``sepal length``, ``sepal width``,
``petal length``, ``petal width`` and a categorical field: ``species``.
By default, BigML considers the last field in the dataset as the
objective field (i.e., the field that you want to generate predictions
for).

.. code-block:: php

sepal length,sepal width,petal length,petal width,species
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
...
5.8,2.7,3.9,1.2,Iris-versicolor
6.0,2.7,5.1,1.6,Iris-versicolor
5.4,3.0,4.5,1.5,Iris-versicolor
...
6.8,3.0,5.5,2.1,Iris-virginica
5.7,2.5,5.0,2.0,Iris-virginica
5.8,2.8,5.1,2.4,Iris-virginica

If your credentials are stored in the environment as mentioned above,
you can easily generate a prediction following these steps

.. code-block:: php

$api = new BigML\BigML();

$source = $api->create_source('./tests/data/iris.csv');
$dataset = $api->create_dataset($source);
$model = $api->create_model($dataset);
$prediction = $api->create_prediction($model, array('sepal length'=> 5, 'sepal width'=> 2.5));

then:

.. code-block:: php

$objective_field_name = $prediction->object->fields->{$prediction->object->objective_fields[0]}->name;

"petal width"

$value = $prediction->object->prediction->{$prediction->object->objective_fields[0]};

0.30455

$api->pprint($prediction);

petal width for {"sepal length":5,"sepal width":2.5} is 0.30455

also, you can generate an evaluation for the model by using

.. code-block:: php

$test_source = $api->create_source('./tests/data/iris.csv');
$test_dataset = $api->create_dataset($test_source);
$evaluation = $api->create_evaluation($model, $test_dataset);

Dataset
-------

If you want to get some basic statistics for each field you can retrieve
the fields from the dataset as follows to get a dictionary keyed by field id

.. code-block:: php

$dataset = $api->get_dataset($dataset);
print_r($api->get_fields($dataset))

The field filtering options are also available using a query string expression, for instance

.. code-block:: php

$dataset = $api->get_dataset($dataset, "limit=20")

limits the number of fields that will be included in dataset to 20.

Model
-----

One of the greatest things about BigML is that the models that it generates for you are fully white-boxed.
To get the explicit tree-like predictive model for the example above

.. code-block:: php

$model = $api->get_model($model_id);

print_r($model->object->model->root);

stdClass Object
(
[children] => Array
(
[0] => stdClass Object
(
[children] => Array
(
[0] => stdClass Object...

Again, filtering options are also available using a query string expression, for instance

.. code-block:: php

$model = $api->get_model($model_id, "limit=5");

limits the number of fields that will be included in model to 5.

Evaluation
----------

The predictive performance of a model can be measured using many different measures.
In BigML these measures can be obtained by creating evaluations.
To create an evaluation you need the id of the model you are evaluating and the id of
the dataset that contains the data to be tested with. The result is shown as

.. code-block:: php

$evaluation = $api->get_evaluation($evaluation_id);

Cluster
-------

For unsupervised learning problems, the cluster is used to classify in a limited number of groups your training data.
The cluster structure is defined by the centers of each group of data, named centroids, and the data enclosed in the group.
As for in the model’s case, the cluster is a white-box resource and can be retrieved as a JSON

.. code-block:: php

$cluster = $api->get_cluster($cluster_id)

Anomaly detector
----------------

For anomaly detection problems, BigML anomaly detector uses iforest as an unsupervised kind of model that detects anomalous data in a dataset. The information it returns encloses a top_anomalies block that contains a list of the most anomalous points. For each, we capture a score from 0 to 1. The closer to 1, the more anomalous. We also capture the row which gives values for each field in the order defined by input_fields. Similarly we give a list of importances which match the row values. These importances tell us which values contributed most to the anomaly score. Thus, the structure of an anomaly detector is similar to

.. code-block:: json

{"category": 0,
"code": 200,
"columns": 14,
"constraints": false,
"created": "2014-09-08T18:51:11.893000",
"credits": 0.11653518676757812,
"credits_per_prediction": 0.0,
"dataset": "dataset/540dfa9d9841fa5c88000765",
"dataset_field_types": { "categorical": 21,
"datetime": 0,
"numeric": 21,
"preferred": 14,
"text": 0,
"total": 42},
"dataset_status": true,
"dataset_type": 0,
"description": "",
"excluded_fields": [],
"fields_meta": { "count": 14,
"limit": 1000,
"offset": 0,
"query_total": 14,
"total": 14},
"forest_size": 128,
"input_fields": [ "000004",
"000005",
"000009",
"000016",
"000017",
"000018",
"000019",
"00001e",
"00001f",
"000020",
"000023",
"000024",
"000025",
"000026"],
"locale": "en_US",
"max_columns": 42,
"max_rows": 200,
"model": { "fields": { "000004": { "column_number": 4,
"datatype": "int16",
"name": "src_bytes",
"optype": "numeric",
"order": 0,
"preferred": true,
"summary": { "bins": [ [ 143,
2],
...
[ 370,
2]],
"maximum": 370,
"mean": 248.235,
"median": 234.57157,
"minimum": 141,
"missing_count": 0,
"population": 200,
"splits": [ 159.92462,
173.73312,
188,
...
339.55228],
"standard_deviation": 49.39869,
"sum": 49647,
"sum_squares": 12809729,
"variance": 2440.23093}},
"000005": { "column_number": 5,
"datatype": "int32",
"name": "dst_bytes",
"optype": "numeric",
"order": 1,
"preferred": true,
...
"sum": 1030851,
"sum_squares": 22764504759,
"variance": 87694652.45224}},
"000009": { "column_number": 9,
"datatype": "string",
"name": "hot",
"optype": "categorical",
"order": 2,
"preferred": true,
"summary": { "categories": [ [ "0",
199],
[ "1",
1]],
"missing_count": 0},
"term_analysis": { "enabled": true}},
"000016": { "column_number": 22,
"datatype": "int8",
"name": "count",
"optype": "numeric",
"order": 3,
"preferred": true,
...
"population": 200,
"standard_deviation": 5.42421,
"sum": 1351,
"sum_squares": 14981,
"variance": 29.42209}},
"000017": { ... }}},
"kind": "iforest",
"mean_depth": 12.314174107142858,
"top_anomalies": [ { "importance": [ 0.06768,
0.01667,
0.00081,
0.02437,
0.04773,
0.22197,
0.18208,
0.01868,
0.11855,
0.01983,
0.01898,
0.05306,
0.20398,
0.00562],
"row": [ 183.0,
8654.0,
"0",
4.0,
4.0,
0.25,
0.25,
0.0,
123.0,
255.0,
0.01,
0.04,
0.01,
0.0],
"score": 0.68782},
{ "importance": [ 0.05645,
0.02285,
0.0015,
0.05196,
0.04435,
0.0005,
0.00056,
0.18979,
0.12402,
0.23671,
0.20723,
0.05651,
0.00144,
0.00612],
"row": [ 212.0,
1940.0,
"0",
1.0,
2.0,
0.0,
0.0,
1.0,
1.0,
69.0,
1.0,
0.04,
0.0,
0.0],
"score": 0.6239},
...],
"trees": [ { "root": { "children": [ { "children": [ { "children": [ { "children": [ { "children":[ { "population": 1,
"predicates": [ { "field": "00001f",
"op": ">",
"value": 35.54357}]},

{ "population": 1,
"predicates": [ { "field": "00001f",
"op": "<=",
"value": 35.54357}]}],
"population": 2,
"predicates": [ { "field": "000005",
"op": "<=",
"value": 1385.5166}]}],
"population": 3,
"predicates": [ { "field": "000020",
"op": "<=",
"value": 65.14308},
{ "field": "000019",
"op": "=",
"value": 0}]}],
"population": 105,
"predicates": [ { "field": "000017",
"op": "<=",
"value": 13.21754},
{ "field": "000009",
"op": "in",
"value": [ "0"]}]}],
"population": 126,
"predicates": [ true,
{ "field": "000018",
"op": "=",
"value": 0}]},
"training_mean_depth": 11.071428571428571}]},
"name": "tiny_kdd's dataset anomaly detector",
"number_of_batchscores": 0,
"number_of_public_predictions": 0,
"number_of_scores": 0,
"out_of_bag": false,
"price": 0.0,
"private": true,
"project": null,
"range": [1, 200],
"replacement": false,
"resource": "anomaly/540dfa9f9841fa5c8800076a",
"rows": 200,
"sample_rate": 1.0,
"sample_size": 126,
"seed": "BigML",
"shared": false,
"size": 30549,
"source": "source/540dfa979841fa5c7f000363",
"source_status": true,
"status": { "code": 5,
"elapsed": 32397,
"message": "The anomaly detector has been created",
"progress": 1.0},
"subscription": false,
"tags": [],
"updated": "2014-09-08T23:54:28.647000",
"white_box": false}

Samples
-------

To provide quick access to your row data you can create a ``sample``. Samples
are in-memory objects that can be queried for subsets of data by limiting
their size, the fields or the rows returned. The structure of a sample would
be::

Samples are not permanent objects. Once they are created, they will be
available as long as GETs are requested within periods smaller than
a pre-established TTL (Time to Live). The expiration timer of a sample is
reset every time a new GET is received.

If requested, a sample can also perform linear regression and compute
Pearson's and Spearman's correlations for either one numeric field
against all other numeric fields or between two specific numeric fields.

Correlations
------------

A ``correlation`` resource contains a series of computations that reflect the
degree of dependence between the field set as objective for your predictions
and the rest of fields in your dataset. The dependence degree is obtained by
comparing the distributions in every objective and non-objective field pair,
as independent fields should have probabilistic
independent distributions. Depending on the types of the fields to compare,
the metrics used to compute the correlation degree will be:

- for numeric to numeric pairs:
`Pearson's `_
and `Spearman's correlation `_
coefficients.
- for numeric to categorical pairs:
`One-way Analysis of Variance `_, with the
categorical field as the predictor variable.
- for categorical to categorical pairs:
`contingency table (or two-way table) `_,
`Chi-square test of independence `_
, and `Cramer's V `_
and `Tschuprow's T `_ coefficients.

An example of the correlation resource JSON structure is

.. code-block:: json

{"category": 0,
"clones": 0,
"code": 200,
"columns": 5,
"correlations": { "correlations": [ { "name": "one_way_anova",
"result": { "000000": { "eta_square": 0.61871,
"f_ratio": 119.2645,
"p_value": 0,
"significant": [ true,
true,
true]},
"000001": { "eta_square": 0.40078,
"f_ratio": 49.16004,
"p_value": 0,
"significant": [ true,
true,
true]},
"000002": { "eta_square": 0.94137,
"f_ratio": 1180.16118,
"p_value": 0,
"significant": [ true,
true,
true]},
"000003": { "eta_square": 0.92888,
"f_ratio": 960.00715,
"p_value": 0,
"significant": [ true,
true,
true]}}}],
"fields": { "000000": { "column_number": 0,
"datatype": "double",
"idx": 0,
"name": "sepal length",
"optype": "numeric",
"order": 0,
"preferred": true,
"summary": { "bins": [ [ 4.3,
1],
[ 4.425,
4],
...
[ 7.9,
1]],
"kurtosis": -0.57357,
"maximum": 7.9,
"mean": 5.84333,
"median": 5.8,
"minimum": 4.3,
"missing_count": 0,
"population": 150,
"skewness": 0.31175,
"splits": [ 4.51526,
4.67252,
4.81113,
4.89582,
4.96139,
5.01131,
...
6.92597,
7.20423,
7.64746],
"standard_deviation": 0.82807,
"sum": 876.5,
"sum_squares": 5223.85,
"variance": 0.68569}},
"000001": { "column_number": 1,
"datatype": "double",
"idx": 1,
"name": "sepal width",
"optype": "numeric",
"order": 1,
"preferred": true,
"summary": { "counts": [ [ 2,
1],
[ 2.2,
...
]]}},
"000004": { "column_number": 4,
"datatype": "string",
"idx": 4,
"name": "species",
"optype": "categorical",
"order": 4,
"preferred": true,
"summary": { "categories": [ [ "Iris-setosa",
50],
[ "Iris-versicolor",
50],
[ "Iris-virginica",
50]],
"missing_count": 0},
"term_analysis": { "enabled": true}}},
"significance_levels": [0.01, 0.05, 0.1]},
"created": "2015-07-28T18:07:37.010000",
"credits": 0.017581939697265625,
"dataset": "dataset/55b7a6749841fa2500000d41",
"dataset_status": true,
"dataset_type": 0,
"description": "",
"excluded_fields": [],
"fields_meta": { "count": 5,
"limit": 1000,
"offset": 0,
"query_total": 5,
"total": 5},
"input_fields": ["000000", "000001", "000002", "000003"],
"locale": "en_US",
"max_columns": 5,
"max_rows": 150,
"name": "iris' dataset correlation",
"objective_field_details": { "column_number": 4,
"datatype": "string",
"name": "species",
"optype": "categorical",
"order": 4},
"out_of_bag": false,
"price": 0.0,
"private": true,
"project": null,
"range": [1, 150],
"replacement": false,
"resource": "correlation/55b7c4e99841fa24f20009bf",
"rows": 150,
"sample_rate": 1.0,
"shared": false,
"size": 4609,
"source": "source/55b7a6729841fa24f100036a",
"source_status": true,
"status": { "code": 5,
"elapsed": 274,
"message": "The correlation has been created",
"progress": 1.0},
"subscription": true,
"tags": [],
"updated": "2015-07-28T18:07:49.057000",
"white_box": false}

Note that the output in the snippet above has been abbreviated. As you see, the
``correlations`` attribute contains the information about each field
correlation to the objective field.

Statistical Tests
-----------------

A ``statisticaltest`` resource contains a series of tests
that compare the
distribution of data in each numeric field of a dataset
to certain canonical distributions,
such as the
`normal distribution `_
or `Benford's law `_
distribution. Statistical test are useful in tasks such as fraud, normality,
or outlier detection.

- Fraud Detection Tests:
Benford: This statistical test performs a comparison of the distribution of
first significant digits (FSDs) of each value of the field to the Benford's
law distribution. Benford's law applies to numerical distributions spanning
several orders of magnitude, such as the values found on financial balance
sheets. It states that the frequency distribution of leading, or first
significant digits (FSD) in such distributions is not uniform.
On the contrary, lower digits like 1 and 2 occur disproportionately
often as leading significant digits. The test compares the distribution
in the field to Bendford's distribution using a Chi-square goodness-of-fit
test, and Cho-Gaines d test. If a field has a dissimilar distribution,
it may contain anomalous or fraudulent values.

- Normality tests:
These tests can be used to confirm the assumption that the data in each field
of a dataset is distributed according to a normal distribution. The results
are relevant because many statistical and machine learning techniques rely on
this assumption.
Anderson-Darling: The Anderson-Darling test computes a test statistic based on
the difference between the observed cumulative distribution function (CDF) to
that of a normal distribution. A significant result indicates that the
assumption of normality is rejected.
Jarque-Bera: The Jarque-Bera test computes a test statistic based on the third
and fourth central moments (skewness and kurtosis) of the data. Again, a
significant result indicates that the normality assumption is rejected.
Z-score: For a given sample size, the maximum deviation from the mean that
would expected in a sampling of a normal distribution can be computed based
on the 68-95-99.7 rule. This test simply reports this expected deviation and
the actual deviation observed in the data, as a sort of sanity check.

- Outlier tests:
Grubbs: When the values of a field are normally distributed, a few values may
still deviate from the mean distribution. The outlier tests reports whether
at least one value in each numeric field differs significantly from the mean
using Grubb's test for outliers. If an outlier is found, then its value will
be returned.

The JSON structure for ``statisticaltest`` resources is similar to this one

.. code-block:: json

{ "category": 0,
"clones": 0,
"code": 200,
"columns": 5,
"created": "2015-07-28T18:16:40.582000",
"credits": 0.017581939697265625,
"dataset": "dataset/55b7a6749841fa2500000d41",
"dataset_status": true,
"dataset_type": 0,
"description": "",
"excluded_fields": [],
"fields_meta": { "count": 5,
"limit": 1000,
"offset": 0,
"query_total": 5,
"total": 5},
"input_fields": ["000000", "000001", "000002", "000003"],
"locale": "en_US",
"max_columns": 5,
"max_rows": 150,
"name": "iris' dataset test",
"out_of_bag": false,
"price": 0.0,
"private": true,
"project": null,
"range": [1, 150],
"replacement": false,
"resource": "statisticaltest/55b7c7089841fa25000010ad",
"rows": 150,
"sample_rate": 1.0,
"shared": false,
"size": 4609,
"source": "source/55b7a6729841fa24f100036a",
"source_status": true,
"status": { "code": 5,
"elapsed": 302,
"message": "The test has been created",
"progress": 1.0},
"subscription": true,
"tags": [],
"statistical_tests": { "ad_sample_size": 1024,
"fields": { "000000": { "column_number": 0,
"datatype": "double",
"idx": 0,
"name": "sepal length",
"optype": "numeric",
"order": 0,
"preferred": true,
"summary": { "bins": [ [ 4.3,
1],
[ 4.425,
4],
[ 7.9,
1]],
"kurtosis": -0.57357,
"maximum": 7.9,
"mean": 5.84333,
"median": 5.8,
"minimum": 4.3,
"missing_count": 0,
"population": 150,
"skewness": 0.31175,
"splits": [ 4.51526,
4.67252,
4.81113,
4.89582,
...
7.20423,
7.64746],
"standard_deviation": 0.82807,
"sum": 876.5,
"sum_squares": 5223.85,
"variance": 0.68569}},
...
"000004": { "column_number": 4,
"datatype": "string",
"idx": 4,
"name": "species",
"optype": "categorical",
"order": 4,
"preferred": true,
"summary": { "categories": [ [ "Iris-setosa",
50],
[ "Iris-versicolor",
50],
[ "Iris-virginica",
50]],
"missing_count": 0},
"term_analysis": { "enabled": true}}},
"fraud": [ { "name": "benford",
"result": { "000000": { "chi_square": { "chi_square_value": 506.39302,
"p_value": 0,
"significant": [ true,
true,
true]},
"cho_gaines": { "d_statistic": 7.124311073683573,
"significant": [ true,
true,
true]},
"distribution": [ 0,
0,
0,
22,
61,
54,
13,
0,
0],
"negatives": 0,
"zeros": 0},
"000001": { "chi_square": { "chi_square_value": 396.76556,
"p_value": 0,
"significant": [ true,
true,
true]},
"cho_gaines": { "d_statistic": 7.503503138331123,
"significant": [ true,
true,
true]},
"distribution": [ 0,
57,
89,
4,
0,
0,
0,
0,
0],
"negatives": 0,
"zeros": 0},
"000002": { "chi_square": { "chi_square_value": 154.20728,
"p_value": 0,
"significant": [ true,
true,
true]},
"cho_gaines": { "d_statistic": 3.9229974017266054,
"significant": [ true,
true,
true]},
"distribution": [ 50,
0,
11,
43,
35,
11,
0,
0,
0],
"negatives": 0,
"zeros": 0},
"000003": { "chi_square": { "chi_square_value": 111.4438,
"p_value": 0,
"significant": [ true,
true,
true]},
"cho_gaines": { "d_statistic": 4.103257341299901,
"significant": [ true,
true,
true]},
"distribution": [ 76,
58,
7,
7,
1,
1,
0,
0,
0],
"negatives": 0,
"zeros": 0}}}],
"normality": [ { "name": "anderson_darling",
"result": { "000000": { "p_value": 0.02252,
"significant": [ false,
true,
true]},
"000001": { "p_value": 0.02023,
"significant": [ false,
true,
true]},
"000002": { "p_value": 0,
"significant": [ true,
true,
true]},
"000003": { "p_value": 0,
"significant": [ true,
true,
true]}}},
{ "name": "jarque_bera",
"result": { "000000": { "p_value": 0.10615,
"significant": [ false,
false,
false]},
"000001": { "p_value": 0.25957,
"significant": [ false,
false,
false]},
"000002": { "p_value": 0.0009,
"significant": [ true,
true,
true]},
"000003": { "p_value": 0.00332,
"significant": [ true,
true,
true]}}},
{ "name": "z_score",
"result": { "000000": { "expected_max_z": 2.71305,
"max_z": 2.48369},
"000001": { "expected_max_z": 2.71305,
"max_z": 3.08044},
"000002": { "expected_max_z": 2.71305,
"max_z": 1.77987},
"000003": { "expected_max_z": 2.71305,
"max_z": 1.70638}}}],
"outliers": [ { "name": "grubbs",
"result": { "000000": { "p_value": 1,
"significant": [ false,
false,
false]},
"000001": { "p_value": 0.26555,
"significant": [ false,
false,
false]},
"000002": { "p_value": 1,
"significant": [ false,
false,
false]},
"000003": { "p_value": 1,
"significant": [ false,
false,
false]}}}],
"significance_levels": [0.01, 0.05, 0.1]},
"updated": "2015-07-28T18:17:11.829000",
"white_box": false}

Note that the output in the snippet above has been abbreviated. As you see, the
``statistical_tests`` attribute contains the ``fraud`, ``normality``
and ``outliers``
sections where the information for each field's distribution is stored.

Logistic Regressions
--------------------

A logistic regression is a supervised machine learning method for
solving classification problems. Each of the classes in the field
you want to predict, the objective field, is assigned a probability depending
on the values of the input fields. The probability is computed
as the value of a logistic function,
whose argument is a linear combination of the predictors' values.
You can create a logistic regression selecting which fields from your
dataset you want to use as input fields (or predictors) and which
categorical field you want to predict, the objective field. Then the
created logistic regression is defined by the set of coefficients in the
linear combination of the values. Categorical
and text fields need some prior work to be modelled using this method. They
are expanded as a set of new fields, one per category or term (respectively)
where the number of occurrences of the category or term is store. Thus,
the linear combination is made on the frequency of the categories or terms.

The JSON structure for a logistic regression is

.. code-block:: json

{ "balance_objective": false,
"category": 0,
"code": 200,
"columns": 5,
"created": "2015-10-09T16:11:08.444000",
"credits": 0.017581939697265625,
"credits_per_prediction": 0.0,
"dataset": "dataset/561304f537203f4c930001ca",
"dataset_field_types": { "categorical": 1,
"datetime": 0,
"effective_fields": 5,
"numeric": 4,
"preferred": 5,
"text": 0,
"total": 5},
"dataset_status": true,
"description": "",
"excluded_fields": [],
"fields_meta": { "count": 5,
"limit": 1000,
"offset": 0,
"query_total": 5,
"total": 5},
"input_fields": ["000000", "000001", "000002", "000003"],
"locale": "en_US",
"logistic_regression": { "bias": 1,
"c": 1,
"coefficients": [ [ "Iris-virginica",
[ -1.7074433493289376,
-1.533662474502423,
2.47026986670851,
2.5567582221085563,
-1.2158200612711925]],
[ "Iris-setosa",
[ 0.41021712519841674,
1.464162165246765,
-2.26003266131107,
-1.0210350909174153,
0.26421852991732514]],
[ "Iris-versicolor",
[ 0.42702327817072505,
-1.611817241669904,
0.5763832839459982,
-1.4069842681625884,
1.0946877732663143]]],
"eps": 1e-05,
"fields": { "000000": { "column_number": 0,
"datatype": "double",
"name": "sepal length",
"optype": "numeric",
"order": 0,
"preferred": true,
"summary": { "bins": [ [ 4.3,
1],
[ 4.425,
4],
[ 4.6,
4],
...
[ 7.9,
1]],
"kurtosis": -0.57357,
"maximum": 7.9,
"mean": 5.84333,
"median": 5.8,
"minimum": 4.3,
"missing_count": 0,
"population": 150,
"skewness": 0.31175,
"splits": [ 4.51526,
4.67252,
4.81113,
...
6.92597,
7.20423,
7.64746],
"standard_deviation": 0.82807,
"sum": 876.5,
"sum_squares": 5223.85,
"variance": 0.68569}},
"000001": { "column_number": 1,
"datatype": "double",
"name": "sepal width",
"optype": "numeric",
"order": 1,
"preferred": true,
"summary": { "counts": [ [ 2,
1],
[ 2.2,
3],
...
[ 4.2,
1],
[ 4.4,
1]],
"kurtosis": 0.18098,
"maximum": 4.4,
"mean": 3.05733,
"median": 3,
"minimum": 2,
"missing_count": 0,
"population": 150,
"skewness": -0.27213,
"splits": [ 1.25138,
1.32426,
1.37171,
...
6.02913,
6.38125],
"standard_deviation": 1.7653,
"sum": 563.7,
"sum_squares": 2582.71,
"variance": 3.11628}},
"000003": { "column_number": 3,
"datatype": "double",
"name": "petal width",
"optype": "numeric",
"order": 3,
"preferred": true,
"summary": { "counts": [ [ 0.1,
5],
[ 0.2,
29],
...
[ 2.4,
3],
[ 2.5,
3]],
"kurtosis": -1.33607,
"maximum": 2.5,
"mean": 1.19933,
"median": 1.3,
"minimum": 0.1,
"missing_count": 0,
"population": 150,
"skewness": -0.10193,
"standard_deviation": 0.76224,
"sum": 179.9,
"sum_squares": 302.33,
"variance": 0.58101}},
"000004": { "column_number": 4,
"datatype": "string",
"name": "species",
"optype": "categorical",
"order": 4,
"preferred": true,
"summary": { "categories": [ [ "Iris-setosa",
50],
[ "Iris-versicolor",
50],
[ "Iris-virginica",
50]],
"missing_count": 0},
"term_analysis": { "enabled": true}}},
"normalize": false,
"regularization": "l2"},
"max_columns": 5,
"max_rows": 150,
"name": "iris' dataset's logistic regression",
"number_of_batchpredictions": 0,
"number_of_evaluations": 0,
"number_of_predictions": 1,
"objective_field": "000004",
"objective_field_name": "species",
"objective_field_type": "categorical",
"objective_fields": ["000004"],
"out_of_bag": false,
"private": true,
"project": "project/561304c137203f4c9300016c",
"range": [1, 150],
"replacement": false,
"resource": "logisticregression/5617e71c37203f506a000001",
"rows": 150,
"sample_rate": 1.0,
"shared": false,
"size": 4609,
"source": "source/561304f437203f4c930001c3",
"source_status": true,
"status": { "code": 5,
"elapsed": 86,
"message": "The logistic regression has been created",
"progress": 1.0},
"subscription": false,
"tags": ["species"],
"updated": "2015-10-09T16:14:02.336000",
"white_box": false}

Note that the output in the snippet above has been abbreviated. As you see,
the ``logistic_regression`` attribute stores the coefficients used in the
logistic function as well as the configuration parameters described in
the `developers section `_ .

Associations
------------

Association Discovery is a popular method to find out relations among values
in high-dimensional datasets.

A common case where association discovery is often used is
market basket analysis. This analysis seeks for customer shopping
patterns across large transactional
datasets. For instance, do customers who buy hamburgers and ketchup also
consume bread?

Businesses use those insights to make decisions on promotions and product
placements.
Association Discovery can also be used for other purposes such as early
incident detection, web usage analysis, or software intrusion detection.

In BigML, the Association resource object can be built from any dataset, and
its results are a list of association rules between the items in the dataset.
In the example case, the corresponding
association rule would have hamburguers and ketchup as the items at the
left hand side of the association rule and bread would be the item at the
right hand side. Both sides in this association rule are related,
in the sense that observing
the items in the left hand side implies observing the items in the right hand
side. There are some metrics to ponder the quality of these association rules:

- Support: the proportion of instances which contain an itemset.

For an association rule, it means the number of instances in the dataset which
contain the rule's antecedent and rule's consequent together
over the total number of instances (N) in the dataset.

It gives a measure of the importance of the rule. Association rules have
to satisfy a minimum support constraint (i.e., min_support).

- Coverage: the support of the antedecent of an association rule.
It measures how often a rule can be applied.

- Confidence or (strength): The probability of seeing the rule's consequent
under the condition that the instances also contain the rule's antecedent.
Confidence is computed using the support of the association rule over the
coverage. That is, the percentage of instances which contain the consequent
and antecedent together over the number of instances which only contain
the antecedent.

Confidence is directed and gives different values for the association
rules Antecedent → Consequent and Consequent → Antecedent. Association
rules also need to satisfy a minimum confidence constraint
(i.e., min_confidence).

- Leverage: the difference of the support of the association
rule (i.e., the antecedent and consequent appearing together) and what would
be expected if antecedent and consequent where statistically independent.
This is a value between -1 and 1. A positive value suggests a positive
relationship and a negative value suggests a negative relationship.
0 indicates independence.

Lift: how many times more often antecedent and consequent occur together
than expected if they where statistically independent.
A value of 1 suggests that there is no relationship between the antecedent
and the consequent. Higher values suggest stronger positive relationships.
Lower values suggest stronger negative relationships (the presence of the
antecedent reduces the likelihood of the consequent)

As to the items used in association rules, each type of field is parsed to
extract items for the rules as follows:

- Categorical: each different value (class) will be considered a separate item.
- Text: each unique term will be considered a separate item.
- Items: each different item in the items summary will be considered.
- Numeric: Values will be converted into categorical by making a
segmentation of the values.
For example, a numeric field with values ranging from 0 to 600 split
into 3 segments:
segment 1 → [0, 200), segment 2 → [200, 400), segment 3 → [400, 600].
You can refine the behavior of the transformation using
`discretization `_
and `field_discretizations `_.

The JSON structure for an association resource is

.. code-block:: json

{
"associations":{
"complement":false,
"discretization":{
"pretty":true,
"size":5,
"trim":0,
"type":"width"
},
"items":[
{
"complement":false,
"count":32,
"field_id":"000000",
"name":"Segment 1",
"bin_end":5,
"bin_start":null
},
{
"complement":false,
"count":49,
"field_id":"000000",
"name":"Segment 3",
"bin_end":7,
"bin_start":6
},
{
"complement":false,
"count":12,
"field_id":"000000",
"name":"Segment 4",
"bin_end":null,
"bin_start":7
},
{
"complement":false,
"count":19,
"field_id":"000001",
"name":"Segment 1",
"bin_end":2.5,
"bin_start":null
},
...
{
"complement":false,
"count":50,
"field_id":"000004",
"name":"Iris-versicolor"
},
{
"complement":false,
"count":50,
"field_id":"000004",
"name":"Iris-virginica"
}
],
"max_k": 100,
"min_confidence":0,
"min_leverage":0,
"min_lift":1,
"min_support":0,
"rules":[
{
"confidence":1,
"id":"000000",
"leverage":0.22222,
"lhs":[
13
],
"lhs_cover":[
0.33333,
50
],
"lift":3,
"p_value":0.000000000,
"rhs":[
6
],
"rhs_cover":[
0.33333,
50
],
"support":[
0.33333,
50
]
},
{
"confidence":1,
"id":"000001",
"leverage":0.22222,
"lhs":[
6
],
"lhs_cover":[
0.33333,
50
],
"lift":3,
"p_value":0.000000000,
"rhs":[
13
],
"rhs_cover":[
0.33333,
50
],
"support":[
0.33333,
50
]
},
...
{
"confidence":0.26,
"id":"000029",
"leverage":0.05111,
"lhs":[
13
],
"lhs_cover":[
0.33333,
50
],
"lift":2.4375,
"p_value":0.0000454342,
"rhs":[
5
],
"rhs_cover":[
0.10667,
16
],
"support":[
0.08667,
13
]
},
{
"confidence":0.18,
"id":"00002a",
"leverage":0.04,
"lhs":[
15
],
"lhs_cover":[
0.33333,
50
],
"lift":3,
"p_value":0.0000302052,
"rhs":[
9
],
"rhs_cover":[
0.06,
9
],
"support":[
0.06,
9
]
},
{
"confidence":1,
"id":"00002b",
"leverage":0.04,
"lhs":[
9
],
"lhs_cover":[
0.06,
9
],
"lift":3,
"p_value":0.0000302052,
"rhs":[
15
],
"rhs_cover":[
0.33333,
50
],
"support":[
0.06,
9
]
}
],
"rules_summary":{
"confidence":{
"counts":[
[
0.18,
1
],
[
0.24,
1
],
[
0.26,
2
],
...
[
0.97959,
1
],
[
1,
9
]
],
"maximum":1,
"mean":0.70986,
"median":0.72864,
"minimum":0.18,
"population":44,
"standard_deviation":0.24324,
"sum":31.23367,
"sum_squares":24.71548,
"variance":0.05916
},
"k":44,
"leverage":{
"counts":[
[
0.04,
2
],
[
0.05111,
4
],
[
0.05316,
2
],
...
[
0.22222,
2
]
],
"maximum":0.22222,
"mean":0.10603,
"median":0.10156,
"minimum":0.04,
"population":44,
"standard_deviation":0.0536,
"sum":4.6651,
"sum_squares":0.61815,
"variance":0.00287
},
"lhs_cover":{
"counts":[
[
0.06,
2
],
[
0.08,
2
],
[
0.10667,
4
],
[
0.12667,
1
],
...
[
0.5,
4
]
],
"maximum":0.5,
"mean":0.29894,
"median":0.33213,
"minimum":0.06,
"population":44,
"standard_deviation":0.13386,
"sum":13.15331,
"sum_squares":4.70252,
"variance":0.01792
},
"lift":{
"counts":[
[
1.40625,
2
],
[
1.5067,
2
],
...
[
2.63158,
4
],
[
3,
10
],
[
4.93421,
2
],
[
12.5,
2
]
],
"maximum":12.5,
"mean":2.91963,
"median":2.58068,
"minimum":1.40625,
"population":44,
"standard_deviation":2.24641,
"sum":128.46352,
"sum_squares":592.05855,
"variance":5.04635
},
"p_value":{
"counts":[
[
0.000000000,
2
],
[
0.000000000,
4
],
[
0.000000000,
2
],
...
[
0.0000910873,
2
]
],
"maximum":0.0000910873,
"mean":0.0000106114,
"median":0.00000000,
"minimum":0.000000000,
"population":44,
"standard_deviation":0.0000227364,
"sum":0.000466903,
"sum_squares":0.0000000,
"variance":0.000000001
},
"rhs_cover":{
"counts":[
[
0.06,
2
],
[
0.08,
2
],
...
[
0.42667,
2
],
[
0.46667,
3
],
[
0.5,
4
]
],
"maximum":0.5,
"mean":0.29894,
"median":0.33213,
"minimum":0.06,
"population":44,
"standard_deviation":0.13386,
"sum":13.15331,
"sum_squares":4.70252,
"variance":0.01792
},
"support":{
"counts":[
[
0.06,
4
],
[
0.06667,
2
],
[
0.08,
2
],
[
0.08667,
4
],
[
0.10667,
4
],
[
0.15333,
2
],
[
0.18667,
4
],
[
0.19333,
2
],
[
0.20667,
2
],
[
0.27333,
2
],
[
0.28667,
2
],
[
0.3,
4
],
[
0.32,
2
],
[
0.33333,
6
],
[
0.37333,
2
]
],
"maximum":0.37333,
"mean":0.20152,
"median":0.19057,
"minimum":0.06,
"population":44,
"standard_deviation":0.10734,
"sum":8.86668,
"sum_squares":2.28221,
"variance":0.01152
}
},
"search_strategy":"leverage",
"significance_level":0.05
},
"category":0,
"clones":0,
"code":200,
"columns":5,
"created":"2015-11-05T08:06:08.184000",
"credits":0.017581939697265625,
"dataset":"dataset/562fae3f4e1727141d00004e",
"dataset_status":true,
"dataset_type":0,
"description":"",
"excluded_fields":[ ],
"fields_meta":{
"count":5,
"limit":1000,
"offset":0,
"query_total":5,
"total":5
},
"input_fields":[
"000000",
"000001",
"000002",
"000003",
"000004"
],
"locale":"en_US",
"max_columns":5,
"max_rows":150,
"name":"iris' dataset's association",
"out_of_bag":false,
"price":0,
"private":true,
"project":null,
"range":[
1,
150
],
"replacement":false,
"resource":"association/5621b70910cb86ae4c000000",
"rows":150,
"sample_rate":1,
"shared":false,
"size":4609,
"source":"source/562fae3a4e1727141d000048",
"source_status":true,
"status":{
"code":5,
"elapsed":1072,
"message":"The association has been created",
"progress":1
},
"subscription":false,
"tags":[ ],
"updated":"2015-11-05T08:06:20.403000",
"white_box":false
}

Note that the output in the snippet above has been abbreviated. As you see,
the ``associations`` attribute stores items, rules and metrics extracted
from the datasets as well as the configuration parameters described in
the `developers section `_ .

Topic Models
------------

A topic model is an unsupervised machine learning method for unveiling
all the different topics underlying a collection of documents. BigML
uses Latent Dirichlet Allocation (LDA), one of the most popular
probabilistic methods for topic modeling. In BigML, each instance
(i.e. each row in your dataset) will be considered a document and the
contents of all the text fields given as inputs will be automatically
concatenated and considered the document bag of words.

Topic model is based on the assumption that any document exhibits a
mixture of topics. Each topic is composed of a set of words which are
thematically related. The words from a given topic have different
probabilities for that topic. At the same time, each word can be
attributable to one or several topics. So for example the word “sea”
may be found in a topic related with sea transport but also in a topic
related to holidays. Topic model automatically discards stop words and
high frequency words.

Topic model’s main applications include browsing, organizing and
understanding large archives of documents. It can been applied for
information retrieval, collaborative filtering, assessing document
similarity among others. The topics found in the dataset can also be
very useful new features before applying other models like
classification, clustering, or anomaly detection.

The JSON structure for a topic mod