Getting Started with MADlib for GreenplumGetting Started with MADlib for Greenplum
Spread the love

Loading

Introduction

Not many data scientists realize that it is possible and much more beneficial to build data science models directly on top of a data warehouse. Apache MADlib is an open-source library for scalable in-database analytics. The Greenplum MADlib extension provides the ability to run machine learning and deep learning workloads in a Greenplum Database.

Outline

1). Preprocessing

2). Sampling

a). Train-Test Split

3). Neural Networks for Supervised Learning(Multilayer Perceptrons)

4). Motivation

Pre-Processing

Pre-processing is an important step in most machine learning or data science projects. Here, we share how we can apply pre-processing on data in data warehouse.

Train-Test Split

Examples

Please note that due to the random nature of sampling, your results may look different from those below.

1.Create an input table:

  1. DROP TABLE IF EXISTS test;
    CREATE TABLE test(
        id1 INTEGER,
        id2 INTEGER,
        gr1 INTEGER,
        gr2 INTEGER
    );
    INSERT INTO test VALUES
    (1,0,1,1),
    (2,0,1,1),
    (3,0,1,1),
    (4,0,1,1),
    (5,0,1,1),
    (6,0,1,1),
    (7,0,1,1),
    (8,0,1,1),
    (9,0,1,1),
    (9,0,1,1),
    (9,0,1,1),
    (9,0,1,1),
    (0,1,1,2),
    (0,2,1,2),
    (0,3,1,2),
    (0,4,1,2),
    (0,5,1,2),
    (0,6,1,2),
    (10,10,2,2),
    (20,20,2,2),
    (30,30,2,2),
    (40,40,2,2),
    (50,50,2,2),
    (60,60,2,2),
    (70,70,2,2);
  2. Sample without replacement:
    DROP TABLE IF EXISTS out;
    SELECT madlib.train_test_split(
                                    ‘test’,    — Source table
                                    ‘out’,     — Output table
                                    0.5,       — Sample proportion
                                    0.5,       — Sample proportion
                                    ‘gr1,gr2’, — Strata definition
                                    ‘id1,id2’, — Columns to output
                                    FALSE,     — Sample without replacement
                                    FALSE);    — Do not separate output tables
    SELECT * FROM out ORDER BY split,gr1,gr2,id1,id2;

     gr1 | gr2 | id1 | id2 | split
    —–+—–+—–+—–+——-
       1 |   1 |   1 |   0 |     0
       1 |   1 |   4 |   0 |     0
       1 |   1 |   6 |   0 |     0
       1 |   1 |   9 |   0 |     0
       1 |   1 |   9 |   0 |     0
       1 |   1 |   9 |   0 |     0
       1 |   2 |   0 |   3 |     0
       1 |   2 |   0 |   4 |     0
       1 |   2 |   0 |   5 |     0
       2 |   2 |  10 |  10 |     0
       2 |   2 |  30 |  30 |     0
       2 |   2 |  40 |  40 |     0
       2 |   2 |  60 |  60 |     0
       1 |   1 |   2 |   0 |     1
       1 |   1 |   3 |   0 |     1
       1 |   1 |   5 |   0 |     1
       1 |   1 |   7 |   0 |     1
       1 |   1 |   8 |   0 |     1
       1 |   1 |   9 |   0 |     1
       1 |   2 |   0 |   1 |     1
       1 |   2 |   0 |   2 |     1
       1 |   2 |   0 |   6 |     1
       2 |   2 |  20 |  20 |     1
       2 |   2 |  50 |  50 |     1
       2 |   2 |  70 |  70 |     1
    (25 rows)
  3. Sample with replacement and create separate train and test tables:
    DROP TABLE IF EXISTS out_train, out_test;
    SELECT madlib.train_test_split(
                                    ‘test’,    — Source table
                                    ‘out’,     — Output table
                                    0.5,       — train_proportion
                                    NULL,      — Default = 1 – train_proportion = 0.5
                                    ‘gr1,gr2’, — Strata definition
                                    ‘id1,id2’, — Columns to output
                                    TRUE,      — Sample with replacement
                                    TRUE);     — Separate output tables
    SELECT * FROM out_train ORDER BY gr1,gr2,id1,id2;

     gr1 | gr2 | id1 | id2
    —–+—–+—–+—–
       1 |   1 |   1 |   0
       1 |   1 |   2 |   0
       1 |   1 |   4 |   0
       1 |   1 |   7 |   0
       1 |   1 |   8 |   0
       1 |   1 |   9 |   0
       1 |   2 |   0 |   4
       1 |   2 |   0 |   5
       1 |   2 |   0 |   6
       2 |   2 |  40 |  40
       2 |   2 |  50 |  50
       2 |   2 |  50 |  50
    (12 rows)

    SELECT * FROM out_test ORDER BY gr1,gr2,id1,id2;

     gr1 | gr2 | id1 | id2
    –—+–—+–—+–—
       1 |   1 |   1 |   0
       1 |   1 |   1 |   0
       1 |   1 |   3 |   0
       1 |   1 |   4 |   0
       1 |   1 |   5 |   0
       1 |   1 |   9 |   0
       1 |   2 |   0 |   1
       1 |   2 |   0 |   5
       1 |   2 |   0 |   6
       2 |   2 |  20 |  20
       2 |   2 |  20 |  20
       2 |   2 |  20 |  20
       2 |   2 |  70 |  70
    (13 rows)

Stratified Sampling

DROP table if exists out;

SELECT madlib.stratified_sample(

                                ‘test’,    — Source table

                                ‘out’,     — Output table

                                0.5,       — Sample proportion

                                ‘gr1,gr2’, — Strata definition

                                ‘id1,id2’, — Columns to output

                                FALSE);    — Sample without replacement

SELECT * FROM out ORDER BY gr1,gr2,id1,id2;

Neural Networks on Greenplum with Apache MADlib

Supervised Learning

Examples

Classification without Mini-Batching

  1. Create an input data set.

DROP TABLE IF EXISTS iris_data;

CREATE TABLE iris_data(
    id serial,
    attributes numeric[],
    class_text varchar,
    class integer,
    state varchar
);
INSERT INTO iris_data(id, attributes, class_text, class, state) VALUES
(1,ARRAY[5.0,3.2,1.2,0.2],’Iris_setosa’,1,’Alaska’),
(2,ARRAY[5.5,3.5,1.3,0.2],’Iris_setosa’,1,’Alaska’),
(3,ARRAY[4.9,3.1,1.5,0.1],’Iris_setosa’,1,’Alaska’),
(4,ARRAY[4.4,3.0,1.3,0.2],’Iris_setosa’,1,’Alaska’),
(5,ARRAY[5.1,3.4,1.5,0.2],’Iris_setosa’,1,’Alaska’),
(6,ARRAY[5.0,3.5,1.3,0.3],’Iris_setosa’,1,’Alaska’),
(7,ARRAY[4.5,2.3,1.3,0.3],’Iris_setosa’,1,’Alaska’),
(8,ARRAY[4.4,3.2,1.3,0.2],’Iris_setosa’,1,’Alaska’),
(9,ARRAY[5.0,3.5,1.6,0.6],’Iris_setosa’,1,’Alaska’),
(10,ARRAY[5.1,3.8,1.9,0.4],’Iris_setosa’,1,’Alaska’),
(11,ARRAY[4.8,3.0,1.4,0.3],’Iris_setosa’,1,’Alaska’),
(12,ARRAY[5.1,3.8,1.6,0.2],’Iris_setosa’,1,’Alaska’),
(13,ARRAY[5.7,2.8,4.5,1.3],’Iris_versicolor’,2,’Alaska’),
(14,ARRAY[6.3,3.3,4.7,1.6],’Iris_versicolor’,2,’Alaska’),
(15,ARRAY[4.9,2.4,3.3,1.0],’Iris_versicolor’,2,’Alaska’),
(16,ARRAY[6.6,2.9,4.6,1.3],’Iris_versicolor’,2,’Alaska’),
(17,ARRAY[5.2,2.7,3.9,1.4],’Iris_versicolor’,2,’Alaska’),
(18,ARRAY[5.0,2.0,3.5,1.0],’Iris_versicolor’,2,’Alaska’),
(19,ARRAY[5.9,3.0,4.2,1.5],’Iris_versicolor’,2,’Alaska’),
(20,ARRAY[6.0,2.2,4.0,1.0],’Iris_versicolor’,2,’Alaska’),
(21,ARRAY[6.1,2.9,4.7,1.4],’Iris_versicolor’,2,’Alaska’),
(22,ARRAY[5.6,2.9,3.6,1.3],’Iris_versicolor’,2,’Alaska’),
(23,ARRAY[6.7,3.1,4.4,1.4],’Iris_versicolor’,2,’Alaska’),
(24,ARRAY[5.6,3.0,4.5,1.5],’Iris_versicolor’,2,’Alaska’),
(25,ARRAY[5.8,2.7,4.1,1.0],’Iris_versicolor’,2,’Alaska’),
(26,ARRAY[6.2,2.2,4.5,1.5],’Iris_versicolor’,2,’Alaska’),
(27,ARRAY[5.6,2.5,3.9,1.1],’Iris_versicolor’,2,’Alaska’),
(28,ARRAY[5.0,3.4,1.5,0.2],’Iris_setosa’,1,’Tennessee’),
(29,ARRAY[4.4,2.9,1.4,0.2],’Iris_setosa’,1,’Tennessee’),
(30,ARRAY[4.9,3.1,1.5,0.1],’Iris_setosa’,1,’Tennessee’),
(31,ARRAY[5.4,3.7,1.5,0.2],’Iris_setosa’,1,’Tennessee’),
(32,ARRAY[4.8,3.4,1.6,0.2],’Iris_setosa’,1,’Tennessee’),
(33,ARRAY[4.8,3.0,1.4,0.1],’Iris_setosa’,1,’Tennessee’),
(34,ARRAY[4.3,3.0,1.1,0.1],’Iris_setosa’,1,’Tennessee’),
(35,ARRAY[5.8,4.0,1.2,0.2],’Iris_setosa’,1,’Tennessee’),
(36,ARRAY[5.7,4.4,1.5,0.4],’Iris_setosa’,1,’Tennessee’),
(37,ARRAY[5.4,3.9,1.3,0.4],’Iris_setosa’,1,’Tennessee’),
(38,ARRAY[6.0,2.9,4.5,1.5],’Iris_versicolor’,2,’Tennessee’),
(39,ARRAY[5.7,2.6,3.5,1.0],’Iris_versicolor’,2,’Tennessee’),
(40,ARRAY[5.5,2.4,3.8,1.1],’Iris_versicolor’,2,’Tennessee’),
(41,ARRAY[5.5,2.4,3.7,1.0],’Iris_versicolor’,2,’Tennessee’),
(42,ARRAY[5.8,2.7,3.9,1.2],’Iris_versicolor’,2,’Tennessee’),
(43,ARRAY[6.0,2.7,5.1,1.6],’Iris_versicolor’,2,’Tennessee’),
(44,ARRAY[5.4,3.0,4.5,1.5],’Iris_versicolor’,2,’Tennessee’),
(45,ARRAY[6.0,3.4,4.5,1.6],’Iris_versicolor’,2,’Tennessee’),
(46,ARRAY[6.7,3.1,4.7,1.5],’Iris_versicolor’,2,’Tennessee’),
(47,ARRAY[6.3,2.3,4.4,1.3],’Iris_versicolor’,2,’Tennessee’),
(48,ARRAY[5.6,3.0,4.1,1.3],’Iris_versicolor’,2,’Tennessee’),
(49,ARRAY[5.5,2.5,4.0,1.3],’Iris_versicolor’,2,’Tennessee’),
(50,ARRAY[5.5,2.6,4.4,1.2],’Iris_versicolor’,2,’Tennessee’),
(51,ARRAY[6.1,3.0,4.6,1.4],’Iris_versicolor’,2,’Tennessee’),
(52,ARRAY[5.8,2.6,4.0,1.2],’Iris_versicolor’,2,’Tennessee’);

  1. Generate a multilayer perceptron with a single hidden layer of 5 units. Use the attributes column as the independent variables, and use the class column as the classification. Set the tolerance to 0 so that 500 iterations will be run. Use a hyperbolic tangent activation function. The model will be written to mlp_model.
    DROP TABLE IF EXISTS mlp_model, mlp_model_summary, mlp_model_standardization;
    — Set seed so results are reproducible
    SELECT setseed(0);
    SELECT madlib.mlp_classification(
        ‘iris_data’,      — Source table
        ‘mlp_model’,      — Destination table
        ‘attributes’,     — Input features
        ‘class_text’,     — Label
        ARRAY[5],         — Number of units per layer
        ‘learning_rate_init=0.003,
        n_iterations=500,
        tolerance=0′,     — Optimizer params
        ‘tanh’,           — Activation function
        NULL,             — Default weight (1)
        FALSE,            — No warm start
        FALSE             — Not verbose
    );

    View the model:
    \x on
    SELECT * FROM mlp_model;

    -[ RECORD 1 ]–+————————————————————————————
    coeff          | {-0.40378996718,0.0157490328855,-0.298904053444,-0.984152185093,-0.657684089715 …
    loss           | 0.0103518565103
    num_iterations | 500

    View the model summary table:
    SELECT * FROM mlp_model_summary;

    -[ RECORD 1 ]——–+——————————
    source_table         | iris_data
    independent_varname  | attributes
    dependent_varname    | class_text
    dependent_vartype    | character varying
    tolerance            | 0
    learning_rate_init   | 0.003
    learning_rate_policy | constant
    momentum             | 0.9
    nesterov             | t
    n_iterations         | 500
    n_tries              | 1
    layer_sizes          | {4,5,2}
    activation           | tanh
    is_classification    | t
    classes              | {Iris_setosa,Iris_versicolor}
    weights              | 1
    grouping_col         | NULL

    View the model standardization table:
    SELECT * FROM mlp_model_standardization;

    -[ RECORD 1 ]——————————————————————
    mean | {5.45961538461539,2.99807692307692,3.025,0.851923076923077}
    std  | {0.598799958694505,0.498262513685689,1.41840579525043,0.550346179381454}
  2. Now let’s use the model to predict. In the following example we will use the training data set for prediction as well, which is not usual but serves to show the syntax. The prediction is in the estimated_class_text column with the actual value in the class_text column.

DROP TABLE IF EXISTS mlp_prediction;

\x off

SELECT madlib.mlp_predict(

         ‘mlp_model’,         — Model table

         ‘iris_data’,         — Test data table

         ‘id’,                — Id column in test table

         ‘mlp_prediction’,    — Output table for predictions

         ‘response’           — Output classes, not probabilities

     );

SELECT * FROM mlp_prediction JOIN iris_data USING (id) ORDER BY id;

 id | estimated_class_text |    attributes     |   class_text    | class |   state

—-+———————-+——————-+—————–+——-+———–

  1 | Iris_setosa          | {5.0,3.2,1.2,0.2} | Iris_setosa     |     1 | Alaska

  2 | Iris_setosa          | {5.5,3.5,1.3,0.2} | Iris_setosa     |     1 | Alaska

  3 | Iris_setosa          | {4.9,3.1,1.5,0.1} | Iris_setosa     |     1 | Alaska

  4 | Iris_setosa          | {4.4,3.0,1.3,0.2} | Iris_setosa     |     1 | Alaska

  5 | Iris_setosa          | {5.1,3.4,1.5,0.2} | Iris_setosa     |     1 | Alaska

  6 | Iris_setosa          | {5.0,3.5,1.3,0.3} | Iris_setosa     |     1 | Alaska

  7 | Iris_setosa          | {4.5,2.3,1.3,0.3} | Iris_setosa     |     1 | Alaska

  8 | Iris_setosa          | {4.4,3.2,1.3,0.2} | Iris_setosa     |     1 | Alaska

  9 | Iris_setosa          | {5.0,3.5,1.6,0.6} | Iris_setosa     |     1 | Alaska

 10 | Iris_setosa          | {5.1,3.8,1.9,0.4} | Iris_setosa     |     1 | Alaska

 11 | Iris_setosa          | {4.8,3.0,1.4,0.3} | Iris_setosa     |     1 | Alaska

 12 | Iris_setosa          | {5.1,3.8,1.6,0.2} | Iris_setosa     |     1 | Alaska

 13 | Iris_versicolor      | {5.7,2.8,4.5,1.3} | Iris_versicolor |     2 | Alaska

 14 | Iris_versicolor      | {6.3,3.3,4.7,1.6} | Iris_versicolor |     2 | Alaska

 15 | Iris_versicolor      | {4.9,2.4,3.3,1.0} | Iris_versicolor |     2 | Alaska

 16 | Iris_versicolor      | {6.6,2.9,4.6,1.3} | Iris_versicolor |     2 | Alaska

 17 | Iris_versicolor      | {5.2,2.7,3.9,1.4} | Iris_versicolor |     2 | Alaska

 18 | Iris_versicolor      | {5.0,2.0,3.5,1.0} | Iris_versicolor |     2 | Alaska

 19 | Iris_versicolor      | {5.9,3.0,4.2,1.5} | Iris_versicolor |     2 | Alaska

 20 | Iris_versicolor      | {6.0,2.2,4.0,1.0} | Iris_versicolor |     2 | Alaska

 21 | Iris_versicolor      | {6.1,2.9,4.7,1.4} | Iris_versicolor |     2 | Alaska

 22 | Iris_versicolor      | {5.6,2.9,3.6,1.3} | Iris_versicolor |     2 | Alaska

 23 | Iris_versicolor      | {6.7,3.1,4.4,1.4} | Iris_versicolor |     2 | Alaska

 24 | Iris_versicolor      | {5.6,3.0,4.5,1.5} | Iris_versicolor |     2 | Alaska

 25 | Iris_versicolor      | {5.8,2.7,4.1,1.0} | Iris_versicolor |     2 | Alaska

 26 | Iris_versicolor      | {6.2,2.2,4.5,1.5} | Iris_versicolor |     2 | Alaska

 27 | Iris_versicolor      | {5.6,2.5,3.9,1.1} | Iris_versicolor |     2 | Alaska

 28 | Iris_setosa          | {5.0,3.4,1.5,0.2} | Iris_setosa     |     1 | Tennessee

 29 | Iris_setosa          | {4.4,2.9,1.4,0.2} | Iris_setosa     |     1 | Tennessee

 30 | Iris_setosa          | {4.9,3.1,1.5,0.1} | Iris_setosa     |     1 | Tennessee

 31 | Iris_setosa          | {5.4,3.7,1.5,0.2} | Iris_setosa     |     1 | Tennessee

 32 | Iris_setosa          | {4.8,3.4,1.6,0.2} | Iris_setosa     |     1 | Tennessee

 33 | Iris_setosa          | {4.8,3.0,1.4,0.1} | Iris_setosa     |     1 | Tennessee

 34 | Iris_setosa          | {4.3,3.0,1.1,0.1} | Iris_setosa     |     1 | Tennessee

 35 | Iris_setosa          | {5.8,4.0,1.2,0.2} | Iris_setosa     |     1 | Tennessee

 36 | Iris_setosa          | {5.7,4.4,1.5,0.4} | Iris_setosa     |     1 | Tennessee

 37 | Iris_setosa          | {5.4,3.9,1.3,0.4} | Iris_setosa     |     1 | Tennessee

 38 | Iris_versicolor      | {6.0,2.9,4.5,1.5} | Iris_versicolor |     2 | Tennessee

 39 | Iris_versicolor      | {5.7,2.6,3.5,1.0} | Iris_versicolor |     2 | Tennessee

 40 | Iris_versicolor      | {5.5,2.4,3.8,1.1} | Iris_versicolor |     2 | Tennessee

 41 | Iris_versicolor      | {5.5,2.4,3.7,1.0} | Iris_versicolor |     2 | Tennessee

 42 | Iris_versicolor      | {5.8,2.7,3.9,1.2} | Iris_versicolor |     2 | Tennessee

 43 | Iris_versicolor      | {6.0,2.7,5.1,1.6} | Iris_versicolor |     2 | Tennessee

 44 | Iris_versicolor      | {5.4,3.0,4.5,1.5} | Iris_versicolor |     2 | Tennessee

 45 | Iris_versicolor      | {6.0,3.4,4.5,1.6} | Iris_versicolor |     2 | Tennessee

 46 | Iris_versicolor      | {6.7,3.1,4.7,1.5} | Iris_versicolor |     2 | Tennessee

 47 | Iris_versicolor      | {6.3,2.3,4.4,1.3} | Iris_versicolor |     2 | Tennessee

 48 | Iris_versicolor      | {5.6,3.0,4.1,1.3} | Iris_versicolor |     2 | Tennessee

 49 | Iris_versicolor      | {5.5,2.5,4.0,1.3} | Iris_versicolor |     2 | Tennessee

 50 | Iris_versicolor      | {5.5,2.6,4.4,1.2} | Iris_versicolor |     2 | Tennessee

 51 | Iris_versicolor      | {6.1,3.0,4.6,1.4} | Iris_versicolor |     2 | Tennessee

 52 | Iris_versicolor      | {5.8,2.6,4.0,1.2} | Iris_versicolor |     2 | Tennessee

(52 rows)

Count the misclassifications:

SELECT COUNT(*) FROM mlp_prediction JOIN iris_data USING (id)

WHERE mlp_prediction.estimated_class_text != iris_data.class_text;

 count

——-+

     0

Classification with Mini-Batching

  1. Use the same data set as above. Call mini-batch preprocessor:
    DROP TABLE IF EXISTS iris_data_packed, iris_data_packed_summary, iris_data_packed_standardization;
    SELECT madlib.minibatch_preprocessor(‘iris_data’,         — Source table
                                         ‘iris_data_packed’,  — Output table
                                         ‘class_text’,        — Dependent variable
                                         ‘attributes’        — Independent variables
                                        );
  2. Train the classification model using similar parameters as before:
    DROP TABLE IF EXISTS mlp_model, mlp_model_summary, mlp_model_standardization;
    — Set seed so results are reproducible
    SELECT setseed(0);
    SELECT madlib.mlp_classification(
        ‘iris_data_packed’,      — Output table from mini-batch preprocessor
        ‘mlp_model’,             — Destination table
        ‘independent_varname’,   — Hardcode to this, from table iris_data_packed
        ‘dependent_varname’,     — Hardcode to this, from table iris_data_packed
        ARRAY[5],                — Number of units per layer
        ‘learning_rate_init=0.1,
        n_iterations=500,
        tolerance=0′,            — Optimizer params
        ‘tanh’,                  — Activation function
        NULL,                    — Default weight (1)
        FALSE,                   — No warm start
        FALSE                    — Not verbose
    );

    View the model:
    \x on
    SELECT * FROM mlp_model;

    -[ RECORD 1 ]–+————————————————————————————
    coeff          | {-0.0780564661828377,-0.0781452670639994,0.3083605989842 …
    loss           | 0.00563534904146765
    num_iterations | 500
  3. Now let’s use the model to predict. As before we will use the training data set for prediction as well, which is not usual but serves to show the syntax. The prediction is in the estimated_class_text column with the actual value in the class_text column.
    DROP TABLE IF EXISTS mlp_prediction;
    \x off
    SELECT madlib.mlp_predict(
             ‘mlp_model’,         — Model table
             ‘iris_data’,         — Test data table
             ‘id’,                — Id column in test table
             ‘mlp_prediction’,    — Output table for predictions
             ‘response’           — Output classes, not probabilities
         );
    SELECT * FROM mlp_prediction JOIN iris_data USING (id) ORDER BY id;

     id | estimated_class_text |    attributes     |   class_text    | class |   state
    —-+———————-+——————-+—————–+——-+———–
      1 | Iris_setosa          | {5.0,3.2,1.2,0.2} | Iris_setosa     |     1 | Alaska
      2 | Iris_setosa          | {5.5,3.5,1.3,0.2} | Iris_setosa     |     1 | Alaska
      3 | Iris_setosa          | {4.9,3.1,1.5,0.1} | Iris_setosa     |     1 | Alaska
      4 | Iris_setosa          | {4.4,3.0,1.3,0.2} | Iris_setosa     |     1 | Alaska
      5 | Iris_setosa          | {5.1,3.4,1.5,0.2} | Iris_setosa     |     1 | Alaska
      6 | Iris_setosa          | {5.0,3.5,1.3,0.3} | Iris_setosa     |     1 | Alaska
      7 | Iris_setosa          | {4.5,2.3,1.3,0.3} | Iris_setosa     |     1 | Alaska
      8 | Iris_setosa          | {4.4,3.2,1.3,0.2} | Iris_setosa     |     1 | Alaska
      9 | Iris_setosa          | {5.0,3.5,1.6,0.6} | Iris_setosa     |     1 | Alaska
     10 | Iris_setosa          | {5.1,3.8,1.9,0.4} | Iris_setosa     |     1 | Alaska
     11 | Iris_setosa          | {4.8,3.0,1.4,0.3} | Iris_setosa     |     1 | Alaska
     12 | Iris_setosa          | {5.1,3.8,1.6,0.2} | Iris_setosa     |     1 | Alaska
     13 | Iris_versicolor      | {5.7,2.8,4.5,1.3} | Iris_versicolor |     2 | Alaska
     14 | Iris_versicolor      | {6.3,3.3,4.7,1.6} | Iris_versicolor |     2 | Alaska
     15 | Iris_versicolor      | {4.9,2.4,3.3,1.0} | Iris_versicolor |     2 | Alaska
     16 | Iris_versicolor      | {6.6,2.9,4.6,1.3} | Iris_versicolor |     2 | Alaska
     17 | Iris_versicolor      | {5.2,2.7,3.9,1.4} | Iris_versicolor |     2 | Alaska
     18 | Iris_versicolor      | {5.0,2.0,3.5,1.0} | Iris_versicolor |     2 | Alaska
     19 | Iris_versicolor      | {5.9,3.0,4.2,1.5} | Iris_versicolor |     2 | Alaska
     20 | Iris_versicolor      | {6.0,2.2,4.0,1.0} | Iris_versicolor |     2 | Alaska
     21 | Iris_versicolor      | {6.1,2.9,4.7,1.4} | Iris_versicolor |     2 | Alaska
     22 | Iris_versicolor      | {5.6,2.9,3.6,1.3} | Iris_versicolor |     2 | Alaska
     23 | Iris_versicolor      | {6.7,3.1,4.4,1.4} | Iris_versicolor |     2 | Alaska
     24 | Iris_versicolor      | {5.6,3.0,4.5,1.5} | Iris_versicolor |     2 | Alaska
     25 | Iris_versicolor      | {5.8,2.7,4.1,1.0} | Iris_versicolor |     2 | Alaska
     26 | Iris_versicolor      | {6.2,2.2,4.5,1.5} | Iris_versicolor |     2 | Alaska
     27 | Iris_versicolor      | {5.6,2.5,3.9,1.1} | Iris_versicolor |     2 | Alaska
     28 | Iris_setosa          | {5.0,3.4,1.5,0.2} | Iris_setosa     |     1 | Tennessee
     29 | Iris_setosa          | {4.4,2.9,1.4,0.2} | Iris_setosa     |     1 | Tennessee
     30 | Iris_setosa          | {4.9,3.1,1.5,0.1} | Iris_setosa     |     1 | Tennessee
     31 | Iris_setosa          | {5.4,3.7,1.5,0.2} | Iris_setosa     |     1 | Tennessee
     32 | Iris_setosa          | {4.8,3.4,1.6,0.2} | Iris_setosa     |     1 | Tennessee
     33 | Iris_setosa          | {4.8,3.0,1.4,0.1} | Iris_setosa     |     1 | Tennessee
     34 | Iris_setosa          | {4.3,3.0,1.1,0.1} | Iris_setosa     |     1 | Tennessee
     35 | Iris_setosa          | {5.8,4.0,1.2,0.2} | Iris_setosa     |     1 | Tennessee
     36 | Iris_setosa          | {5.7,4.4,1.5,0.4} | Iris_setosa     |     1 | Tennessee
     37 | Iris_setosa          | {5.4,3.9,1.3,0.4} | Iris_setosa     |     1 | Tennessee
     38 | Iris_versicolor      | {6.0,2.9,4.5,1.5} | Iris_versicolor |     2 | Tennessee
     39 | Iris_versicolor      | {5.7,2.6,3.5,1.0} | Iris_versicolor |     2 | Tennessee
     40 | Iris_versicolor      | {5.5,2.4,3.8,1.1} | Iris_versicolor |     2 | Tennessee
     41 | Iris_versicolor      | {5.5,2.4,3.7,1.0} | Iris_versicolor |     2 | Tennessee
     42 | Iris_versicolor      | {5.8,2.7,3.9,1.2} | Iris_versicolor |     2 | Tennessee
     43 | Iris_versicolor      | {6.0,2.7,5.1,1.6} | Iris_versicolor |     2 | Tennessee
     44 | Iris_versicolor      | {5.4,3.0,4.5,1.5} | Iris_versicolor |     2 | Tennessee
     45 | Iris_versicolor      | {6.0,3.4,4.5,1.6} | Iris_versicolor |     2 | Tennessee
     46 | Iris_versicolor      | {6.7,3.1,4.7,1.5} | Iris_versicolor |     2 | Tennessee
     47 | Iris_versicolor      | {6.3,2.3,4.4,1.3} | Iris_versicolor |     2 | Tennessee
     48 | Iris_versicolor      | {5.6,3.0,4.1,1.3} | Iris_versicolor |     2 | Tennessee
     49 | Iris_versicolor      | {5.5,2.5,4.0,1.3} | Iris_versicolor |     2 | Tennessee
     50 | Iris_versicolor      | {5.5,2.6,4.4,1.2} | Iris_versicolor |     2 | Tennessee
     51 | Iris_versicolor      | {6.1,3.0,4.6,1.4} | Iris_versicolor |     2 | Tennessee
     52 | Iris_versicolor      | {5.8,2.6,4.0,1.2} | Iris_versicolor |     2 | Tennessee
    (52 rows)

    Count the misclassifications:
    SELECT COUNT(*) FROM mlp_prediction JOIN iris_data USING (id)
    WHERE mlp_prediction.estimated_class_text != iris_data.class_text;

     count
    ——-+
         0

Classification with Other Parameters

  1. Now, use the n_tries optimizer parameter to learn and choose the best model among n_tries number of models learnt by the algorithm. Run only for 50 iterations and choose the best model from this short run. Note we are not using mini-batching here.

DROP TABLE IF EXISTS mlp_model, mlp_model_summary, mlp_model_standardization;

— Set seed so results are reproducible

SELECT setseed(0);

SELECT madlib.mlp_classification(

    ‘iris_data’,      — Source table

    ‘mlp_model’,      — Destination table

    ‘attributes’,     — Input features

    ‘class_text’,     — Label

    ARRAY[5],         — Number of units per layer

    ‘learning_rate_init=0.003,

    n_iterations=50,

    tolerance=0,

    n_tries=3′,       — Optimizer params, with n_tries

    ‘tanh’,           — Activation function

    NULL,             — Default weight (1)

    FALSE,            — No warm start

    FALSE             — Not verbose

);

View the model:

\x on

SELECT * FROM mlp_model;

-[ RECORD 1 ]–+————————————————————————————

coeff          | {0.000156316559088915,0.131131017223563,-0.293990512682215 …

loss           | 0.142238768280717

num_iterations | 50

  1. Next, use the warm_start parameter to start learning a new model, using the coefficients already present in mlp_model. Note that we must not drop the mlp_model table, and cannot use the n_tries parameter if warm_start is used.
    SELECT madlib.mlp_classification(
        ‘iris_data’,      — Source table
        ‘mlp_model’,      — Destination table
        ‘attributes’,     — Input features
        ‘class_text’,     — Label
        ARRAY[5],         — Number of units per layer
        ‘learning_rate_init=0.003,
        n_iterations=450,
        tolerance=0′,     — Optimizer params
        ‘tanh’,           — Activation function
        NULL,             — Default weight (1)
        TRUE,             — Warm start
        FALSE             — Not verbose
    );

    View the model:
    \x on
    SELECT * FROM mlp_model;

    -[ RECORD 1 ]–+————————————————————————————
    coeff          | {0.0883013960215441,0.235944854050211,-0.398126039487036 …
    loss           | 0.00818899646775659
    num_iterations | 450

    Notice that the loss is lower compared to the previous example, despite having the same values for every other parameter. This is because the algorithm learnt three different models starting with a different set of initial weights for the coefficients, and chose the best model among them as the initial weights for the coefficients when run with warm start.

Classification with Grouping

  1. Next, group the training data by state, and learn a different model for each state. Note we are not using mini-batching in this example.

DROP TABLE IF EXISTS mlp_model_group, mlp_model_group_summary, mlp_model_group_standardization;

— Set seed so results are reproducible

SELECT setseed(0);

SELECT madlib.mlp_classification(

    ‘iris_data’,        — Source table

    ‘mlp_model_group’,  — Destination table

    ‘attributes’,       — Input features

    ‘class_text’,       — Label

    ARRAY[5],           — Number of units per layer

    ‘learning_rate_init=0.003,

    n_iterations=500,   — Optimizer params

    tolerance=0′,

    ‘tanh’,             — Activation function

    NULL,               — Default weight (1)

    FALSE,              — No warm start

    FALSE,              — Not verbose

    ‘state’             — Grouping column

);

View the model:

\x on

SELECT * FROM mlp_model_group ORDER BY state;

-[ RECORD 1 ]–+————————————————————————————

state          | Alaska

coeff          | {-0.51246602223,-0.78952457411,0.454192045225,0.223214894458,0.188804700547 …

loss           | 0.0225081995679

num_iterations | 500

-[ RECORD 2 ]–+————————————————————————————

state          | Tennessee

coeff          | {-0.215009937565,0.116581594162,-0.397643279185,0.919193295184,-0.0811341736111 …

loss           | 0.0182854983946

num_iterations | 500

A separate model is learnt for each state, and the result table displays the name of the state (grouping column) associated with the model.

  1. Prediction based on grouping using the state column:

DROP TABLE IF EXISTS mlp_prediction;

SELECT madlib.mlp_predict(

         ‘mlp_model_group’,   — Model table

         ‘iris_data’,         — Test data table

         ‘id’,                — Id column in test table

         ‘mlp_prediction’,    — Output table for predictions

         ‘response’           — Output classes, not probabilities

     );

SELECT * FROM mlp_prediction JOIN iris_data USING (state,id) ORDER BY state, id;

Result for the classification model:

   state   | id | estimated_class_text |    attributes     |   class_text    | class

———–+—-+———————-+——————-+—————–+——-

 Alaska    |  1 | Iris_setosa          | {5.0,3.2,1.2,0.2} | Iris_setosa     |     1

 Alaska    |  2 | Iris_setosa          | {5.5,3.5,1.3,0.2} | Iris_setosa     |     1

 Alaska    |  3 | Iris_setosa          | {4.9,3.1,1.5,0.1} | Iris_setosa     |     1

 Alaska    |  4 | Iris_setosa          | {4.4,3.0,1.3,0.2} | Iris_setosa     |     1

 Alaska    |  5 | Iris_setosa          | {5.1,3.4,1.5,0.2} | Iris_setosa     |     1

 Alaska    |  6 | Iris_setosa          | {5.0,3.5,1.3,0.3} | Iris_setosa     |     1

 Alaska    |  7 | Iris_setosa          | {4.5,2.3,1.3,0.3} | Iris_setosa     |     1

 Alaska    |  8 | Iris_setosa          | {4.4,3.2,1.3,0.2} | Iris_setosa     |     1

 Alaska    |  9 | Iris_setosa          | {5.0,3.5,1.6,0.6} | Iris_setosa     |     1

 Alaska    | 10 | Iris_setosa          | {5.1,3.8,1.9,0.4} | Iris_setosa     |     1

 Alaska    | 11 | Iris_setosa          | {4.8,3.0,1.4,0.3} | Iris_setosa     |     1

 Alaska    | 12 | Iris_setosa          | {5.1,3.8,1.6,0.2} | Iris_setosa     |     1

 Alaska    | 13 | Iris_versicolor      | {5.7,2.8,4.5,1.3} | Iris_versicolor |     2

 Alaska    | 14 | Iris_versicolor      | {6.3,3.3,4.7,1.6} | Iris_versicolor |     2

 Alaska    | 15 | Iris_versicolor      | {4.9,2.4,3.3,1.0} | Iris_versicolor |     2

 Alaska    | 16 | Iris_versicolor      | {6.6,2.9,4.6,1.3} | Iris_versicolor |     2

 Alaska    | 17 | Iris_versicolor      | {5.2,2.7,3.9,1.4} | Iris_versicolor |     2

 Alaska    | 18 | Iris_versicolor      | {5.0,2.0,3.5,1.0} | Iris_versicolor |     2

 Alaska    | 19 | Iris_versicolor      | {5.9,3.0,4.2,1.5} | Iris_versicolor |     2

 Alaska    | 20 | Iris_versicolor      | {6.0,2.2,4.0,1.0} | Iris_versicolor |     2

 Alaska    | 21 | Iris_versicolor      | {6.1,2.9,4.7,1.4} | Iris_versicolor |     2

 Alaska    | 22 | Iris_versicolor      | {5.6,2.9,3.6,1.3} | Iris_versicolor |     2

 Alaska    | 23 | Iris_versicolor      | {6.7,3.1,4.4,1.4} | Iris_versicolor |     2

 Alaska    | 24 | Iris_versicolor      | {5.6,3.0,4.5,1.5} | Iris_versicolor |     2

 Alaska    | 25 | Iris_versicolor      | {5.8,2.7,4.1,1.0} | Iris_versicolor |     2

 Alaska    | 26 | Iris_versicolor      | {6.2,2.2,4.5,1.5} | Iris_versicolor |     2

 Alaska    | 27 | Iris_versicolor      | {5.6,2.5,3.9,1.1} | Iris_versicolor |     2

 Tennessee | 28 | Iris_setosa          | {5.0,3.4,1.5,0.2} | Iris_setosa     |     1

 Tennessee | 29 | Iris_setosa          | {4.4,2.9,1.4,0.2} | Iris_setosa     |     1

 Tennessee | 30 | Iris_setosa          | {4.9,3.1,1.5,0.1} | Iris_setosa     |     1

 Tennessee | 31 | Iris_setosa          | {5.4,3.7,1.5,0.2} | Iris_setosa     |     1

 Tennessee | 32 | Iris_setosa          | {4.8,3.4,1.6,0.2} | Iris_setosa     |     1

 Tennessee | 33 | Iris_setosa          | {4.8,3.0,1.4,0.1} | Iris_setosa     |     1

 Tennessee | 34 | Iris_setosa          | {4.3,3.0,1.1,0.1} | Iris_setosa     |     1

 Tennessee | 35 | Iris_setosa          | {5.8,4.0,1.2,0.2} | Iris_setosa     |     1

 Tennessee | 36 | Iris_setosa          | {5.7,4.4,1.5,0.4} | Iris_setosa     |     1

 Tennessee | 37 | Iris_setosa          | {5.4,3.9,1.3,0.4} | Iris_setosa     |     1

 Tennessee | 38 | Iris_versicolor      | {6.0,2.9,4.5,1.5} | Iris_versicolor |     2

 Tennessee | 39 | Iris_versicolor      | {5.7,2.6,3.5,1.0} | Iris_versicolor |     2

 Tennessee | 40 | Iris_versicolor      | {5.5,2.4,3.8,1.1} | Iris_versicolor |     2

 Tennessee | 41 | Iris_versicolor      | {5.5,2.4,3.7,1.0} | Iris_versicolor |     2

 Tennessee | 42 | Iris_versicolor      | {5.8,2.7,3.9,1.2} | Iris_versicolor |     2

 Tennessee | 43 | Iris_versicolor      | {6.0,2.7,5.1,1.6} | Iris_versicolor |     2

 Tennessee | 44 | Iris_versicolor      | {5.4,3.0,4.5,1.5} | Iris_versicolor |     2

 Tennessee | 45 | Iris_versicolor      | {6.0,3.4,4.5,1.6} | Iris_versicolor |     2

 Tennessee | 46 | Iris_versicolor      | {6.7,3.1,4.7,1.5} | Iris_versicolor |     2

 Tennessee | 47 | Iris_versicolor      | {6.3,2.3,4.4,1.3} | Iris_versicolor |     2

 Tennessee | 48 | Iris_versicolor      | {5.6,3.0,4.1,1.3} | Iris_versicolor |     2

 Tennessee | 49 | Iris_versicolor      | {5.5,2.5,4.0,1.3} | Iris_versicolor |     2

 Tennessee | 50 | Iris_versicolor      | {5.5,2.6,4.4,1.2} | Iris_versicolor |     2

 Tennessee | 51 | Iris_versicolor      | {6.1,3.0,4.6,1.4} | Iris_versicolor |     2

 Tennessee | 52 | Iris_versicolor      | {5.8,2.6,4.0,1.2} | Iris_versicolor |     2

(52 rows)

Regression without Mini-Batching

  1. Create a dataset with housing prices data.

DROP TABLE IF EXISTS lin_housing;

CREATE TABLE lin_housing (id serial, x numeric[], zipcode int, y float8);

INSERT INTO lin_housing(id, x, zipcode, y) VALUES

(1,ARRAY[1,0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98],94016,24.00),

(2,ARRAY[1,0.02731,0.00,7.070,0,0.4690,6.4210,78.90,4.9671,2,242.0,17.80,396.90,9.14],94016,21.60),

(3,ARRAY[1,0.02729,0.00,7.070,0,0.4690,7.1850,61.10,4.9671,2,242.0,17.80,392.83,4.03],94016,34.70),

(4,ARRAY[1,0.03237,0.00,2.180,0,0.4580,6.9980,45.80,6.0622,3,222.0,18.70,394.63,2.94],94016,33.40),

(5,ARRAY[1,0.06905,0.00,2.180,0,0.4580,7.1470,54.20,6.0622,3,222.0,18.70,396.90,5.33],94016,36.20),

(6,ARRAY[1,0.02985,0.00,2.180,0,0.4580,6.4300,58.70,6.0622,3,222.0,18.70,394.12,5.21],94016,28.70),

(7,ARRAY[1,0.08829,12.50,7.870,0,0.5240,6.0120,66.60,5.5605,5,311.0,15.20,395.60,12.43],94016,22.90),

(8,ARRAY[1,0.14455,12.50,7.870,0,0.5240,6.1720,96.10,5.9505,5,311.0,15.20,396.90,19.15],94016,27.10),

(9,ARRAY[1,0.21124,12.50,7.870,0,0.5240,5.6310,100.00,6.0821,5,311.0,15.20,386.63,29.93],94016,16.50),

(10,ARRAY[1,0.17004,12.50,7.870,0,0.5240,6.0040,85.90,6.5921,5,311.0,15.20,386.71,17.10],94016,18.90),

(11,ARRAY[1,0.22489,12.50,7.870,0,0.5240,6.3770,94.30,6.3467,5,311.0,15.20,392.52,20.45],94016,15.00),

(12,ARRAY[1,0.11747,12.50,7.870,0,0.5240,6.0090,82.90,6.2267,5,311.0,15.20,396.90,13.27],20001,18.90),

(13,ARRAY[1,0.09378,12.50,7.870,0,0.5240,5.8890,39.00,5.4509,5,311.0,15.20,390.50,15.71],20001,21.70),

(14,ARRAY[1,0.62976,0.00,8.140,0,0.5380,5.9490,61.80,4.7075,4,307.0,21.00,396.90,8.26],20001,20.40),

(15,ARRAY[1,0.63796,0.00,8.140,0,0.5380,6.0960,84.50,4.4619,4,307.0,21.00,380.02,10.26],20001,18.20),

(16,ARRAY[1,0.62739,0.00,8.140,0,0.5380,5.8340,56.50,4.4986,4,307.0,21.00,395.62,8.47],20001,19.90),

(17,ARRAY[1,1.05393,0.00,8.140,0,0.5380,5.9350,29.30,4.4986,4,307.0,21.00,386.85,6.58],20001, 23.10),

(18,ARRAY[1,0.78420,0.00,8.140,0,0.5380,5.9900,81.70,4.2579,4,307.0,21.00,386.75,14.67],20001,17.50),

(19,ARRAY[1,0.80271,0.00,8.140,0,0.5380,5.4560,36.60,3.7965,4,307.0,21.00,288.99,11.69],20001,20.20),

(20,ARRAY[1,0.72580,0.00,8.140,0,0.5380,5.7270,69.50,3.7965,4,307.0,21.00,390.95,11.28],20001,18.20);

  1. Now train a regression model using a multilayer perceptron with two hidden layers of twenty five nodes each:

DROP TABLE IF EXISTS mlp_regress, mlp_regress_summary, mlp_regress_standardization;

SELECT setseed(0);

SELECT madlib.mlp_regression(

    ‘lin_housing’,    — Source table

    ‘mlp_regress’,    — Desination table

    ‘x’,              — Input features

    ‘y’,              — Dependent variable

    ARRAY[25,25],     — Number of units per layer

    ‘learning_rate_init=0.001,

    n_iterations=500,

    lambda=0.001,

    tolerance=0′,     — Optimizer params

    ‘relu’,           — Activation function

    NULL,             — Default weight (1)

    FALSE,            — No warm start

    FALSE             — Not verbose

);

View the model:

\x on

SELECT * FROM mlp_regress;

[ RECORD 1 ]–+————————————————————————————-

coeff          | {-0.250057620174,0.0630805938982,-0.290635490112,-0.382966162592,-0.212206338909…

loss           | 1.07042781236

num_iterations | 500

  1. Prediction using the regression model:

DROP TABLE IF EXISTS mlp_regress_prediction;

SELECT madlib.mlp_predict(

         ‘mlp_regress’,               — Model table

         ‘lin_housing’,               — Test data table

         ‘id’,                        — Id column in test table

         ‘mlp_regress_prediction’,    — Output table for predictions

         ‘response’                   — Output values, not probabilities

     );

View results:

SELECT * FROM lin_housing JOIN mlp_regress_prediction USING (id) ORDER BY id;

 id |                                         x                                        | zipcode |  y   |   estimated_y

—-+———————————————————————————-+———+——+——————

  1 | {1,0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98}   |   94016 |   24 | 23.9989087488259

  2 | {1,0.02731,0.00,7.070,0,0.4690,6.4210,78.90,4.9671,2,242.0,17.80,396.90,9.14}    |   94016 | 21.6 | 21.5983177932005

  3 | {1,0.02729,0.00,7.070,0,0.4690,7.1850,61.10,4.9671,2,242.0,17.80,392.83,4.03}    |   94016 | 34.7 | 34.7102398021623

  4 | {1,0.03237,0.00,2.180,0,0.4580,6.9980,45.80,6.0622,3,222.0,18.70,394.63,2.94}    |   94016 | 33.4 | 33.4221257351015

  5 | {1,0.06905,0.00,2.180,0,0.4580,7.1470,54.20,6.0622,3,222.0,18.70,396.90,5.33}    |   94016 | 36.2 | 36.1523886001663

  6 | {1,0.02985,0.00,2.180,0,0.4580,6.4300,58.70,6.0622,3,222.0,18.70,394.12,5.21}    |   94016 | 28.7 |  28.723894783928

  7 | {1,0.08829,12.50,7.870,0,0.5240,6.0120,66.60,5.5605,5,311.0,15.20,395.60,12.43}  |   94016 | 22.9 | 22.6515242795835

  8 | {1,0.14455,12.50,7.870,0,0.5240,6.1720,96.10,5.9505,5,311.0,15.20,396.90,19.15}  |   94016 | 27.1 | 25.7615314879354

  9 | {1,0.21124,12.50,7.870,0,0.5240,5.6310,100.00,6.0821,5,311.0,15.20,386.63,29.93} |   94016 | 16.5 | 15.7368298351732

 10 | {1,0.17004,12.50,7.870,0,0.5240,6.0040,85.90,6.5921,5,311.0,15.20,386.71,17.10}  |   94016 | 18.9 | 16.8850496141437

 11 | {1,0.22489,12.50,7.870,0,0.5240,6.3770,94.30,6.3467,5,311.0,15.20,392.52,20.45}  |   94016 |   15 | 14.9150416339458

 12 | {1,0.11747,12.50,7.870,0,0.5240,6.0090,82.90,6.2267,5,311.0,15.20,396.90,13.27}  |   20001 | 18.9 | 19.4541629864106

 13 | {1,0.09378,12.50,7.870,0,0.5240,5.8890,39.00,5.4509,5,311.0,15.20,390.50,15.71}  |   20001 | 21.7 |  21.715554997762

 14 | {1,0.62976,0.00,8.140,0,0.5380,5.9490,61.80,4.7075,4,307.0,21.00,396.90,8.26}    |   20001 | 20.4 | 20.3181247234996

 15 | {1,0.63796,0.00,8.140,0,0.5380,6.0960,84.50,4.4619,4,307.0,21.00,380.02,10.26}   |   20001 | 18.2 | 18.5026399122209

 16 | {1,0.62739,0.00,8.140,0,0.5380,5.8340,56.50,4.4986,4,307.0,21.00,395.62,8.47}    |   20001 | 19.9 | 19.9131696333521

 17 | {1,1.05393,0.00,8.140,0,0.5380,5.9350,29.30,4.4986,4,307.0,21.00,386.85,6.58}    |   20001 | 23.1 | 23.1757650468106

 18 | {1,0.78420,0.00,8.140,0,0.5380,5.9900,81.70,4.2579,4,307.0,21.00,386.75,14.67}   |   20001 | 17.5 | 17.2671872543377

 19 | {1,0.80271,0.00,8.140,0,0.5380,5.4560,36.60,3.7965,4,307.0,21.00,288.99,11.69}   |   20001 | 20.2 | 20.1073474558796

 20 | {1,0.72580,0.00,8.140,0,0.5380,5.7270,69.50,3.7965,4,307.0,21.00,390.95,11.28}   |   20001 | 18.2 | 18.2143446340975

(20 rows)

RMS error:

SELECT SQRT(AVG((y-estimated_y)*(y-estimated_y))) as rms_error FROM lin_housing

JOIN mlp_regress_prediction USING (id);

    rms_error

——————+

 0.544960829104004

Regression with Mini-Batching

  1. Call min-batch preprocessor using the same data set as above:

DROP TABLE IF EXISTS lin_housing_packed, lin_housing_packed_summary, lin_housing_packed_standardization;

SELECT madlib.minibatch_preprocessor(‘lin_housing’,         — Source table

                                     ‘lin_housing_packed’,  — Output table

                                     ‘y’,                   — Dependent variable

                                     ‘x’                   — Independent variables

                                     );

  1. Train regression model with mini-batching

DROP TABLE IF EXISTS mlp_regress, mlp_regress_summary, mlp_regress_standardization;

SELECT setseed(0);

SELECT madlib.mlp_regression(

    ‘lin_housing_packed’,    — Source table

    ‘mlp_regress’,           — Desination table

    ‘independent_varname’,   — Hardcode to this, from table lin_housing_packed

    ‘dependent_varname’,     — Hardcode to this, from table lin_housing_packed

    ARRAY[25,25],            — Number of units per layer

    ‘learning_rate_init=0.01,

    n_iterations=500,

    lambda=0.001,

    tolerance=0′,            — Optimizer params

    ‘tanh’,                  — Activation function

    NULL,                    — Default weight (1)

    FALSE,                   — No warm start

    FALSE                    — Not verbose

);

View model:

\x on

SELECT * FROM mlp_regress;

-[ RECORD 1 ]–+————————————————————-

coeff          | {0.0395865908810001,-0.164860448878703,-0.132787863194324…

loss           | 0.0442383714892138

num_iterations | 500

  1. Prediction for regression:

DROP TABLE IF EXISTS mlp_regress_prediction;

SELECT madlib.mlp_predict(

         ‘mlp_regress’,               — Model table

         ‘lin_housing’,               — Test data table

         ‘id’,                        — Id column in test table

         ‘mlp_regress_prediction’,    — Output table for predictions

         ‘response’                   — Output values, not probabilities

     );

\x off

SELECT *, ABS(y-estimated_y) as abs_diff FROM lin_housing JOIN mlp_regress_prediction USING (id) ORDER BY id;

 id |                                        x                                         | zipcode |  y   | zipcode |   estimated_y    |      abs_diff

—-+———————————————————————————-+———+——+———+——————+——————–

  1 | {1,0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98}   |   94016 |   24 |   94016 | 23.9714991250013 | 0.0285008749987092

  2 | {1,0.02731,0.00,7.070,0,0.4690,6.4210,78.90,4.9671,2,242.0,17.80,396.90,9.14}    |   94016 | 21.6 |   94016 | 22.3655180133895 |  0.765518013389535

  3 | {1,0.02729,0.00,7.070,0,0.4690,7.1850,61.10,4.9671,2,242.0,17.80,392.83,4.03}    |   94016 | 34.7 |   94016 | 33.8620767428645 |  0.837923257135465

  4 | {1,0.03237,0.00,2.180,0,0.4580,6.9980,45.80,6.0622,3,222.0,18.70,394.63,2.94}    |   94016 | 33.4 |   94016 | 35.3094157686524 |   1.90941576865244

  5 | {1,0.06905,0.00,2.180,0,0.4580,7.1470,54.20,6.0622,3,222.0,18.70,396.90,5.33}    |   94016 | 36.2 |   94016 | 35.0379122731818 |   1.16208772681817

  6 | {1,0.02985,0.00,2.180,0,0.4580,6.4300,58.70,6.0622,3,222.0,18.70,394.12,5.21}    |   94016 | 28.7 |   94016 | 27.5207943492151 |   1.17920565078487

  7 | {1,0.08829,12.50,7.870,0,0.5240,6.0120,66.60,5.5605,5,311.0,15.20,395.60,12.43}  |   94016 | 22.9 |   94016 | 24.9841422781166 |    2.0841422781166

  8 | {1,0.14455,12.50,7.870,0,0.5240,6.1720,96.10,5.9505,5,311.0,15.20,396.90,19.15}  |   94016 | 27.1 |   94016 | 24.5403994064793 |   2.55960059352067

  9 | {1,0.21124,12.50,7.870,0,0.5240,5.6310,100.00,6.0821,5,311.0,15.20,386.63,29.93} |   94016 | 16.5 |   94016 | 17.2588278443879 |   0.75882784438787

 10 | {1,0.17004,12.50,7.870,0,0.5240,6.0040,85.90,6.5921,5,311.0,15.20,386.71,17.10}  |   94016 | 18.9 |   94016 | 17.0600407532569 |    1.8399592467431

 11 | {1,0.22489,12.50,7.870,0,0.5240,6.3770,94.30,6.3467,5,311.0,15.20,392.52,20.45}  |   94016 |   15 |   94016 | 15.2284207930287 |  0.228420793028732

 12 | {1,0.11747,12.50,7.870,0,0.5240,6.0090,82.90,6.2267,5,311.0,15.20,396.90,13.27}  |   20001 | 18.9 |   20001 | 19.2272848285357 |  0.327284828535671

 13 | {1,0.09378,12.50,7.870,0,0.5240,5.8890,39.00,5.4509,5,311.0,15.20,390.50,15.71}  |   20001 | 21.7 |   20001 | 21.3979318641202 |  0.302068135879811

 14 | {1,0.62976,0.00,8.140,0,0.5380,5.9490,61.80,4.7075,4,307.0,21.00,396.90,8.26}    |   20001 | 20.4 |   20001 | 19.7743403979155 |  0.625659602084532

 15 | {1,0.63796,0.00,8.140,0,0.5380,6.0960,84.50,4.4619,4,307.0,21.00,380.02,10.26}   |   20001 | 18.2 |   20001 | 18.7400800902121 |  0.540080090212125

 16 | {1,0.62739,0.00,8.140,0,0.5380,5.8340,56.50,4.4986,4,307.0,21.00,395.62,8.47}    |   20001 | 19.9 |   20001 | 19.6187933144569 |  0.281206685543061

 17 | {1,1.05393,0.00,8.140,0,0.5380,5.9350,29.30,4.4986,4,307.0,21.00,386.85,6.58}    |   20001 | 23.1 |   20001 | 23.3492239648177 |  0.249223964817737

 18 | {1,0.78420,0.00,8.140,0,0.5380,5.9900,81.70,4.2579,4,307.0,21.00,386.75,14.67}   |   20001 | 17.5 |   20001 | 17.0806608347814 |  0.419339165218577

 19 | {1,0.80271,0.00,8.140,0,0.5380,5.4560,36.60,3.7965,4,307.0,21.00,288.99,11.69}   |   20001 | 20.2 |   20001 | 20.1559086626409 |  0.044091337359113

 20 | {1,0.72580,0.00,8.140,0,0.5380,5.7270,69.50,3.7965,4,307.0,21.00,390.95,11.28}   |   20001 | 18.2 |   20001 | 18.6980897920022 |  0.498089792002183

(20 rows)

RMS error:

SELECT SQRT(AVG((y-estimated_y)*(y-estimated_y))) as rms_error FROM lin_housing

JOIN mlp_regress_prediction USING (id);

     rms_error

——————-+

 0.912158035902468

(1 row)

Regression with Grouping and Mini-Batching

  1. To use grouping and mini-batching, we must first re-run the preprocessor and specify grouping:

DROP TABLE IF EXISTS lin_housing_packed, lin_housing_packed_summary, lin_housing_packed_standardization;

SELECT madlib.minibatch_preprocessor(‘lin_housing’,         — Source table

                                     ‘lin_housing_packed’,  — Output table

                                     ‘y’,                   — Dependent variable

                                     ‘x’,                   — Independent variables

                                     ‘zipcode’              — Group by zipcode

                                     );

  1. Train regression model and group the training data by zipcode to learn a different model for each zipcode.

DROP TABLE IF EXISTS mlp_regress_group, mlp_regress_group_summary, mlp_regress_group_standardization;

— Set seed so results are reproducible

SELECT setseed(0);

SELECT madlib.mlp_regression(

    ‘lin_housing_packed’,    — Source table

    ‘mlp_regress_group’,     — Desination table

    ‘independent_varname’,   — Input features

    ‘dependent_varname’,     — Dependent variable

    ARRAY[25,25],     — Number of units per layer

    ‘learning_rate_init=0.001,

    n_iterations=500,

    lambda=0.001,

    tolerance=0′,            — Optimizer params

    ‘relu’,                  — Activation function

    NULL,                    — Default weight (1)

    FALSE,                   — No warm start

    FALSE,                   — Not verbose

    ‘zipcode’                — Grouping column

);

View regression model with grouping:

\x on

SELECT * FROM mlp_regress_group;

-[ RECORD 1 ]–+————————————————————————————

zipcode        | 200001

coeff          | {-0.193588485849,0.063428493184,-0.30440608833,-0.355695802004,-0.175942716164 …

loss           | 0.0904009145541

num_iterations | 500

-[ RECORD 2 ]–+————————————————————————————

zipcode        | 94016

coeff          | {-0.18965351506,0.0633650963628,-0.302423579808,-0.334367637252,-0.230043593847 …

loss           | 1.04772100552

num_iterations | 500

  1. Prediction using the regression model for each group based on the zipcode:

DROP TABLE IF EXISTS mlp_regress_prediction;

SELECT madlib.mlp_predict(

         ‘mlp_regress_group’,         — Model table

         ‘lin_housing’,             — Test data table

         ‘id’,                        — Id column in test table

         ‘mlp_regress_prediction’,    — Output table for predictions

         ‘response’                   — Output values, not probabilities

     );

\x off

SELECT * FROM lin_housing JOIN mlp_regress_prediction USING (zipcode, id) ORDER BY zipcode, id;

 zipcode | id |                                        x                                         |  y   |   estimated_y

———+—-+———————————————————————————-+——+——————

   20001 | 12 | {1,0.11747,12.50,7.870,0,0.5240,6.0090,82.90,6.2267,5,311.0,15.20,396.90,13.27}  | 18.9 | 19.2272848285357

   20001 | 13 | {1,0.09378,12.50,7.870,0,0.5240,5.8890,39.00,5.4509,5,311.0,15.20,390.50,15.71}  | 21.7 | 21.3979318641202

   20001 | 14 | {1,0.62976,0.00,8.140,0,0.5380,5.9490,61.80,4.7075,4,307.0,21.00,396.90,8.26}    | 20.4 | 19.7743403979155

   20001 | 15 | {1,0.63796,0.00,8.140,0,0.5380,6.0960,84.50,4.4619,4,307.0,21.00,380.02,10.26}   | 18.2 | 18.7400800902121

   20001 | 16 | {1,0.62739,0.00,8.140,0,0.5380,5.8340,56.50,4.4986,4,307.0,21.00,395.62,8.47}    | 19.9 | 19.6187933144569

   20001 | 17 | {1,1.05393,0.00,8.140,0,0.5380,5.9350,29.30,4.4986,4,307.0,21.00,386.85,6.58}    | 23.1 | 23.3492239648177

   20001 | 18 | {1,0.78420,0.00,8.140,0,0.5380,5.9900,81.70,4.2579,4,307.0,21.00,386.75,14.67}   | 17.5 | 17.0806608347814

   20001 | 19 | {1,0.80271,0.00,8.140,0,0.5380,5.4560,36.60,3.7965,4,307.0,21.00,288.99,11.69}   | 20.2 | 20.1559086626409

   20001 | 20 | {1,0.72580,0.00,8.140,0,0.5380,5.7270,69.50,3.7965,4,307.0,21.00,390.95,11.28}   | 18.2 | 18.6980897920022

   94016 |  1 | {1,0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98}   |   24 | 23.9714991250013

   94016 |  2 | {1,0.02731,0.00,7.070,0,0.4690,6.4210,78.90,4.9671,2,242.0,17.80,396.90,9.14}    | 21.6 | 22.3655180133895

   94016 |  3 | {1,0.02729,0.00,7.070,0,0.4690,7.1850,61.10,4.9671,2,242.0,17.80,392.83,4.03}    | 34.7 | 33.8620767428645

   94016 |  4 | {1,0.03237,0.00,2.180,0,0.4580,6.9980,45.80,6.0622,3,222.0,18.70,394.63,2.94}    | 33.4 | 35.3094157686524

   94016 |  5 | {1,0.06905,0.00,2.180,0,0.4580,7.1470,54.20,6.0622,3,222.0,18.70,396.90,5.33}    | 36.2 | 35.0379122731818

   94016 |  6 | {1,0.02985,0.00,2.180,0,0.4580,6.4300,58.70,6.0622,3,222.0,18.70,394.12,5.21}    | 28.7 | 27.5207943492151

   94016 |  7 | {1,0.08829,12.50,7.870,0,0.5240,6.0120,66.60,5.5605,5,311.0,15.20,395.60,12.43}  | 22.9 | 24.9841422781166

   94016 |  8 | {1,0.14455,12.50,7.870,0,0.5240,6.1720,96.10,5.9505,5,311.0,15.20,396.90,19.15}  | 27.1 | 24.5403994064793

   94016 |  9 | {1,0.21124,12.50,7.870,0,0.5240,5.6310,100.00,6.0821,5,311.0,15.20,386.63,29.93} | 16.5 | 17.2588278443879

   94016 | 10 | {1,0.17004,12.50,7.870,0,0.5240,6.0040,85.90,6.5921,5,311.0,15.20,386.71,17.10}  | 18.9 | 17.0600407532569

   94016 | 11 | {1,0.22489,12.50,7.870,0,0.5240,6.3770,94.30,6.3467,5,311.0,15.20,392.52,20.45}  |   15 | 15.2284207930287

(20 rows)

Note that the results you get for all examples may vary with the database you are using.

Technical Background

To train a neural net, the loss function is minimized using stochastic gradient descent. In the case of classification, the loss function is cross entropy. For regression, mean square error is used. Weights in the neural net are updated via the back-propogation process, which uses dynamic programming to compute the partial derivative of each weight with respect to the overall loss. This partial derivative incorporates the activation function used, which requires that the activation function be differentiable.

Related

Greenplum Getting Started | Hands-On Tutorial – Technology Magazine (tech-mags.com)

Deep Dive into understanding and making best use of Open Source Greenplum Data Warehouse Architecture – Technology Magazine (tech-mags.com)

Understanding Greenplum Architecture and Use Cases for Data Science and Business Intelligence – Technology Magazine (tech-mags.com)

By Hassan Amin

Dr. Syed Hassan Amin has done Ph.D. in Computer Science from Imperial College London, United Kingdom and MS in Computer System Engineering from GIKI, Pakistan. During PhD, he has worked on Image Processing, Computer Vision, and Machine Learning. He has done research and development in many areas including Urdu and local language Optical Character Recognition, Retail Analysis, Affiliate Marketing, Fraud Prediction, 3D reconstruction of face images from 2D images, and Retinal Image analysis in addition to other areas.