Spread the love

Introduction

Greenplum follows a master/slave architecture, where data is distributed on segment nodes. Master node is responsible for executing user queries on segment nodes, and getting the results back to end users.

In order for this architecture to deliver optimal performance, architects need to ensure couple of things :-

1). System has adequate network bandwidth

2). Segment nodes have adequate memory, and processing power.

3). Data is optimally distributed on segment nodes for the fastest processing and loading

In this article, we are concerned with the optimization of query performance through optimal partitioning and distribution of data on segment nodes.

High Level Architecture for Greenplum

Data Distribution

Our goal is to make sure that data is equally distributed on segment nodes, so that execution of queries can be divided on different segment nodes.

It must be noted, that in many cases we may very a large number of segment nodes so it may not be necessary to distribute data on all the nodes.

Let’s now learn, how to distribute the data, and check data distribution on segment nodes.

 

Create Table with Distributed by Clause

CREATE TABLE products

(name varchar(40), prod_id integer, supplier_id integer)

DISTRIBUTED BY (prod_id);

 

Checking Data Distribution on Segments

SELECT gp_segment_id, COUNT(*) FROM faa.otp_c

GROUP BY gp_segment_id

ORDER BY gp_segment_id;

 

Data Partitioning

With large tables having millions or even billions of records, it is suicidal for database performance to go through the whole of the table while doing SELECT operations. To avoid doing that, you need to select optimal partitioning to eliminate un-needed partitions for a given select query. Ideally, the partitioning strategy will choose criteria that will use a column being used in most of the SELECT queries, where clause.

There are two main partitioning strategies:-

1). Range Partitioning

2). List Partitioning

Let’s go through a few examples showing how it’s done.

Create Table with Range Partition

CREATE TABLE sales (id int, date date, amt decimal(10,2))

DISTRIBUTED BY (id)

PARTITION BY RANGE (date)

( START (date ‘2008-01-01’) INCLUSIVE

END (date ‘2009-01-01’) EXCLUSIVE

EVERY (INTERVAL ‘1 day’) );

Create Table with Range Partition(Manually Specified Partitions)

CREATE TABLE sales (id int, date date, amt decimal(10,2))

DISTRIBUTED BY (id)

PARTITION BY RANGE (date)

( PARTITION Jan08 START (date ‘2008-01-01’) INCLUSIVE ,

PARTITION Feb08 START (date ‘2008-02-01’) INCLUSIVE ,

PARTITION Mar08 START (date ‘2008-03-01’) INCLUSIVE ,

PARTITION Apr08 START (date ‘2008-04-01’) INCLUSIVE ,

PARTITION May08 START (date ‘2008-05-01’) INCLUSIVE ,

PARTITION Jun08 START (date ‘2008-06-01’) INCLUSIVE ,

PARTITION Jul08 START (date ‘2008-07-01’) INCLUSIVE ,

PARTITION Aug08 START (date ‘2008-08-01’) INCLUSIVE ,

PARTITION Sep08 START (date ‘2008-09-01’) INCLUSIVE ,

PARTITION Oct08 START (date ‘2008-10-01’) INCLUSIVE ,

PARTITION Nov08 START (date ‘2008-11-01’) INCLUSIVE ,

PARTITION Dec08 START (date ‘2008-12-01’) INCLUSIVE

END (date ‘2009-01-01’) EXCLUSIVE );

Defining Numeric Range Table Partitions

CREATE TABLE rank(id int, rank int, year int, gender char(1), count int)

DISTRIBUTED BY (id)

PARTITION BY RANGE (year)

( START (2001) END (2008) EVERY (1),

DEFAULT PARTITION extra );

Defining List Table Partitions

CREATE TABLE rank (id int, rank int, year int, gender char(1), count int )

DISTRIBUTED BY (id)

PARTITION BY LIST (gender)

( PARTITION girls VALUES (‘F’),

PARTITION boys VALUES (‘M’),

DEFAULT PARTITION other );

CREATE TABLE bar (bid integer, bloodtype text, bdate date)

DISTRIBUTED by (bid)

PARTITION BY LIST(bloodtype)

(PARTITION a values(‘A+’, ‘A’, ‘A-‘),

PARTITION b values (‘B+’, ‘B-‘, ‘B’),

PARTITION ab values (‘AB+’, ‘AB-‘, ‘AB’),

PARTITION o values (‘O+’, ‘O-‘, ‘O’)

DEFAULT PARTITION unknown);

 

Range Partitioning
CREATE TABLE sales (id int, date date, amt decimal(10,2))

DISTRIBUTED BY (id)

PARTITION BY RANGE (date)

( START (date ‘2008-01-01’) INCLUSIVE

END (date ‘2009-01-01’) EXCLUSIVE

EVERY (INTERVAL ‘1 day’) );

Partitioning an Existing Table

It is not possible to partition a table that has already been created. Tables can only be partitioned at CREATE TABLE time. If you have an existing table that you want to partition, you must recreate the table as a partitioned table, reload the data into the newly partitioned table, drop the original table and rename the partitioned table to the original name. You must also regrant any table permissions.

For example:

CREATE TABLE sales2 (LIKE sales)

PARTITION BY RANGE (date)

( START (date ‘2008-01-01’) INCLUSIVE

END (date ‘2009-01-01’) EXCLUSIVE

EVERY (INTERVAL ‘1 month’) );

INSERT INTO sales2 SELECT * FROM sales;

DROP TABLE sales;

ALTER TABLE sales2 RENAME TO sales;

GRANT ALL PRIVILEGES ON sales TO admin;

GRANT SELECT ON sales TO guest;

Viewing Your Partition Design
You can look up information about your partition design using the pg_partitions view. For example, to see the partition design of the sales table:

SELECT partitionboundary, partitiontablename, partitionname,

partitionlevel, partitionrank

FROM pg_partitions

WHERE tablename=’sales’;

The following table and views show information about partitioned tables.

pg_partition – Tracks partitioned tables and their inheritance level relationships.
pg_partition_templates – Shows the subpartitions created using a subpartition template.
pg_partition_columns – Shows the partition key columns used in a partition design.

Compression Example

CREATE TABLE foobar (fid INT, ft TEXT) WITH (compresstype = quicklz, orientation=column, appendonly=true) DISTRIBUTED by (fid);