Connecting External Tables to Greenplum
How to easily create a connection between tables located in different DBMS and speed up work with them.
Surely each of us has come across the fact that database queries are executed for an extremely long time, and no one wants to wait for the result for 15-20 minutes. Yes, of course, there are many determining factors that affect the request time: the amount of data requested, the structure of the request, server load, etc. For example, in Apache Hive, the average waiting time for the result of a query to the database is measured in tens of seconds, and sometimes minutes.
From the very beginning of working with the Greenplum DBMS, the high speed of working with data was noted; accordingly, the question arose of connecting external Hive tables for more productive work.
Next, I will tell you how I managed to solve the problem of working with tables from Apache Hive in the Greenplum DBMS. This method has been successfully tested and is used in DBeaver.
First of all, in order to clearly show the difference in the speed of query execution, let’s compare the speed of the two systems on the same queries created simultaneously. Queries are generated according to a table of 176 million rows and a selection condition is added for a specific value of one of the fields:
Request execution speed, sec
As you can see, the difference is more than significant.
Now let’s describe the mechanism for connecting tables from Apache Hive.
You can read the subscription data in the GP through an external table. To create an external table:
· Determine the name of the schema and table with subscription data in Hive (for proxied and materialized subscriptions). The schema can be found in Hue by the system subscription name provided during checkout.
· Get a list of fields (and their types) of a table materialized on a DDA cluster.
Given the need to cast the Apache Hive and Greenplum data types according to the schema:
Field type in Hive
Field type in GreenPlum
1) write a procedure that connects the source table to the DBMS and brings the data types in line with the DBMS:
CREATE OR REPLACE FUNCTION s_grnplm_ld_audit_da_sandbox_oaop.create_external_table( in schema_name text, in table_name text, in text_in text, out text_out text ) AS $$ begin text_in := replace(text_in, '`', ''); text_in := replace(text_in , concat(schema_name, '.'), ''); schema_name := replace(schema_name, 'prx_', ''); schema_name := replace(schema_name, 'cap_', ''); text_in := replace(text_in, ' tinyint', ' int2'); text_in := replace(text_in, '''', ''); text_in := replace(text_in, ' timestamp', ' timestamp'); text_in := replace(text_in, ' string', ' text'); text_in := replace(text_in, ' smallint', ' int2'); text_in := replace(text_in, ' float', ' float4'); text_in := replace(text_in, ' double', ' float8'); text_in := replace(text_in, ' boolean', ' bool'); text_in := replace(text_in, ' binary', ' bytea'); text_in := replace(text_in, ' bigint', ' int8'); text_in := replace(text_in, ' decimal', ' numeric'); text_in := replace(text_in, 'CREATE EXTERNAL TABLE ', 'CREATE EXTERNAL TABLE s_grnplm_ld_audit_da_sandbox_oaop.'); text_in := concat(text_in, 'location'); text_in := replace(text_in, '<схема>', schema_name); text_in := replace(text_in, '<таблица>', table_name); text_out:= text_in; execute text_out; RETURN; END $$ LANGUAGE plpgsql;
2) Create a connection to an external table:
select s_grnplm_ld_audit_da_sandbox_oaop.create_external_table( '<схема>', '<таблица>', '<show create table .....>');
For a successful connection, you need to get the list of columns and data types from the source table:
show create table ...
And further from the query result we are interested in:
At this stage, you can already work with data in the DBMS, they do not take up server disk space, but are still processed by Apache Hive.
The speed of query execution from a non-materialized database can be compared with previously obtained values:
Request execution speed, sec.
Greenplum (materialized table)
Greenplum (immaterialized table)
To create an independent database, it is necessary to materialize the data already in the DBMS environment:
3) Create a table for data materialization:
create table s_grnplm_ld_audit_da_sandbox_oaop.<таблица>_m (like s_grnplm_ld_audit_da_sandbox_oaop.<таблица>) distributed randomly
4) Copy the data to the table:
insert into s_grnplm_ld_audit_da_sandbox_oaop.<таблица>_m select * from s_grnplm_ld_audit_da_sandbox_oaop.<таблица> where <при необходимости>;
Thus, I showed a working solution to reduce the processing time of queries to the database through the use of the Greenplum DBMS using the table materialization tool.