The value of a relevant comment

4 min


Each of us has a problem, over the solution of which you struggle for a long time and unsuccessfully, which you have already studied up and down, but still you cannot get the desired result. At the same time, some are sure that your problem, in principle, has no solution, others – that there is no problem at all. Such a task for me was the duration of queries to the data warehouse on Greenplum. And how nice it is to finally write that I won this problem!

When trying to understand why requests take so long
When trying to understand why requests take so long

From time to time, when working with the storage, it took several minutes to wait for the result of the query. And it doesn’t matter if the query selects data from a table of ten rows, or a million – the waiting time will be commensurate. Anything simple in such a creative process as analytics is just torture!

I went over to fix this behavior many times. I started with tuning the sql client. DBeaver offers users a whole bunch of different tweaks – I think I’ve tried all of them, but have not been successful. Then I tried the settings of the PostgreSQL driver itself – the effect is the same.

All this time before my eyes there was an example of a specialized client for Greenplum – Aginity Workbench, in which out of the box the speed of execution of similar requests is noticeably higher. This program was an additional annoying factor, since I could not find the reasons for such an acceleration – its source code is closed and it works only under a license, and it connects to greenplum using native tools, and not ordinary jdbc drivers.

The inspiration came when I looked into the pg_ system viewstat_activity – it turned out that not one request for data actually flies from the client to the database, but two: first, the data is fetched upon request, and then the metadata for this selection from pg_ is readcatalog. Moreover, of the total execution time, the most part is taken by reading the metadata.

At this moment, I experienced a whole storm of emotions – from the joy that I finally found the problem, to bewilderment from the very essence of the find. The fact that the catalog is decently loaded, I knew before: more than 250 source systems are now connected to the storage, each has more than one hundred tables, and the tables themselves can have two-level partitioning. Greenplum counts each partition as a separate entity, so system views like pg_class bloat decently. But do you really need to refer to the directory for any data request?

Fortunately, the postgres driver has an open source source, so you can see for yourself what is happening under the hood. Looked into the code, found the class I needed PgResultSetMetaData and its method, which requests metadata, is fetchFieldMetaData. The content is encouraging – it seems that metadata can be dispensed with: fetchFieldMetaData is called from several other methods that are not so important for my purposes, while these methods themselves allow the return of an empty object, which hints at the possibility of cutting out all calls to fetchFieldMetaData from the code in general, which means avoid slow directory queries.

More about the methods:
  • isAutoIncrement. The easiest way to refuse the request for metadata in this method is that there are no such fields in our storage, and the comment “It is believed that PostgreSQL does not support this feature” seems to hint that it will not be useful for other databases either.

      public boolean isAutoIncrement(int column) throws SQLException {
        fetchFieldMetaData();
        Field field = getField(column);
        FieldMetadata metadata = field.getMetadata();
        return metadata != null && metadata.autoIncrement;
      }
  • isNullable. In the absence of metadata in the resultset, all fields will be considered nullable – not a big loss.

      public int isNullable(int column) throws SQLException {
        fetchFieldMetaData();
        Field field = getField(column);
        FieldMetadata metadata = field.getMetadata();
        return metadata == null ? ResultSetMetaData.columnNullable : metadata.nullable;
      }
  • getBaseColumnName. In the driver code, this method is used in updatable resultsets when updating values. My plans do not include the use of this functionality, but just in case it will be possible to screw up throwing exceptions when calling such methods.

    public String getBaseColumnName(int column) throws SQLException {
        Field field = getField(column);
        if (field.getTableOid() == 0) {
          return "";
        }
        fetchFieldMetaData();
        FieldMetadata metadata = field.getMetadata();
        return metadata == null ? "" : metadata.columnName;
      }
  • getBaseSchemaName and getBaseTableName. Returns the name of the schema / table that contains the requested attribute. The method is not used anywhere in the driver code, I don’t really need this information either.

    public String getBaseSchemaName(int column) throws SQLException {
        fetchFieldMetaData();
        Field field = getField(column);
        FieldMetadata metadata = field.getMetadata();
        return metadata == null ? "" : metadata.schemaName;
      }
    
    public String getBaseTableName(int column) throws SQLException {
        fetchFieldMetaData();
        Field field = getField(column);
        FieldMetadata metadata = field.getMetadata();
        return metadata == null ? "" : metadata.tableName;
      }

I check the impact on the code, comment on the call of the fetchFieldMetaData method, collect the driver jar file, connect it to the database with it, iiiii …

To say that it worked is to say nothing. Requests just fly now. Acceleration – at times. I sent it to my colleagues for a test – the reviews are something like this:

An added bonus that warms the heart: by eliminating unnecessary metadata requests for the pgclass directory, we take the unnecessary load off of it. I’m not sure that this gives any significant effect for the storage facility, but here, as in the fight for the environment, it is even pleasing that you at least somehow improve the situation.

In general, do not give up if the problem does not give up for a long time – it is likely that the answer is somewhere nearby and some trifle separates you from success. In my case, such a trifle were the comment symbols that needed to be placed in the correct places.

PS In the final version of the code, I created a new driver connection parameter runtimeMetaDisable. Calling metadata and throwing exceptions bound to its value. This approach is more flexible than a hard-coded method call and allows you to control driver behavior based on your needs. I posted the code on github… If your database has a heavy directory and you want to try the driver in business, but are not familiar with the java world and do not know how to compile the driver jar file – write in the comments!


0 Comments

Leave a Reply