Curious and non-obvious features when working with Snowflake

No long introductions, straight to the point.

Do you know that in snowflake is it possible to create objects with empty names? For example:

CREATE DATABASE "";
CREATE SCHEMA ""."";
CREATE TABLE ""."".""("" NUMBER);

This works at the time of publication and potentially creates a lot of problems for external systems that do not expect such a turn. It also usually enthralls DWH admins.

More interesting and practical tips under the cut.

Free automatic clustering

Many people know about opportunities specify keys for automatic data clustering in the table:

CREATE TABLE <name> ... CLUSTER BY ( <expr1> [ , <expr2> ... ] );

This allows Snowflake to store data in micro-partitions more efficiently and speed up reading significantly if an appropriate filter is present in the request.

But few people know that almost the same result can be achieved “for free” if you load and delete data in the table only in small blocks, which are organized in strict accordance with the desired key.

For example, instead of loading the entire table in one big transaction:

COPY INTO my_events FROM 's3://my_events/*/*.csv'

It is better to divide the incoming data into partitions and perform several small transactions, one for each date:

COPY INTO my_events FROM 's3://my_events/2022-01-01/*.csv'
COPY INTO my_events FROM 's3://my_events/2022-01-02/*.csv'
COPY INTO my_events FROM 's3://my_events/2022-01-03/*.csv'

In this case, you will end up with a table that is naturally clustered without spending a single penny rewriting micro-partitions. Depending on the amount of data in your account, this simple technique can save you hundreds, thousands, tens of thousands of dollars.

You can check the result using the function SYSTEM$CLUSTERING_INFORMATION, as well as in the query execution profile by comparing “Partitions scanned” and “Partitions total”. The fewer partitions a query reads, the better.

Invalid views

In some cases, objects VIEW can break and stop working even if their SQL TEXT is absolutely correct. Unfortunately, with changes in objects TABLE Snowflake not updating VIEWthat depend on them.

In practice, this leads to the fact that users receive errors that can only be corrected by manual re-creation VIEW. For example:

CREATE TABLE my_table(id NUMBER);
CREATE OR REPLACE VIEW my_view AS SELECT * FROM my_table;

SELECT * FROM my_view;
-- it works

So far everything is great. Now let’s add another column to the table and break VIEW:

ALTER TABLE my_table ADD name VARCHAR;

SELECT * FROM my_view;
-- it fails: view declared 1 column(s), but view query produces 2 column(s)

Recreation VIEW corrects the situation:

CREATE OR REPLACE VIEW my_view AS SELECT * FROM my_table;

SELECT * FROM my_view;
-- it works again

The question naturally arises, how to find such VIEW and automate their re-creation with minimal cost?

Through trial and error, we managed to find a way that does not require the use of INFORMATION_SCHEMAactive WAREHOUSE and cunning procedures. This method is to check all VIEW through command .describe() in the Snowflake Python Connector.

It allows you to schedule a query, but not execute it. If the object VIEW is broken, the command will return an exception that can be handled by re-creating VIEW. All these operations are “free”.

Role hierarchy

The Snowflake documentation mentions the benefits of creating a literate role hierarchies for access control, but she doesn’t give specific examples. Because of this, many novice administrators do not pay enough attention to it at the start, make many mistakes and quickly get confused as complexity grows.

In the following, I will briefly describe a specific example of a three-tiered role hierarchy that has shown excellent results in practice.

  • Tier 1: roles that give privileges to specific objects through GRANTS and FUTURE GRANTS.

    For example: “the role allows you to read all tables in schema XXX”, “the role allows you to use warehouse YYY”, “the role gives write access to tables ZZ1 and ZZ2”.

  • Tier 2: roles that combine one or more T1 roles in a business function.

    For example: “role of financial analyst”, “role of BI developer”, “role of external auditor”.

  • Tier 3: roles that combine one or more T2 roles and assign them to a specific user.

    For example: “Alice is a business analyst”, “Bob is an external auditor”, “Victor is a business analyst in the AAA project, but also an administrator in the BBB project”.

It is convenient to automate the creation of most types of roles in this system.

For example, you can create separate T1 roles for each scheme with FUTURE GRANTS on the:

  • ownership of all objects in the schema (OWNERSHIP);

  • reading all objects in the schema (READ);

  • writing to all objects in the schema (WRITE);

Additionally, you can create T1 roles for everyone WAREHOUSE with rights to USAGE and OPERATE.

Finally, T3 roles can be created and assigned to each user automatically. Manually it remains to configure only T2 business roles and assign them to specific users.

The time spent on managing such a system is close to zero. She is also very pleased with the auditors and security guards, who do not need to unravel a complex tangle of crutches and props.

Package Management for Java & Python UDF

Snowflake is currently actively developing the concept Function UDFswhich allow you to execute almost arbitrary code written in Java or Python right inside WAREHOUSEwithout taking the data out.

This is potentially a very powerful tool, but its use raises a number of practical questions, one of which is how to manage packages (.JAR, .WHL)?

One of the simple and good solutions is to treat packages as just another type of object inside Snowflake that depend on STAGE and updated with FUNCTION within the same CI/CD process.

For example, if you store descriptions of objects in Git, then the sequence of their application will be as follows:

  1. Create DATABASE.

  2. Create SCHEMA.

  3. Create STAGE (internal) for packages.

  4. Upload .JAR / .WHL files to STAGE.

  5. Create FUNCTIONwhich depends on the downloaded files.

In other words, package management should not happen “before”, not “after”, not “manually on the side”, but strictly between the creation STAGE and creation FUNCTION. In this case, everything will work without errors.

This process has one technical feature – how to understand that already existing files in STAGE need an update? Snowflake has a standard command LISTwhich allows you to get a list of files in STAGEas well as their MD5 sums.

But the problem is that these MD5 sums are calculated not from the original file, but from the encrypted file, which is not suitable for comparison. To keep the original MD5, you can additionally load in STAGE empty files that contain a hash in their name.

Or, if there are not many packages yet, then they can be completely overwritten with each CI / CD call.

SnowDDL: an open source tool for object schema manipulation

When I first started working with Snowflake, I was surprised by the lack of full-fledged declarative tools for working with the object schema. The most popular options (schemachange and Terraform) there are significant gaps, and even the vendor itself recommends to use them together, which is quite inconvenient.

To correct this situation a little, I created and made available a new native tool – SnowDDL (docs). Its main features:

  1. Absence “states”.

  2. Ability to rollback changes.

  3. Support ALTER COLUMN in cases where it is possible.

  4. Built-in role hierarchy, which is described earlier in this article.

  5. Invalid VIEW updated automatically.

  6. Simplification of Code Review by separating operations into conditionally “dangerous” and conditionally “safe”.

  7. Ability to create multiple isolated “environments” for each developer within one Snowflake account.

  8. Simple and explicit dependency management between objects.

  9. Package management for UDF functions, which is described earlier in this article.

In order not to make this article too long, I’ll stop there for now.

If readers are interested enough, I’ll be happy to cover the implementation details of SnowDDL, as well as other non-obvious features when working with Snowflake, in subsequent issues.

Enjoy!

Similar Posts

Leave a Reply