Getting Started with Snowflake ❄️

Exploring the hype and delving into the free trial of Snowflake’s Cloud Data Platform

Emile Gill
Towards Data Science

--

Image by Gerd Altmann from Pixabay

Whether you’ve heard of Snowflake Inc. from its record-breaking IPO this September, or from its ever-growing presence in data job ads, there is definitely a lot of hype around this company at the moment. In this article, we explore why, and how we can get started with its free trial.

What is Snowflake?

A data warehouse is essential for any company that wants to be data-driven, and all companies should aspire to be data-driven; having business data at the fingertips of analysts and decision-makers has never been more important. That said, it is no easy feat to architect an on-premise data warehouse.

Requiring heavy IT support to set up and maintain the necessary hardware and software, on-premise data-warehouses have traditionally only been viable for large companies, and have often proved a source of friction between IT and analytics teams. This is humorously demonstrated by Snowflake in their ad:

Snowflake elegantly solves the problem of data warehousing. A cloud-based solution, Snowflake requires none of the heavy infrastructure set-up or up-front costs that are associated with traditional on-premise data-warehouses. It also allows users to scale up and down easily, whilst only paying for the storage and compute that they use. This makes it a great option for startups, where it may not be feasible to raise the up-front costs needed for a traditional data warehouse. Or for companies moving up the data hierarchy of needs, that need to start small, but scale up as their company progresses towards a more data-driven culture.

Why Snowflake?

Image by Natalie White from Pixabay

Its unique selling point comes in its layered architecture. At the base we have the “Database Storage” layer, this is where Snowflake stores and efficiently organises data loaded into the platform. Above this sits the “Compute” layer, where our virtual warehouses live. Each of these is an independent compute cluster that has access to the storage layer. Being independent, these virtual warehouses do not compete for any of the same resources; hence, we can scale the computational power to run queries on our storage layer indefinitely. On top of all of this sits Snowflake’s “Cloud Services” layer. This is the layer that we directly interact with, coordinating the way we interact with Snowflake’s underlying architecture, all through the universal language of SQL.

The separations of these layers and independent ability to scale each is what marks Snowflake out from the crowd. Giving the flexibility to organisations to pay only for the storage and compute they need, as well as scaling indefinitely to meet their needs.

Getting Started — Signing up for a free trial

Photo by Matt Duncan on Unsplash

One of the great things about cloud tech, is that there’s almost always a free trial that you can try out, without having to pay anything upfront. This makes learning the platform accessible, not just to those privileged enough to work for a company willing to try out new tech, or invest in learning and development, but also to individuals who are curious and trying to upskill themselves for the job market.

In a recent article, I detailed how one could set up a cloud PostgreSQL instance on AWS, Google Cloud Platform, or Heroku. Similarly, here, we will see that Snowflake offers a generous free-trial option that will allow us to explore what the platform has to offer.

To get started, we simply need to register here. Upon registration, an email will shortly arrive in our inbox with a link to our newly created Snowflake instance, for which we can set a username and password. From this point onwards we have $400 of Snowflake credits at our disposal to get a feel for the platform and explore the power of the features it has to offer.

Exploring the Web UI

Let’s begin by taking a look at the user interface of our newly set up instance. At the top of the console, we can see a taskbar comprising ‘Databases’, ‘Shares’, ‘Data Marketplace’, ‘Warehouses’, ‘Worksheets’, ‘History’. We’ll go through these one by one to get a feel for some of the fundamental concepts and features of Snowflake.

Databases

Here we can see a list of all the databases we have access to. We can also use the user interface to create, clone, drop or transfer ownership of a database. You will see that in our free trial we have already been provided with several databases. The database SNOWFLAKE_SAMPLE_DATA is of particular use, as this is pre-loaded with several datasets that we can immediately get started with.

Clicking on a database, we can see its comprising Tables, Views, Schemas, Stages, File Formats, Sequences and Pipes. The first three of these will be familiar to anyone with a working knowledge of SQL, the latter are powerful additions that help us with loading data into Snowflake, some of which we will explore later.

Shares

The “Shares” tab demonstrates another feature unique to Snowflake, allowing us to easily share Tables and Views between different accounts. Owing to its layered architecture, data stored on a provider’s account may be queried by a consumer, without the need for any replication or transfer of data. The top-level “Services” layer simply manages access to the stored data, such that the consumer incurs charges only for the compute resources they use to query the data, but no additional storage resources.

This all sounds simple, but is quite revolutionary when it comes to sharing data between third-party vendors, or between independent business functions within an organisation. It eliminates the need for any archaic form of data transfer like SFTP, or any replication of the data. The services layer from the consumer account simply reaches over to the provider’s storage layer and can see the data as it is, using its own compute resources to perform any queries.

Data Marketplace

As well as the direct sharing of data from providers to consumers we saw in the previous section, Snowflake also offers a “Data Marketplace” feature. This is an “app-store” of data, if you will. Various companies offer datasets, some available instantly for us to import and query, others being “Personalised” options which can be requested and purchased from the company under specific commercial terms agreed with the consumer. Again, something so simple, but entirely unique, that Snowflake has brought to the game, creating an entire industry in itself of companies providing and purchasing data in a seamless manner.

Warehouses

As discussed previously, “virtual warehouses” provide the computational element of the Snowflake platform, powering the queries and data loading tasks we perform. In the “Warehouses” tab we see a list of the warehouses we have available, as well as- analogous to what we saw in the “Tables” tab- the ability to create, drop and transfer ownership of warehouses.

We also have the option to configure the properties of a warehouse and to suspend or resume a warehouse. Remember, each warehouse is an independent compute cluster with specified size. This size determines the computational power we have available to make a query, but also the number of credits that we will be billed. We can provision compute resources to meet our individual needs and suspend or resume these as required so that we are not billed while they are not in use. The “Auto Suspend” property of a warehouse means that we do not have to resume/suspend manually; after a specified period of inactivity, the warehouse will be suspended automatically, resuming once a new query is executed.

Worksheets

The “Worksheets” tab provides a neat editor where we can write and execute SQL commands. The sidebar on the left visualises database objects: databases, schemas, tables, views. The tabbed layout at the top of the page allows us to create multiple “worksheets”, which are saved by Snowflake, so we don’t need to worry about losing our work. We can also load a SQL script in a .sql file, stored locally, to a worksheet. We see the result of any queries we execute at the bottom of the UI, and have the option of exporting this data to a CSV, or even copying it to our clipboard.

History

Here we can see all queries that have been executed on our Snowflake instance, going back 14 days. We can also see various metadata about these executions, including the status of the execution, warehouse used and duration the command took to run. We can even see the results of historic queries that have been run in the last 24 hours. This is useful if we want to retrieve information about queries that we have run or to check up on the status of any queries that might be currently executing.

Fundamentals of Snowflake

Photo by Aaron Burden on Unsplash

Now that we have a feel for the concepts behind the Snowflake platform and the user interface that it provides, let’s consider a simple use case. Say we have a CSV dataset that we want to interrogate in Snowflake, how would one go about this task?

Connecting to Snowflake

When it comes to connecting to Snowflake, we have several options to choose from.

We have seen that the web UI provides us with a “Worksheets” tab, where we can execute SQL commands and that we can use the UI itself to create objects. In addition to this, we can also connect to Snowflake using the SnowSQL command-line interface, or through its various connectors to common languages/frameworks. Here we will stay language agnostic by exploring the Snowflake CLI.

The installation process for SnowSQL CLI is covered here in the official documentation. Once it has been installed, connecting is easy. We simply, provide our connection parameters (the full list of which may be found here) and we’re in:

snowsql -a <account-name> -u <username>

Creating Database Objects

Before we load any data, we must first set up the required database objects in Snowflake to house our CSV data, as well as provisioning the resources required for the load process.

Let’s first provision the compute resources we will use for our loading of data. To do this we need to create a new “virtual warehouse”, specifying the properties we want it to have. In this case, since our task will require minimal compute power, we provision an “X-SMALL” warehouse, which will only consume 1 Snowflake Credit per hour. We will also set an autosuspend of 3 minutes, so that our warehouse will automatically shut down after 3 minutes of inactivity, conserving the credits we use.

Next, we will need a database to house the tables and views for our project. We can do this simply by running:

Now that we have our warehouse and database set up, we can check the context we are working in, with:

If we correctly executed the above commands, then this should return the following:

Checking current context

All that remains is to create a table and then we can go ahead and start learning how to load in our data. For comparison, I’ll use the same data as my A Practical Guide to Getting Set-up with PostgreSQL post, which used the Covid-19 Geographic Case Distribution dataset provided by the European Centre for Disease Prevention and Control. You can get the latest version of this dataset here as a CSV.

Loading Data

When it comes to loading data into Snowflake, we must first create a “Stage”. This is simply a reference to a cloud storage location, from which Snowflake can pull our data. There are two forms of data stage that we might use: external stages, which reference cloud storage outside of Snowflake (i.e. Amazon S3, Google Cloud Storage, or Microsoft Azure) and internal stages, cloud storage hosted within the Snowflake platform.

To create an external stage we would- in keeping with the above syntax for creating database objects- simply need to run:

The parameters will vary slightly depending on the cloud service provider (AWS, GCP, Azure) we are connecting to. The example I have given here is for AWS, but the documentation here provides examples for all three, as well as the full list of optional parameters that can be supplied when creating a stage.

For the purposes of this tutorial, though, I will assume that we don’t have access to cloud storage already and will leverage the internal data storage provided by Snowflake. To create an internal stage is even simpler, running the above without specifying a URL or credentials parameter will create our stage within Snowflake:

Once we have our internal Snowflake stage set up, we can load in our CSV using the SnowSQL CLI put command from our local machine.

put file://C:\Users\emile\OneDrive\Documents\GitHub\snowflake-basics\ecdc_covid_cases.csv @test_db.public.test_stage;

If all went well, our file will now be stored in our Snowflake cloud storage, ready to be loaded into our table. We can quickly check the contents of a Snowflake data stage by running the command:

This will return a list of the files stored in our stage. In our case, we should see only one file, named ecdc_covid_cases.csv.gz . The .gz suffix shows that Snowflake has implicitly compressed our file, reducing the cloud storage required to host it.

Now that we have our data staged in the cloud, loading it into our target table couldn’t be easier. We simply run a COPY INTO command, specifying our table and the data stage we wish to load from.

As well as defining the table we seek to populate, and stage to gather data from, we also specify the FILE_FORMAT parameter. In this case, letting Snowflake know that we are loading CSV data, we want to skip the first row which contains headers, ignore missing columns and ignore commas contained within strings enclosed by double-quotes. There are many other options we could have specified here, and Snowflake even allows us to execute the create file_format command to create a named format, should we want to persist this for later use. In this case, though, specifying it in our COPY INTO command will suffice.

Following the execution of this command, we will be greeted with a line confirming the success of our execution. If not we will see a red error message, suggesting we need to tweak our load command to ensure that the file format is correctly specified and the data within is corresponding to the table we configured previously. Assuming our command has indeed executed correctly, we can now test the data we have loaded by selecting the first few lines by query:

select * from worldwide_cases
limit 5;

The results should return the first 5 lines of our dataset, indicating that our data is loaded correctly. We can now query away to our heart’s content!

Conclusion

Photo by Alessandro Erbetta on Unsplash

We now have a basic feel for the fundamentals of the Snowflake Data Platform, can create database objects, stage and load data, and query data, both from the web-UI, and SnowSQL command-line interface. We have also explored why- in my opinion, at least- Snowflake is worthy of the hype that surrounds it.

In my next post on Snowflake, I hope to take this further and explore a more complex use case where we need to continuously load data into a Snowflake database from cloud storage. But for now…

Thanks for reading!

--

--