Amazon Redshift Tutorial (AWS)

Amazon Redshift Data Warehouse

Step 1: Create an AWS Account
Step 2: Setup Billing Alarms
Step 3: Create a Redshift Cluster
Step 4: Update VPC for External Access
Step 5: Create an IAM User
Step 6: Configuring DBeaver
Step 7: Load Data into S3
Step 8: Loading Data into Redshift

Step 1: Create an AWS Account

Go to aws.amazon.com and create an account.
Follow through with the payment information and identity verification.

Step 2: Setup Billing Alarms

Navigate to My Billing Dashboard from the top menu.
Click on AWS Budgets.
Click on Create a budget.
Choose your budget type. For this example, we’ll select Cost budget.
Fill in the details at the Set your budget step. This is where you can specify the budget period (monthly or annually), and the budget amount.
When you’ve set your budget, scroll to the bottom and click Configure Alerts.
At the Configure alerts screen, you can choose to send alerts based on actual costs or forecasted costs.
We’ll set ours to Actual Costs, with an alert threshold of 80%.
Type in the email address(es) you would like alerts to be sent to.
Scroll to the bottom and click Confirm budget.
Review the details on the screen, then click Create.
You should now see a new entry in the AWS Budgets screen.
We’ve only created one alarm at $10, but you’ll probably want to create others to alert you when you go over different limits. e.g. One at $10, one at $50, one at $100, depending on your estimated usage and budget.

Step 3: Create a Redshift Cluster

Drop down the Services menu from the top of the page, and open the Amazon Redshift service
Click on Quick launch cluster
Take a note of the region you’re currently set to from the top right corner. We’ll use US East (Ohio).
Increase the number of nodes to 4
Provide a value for the Cluster identifier field (following conventions*)
Database port can be left as the default of 5439
Provide a user name
Provide a password, then confirm your password
Click launch cluster (charges will begin once you push this button)

*lowercase string, words separated by hyphens.

Step 4: Update VPC for External Access

From the AWS Management Console, search for the VPC service
From the VPC dashboard, select Security Groups on the left
To grant the DBeaver application access, click Edit rules
Click Add Rule and leave the type as Custom TCP Rule
Set the port to be 5439
Set the source value to be your IPV4 address, followed by /32
Save the rule

Step 5: Create an IAM User

From the AWS Management Console, search for the IAM service
Select Users from the menu on the left, then click Add user
Provide a user name
For the access type, select Programmatic access
Click Next: Permissions
Click on Attach existing policies directly
Search for redshift. Click on the AmazonRedshiftFullAccess policy name
Click Next: Tags
Click Next: Review
Review your selections, then click Create User
Below the Success message, click on Download.csv
This downloads your credentials, including access key ID and secret access key. Store this somewhere safe
Return back to your Amazon redshift service dashboard
Your cluster should now be available
Opening your cluster reveals three dropdown menus; Cluster, Database, and Backup
Clusters cannot be stopped and resumed. When you’ve finished your work, or want to stop being charged, you’ll need to:
1. Create a snapshot of the cluster
2. Delete the cluster
3. Create a new cluster using the snapshot

Command to generate a snapshot, then delete a cluster:
aws redshift delete-cluster –cluster-identifier –final-cluster-snapshot-identifier

Command to restore a cluster from the snapshot:
aws redshift restore-from-cluster-snapshot –cluster-identifier –snapshot-identifier

Snapshots incur S3 data storage charges, but it’s considerably cheaper than keeping the cluster up.

Step 6: Configuring DBeaver

Download DBeaver from their website
Install and open DBeaver
Click on the Databases menu and select New Database Connection
Click on All then select Redshift
Click Next
Click Edit Driver Settings
Click Download/Update to download the latest driver
You should now see the .jar file under the Libraries tab
Click OK
On the General tab of the Connect to database window:
Host: section between // and :
Port: 5439
Database:
User:
Password:
The above information should be pulled from the configuration page of your cluster in AWS
Click Test connection to ensure it’s successful
Click Finish
Run select version(); from a query window to check the connection
This will show the version of Redshift you’re running, together with the Postgres version

Step 7: Load Data into S3

From AWS, search for the S3 service
Click on Create bucket
Provide a globally unique name for your bucket
Set the region to be the same as your Redshift cluster
In Step 3: Creating a Redshift Cluster, we used US East (Ohio)
Click Next
For this guide, we won’t enable versioning or logging. Click Next
At the Set permissions screen, select Block all public access
Click Next
Click Create bucket
At the S3 buckets screen, click on the bucket you just created
Click Create folder
Give your folder a name, then click Save
Download the AWS Sample Files
This will trigger the download of files for creating the database schema
From your Amazon S3 bucket folder, click Upload
Drag the files from our unzipped folder into the Upload window
Click Upload
Each object is uploaded with a storage class of STANDARD
You can find out more about what each storage class means here

Step 8: Loading Data into Redshift

Schema creation

Run the following SQL from DBeaver to return a list of table names in Redshift:
select distinct(tablename) from pg_table_def where schemaname = 'public';

Run the following SQL scripts to create our schema:

CREATE TABLE part 
(
  p_partkey     INTEGER NOT NULL,
  p_name        VARCHAR(22) NOT NULL,
  p_mfgr        VARCHAR(6),
  p_category    VARCHAR(7) NOT NULL,
  p_brand1      VARCHAR(9) NOT NULL,
  p_color       VARCHAR(11) NOT NULL,
  p_type        VARCHAR(25) NOT NULL,
  p_size        INTEGER NOT NULL,
  p_container   VARCHAR(10) NOT NULL
);

CREATE TABLE supplier 
(
  s_suppkey   INTEGER NOT NULL,
  s_name      VARCHAR(25) NOT NULL,
  s_address   VARCHAR(25) NOT NULL,
  s_city      VARCHAR(10) NOT NULL,
  s_nation    VARCHAR(15) NOT NULL,
  s_region    VARCHAR(12) NOT NULL,
  s_phone     VARCHAR(15) NOT NULL
);

CREATE TABLE customer 
(
  c_custkey      INTEGER NOT NULL,
  c_name         VARCHAR(25) NOT NULL,
  c_address      VARCHAR(25) NOT NULL,
  c_city         VARCHAR(10) NOT NULL,
  c_nation       VARCHAR(15) NOT NULL,
  c_region       VARCHAR(12) NOT NULL,
  c_phone        VARCHAR(15) NOT NULL,
  c_mktsegment   VARCHAR(10) NOT NULL
);

CREATE TABLE dwdate 
(
  d_datekey            INTEGER NOT NULL,
  d_date               VARCHAR(19) NOT NULL,
  d_dayofweek          VARCHAR(10) NOT NULL,
  d_month              VARCHAR(10) NOT NULL,
  d_year               INTEGER NOT NULL,
  d_yearmonthnum       INTEGER NOT NULL,
  d_yearmonth          VARCHAR(8) NOT NULL,
  d_daynuminweek       INTEGER NOT NULL,
  d_daynuminmonth      INTEGER NOT NULL,
  d_daynuminyear       INTEGER NOT NULL,
  d_monthnuminyear     INTEGER NOT NULL,
  d_weeknuminyear      INTEGER NOT NULL,
  d_sellingseason      VARCHAR(13) NOT NULL,
  d_lastdayinweekfl    VARCHAR(1) NOT NULL,
  d_lastdayinmonthfl   VARCHAR(1) NOT NULL,
  d_holidayfl          VARCHAR(1) NOT NULL,
  d_weekdayfl          VARCHAR(1) NOT NULL
);

CREATE TABLE lineorder 
(
  lo_orderkey          INTEGER NOT NULL,
  lo_linenumber        INTEGER NOT NULL,
  lo_custkey           INTEGER NOT NULL,
  lo_partkey           INTEGER NOT NULL,
  lo_suppkey           INTEGER NOT NULL,
  lo_orderdate         INTEGER NOT NULL,
  lo_orderpriority     VARCHAR(15) NOT NULL,
  lo_shippriority      VARCHAR(1) NOT NULL,
  lo_quantity          INTEGER NOT NULL,
  lo_extendedprice     INTEGER NOT NULL,
  lo_ordertotalprice   INTEGER NOT NULL,
  lo_discount          INTEGER NOT NULL,
  lo_revenue           INTEGER NOT NULL,
  lo_supplycost        INTEGER NOT NULL,
  lo_tax               INTEGER NOT NULL,
  lo_commitdate        INTEGER NOT NULL,
  lo_shipmode          VARCHAR(10) NOT NULL
);

Now run the table check SQL again, and we should see some results:

select distinct(tablename) from pg_table_def where schemaname = 'public';

From the AWS IAM service, click on Users and open the user you created
You should already have the AmazonRedshiftFullAccess policy
Click Add permissions, then Attach existing policies directly
Select the AmazonS3FullAccess policy, then click Next: Review
Click Add permissions
We can now load data from S3 into our Redshift schema

Data Load

Open the credentials.csv file downloaded earlier
In the commands below, replace [ACCESS-KEY-ID] with your access key ID
Replace [SECRET-ACCESS-KEY] with your AWS secret key

Load dwdate:

copy dwdate from 's3://awssampledbuswest2/ssbgz/dwdate' 
credentials 'aws_access_key_id=[ACCESS-KEY-ID];aws_secret_access_key=[SECRET-ACCESS-KEY]' 
gzip compupdate off region 'us-west-2';

Load customer:

copy customer from 's3://awssampledbuswest2/ssbgz/customer' 
credentials 'aws_access_key_id=[ACCESS-KEY-ID];aws_secret_access_key=[SECRET-ACCESS-KEY]' 
gzip compupdate off region 'us-west-2';

Load lineorder:

copy lineorder from 's3://awssampledbuswest2/ssbgz/lineorder' 
credentials 'aws_access_key_id=[ACCESS-KEY-ID];aws_secret_access_key=[SECRET-ACCESS-KEY]' 
gzip compupdate off region 'us-west-2';

Load part:

copy part from 's3://awssampledbuswest2/ssbgz/part' 
credentials 'aws_access_key_id=[ACCESS-KEY-ID];aws_secret_access_key=[SECRET-ACCESS-KEY]' 
gzip compupdate off region 'us-west-2';

Load supplier:

copy supplier from 's3://awssampledbuswest2/ssbgz/supplier' 
credentials 'aws_access_key_id=[ACCESS-KEY-ID];aws_secret_access_key=[SECRET-ACCESS-KEY]' 
gzip compupdate off region 'us-west-2';

You can check the tables are now populated using the following SQL:
```
select count(*) from LINEORDER;
select count(*) from PART;
select count(*) from  CUSTOMER;
select count(*) from  SUPPLIER;
select count(*) from  DWDATE;
```
This should return the following results:

Table Rows

LINEORDER 600,037,902

PART 1,400,000

CUSTOMER 3,000,000

SUPPLIER 1,000,000

DWDATE 2,556
To confirm the load happened, navigate to the Loads tab of your Redshift cluster

Table	Rows
LINEORDER	600,037,902
PART	1,400,000
CUSTOMER	3,000,000
SUPPLIER	1,000,000
DWDATE	2,556

If the load fails, you can find out why it failed by running the SQL below:

select query, substring(filename,22,25) as filename,line_number as line, 
substring(colname,0,12) as column, type, position as pos, substring(raw_line,0,30) as line_text,
substring(raw_field_value,0,15) as field_text, 
substring(err_reason,0,45) as reason
from stl_load_errors 
order by query desc
limit 10;

If you encounter any issues during the data loads, the best place to look for a solution is the official AWS documentation, which can be found here.

Other useful SQL scripts:

-- View the last commit
select query, trim(filename) as file, curtime as updated
from stl_load_commits order by updated desc;

-- Check data distribution across tables. Each slice should have approx same number of records
select slice, col, num_values, minvalue, maxvalue
from svv_diskusage
where name='customer' and col=0
order by slice,col;

-- Check for disk spills
select query, step, rows, workmem, label, is_diskbased
from svl_query_summary
where query = [YOUR-QUERY-ID]
order by workmem desc;

-- Check distkey, sortkey, and encoding for a given table
select "column", type, encoding, distkey, sortkey, "notnull" 
from pg_table_def
where tablename = 'lineorder';