Let’s Play with R & Amazon Redshift - Part 1: Basic Connection

POSTED IN: Data Analytics & Visualization Blog

Started with a Quick Look through Documentation - high points below

Let’s Play with R & Amazon Redshift

Amazon Redshift Management Guide: http://docs.aws.amazon.com/redshift/latest/mgmt/redshift-mgmt.pdf

An Amazon Redshift cluster consists of nodes. There are two types of nodes: dense storage and dense compute. The dense storage node types are recommended for substantial data storage needs, while dense compute node types are optimized for performance-intensive workloads. Each cluster has a leader node and a compute node. If you have a single-node cluster, the node is shared for leader and compute node functionality. If you have a multi-node cluster, the leader node is separate from the compute node or nodes. To prevent connection issues between SQL client tools and the Amazon Redshift database, we recommend disabling TCP/IP jumbo frames by setting the maximum transmission unit (MTU) to 1500 on the network interface (NIC) of your Amazon EC2 instances. You can shut down your cluster if you want to stop it from running and incurring charges. When you shut it down, you can optionally create a final snapshot. If you create a final snapshot, Amazon Redshift will create a manual snapshot of your cluster before shutting it down.You can later restore that snapshot if you want to resume running the cluster and querying data.


Database Ports

Type a port number through which you will connect from client applications to the database.The port number must be included in the connection string when opening JDBC or ODBC connections to the databases in the cluster.  The port number must meet the following conditions:

• It must contain only numeric characters.

• It must fall in the range of 1150 to 65535. The default port is 5439.

• It must specify an open port that accepts inbound connections, if you are behind a firewall.

ODBC (and thus RODBC ) is supported


After creating an Amazon Redshift cluster, you can create, manage, and delete databases in the cluster by using client applications or tools that execute SQL statements through the PostgreSQL ODBC or JDBC drivers.

• For information about installing client SQL tools and connecting to a cluster, see Connecting to a Cluster (p. 138). • For information about designing databases and the SQL statements supported by Amazon Redshift,go to the Amazon Redshift Database Developer Guide.



 mazon Redshift QUERY API — is a Amazon Redshift management API you can call by submitting a Query request. Query requests are HTTP or HTTPS requests that use the HTTP verbs GET or POST with a query parameter named Action. Calling the Query API is the most direct way to access the Amazon Redshift service, but requires that your application handle low-level details such as error handling and generating a hash to sign the request. 

• For information about building and signing a Query API request, see Signing an HTTP Request (p. 131).

• For information about the Query API actions and data types for Amazon Redshift, go to the Amazon

Redshift API Reference.

AWS SDKs — Amazon Web Services provides Software Development Kits (SDKs) that you can use to perform Amazon Redshift cluster-related operations. Several of the SDK libraries wrap the underlying Amazon Redshift Query API.They integrate the API functionality into the specific programming language and handle many of the low-level details, such as calculating signatures, handling request retries, and error handling. Calling the wrapper functions in the SDK libraries can greatly simplify the process of writing an application to manage an Amazon Redshift cluster. 

• Amazon Redshift is supported by the AWS SDKs for Java, .NET, PHP, Python, Ruby, and Node.js.

The wrapper functions for Amazon Redshift are documented in the reference manual for each SDK. For a list of the AWS SDKs and links to their documentation, go to Tools for Amazon Web Services.

• This guide provides examples of working with Amazon Redshift using the Java SDK. For more general AWS SDK code examples, go to Sample Code & Libraries.


Amazon Redshift requires that every request you send to the management API be authenticated with a

signature. This topic explains how to sign your requests.  If you are using one of the AWS Software Development Kits (SDKs) or the AWS Command Line Interface, request signing is handled automatically, and you can skip this section. For more information about using AWS SDKs, see Using the Amazon Redshift Management Interfaces (p. 128). For more information about using the Amazon Redshift Command Line Interface, go to Amazon Redshift Command Line Reference. To sign a request, you calculate a digital signature by using a cryptographic hash function.

You can use a GET or POST request to send requests to Amazon Redshift. The difference between the two is that for the GET request your parameters are sent as query string parameters. For the POST request they are included in the body of the request. To work with data in your cluster, you need JDBC or ODBC drivers for connectivity from your client computer or instance. Code your applications to use JDBC or ODBC data access APIs, and use SQL client tools that support either JDBC or ODBC.

Amazon Redshift: Database Developer Guide


Getting Started – The Getting Started Guide includes an example that walks you through the process of creating an Amazon Redshift data warehouse cluster, creating database tables, uploading data, and testing queries.


Works: http://docs.aws.amazon.com/redshift/latest/mgmt/install-odbc-driver-windows.html

Connect Without Using the Server Certificate with ODBC

If you want to connect to your cluster without using the Amazon Redshift server certificate, you can configure your ODBC DSN to use one of the following SSL modes: allowprefer, or require. With these settings, the connection will use SSL but will not verify the server certificate.


### BASIC CONNECTIVITY TESTING - R STUDIO (Windows Deployment) via RODBC to Amazon Redshift
# Ryan Anderson, June 2015
# Need to install this on Windows first (ODBC COnnection)
# http://docs.aws.amazon.com/redshift/latest/mgmt/install-odbc-driver-windows.html  # Then TEST it from Microsoft 
# For reference - # http://stackoverflow.com/questions/25783571/writing-data-from-an-r-data-set-into-redshift (careful, author had performance issues with this method


# what ODBC handles are options to open up a connection channel?  need to pre-configure and TEST OK your AMazon 
odbcDataSources(type = c("all", "user", "system"))  ## show what ODBC Options are on system - should see Amazon

getwd()  #where are we
password <- read.table(file="private.txt", header=FALSE) # where I'm holding pw outside public code , for now
password <- paste(password[1,1],sep="")  # ugly - but masks my password in public code (can also use registry, may update later)

channel <- odbcConnect("AWS_hydrogen_source", uid = "master_user", pwd = password)
channel # works!  if a positive integer, you are connected

odbcTables(channel, catalog = NULL, schema = NULL, tableName = NULL, tableType = NULL, literal = FALSE)


df <- data.frame(open=rnorm(50), low=rnorm(50), high=rnorm(50), close=rnorm(50))
#careful about putting everything in lower case - LOWER CASE AWS (no "Low" must be "low")
sqlSave(channel,df,"test_table", rownames=F)  # 50 rows about half a minute - 5k Long time
test_data <- sqlQuery(channel,"select * from test_table where close > '0'") # reading is fast. subset
sqlDrop(channel, "test_table", errors = FALSE) # clean up our toys

#### iris test

colnames(iris) <- tolower(colnames(iris)) # I think AWS does not like caps in column names
sqlSave(channel,iris,"iris", rownames=F) ## SLOOOOOOW!  SO SLOW! Must be a better way 150 ~1.5 minutes
iris_results <- sqlQuery(channel,"select * from iris where species = 'virginica'") # fast subset. this does work and shows up on AWS Redshift Dashboard
sqlFetch(channel, "iris", max = 5)
sqlColumns(channel, "iris")

## clean up our toys
sqlDrop(channel, "iris", errors = FALSE) # clean up our toys

################## END OF CODE ###

## needs work or not sure what these do yet
odbcFetchRows(channel, max = 0, buffsize = 1000, nullstring = NA_character_, believeNRows = TRUE)
## not working
sqlUpdate(channel, iris_results, "iris")   


updates here - https://github.com/rustyoldrake/RODBC_AWS_Redshift



Interested in more content by this author?

About the Author

Ryan Anderson

Ryan Anderson

Hi! I like to play with data, analytics and hack around with robots and gadgets in my garage. Lately I've been learning about machine learning.

About this blog

Description is...<br/>Data Analytics & Visualization Blog - Generating insights from Data since 2013

Created: July 25, 2014


Up Next