The Cloud data warehouse space is full of some tough competition these days. Every major cloud platform comes with its own data warehouse solution and its own set of rich features that get more impressive every year.
The last two years has been busy for the Amazon Redshift team; they’ve worked hard to compete against the likes of Snowflake and Google BigQuery for their place in the Cloud data warehouse arena. It’s time for a rundown on the top five reasons Redshift is no longer simply a fast columnar-store database. It’s packing a lot of features worth taking a closer look at.
Ad-Hoc ETL (Redshift Spectrum)
The “Data Lake” is now commonplace among organizations—treating their data as an asset to their business and collecting it in the safest manner possible every chance they get. However, it doesn’t always make sense to invest in transforming the data to fit into a columnar store database and the associated ETL maintenance overhead.
Enter Redshift Spectrum. You can now query files in your S3 data lake and join directly to tables in your Redshift cluster without writing any ETL at all. This is ideal for storage of less frequently accessed records you may not want to store in your cluster or iterating on table design containing real data without having to load directly into your database.
Federated Querying
Often another team has already invested heavily in curating data to be stored in an external relational database and has established credibility and trust with stakeholders as the source of truth for the company. Duplicating this data into your Redshift cluster is certainly one way to leverage its value to the business but equates to yet another pipeline that warrants its own maintenance and uptime concerns.
Redshift Federated Querying allows you to join to this external data without any pipelines at all. Similar in concept to Redshift Spectrum, federated querying allows for querying data in an external database without writing any ETL and leverages the compute resource in your existing cluster to do so. This means no ETL is required to move the data to your cluster in order to be accessible; join on a table from a completely different database onto a table in your cluster, without any pipeline at all.
Integrated Security (Lake Formation)
The advent of the data lake was attractive in many ways, but often meant administrators managed access to the data in no less than two places: the permissions on the database engine itself and the permissions to the S3 locations in the data lake. It’s for this reason that many admins insisted that access to the data only occur in the database, reducing the need to manage permissions to a single location. It also meant the business was missing out on other data in the lake that hadn’t yet made it to the database.
Lake Formation offers a single place to manage access to data inside and outside of your Redshift cluster. The column and row-level permissions available in Lake Formation mean administrators can secure data lake records just like Redshift table records. As long as the tables in your cluster have been registered in the Glue Data Catalog, a person with the right permissions can query across your Redshift cluster and data lake in a secure manner. Each time users try to run queries Lake Formation verifies access to the table for that specific principal. Lake Formation vends temporary credentials to Redshift Spectrum and the query runs.
On-Demand Compute (Concurrency Scaling)
Provisioning for peak compute needs has gone the way of the dodo if one is taking full advantage of the Cloud concept. Autoscaling groups and/or containerization have largely solved this problem.
Ironically, it took several years for the database, often the most demanded resource in an architecture, to finally make it to the autoscaling party. How exactly would one containerize a SQL Server database? With licensing. Not the least of the reasons why that wasn’t a good idea, one simply built a server that was big enough to handle peak load without breaking or being a risk to the business. This posed the classic compute cost problem: extra compute costs during times of idleness, and the risk of too little capacity during peak load. Unfortunately, Redshift was no exception.
Redshift Concurrency Scaling is the answer to database scaling woes. During times of high query demand on your cluster, it will scale compute nodes to meet the demand and then scale back once demand has subsided. You’re only paying for the concurrency nodes while they’re running, so you only pay for your peak demand during peak demand. This means you’re notpaying for peak capacity during times of low demand.
Automated Query Management (Automatic WLM and SQA)
In the same vein as concurrency, managing different query needs across an organization can be a difficult task. Workload Management (WLM) Queues was Redshift’s early answer to this by allowing administrators to assign priority to certain users to ensure their experience was optimal. The CEO’s daily report could jump the line and get really fast reporting, while less critical queries waited for available resources before committing the query. Long running ETL jobs could also be prioritized accordingly. This was a tedious process in and of itself and required revisiting priorities often.
Automatic WLM and Short Query Acceleration (SQA) solve most of this balancing act for you. Automatic WLM determines the compute needed automatically, and adjusts the concurrency based on the query needs. When demanding queries are in the system (for example, hash joins between large tables), the concurrency is lower. When lighter queries (such as inserts, deletes, scans, or simple aggregations) are submitted, concurrency is higher. Automatic WLM and SQA work together to ensure lightweight queries to complete even while long running, demanding queries are running.
If any of this sounds like something that can help your business, please reach out to us at info@1Strategy.com. We’d love to schedule a call to discuss how 1Strategy can help you get started.