Use Amazon Redshift Spectrum with row-level and cell-level security policies defined in AWS Lake Formation

Spread the love


Advertisement

Data warehouses and data lakes are key to an enterprise data management strategy. A data lake is a centralized repository that consolidates your data in any format at any scale and makes it available for different kinds of analytics. A data warehouse, on the other hand, has cleansed, enriched, and transformed data that is optimized for faster queries. Amazon Redshift is a cloud-based data warehouse that powers a lake house architecture, which enables you to query the data in a data warehouse and an Amazon Simple Storage Service (Amazon S3) data lake using familiar SQL statements and gain deeper insights.

Advertisements

Data lakes often contain data for multiple business units, users, locations, vendors, and tenants. Enterprises want to share their data while balancing compliance and security needs. To satisfy compliance requirements and to achieve data isolation, enterprises often need to control access at the row level and cell level. For example:

  • If you have a multi-tenant data lake, you may want each tenant to be able to view only those rows that are associated to their tenant ID
  • You may have data for multiple portfolios in the data lake and you need to control access for various portfolio managers
  • You may have sensitive information or personally identifiable information (PII) that can be viewed by users with elevated privileges only

AWS Lake Formation makes it easy to set up a secure data lake and access controls for these kinds of use cases. You can use Lake Formation to centrally define security, governance, and auditing policies, thereby achieving unified governance for your data lake. Lake Formation supports row-level security and cell-level security:

  • Row-level security allows you to specify filter expressions that limit access to specific rows of a table to a user
  • Cell-level security builds on row-level security by allowing you to apply filter expressions on each row to hide or show specific columns

Amazon Redshift is the fastest and most widely used cloud data warehouse. Amazon Redshift Spectrum is a feature of Amazon Redshift that enables you to query data from and write data back to Amazon S3 in open formats. You can query open file formats such as Parquet, ORC, JSON, Avro, CSV, and more directly in Amazon S3 using familiar ANSI SQL. This gives you the flexibility to store highly structured, frequently accessed data in an Amazon Redshift data warehouse, while also keeping up to exabytes of structured, semi-structured, and unstructured data in Amazon S3. Redshift Spectrum integrates with Lake Formation natively. This integration enables you to define data filters in Lake Formation that specify row-level and cell-level access control for users on your data and then query it using Redshift Spectrum.

In this post, we present a sample multi-tenant scenario and describe how to define row-level and cell-level security policies in Lake Formation. We also show how these policies are applied when querying the data using Redshift Spectrum.

Solution overview

In our use case, Example Corp has built an enterprise data lake on Amazon S3. They store data for multiple tenants in the data lake and query it using Redshift Spectrum. Example Corp maintains separate AWS Identity and Access Management (IAM) roles for each of their tenants and wants to control access to the multi-tenant dataset based on their IAM role.

Example Corp needs to ensure that the tenants can view only those rows that are associated to them. For example, Tenant1 should see only those rows where tenantid = 'Tenant1' and Tenant2 should see only those rows where tenantid = 'Tenant2'. Also, tenants can only view sensitive columns such as phone, email, and date of birth associated to specific countries.

The following is a screenshot of the multi-tenant dataset we use to demonstrate our solution. It has data for two tenants: Tenant1 and Tenant2. tenantid is the column that distinguishes data associated to each tenant.

To solve this use case, we implement row-level and cell-level security in Lake Formation by defining data filters. When Example Corp’s tenants query the data using Redshift Spectrum, the service checks filters defined in Lake Formation and returns only the data that the tenant has access to.

Lake Formation metadata tables contain information about data in the data lake, including schema information, partition information, and data location. You can use them to access underlying data in the data lake and manage that data with Lake Formation permissions. You can apply row-level and cell-level security to Lake Formation tables. In this post, we provide a walkthrough using a standard Lake Formation table.

The following diagram illustrates our solution architecture.

The solution workflow consists of the following steps:

  1. Create IAM roles for the tenants.
  2. Register an Amazon S3 location in Lake Formation.
  3. Create a database and use AWS Glue crawlers to create a table in Lake Formation.
  4. Create data filters in Lake Formation.
  5. Grant access to the IAM roles in Lake Formation.
  6. Attach the IAM roles to the Amazon Redshift cluster.
  7. Create an external schema in Amazon Redshift.
  8. Create Amazon Redshift users for each tenant and grant access to the external schema.
  9. Users Tenant1 and Tenant2 assume their respective IAM roles and query data using the SQL query editor or any SQL client to their external schemas inside Amazon Redshift.

Prerequisites

This walkthrough assumes that you have the following prerequisites:

Create IAM roles for the tenants

Create IAM roles Tenant1ReadRole and Tenant2ReadRole for users with elevated privileges for the two tenants, with Amazon Redshift as the trusted entity, and attach the following policy to both roles:

{
	"Version": "2012-10-17",
	"Statement": [{
		"Effect": "Allow",
		"Action": [
			"lakeformation:GetDataAccess",
			"glue:GetTable",
			"glue:GetTables",
			"glue:SearchTables",
			"glue:GetDatabase",
			"glue:GetDatabases",
			"glue:GetPartition",
			"glue:GetPartitions"
		],
		"Resource": "*"
	}]
}

Register an Amazon S3 location in Lake Formation

We use the sample multi-tenant dataset SpectrumRowLevelFiltering.csv. Complete the following steps to register the location of this dataset in Lake Formation:

  1. Download the dataset and upload it to the Amazon S3 path s3://<your_bucket>/order_details/SpectrumRowLevelFiltering.csv.
  2. On the Lake Formation console, choose Data lake locations in the navigation pane.
  3. Choose Register location.
  4. For Amazon S3 path, enter the S3 path of your dataset.
  5. For IAM role, choose either the AWSServiceRoleForLakeFormationDataAccess service-linked role (the default) or the Lake Formation administrator role mentioned in the prerequisites.
  6. Choose Register location.

Create a database and a table in Lake Formation

To create your database and table, complete the following steps:

  1. Sign in to the AWS Management Console as the data lake administrator.
  2. On the Lake Formation console, choose Databases in the navigation pane.
  3. Choose Create database.
  4. For Name, enter rs_spectrum_rls_blog.
  5. If Use only IAM access control for new tables in this database is selected, uncheck it.
  6. Choose Create database.Next, you create a new data lake table.
  7. On the AWS Glue console, choose Crawlers in navigation pane.
  8. Choose Add crawler.
  9. For Crawler name, enter order_details.
  10. For Specify crawler source type, keep the default selections.
  11. For Add data store, choose Include path, and choose the S3 path to the dataset (s3://<your_bucket>/order_details/).
  12. For Choose IAM Role, choose Create an IAM role, with the suffix rs_spectrum_rls_blog.
  13. For Frequency, choose Run on demand.
  14. For Database, choose database you just created (rs_spectrum_rls_blog).
  15. Choose Finish to create the crawler.
  16. Grant CREATE TABLE permissions and DESCRIBE/ALTER/DELETE database permissions to the IAM role you created in Step 12.
  17. To run the crawler, in the navigation pane, choose Crawlers.
  18. Select the crawler order_details and choose Run crawler.When the crawler is complete, you can find the table order_details created under the database rs_spectrum_rls_blog in the AWS Glue Data Catalog.
  19. On the AWS Glue console, in the navigation pane, choose Databases.
  20. Select the database rs_spectrum_rls_blog and choose View tables.
  21. Choose the table order_details.

The following screenshot is the schema of the order_details table.

Create data filters in Lake Formation

To implement row-level and cell-level security, first you create data filters. Then you choose that data filter while granting SELECT permission on the tables. For this use case, you create two data filters: one for Tenant1 and one for Tenant2.

  1. On the Lake Formation console, choose Data catalog in the navigation pane, then choose Data filters.
  2. Choose Create new filter.
    Let’s create the first data filter filter-tenant1-order-details restricting the rows Tenant1 is able to see in table order_details.
  3. For Data filter name, enter filter-tenant1-order-details.
  4. For Target database, choose rs_spectrum_rls_blog.
  5. For Target table, choose order_details.
  6. For Column-level access, select Include columns and then choose the following columns: c_emailaddress, c_phone, c_dob, c_firstname, c_address, c_country, c_lastname, and tenanted.
  7. For Row filter expression, enter tenantid = 'Tenant1' and c_country in  (‘USA’,‘Spain’).
  8. Choose Create filter.
  9. Repeat these steps to create another data filter filter-tenant2-order-details, with row filter expression tenantid = 'Tenant2' and c_country in (‘USA’,‘Canada’).

Grant access to IAM roles in Lake Formation

After you create the data filters, you need to attach them to the table to grant access to a principal. First let’s grant access to order_details to the IAM role Tenant1ReadRole using the data filter we created for Tenant1.

  1. On the Lake Formation console, in the navigation pane, under Permissions, choose Data Permissions.
  2. Choose Grant.
  3. In the Principals section, select IAM users and roles.
  4. For IAM users and roles, choose the role Tenant1ReadRole.
  5. In the LF-Tags or catalog resources section, choose Named data catalog resources.
  6. For Databases, choose rs_spectrum_rls_blog.
  7. For Tables, choose order_details.
  8. For Data filters, choose filter-tenant1-order-details.
  9. For Data filter permissions, choose Select.
  10. Choose Grant.
  11. Repeat these steps with the IAM role Tenant2ReadRole and data filter filter-tenant2-order-details.

Attach the IAM roles to the Amazon Redshift cluster

To attach your roles to the cluster, complete the following steps:

  1. On the Amazon Redshift console, in the navigation menu, choose CLUSTERS, then select the name of the cluster that you want to update.
  2. On the Actions menu, choose Manage IAM roles.The IAM roles page appears.
  3. Either choose Enter ARN and enter an ARN of the Tenant1ReadRole IAM role, or choose the Tenant1ReadRole IAM role from the list.
  4. Choose Add IAM role.
  5. Choose Done to associate the IAM role with the cluster.The cluster is modified to complete the change.
  6. Repeat these steps to add the Tenant2ReadRole IAM role to the Amazon Redshift cluster.

Amazon Redshift allows up to 50 IAM roles to attach to the cluster to access other AWS services.

Create an external schema in Amazon Redshift

Create an external schema on the Amazon Redshift cluster, one for each IAM role, using the following code:

CREATE EXTERNAL SCHEMA IF NOT EXISTS spectrum_tenant1
FROM DATA CATALOG DATABASE 'rs_spectrum_rls_blog'
IAM_ROLE '<<Tenant1ReadRole ARN>>'
REGION 'us-east-1';

CREATE EXTERNAL SCHEMA IF NOT EXISTS  spectrum_tenant2
FROM DATA CATALOG DATABASE  'rs_spectrum_rls_blog'
IAM_ROLE '<<Tenant2ReadRole ARN>>'
REGION 'us-east-1';

Create Amazon Redshift users for each tenant and grant access to the external schema

Complete the following steps:

  1. Create Amazon Redshift users to restrict access to the external schemas (connect to the cluster with a user that has permission to create users or superusers) using the following code:
    CREATE USER tenant1_user WITH PASSWORD '<password>';
    CREATE USER tenant2_user WITH PASSWORD '<password>';

  2. Let’s create the read-only role (tenant1_ro) to provide read-only access to the spectrum_tenant1 schema:
  3. Grant usage on spectrum_tenant1 schema to the read-only tenant1_ro role:
    grant usage on schema spectrum_tenant1 to role tenant1_ro;

  4. Now assign the user to the read-only tenant1_ro role:
    grant role tenant1_ro to tenant1_user;

  5. Repeat the same steps to grant permission to the user tenant2_user:
    create role tenant2_ro;
    grant usage on schema spectrum_tenant2 to role tenant2_ro;
    grant role tenant2_ro to tenant2_user;

Tenant1 and Tenant2 users run queries using the SQL editor or a SQL client

To test the permission levels for different users, connect to the database using the query editor with that user.

In the Query Editor in the Amazon Redshift console, connect to the cluster with tenant1_user and run the following query:

-- Query table 'order_details' in schema spectrum_tenant1 with role Tenant1ReadRole

SELECT * FROM spectrum_tenant1.order_details;

In the following screenshot, tenant1_user is only able to see records where the tenantid value is Tenant1 and only the customer PII fields specific to the US and Spain.

To validate the Lake Formation data filters, the following screenshot shows that Tenant1 can’t see any records for Tenant2.

Reconnect to the cluster using tenant2_user and run the following query:

-- Query table 'order_details' in schema spectrum_tenant2 with role Tenant2ReadRole

SELECT * FROM spectrum_tenant2.order_details;

In the following screenshot, tenant2_user is only able to see records where the tenantid value is Tenant2 and only the customer PII fields specific to the US and Canada.

To validate the Lake Formation data filters, the following screenshot shows that Tenant2 can’t see any records for Tenant1.

Conclusion

In this post, you learned how to implement row-level and cell-level security on an Amazon S3-based data lake using data filters and access control features in Lake Formation. You also learned how to use Redshift Spectrum to access the data from Amazon S3 while adhering to the row-level and cell-level security policies defined in Lake Formation.

You can further enhance your understanding of Lake Formation row-level and cell-level security by referring to Effective data lakes using AWS Lake Formation, Part 4: Implementing cell-level and row-level security.

To learn more about Redshift Spectrum, refer Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required.

For more information about configuring row-level access control natively in Amazon Redshift, refer to Achieve fine-grained data security with row-level access control in Amazon Redshift.


About the authors

Anusha Challa is a Senior Analytics Specialist Solutions Architect at AWS. Her expertise is in building large-scale data warehouses, both on premises and in the cloud. She provides architectural guidance to our customers on end-to-end data warehousing implementations and migrations.

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS.

Author: Subham

Leave a Reply

Your email address will not be published. Required fields are marked *