Hydrating Data Mesh from AWS Lake Formation, Glue, and DataBrew

4 min readFeb 20, 2023

In our previous articles, we looked at defining and building a data mesh experience, starting with a meta-architecture, a data model, catalog UI, and a mesh graph. So far, we have provided the foundation for converging our data ecosystems to the data mesh core principles, whether we already have existing resources in the cloud or on-premises or will start from scratch. The data model's entities, relations, and constraints will help us integrate with existing metadata stores and data services; the catalog and the mesh graph will be the unifying landing page. In this article, we will start integrating with AWS Lake Formation, Glue, and DataBrew and hydrate entities in Data Mesh for a unified, data product-based experience.

Approach

AWS Lake Formation(LF) and Glue provide a data catalog, central permission governance, and central metadata catalog; DataBrew has the quality and central lineage native capabilities. Starting with existing resources in the AWS lake catalog, we will add data product metadata and introduce constraints and mappings to create Data Mesh entities. Finally, we will view our catalog's central dashboard and mesh graph hydrated from AWS.

AWS Resources

We are creating a small set of resources for this exercise to build a single domain, two data products, and two datasets mesh to illustrate the concepts.

We are starting with the following AWS resources:

S3 source table
Glue catalog database and tables
Glue crawler crawls S3 source and creates the catalog table
Lake Formation catalog
DataBrew dataset (derived from S3 source table)
DataBrew quality rulesets and profile jobs (checks for the quality of the source and destination table)
DataBrew recipe job (series of recipes to unnest the data and transform)
DataBrew output dataset (created by the recipe job and registered in the catalog; location is also S3)
IAM users and roles (data-analyst, data-engineer, glue-role, databrew-role, etc.)

Note: In our example, we use a single Lake Formation account and region, but it can be extended to multiple accounts and regions. AWS also supports cross-account sharing of LF resources.

To identify the resources represented in the mesh and to define the right associations, we are introducing the following constraints:

LF databases — as DataProducts — should include the following tags:
– LFTag dataproduct=yes
– LFTag domain=[domain-name]
LF tables — as Datasets — should use TBAC permissions for reader and writer permissions
– For example, persona=analyst LFTag added to a table along with LFTag association of [Identity] | dataproduct=yes&&persona==analyst | SELECT will result in Data Mesh metadata of [Identity] as "reader" of all matching tables
IAM users and roles — as Roles — should be assigned to a domain with role tags domain=[domain-name]
– We need this information to enforce cross-domain permissions, as discussed here.

Querying the metadata from AWS

The following summarizes the mappings we use with the abovementioned constraints to query the metadata from AWS and create DataMesh entities.

Domains: They are collections of datasets, roles, datasets, and data products as labeled by LFTags and tags. An LF catalog can contain multiple domains.
DataProduct: An LF, DataGlue database maps to a single data product
Datasets: Tables map to datasets
Roles: IAM users or roles map to mesh roles; they are readers or reader/writers based on the LFTag policies defined in the Lake Formation

Data Mesh experience built on AWS resources

Below is the mesh catalog showing data products, domains, and health info.

Below is the mesh graph showing relations between DataProducts, Datasets, and Roles.

Recap

In this article, we outlined an approach to apply mesh concepts to AWS Lake Formation and Glue resources and view them in the Data Mesh catalog and the mesh graph. As a set of next steps, we can add - ownership of data products and domains, cross-account and multi-region resources, navigation to resources, and mutation operations.

A Meta-architecture for Data Mesh

One of the essential requirements for successful data-driven decisions is speed. While there are many ways to improve…

qulia.medium.com

Modeling Data Mesh Catalog

Our previous article outlined the key elements of a meta-architecture for Data Mesh. We talked about the marketplace as…

qulia.medium.com

Data Mesh Catalog with React, Relay, and GraphQL

Our previous articles provided a high-level architecture for Data Mesh and an approach to model the catalog with ORM…

qulia.medium.com

Visualizing Data Mesh with React ForceGraph2D

Our previous articles looked at meta-architecture for Data Mesh, an approach to model the catalog with ORM, and the…

qulia.medium.com

Hydrating Data Mesh from AWS Lake Formation, Glue, and DataBrew

Approach

AWS Resources

Querying the metadata from AWS

Data Mesh experience built on AWS resources

Recap

Previous articles

A Meta-architecture for Data Mesh

One of the essential requirements for successful data-driven decisions is speed. While there are many ways to improve…

Modeling Data Mesh Catalog

Our previous article outlined the key elements of a meta-architecture for Data Mesh. We talked about the marketplace as…

Data Mesh Catalog with React, Relay, and GraphQL

Our previous articles provided a high-level architecture for Data Mesh and an approach to model the catalog with ORM…

Visualizing Data Mesh with React ForceGraph2D

Our previous articles looked at meta-architecture for Data Mesh, an approach to model the catalog with ORM, and the…

Written by Hülya Pamukçu Crowell

No responses yet