Hydrating Data Mesh from AWS Lake Formation, Glue, and DataBrew

Hülya Pamukçu Crowell
4 min readFeb 20, 2023

--

In our previous articles, we looked at defining and building a data mesh experience, starting with a meta-architecture, a data model, catalog UI, and a mesh graph. So far, we have provided the foundation for converging our data ecosystems to the data mesh core principles, whether we already have existing resources in the cloud or on-premises or will start from scratch. The data model's entities, relations, and constraints will help us integrate with existing metadata stores and data services; the catalog and the mesh graph will be the unifying landing page. In this article, we will start integrating with AWS Lake Formation, Glue, and DataBrew and hydrate entities in Data Mesh for a unified, data product-based experience.

Approach

AWS Lake Formation(LF) and Glue provide a data catalog, central permission governance, and central metadata catalog; DataBrew has the quality and central lineage native capabilities. Starting with existing resources in the AWS lake catalog, we will add data product metadata and introduce constraints and mappings to create Data Mesh entities. Finally, we will view our catalog's central dashboard and mesh graph hydrated from AWS.

AWS Resources

We are creating a small set of resources for this exercise to build a single domain, two data products, and two datasets mesh to illustrate the concepts.

We are starting with the following AWS resources:

  • S3 source table
  • Glue catalog database and tables
  • Glue crawler crawls S3 source and creates the catalog table
  • Lake Formation catalog
  • DataBrew dataset (derived from S3 source table)
  • DataBrew quality rulesets and profile jobs (checks for the quality of the source and destination table)
  • DataBrew recipe job (series of recipes to unnest the data and transform)
  • DataBrew output dataset (created by the recipe job and registered in the catalog; location is also S3)
  • IAM users and roles (data-analyst, data-engineer, glue-role, databrew-role, etc.)

Note: In our example, we use a single Lake Formation account and region, but it can be extended to multiple accounts and regions. AWS also supports cross-account sharing of LF resources.

To identify the resources represented in the mesh and to define the right associations, we are introducing the following constraints:

  • LF databases — as DataProducts — should include the following tags:
    LFTag dataproduct=yes
    – LFTag domain=[domain-name]
  • LF tables — as Datasets — should use TBAC permissions for reader and writer permissions
    – For example, persona=analyst LFTag added to a table along with LFTag association of [Identity] | dataproduct=yes&&persona==analyst | SELECT will result in Data Mesh metadata of [Identity] as "reader" of all matching tables
  • IAM users and roles — as Roles — should be assigned to a domain with role tags domain=[domain-name]
    – We need this information to enforce cross-domain permissions, as discussed here.

Querying the metadata from AWS

The following summarizes the mappings we use with the abovementioned constraints to query the metadata from AWS and create DataMesh entities.

  • Domains: They are collections of datasets, roles, datasets, and data products as labeled by LFTags and tags. An LF catalog can contain multiple domains.
  • DataProduct: An LF, DataGlue database maps to a single data product
  • Datasets: Tables map to datasets
  • Roles: IAM users or roles map to mesh roles; they are readers or reader/writers based on the LFTag policies defined in the Lake Formation

Data Mesh experience built on AWS resources

Below is the mesh catalog showing data products, domains, and health info.

Below is the mesh graph showing relations between DataProducts, Datasets, and Roles.

Recap

In this article, we outlined an approach to apply mesh concepts to AWS Lake Formation and Glue resources and view them in the Data Mesh catalog and the mesh graph. As a set of next steps, we can add - ownership of data products and domains, cross-account and multi-region resources, navigation to resources, and mutation operations.

Previous articles

--

--