Query Engine Overview

The Immuta Query Engine acts as a proxy, implemented as an extension to the open source database PostgreSQL. Using the Foreign Data Wrapper (FDW) framework and planner hooks exposed by PostgreSQL, the Query Engine adds support for querying data in remote databases over both ODBC and REST as if the data existed locally within the Query Engine. However, no data is ever copied or persisted in the Query Engine; data is fetched directly from the remote database, transformed into Postgres tuples, and returned to the user.

Query Lifecycle

View the slides below for an overview of the Query Engine lifecycle and a detailed illustration of how a query is rewritten.

Query Lifecycle Explained

1. User authenticates to the Immuta Query Engine(QE) with Immuta SQL credentials.

To properly authenticate users, enforce policies, and ensure that queries are correctly audited, the QE must be able to communicate with the Immuta Web Service. When the Immuta service is started it will attempt to call out to Postgres and write relevant metadata to immutasys.service_config. The values in that table will be used when establishing future connections to the Immuta service from the QE. An Immuta system account and an API key for that account are also written to the immutasys.dbtoken table. The system account will be used to make system specific calls to the Immuta service. The API key will be used to periodically refresh a token used to make these calls. The token will be shared in a shared memory object within Postgres. The API Key associated with the account can be manually refreshed in the event of error via the Immuta Application Configuration page. Without a valid token/api key all queries will fail as the QE will not be able to check/enforce policies or audit those queries.

2. User executes a SQL SELECT statement against an Immuta virtual table.

When a new connection is opened to PostgreSQL and a user connects to the Immuta database with the role bodata_user (all normal user accounts) Immuta verifies that the Postgres account is still valid with the Immuta web service. If the account is no longer valid (or the user hasn’t authenticated with the Immuta service within the required time window) the user will be denied access.

The vast majority of remote databases supported by the QE are supported via ODBC. Due to various license restrictions as well as the fact that some ODBC drivers are not free, Immuta does not ship with drivers for all of our supported databases. Customers can use the Immuta Application Configuration page to upload drivers for these databases which will then be installed on all QE nodes so that they can be used by Immuta. When a new driver is uploaded the relevant metadata is written to the immutasys.driver-metadata table. Similar to Kerberos, a PostgreSQL background worker is used to periodically check for new drivers and when found to install and register those drivers with the ODBC manager on the QE nodes, unixODBC. Immuta uses a custom fork of unixODBC that enables deep linking of all drivers via RTLD_DEEPBIND.

3. Query Engine checks with the Immuta service to determine the policy decision.

Immuta code in the Query Engine checks with the Immuta service to determine the policy decision (can the user query this table, what rows should be hidden, columns masked, etc).

Immuta manages the subscription; the user does not need to have access to the data outside of Immuta.

User attributes from the identity management system used to make the policy decision are cached for the configured latency tolerance on the data source.

4. The original query is rewritten based on the policy decision.

For queries against the Immuta database by a PostgreSQL account with the bodata_user role Immuta leverages the PostgreSQL post_parse_analyze_hook to determine if the query is listing tables. This is determined by checking if any of the columns referenced by the query belong to system tables and include relkind or relname. If so Immuta alters the parse tree to ensure that foreign tables are returned and that tables are filtered based on the context of the current user account. This is done to hide tables within BI Tools that the user doesn’t have access to. While there is no security risk associated with showing users the names of tables and schema to which they don’t have access to, it does provide a cleaner user experience.

At this point Immuta separates conditions into those that can and can’t be pushed down, retrieve the policies related to the query and return the updated plan with the relevant metadata attached.

5. The query is converted to the native database syntax and executed as a query on the native database.

Generating the remote SQL statement happens as part of generating the query plan. The bulk of driver specific work is accomplished during query generation. In addition to translating the PostgreSQL parse tree into generic SQL for the remote database Immuta also generates SQL based on the policy information retrieved in the previous step. This includes appending WHERE statements, manipulating SELECT statements, and potentially blocking queries if users attempt to bypass our policies or run a query which would leak data.

The query on the native database is executed using the credentials used to create the data source.

The user and query are fully audited in Immuta.

6. Results are streamed back to the user through the Immuta Query Engine.

Query results flow back through the Query Engine to the user. No data is ever copied or stored in the Query Engine.

Architecture

The diagram of the Immuta ecosystem below illustrates how the Query Engine interacts with remote databases and analytic tools in the Immuta Access Control Plane (where policies are enforced).

In contrast, when customers use a native access pattern (Databricks, Snowflake, Presto, etc.), which are explored in the next section, queries do not flow through the Query Engine. Instead, queries are executed and policies are enforced directly (“natively”) in the remote database.