Immuta’s performance considerations will change from customer to customer depending on their own environment, their access pattern, their integrations, their users, and their use case.
Here are a couple sample performance issues and the solutions.
Immuta’s Query Engine
First of all, this could be a wide variety of issues, but after looking at the particular customer environment this was the solution that was found.
When using Immuta each distinct database connection maps to a different foreign server in the query engine.
What this means in practice, is that two data sources, backed by the same database host, can map to different foreign servers in Immuta if any of the connection details are different. They could use a different database user, or maybe one uses a fully qualified host name and another uses a short name.
Immuta depends on query push-down to ensure query performance. When two data sources from the same database use different foreign servers, Immuta can not push down queries that join between them. A lot of times this manifests as poor query performance. Queries that should take seconds may take minutes, or may never complete.
The most common cause of this is the use of different credentials to create data sources from a single database. In this case it makes sense to recommend that users use a single account to create all of their data sources. One way to help ensure this is to use bulk data source creation and schema monitoring (available since 2020.2).
It’s important to consider two different performance testing scenarios:
General overhead: this is when you compare query performance on a cluster without Immuta to query performance on a cluster with Immuta, but no Immuta policies enforced. Remember, though, there’s never “no” policies enforced because Immuta must check for the existence (or not) of policy on every query.
Policy overhead: this is comparing non-policy enforced queries with policy enforced queries.
Have an Immuta cluster and non-Immuta cluster with the same metastore.
Ensure both clusters are sized identically.
It is best if you are the only user on both clusters (eliminate load from other users)
Avoid configuring clusters with auto-scaling, both the Immuta enabled and non-Immuta enabled clusters. This removes worker node availability as a variable that could impact performance.
The Immuta SecurityManager, which is required for R and Scala clusters, can add overhead. Assuming a SQL or Python cluster, you should configure the SecurityManager to “off” with these steps (note that in Immuta 2021.1 or greater, these steps will not be required as long as it is not an R or Scala cluster): a. Change the IMMUTA_INIT_ALLOWED_CALLING_CLASSES_URI environment variable to reference our pythonAndSQLCallingClasses.json configuration file (this will be available on the download portal or internally to provide to customers). b. Remove the spark.executor.extraJavaOptions configuration from the cluster’s Spark configuration. c. Validate that spark.databricks.repl.allowedLanguages is set to only include python and/or sql.
Execute the jobs multiple times, remembering that Databricks has quite a large standard deviation in job runtime and sometimes the Immuta overhead will be less than that standard deviation – especially if you consider cluster startup on the first job and general cluster load.
The initial job will be slower than subsequent jobs, this is because Immuta caches metastore results relevant to policies.
Our internal tests are run with TPC-DS
Accessing this course requires a login. Please enter your credentials below!