Azure Databricks Best Practices

Learn the Azure Databricks best practices by experts

Azure Databricks is a data analytics platform created specifically for the Azure Cloud Platform. Azure Databricks provides three environments for creating and developing data-intensive applications: Azure Databricks SQL, Azure Databricks Data Science and Engineering, and Azure Databricks Machine Learning.

Read more below about the best practices for Azure Databricks.

Single Set Of Workspaces

While most users prefer splitting workspaces because of increased efficiency, some Azure Databricks customers require only a single set of workspaces. These customers find that all their needs can be met by a single set of workspaces, especially due to newly added features like Repos, Unity Catalog, Persona-based landing pages, etc.

The best practices for a single set of workspaces are:

  • Since everything is in the same environment, there is no worry about cluttering the workspace. Assets aren’t mixed; thus, cost and usage across multiple projects and teams aren’t diluted.
  • Administrative overhead costs are significantly reduced as workspace management is now a simple task.

Sandbox Workspaces

A sandbox workspace is an environment that allows users to formulate, develop, and incubate work that might still be potentially valuable. The sandbox environment will enable users to explore and work with data while offering protection against unintentional changes and affecting existing workloads. Cluster Policies can be implemented to keep the effects of the sandbox environment on other workloads to a minimum.

The best practices for sandbox workspaces include:

  • Load the sandbox environment in a completely separate cloud account not having production or sensitive data.
  • Utilize Cluster Policies to set up guardrails so that users can have some degree of freedom in the environment without requiring administrator management.
  • It should be communicated clearly that the sandbox environment is self-service.
  • If the user prefers Hadoop workloads, then Azure Databricks can be considered the best option.

Data Isolation And Sensitivity

Data that comes from a variety of sources is considered highly valuable. It is used to aggregate information about customers and form actionable insights to drive strategies for organizations. This data has a high risk of experiencing a data breach. Therefore it is essential to keep data separated, protected, and segregated. Azure Databricks offers ACLs, Secure Sharing, and many security options to protect data and make it low risk for the organization.

The best practices for data isolation and sensitivity are:

  • Understand data governance according to your organization. Each organization has a different strategy, and different needs and therefore need to develop a data governance strategy accordingly.
  • Implement policies and controls at the metastore and storage levels. Using the principle of least access, S3 policies and ADLS, ACLs should be used. As an additional layer of security and control, Leverage Unity Catalog should be applied.
  • It is a best practice to physically and virtually separate sensitive and non-sensitive data. Most users that use Azure Databricks workspaces already segregate and separate their sensitive and non-sensitive data.

Disaster Recovery And Regional Backup

  • Disaster recovery is essential to make sure sensitive information and production workloads are not lost in any situation. The best practice is to create and maintain a workspace separately in a different region than the standard production workspace. The regional backup strategy can vary between organizations.

azure databricks best practicesSome customers prefer real-time access and backups between two regions, therefore they adopt an active-active configuration. This is one of the most costly backup and disaster recovery configurations. Other customers prefer minimum required backups only to ensure business continuity. Data is backed-up occasionally and hence the cost is minimized.

The best practices for data recovery and regional backup are as follows:

  • A Git repository can be used to store code either on-site or on the cloud. Repos can be used to synchronize it to Azure Databricks whenever required.
  • Delta Lake should be used together with Deep Clone to make a copy and backup of the data.
  • For items not stored in Delta Lake, cloud providers’ native tools should be used to keep backups.
  • Terraform should be used for backing up objects such as jobs, clusters, secrets, notebooks, and other workspace objects.

Henson Group is one of the best-Managed Service Providers (MSP) for Microsoft Azure and has a strong global network. If you are considering using Azure Databricks, then get in touch with us. We can help you get started with Azure Databricks in no time.