What is Microsoft Azure Data Lake?

microsoft azure data lake

Azure Data Lake offers all the capabilities and services to help developers, Data Scientists, and Analysts to store data of any size, shape, and speed. It helps to perform all kinds of processing and analytics across platforms using various languages. It makes it very easy and speeds the process of storing and ingesting data using batch, streaming, and interacting analytics.

 

What is Azure Data Lake Storage?

Azure Data Lake Storage (ADLS) is a secured and scalable Data Lake that helps to achieve high-performance analytics workloads. It is also known as Azure Data Lake Store. It offers a single storage platform to integrate the large volume of organizational data. It is very cost-effective and provides tiered storage and policy management. ADLS also offers single sign-on capabilities and access controls using Hadoop Distributed File System. Azure Data Lake Storage enables us to use all those tools which support HDFS.

Benefits of Azure Data Lake?

The Data Lake in Azure solution is designed for organizations that want to take advantage of Big Data. It provides a data platform that can help Developers, Data Scientists, and Analysts store data of any size and format and perform all types of processing and analytics across multiple platforms using various programming languages. It can also work with existing solutions, such as identity management and security solutions. Moreover, it integrates with other data warehouses and cloud environments. It can be useful for organizations that need the following:

  • Azure Active Directory:

Azure Active Directory or AAD allows you to provide Role-Based Access Control (RBAC) or identity within the solutions. These identities have several applications that can be managed by the service principal. The service principal stores the principal’s credentials if a service wants to connect to it, whereas, managed identities are directly connected to the service, so there is no need to manage credential storage.

  • Multi-protocol SDK:

It’s a new version of the Blob Storage SDK used with Azure Data Lake to handle reading and writing of the data from ADLS and retry if a transient failure occurs. However, there are some limitations as it cannot perform atomic manipulation or control the access.

  • Low-cost Storage:

Azure storage emerged as a cost-effective solution for data storage with various functionalities, such as data migrations from hot storage to colder storage, life-cycle management system, high power, archive storage, and more.

  • Reliability:

Azure Storage allows users to make copies of their data to prepare for data center failure or a natural disaster. Also, the advanced threat detection system integrates with the data storage and detects malicious programs or software that might damage the data or compromise your privacy.

  • Scalability:

Azure is massively scalable with a current limit of up to 500 petabytes in various regions around the world, except the USA and Europe where the limit is 2 petabytes. It offers both linear and vertical scaling.

Working of Azure Data Lake

Azure Data Lake is built on Azure Blob storage, the Microsoft object storage solution for the cloud. The solution mat features low-cost, tiered storage and high-availability/disaster recovery capabilities. It integrates with other Azure services, including Azure Data Factory, a tool used for creating and running extract, transform, and load (ETL) and extract, load, and transform (ELT) processes.

The solution is based on the Apache Hadoop YARN (Yet Another Resource Negotiator) cluster management platform. It can scale dynamically across SQL servers within the data lake, as well as servers in the Azure SQL Database and the Azure SQL Data Warehouse.

To start using Azure Data Lake, you need to create a free account on the Microsoft Azure portal. From the portal, you will be able to access all of the Azure services.

 

ADLS and Big Data Processing

By using ADLS, we can store data from anywhere without any data transformation. There is no need to define a schema before data loading. It also offers us to store files of different sizes and formats, on-premises legacy systems, existing cloud stores help ADLS to deal with structured, unstructured, and semi-structured data.

Azure Data Lake Storage – GEN2

Recently, Microsoft announced ADLS Gen2, a superset of ADLS Gen1 that offers new capabilities dedicated to analytics built on top of Azure Blob storage.

Described by Microsoft as a “no-compromise data lake,” ADLS Gen2 extends Azure Blob storage capabilities and is best optimized for analytics workloads. Users can store data once and access it through existing blob storage and HDFS-compliant file system interfaces without any changes in programming or data copying while performing database operations.

ADLS Gen2 includes most of the features from both ADLS Gen1 and Azure Blob storage, including:

  • Limitless storage capacity
  • Azure Active Directory (AAD) integration
  • Hierarchical File System (HFS)
  • Read-access geo-redundant storage
  • 5 TB file size limit
  • Blob tiers (Hot, Cool, Archive)

Azure Data Lake Storage Gen2 is Microsoft’s latest version of cloud-based Big Data storage. In the prior version, the hot/cold storage tier and the redundant storage were not available. Although the blob storage in Microsoft Azure had the capability of hot and cold storage, it was short of features like a directory, and file-level security, etc., which are available in Gen1. To overcome this difference in terms of storage and features, Microsoft released the latest version of cloud-based Big Data storage, Gen 2.

Gen2 is built on Azure Blob storage. It contains several features from Gen1, such as file system semantics, directory, file-level security, and scalability, along with features like low-cost, tiered storage, and high availability/disaster recovery capabilities from Azure Blob storage.

Azure Data Lake Store Security

When implementing a Big Data solution, security shouldn’t be optional. To conform to security standards and limit sensitive information visibility, data must be secured in transit and at rest. ADLS provides rich security capabilities so users can have peace of mind when storing their assets in the ADLS infrastructure. Users can monitor performance, audit usage, and access control through the integrated Azure Active Directory.

Auditing

ADLS create audit logs for all operations performed in it. These logs can be analyzed with U-SQL scripts.

Access Control

ADLS provides access control through the support of POSIX-compliant access control lists (ACL) on files and folders stored in its infrastructure. It also manages authentication through the integration of AAD based on OAuth tokens from supported identity providers. Tokens will carry the user’s security group data, and this information will be passed through all the ADLS microservices.

Data Encryption

ADLS encrypts data in transit and at rest, providing server-side encryption of data with the help of keys, including customer-managed keys in the Azure Key Vault.

Data Encryption Key Types

ADLS uses a Master Encryption Key (MEK) stored in Azure’s key vault to encrypt and decrypt data. Users have the option to manage this key themselves but there is always a risk of not being able to decrypt the data if the key is lost. ADLS also includes the following keys:

  • Block Encryption Key (BEK): These are keys generated for each block of data
  • Data Encryption Key (DEK): These keys are encrypted by the MEK and are responsible for generating BEKs to encrypt data blocks

Azure Data Lake Store Pricing

Data Lake Store is currently available in the US-2 region and offers preview pricing rates (excluding Outbound Data transfer):

Usage Cost
Data Stored US$0.04 per GB per month
Data Lake Transactions US$0.07 per million transactions

In the next section of this Azure Data Lake Tutorial, you will learn to get started with Analytics.

How do I get started?

Getting started with Azure Data Lake Analytics is extremely easy. Here’s what you’ll need:

  • An Azure subscription — grab a free trial if you don’t have one.
  • An Azure Data Lake Analytics account — create one in your Azure subscription
    • You’ll also have to create a Store account during this step.
  • Some data to play with — start with text or images.

You don’t need to install anything on your personal computer to use it. You can write and submit the necessary jobs in your browser.

Components of Azure Data Lake

The full solution consists of three components that provide storage, analytics service, and cluster capabilities.

Azure Data Lake Storage is a massively scalable and secure Data Lake for high-performance analytics workloads. Azure Lake Data Storage was formerly known and is sometimes still referred to as the Azure Data Lake Store. Designed to eliminate data silos, it provides a single storage platform that organizations can use to integrate their data.

The storage can help optimize costs with tiered storage and policy management. It also provides role-based access controls and single sign-on capabilities through Azure Active Directory. Users can manage and access data within the storage using the Hadoop Distributed File System (HDFS). Therefore, any HDFS-based tool that you are using will work with ADLS.

Azure Data Lake Analytics is an on-demand analytics platform for Big Data. Users can develop and run parallel data transformation and processing programs in U-SQL, R, Python, and .NET over petabytes of data. U-SQL is a Big Data query language created by Microsoft for the Azure Data Lake Analytics service. With Azure Data Lake Analytics, users pay for each job to process data on-demand in analytics as a service environment. It is a cost-effective analytics solution because you pay only for the processing power that you use.

Azure HDInsight is a cluster management solution that offers easy, fast, and cost-effective ways to process massive amounts of data. It’s a cloud deployment infrastructure of Apache Hadoop that enables users to take advantage of optimized open-source analytic clusters for Apache Spark, Hive, Map Reduce, HBase, Storm, Kafka, and R-Server. With these frameworks, you can support a broad range of functions, such as ETL, data warehousing, machine learning, and IoT. Azure HDInsight also integrates with Azure Active Directory for role-based access controls and single sign-on capabilities.

Need of Azure Data Lake

The Azure Data Lake offers the following benefits and facilities:

  • Data warehousing: Since the solution supports any type of data, you can use it to integrate all of your enterprise data into a single data warehouse.
  • Internet of Things (IoT) capabilities: The Azure platform provides tools for processing streaming data in real-time from multiple types of devices.
  • Support for hybrid cloud environments: You can use the Azure HDInsight component to extend an existing on-premises Big Data infrastructure to the Azure cloud.
  • Enterprise features: The environment is managed and supported by Microsoft and includes enterprise features for security, encryption, and governance. You can also extend your on-premises security solutions and controls to the Azure cloud environment.
  • Speed to deployment: It’s easy to get up and running with the Azure Data Lake solution. All of the components are available through the portal and there are no servers to install or infrastructure to manage.

About Azure Data Lake Store

According to Microsoft, the Azure Data Lake store is a hyper-scale repository for Big Data Analytics workloads and a Hadoop Distributed File System for the cloud. Some of its features include:

  • Imposes no fixed limits on file size
  • Imposes no fixed limits on account size
  • Allows unstructured and structured data in their native formats
  • Allows massive throughput to increase analytic performance
  • Offers high durability, availability, and reliability
  • Is integrated with Azure Active Directory access control

Other than the fact that both Azure Data Lake Store and Amazon S3 provide unlimited storage space, the two don’t have much in common. When comparing S3 to an Azure service, you’ll get better mileage with the Azure Storage Service. The store, on the other hand, provides an integrated analytics service and places no limits on file size.

Here’s a nice illustration:

Azure Data Lake store - diagram

Source: Microsoft

It can handle any data in its native format, as is, without requiring prior transformations. Data Lake store does not require a schema to be defined before the data is uploaded, leaving it up to the individual analytic framework to interpret the data and define a schema at the time of the analysis. Being able to store files of arbitrary size and formats makes it possible for a Data Lake store to handle structured, semi-structured, and even unstructured data.

Azure Data Lake store file system (adl://)

It can be accessed from Hadoop (available with an HDInsight cluster) using the WebHDFS-compatible REST APIs. However, the Azure Data Lake store introduced a new file system called AzureDataLakeFilesystem (ADL://). adl:// is optimized for performance and available in HDInsight. Data can be accessed in the Data Lake Store using:

adl://<data_lake_store_name>.azuredatalakestore.net

Azure Data Lake Store Security

It uses Azure Active Directory for authentication and Access Control Lists (ACLs) to manage access to your data. Azure Data Lake uses all AAD features including Multi-factor Authentication, conditional access, role-based access control, application usage monitoring, security monitoring, and alerting. Azure Data Lake Store supports the OAuth 2.0 protocol for authentication within the REST interface. Similarly, the Data Lake Store provides access control by supporting POSIX-style permissions exposed by the WebHDFS protocol.

Conclusion

Azure Data Lake is an important part of Microsoft’s ambitious cloud offering. With Data Lake, Microsoft provides a service to store and analyze data of any size at an affordable cost. In this comprehensive blog, we have read in detail about the Azure Data Lake, its components and features, its working, and so on.