What is Databricks Auto Loader?
Updated: Sep 29, 2022
Databricks is a scalable big data analytics platform designed for data science and data engineering. Built on top of Apache Spark, it is a fast and generic engine for Large-Scale Data Processing with industry leading performance and integration with major cloud platforms Amazon Web Services, Microsoft Azure, and Google Cloud Platform.
Auto Loader is a feature of Databricks to stream or incrementally ingest millions of files per hour from data lake storage. Auto Loader achieves this by only reading newly arriving files and accepts many file formats such as Json, Csv, Parquet, Avro, Orc, Text, or Binary file.
With more businesses using a variety of IoT devices and applications we need a method to ingest the varied data into a usable format, Auto Loader simplifies this process to enable quick and easy access to the data.
Data Mastery recently deployed an Auto Loader solution with an OEM to ingest and transform telemetry data from WAGO PLC devices to IoT Hub. Previously, the OEM was manually downloading the data files on site, then manually transforming the data file using excel to produce a readable report. Auto Loader completely automates this process and delivers a readable report automatically, eliminating one man hour of manual data processing per shift, saving over 21+ man hours per week per site (24hr operation). Needless to say, our client was delighted with the result!
How Auto Loader processes cloud files?
Auto Loader processes files on cloud storage accounts by using native Azure components Event Grid and Storage Queues to identify new files optimally as they arrive and Databricks structured streaming immediately processes the incoming data.
The option to run as a continuous stream or as a batch mode by setting a checkpoint under the cover to pick up where it left off. This is something Data Mastery ❤️ as it's a huge cost saving for our clients when combined with the auto terminate on a Databricks cluster.
It has the ability to merge the incoming stream instead of a standard append. This is very useful when used in conjunction with Delta Live tables.
How new files are detected?
Databricks Auto Loader supports two methods to detect new files in your Cloud storage :
Directory Listing: By default, Auto Loader will automatically detect whether a given directory is applicable for incremental listing by checking and comparing file paths of previously completed directory listings. To ensure eventual completeness of data in auto mode, Auto Loader will automatically trigger a full directory list after completing 7 consecutive incremental lists. You have the option to explicitly choose between the Incremental Listing or Full Directory Listing by setting cloudFiles.useIncrementalListing as true or false.
File Notification: Auto Loader can automatically set up a notification service and queue service that subscribes to file events from the input directory. File notification mode is more performant and scalable for large input directories or a high volume of files but requires additional cloud permissions for set up. Using the Cloud services like Azure Event Grid and Queue Storage services, AWS SNS and SQS or GCS Notifications, and Google Cloud Pub/Sub services, it subscribes to file events in the input directory.
Supported Cloud Storage
AWS S3, Azure Data Lake Storage Gen2, Google Cloud Storage, Azure Blob Storage, and Databricks File System.
Supported file formats
Json, CSV, Text, Parquet, Avro, BinaryFile and Orc
Auto Loader components
There are two main components of Auto Loader:
Cloud Files: Is a structured streaming source provided by Auto Loader for reading files on cloud storage, this is an Apache Spark data reader specifically for Auto Loader, similar to having spark data readers for parquet, Json, csv etc.
See the example below of the cloudFiles configuration - for further details on the cloudFiles configuration options click here.
Cloud Notifications: Is an optional service of cloudFiles, this component creates the Event Grid Topic, Event Subscriptions and Storage Queue for Auto Loader when configured to 'useNotifications'. This is the service to process only newly arriving files, instead of a full directory - see process below
Auto Loader is an excellent tool that can be applied across any business in any sector to get the benefits of a simple, time efficient solution and real time data.
I hope you have found this helpful and will save you time understanding the basics of Databricks Auto Loader.
Please connect on LinkedIn, and share your thoughts, questions and suggestions.