Deep Lake, a Lakehouse for Deep Learning: Machine Learning Use Cases

5 Jun 2024

Authors:

(1) Sasun Hambardzumyan, Activeloop, Mountain View, CA, USA;

(2) Abhinav Tuli, Activeloop, Mountain View, CA, USA;

(3) Levon Ghukasyan, Activeloop, Mountain View, CA, USA;

(4) Fariz Rahman, Activeloop, Mountain View, CA, USA;.

(5) Hrant Topchyan, Activeloop, Mountain View, CA, USA;

(6) David Isayan, Activeloop, Mountain View, CA, USA;

(7) Mark McQuade, Activeloop, Mountain View, CA, USA;

(8) Mikayel Harutyunyan, Activeloop, Mountain View, CA, USA;

(9) Tatevik Hakobyan, Activeloop, Mountain View, CA, USA;

(10) Ivo Stranic, Activeloop, Mountain View, CA, USA;

(11) Davit Buniatyan, Activeloop, Mountain View, CA, USA.

Table of Links

5. MACHINE LEARNING USE CASES

In this section, we review the applications of Deep Lake.

A typical scenario in a Deep Learning application starts with

(1) A raw set of files that is collected on an object storage bucket. It might include images, videos, and other types of multimedia data in their native formats such as JPEG, PNG or MP4.

(2) Any associated metadata and labels stored on a relational database. Optionally, they could be stored on the same bucket along with the raw data in a normalized tabular form such as CSV, JSON, or Parquet format.

As shown in Fig. 4, an empty Deep Lake dataset is created. Then, empty tensors are defined for storing both raw data as well as metadata. The number of tensors could be arbitrary. A basic example of an image classification task would have two tensors,

• images tensor with htype of 𝑖𝑚𝑎𝑔𝑒 and sample compression of JPEG

• labels tensor with htype of 𝑐𝑙𝑎𝑠𝑠_𝑙𝑎𝑏𝑒𝑙 and chunk compression of LZ4.

After declaring tensors, the data can be appended to the dataset. If a raw image compression matches the tensor sample compression, the binary is directly copied into a chunk without additional decoding. Label data is extracted from a SQL query or CSV table into a categorical integer and appended into labels tensor. labels tensor chunks are stored using LZ4 compression. All Deep Lake data is stored in the bucket and is self-contained. After storage, the data can be accessed in a NumPy interface or as a streamable deep learning dataloader. Then, the model running on a compute machine iterates over the stream of image tensors, and stores the output of the model in a new tensor called predictions. Furthermore, we discuss below how one can train, version control, query, and inspect the quality of a Deep Lake dataset.

5.1 Deep Learning Model Training

Deep learning models are trained at multiple levels in an organization, ranging from exploratory training occurring on personal computers to large-scale training that occurs on distributed machines involving many GPUs. The time and effort required to bring the data from long-term storage to the training client are often comparable to the training itself. Deep Lake solves this problem by enabling rapid streaming of data without bottlenecking the downstream training process, thus avoiding the cost and time required to duplicate data on local storage.

5.2 Data Lineage and Version Control

Deep learning data constantly evolve as new data is added and existing data is quality controlled. Analytical and training workloads occur in parallel while the data is changing. Hence, knowing which data version was used by a given workload is critical to understand the relationship between the data and model performance. Deep Lake enables deep learning practitioners to understand which version of their data was used in any analytical workload and to time travel across these versions if an audit is required. Since all data is mutable, it can be edited to meet compliance-related privacy requirements. Like Git for code, Deep Lake also introduces the concept of data branches, allowing experimentation and editing of data without affecting colleagues’ work.

5.3 Data Querying and Analytics

Training of deep learning models rarely occurs on all data collected by an organization for a particular application. Training datasets are often constructed by filtering the raw data based on conditions increasing model performance, which often includes data balancing, eliminating redundant data, or selecting data that contains specific features. Deep Lake provides the tools to query and analyze data so that deep learning engineers can create datasets yielding the highest accuracy models.

5.4 Data Inspection and Quality Control

Though unsupervised learning is becoming more applicable in realworld use cases, most deep learning applications still rely on supervised learning. Any supervised learning system is only as good as the quality of its data, often achieved by manual and exhaustive inspection of the data. Since this process is time-consuming, it is critical to provide the humans in the loop with tools to examine vast amounts of data very quickly. Deep Lake allows inspecting deep learning datasets of any size from the browser without any setup time or need to download data. Furthermore, the tools can be extended for comparing model results with ground truth. Combined with querying and version control, this can be applied to the iterative improvement of data to achieve the best possible model.

This paper is available on arxiv under CC 4.0 license.