Skip to content

Integrated Data Lake Service

Idea

The Integrated Data Lake (IDL) is a repository that allows you to store structured and unstructured data in its native format until it is needed. It handles large data pools for which the schema and data requirements are not defined until the data is queried. This offers more agility and flexibility than traditional data management systems.

The Integrated Data Lake Service allows you to store your data as is, analyze it using dashboards and visualizations or use it for big data processing, real-time analytics, and machine learning.

Access

For accessing Integrated Data Lake Service you need to have the respective roles listed in Data lake Services roles and scopes.

A user can only interact with objects within their tenant and subtenants.

Basics

Signed URL

The Integrated Data Lake Service enables data upload and download using AWS or Azure. The signed URLs for AWS and Shared Access Signatures (SAS) for Azure have an expiration data and time and can only be used by authorized tenant users or services.

Data upload and download AWS Azure
Using service signed URLs Service Principal
Maximum object size limit 5 GB 256 GB

Time Series Import

The Integrated Data Lake Service allows authorized tenant users or services to import time series data into the data lake. This enables on-demand time series upload for analytics and machine learning tools.

Metadata

The Integrated Data Lake Service assigns each object a unique identifier. Additionally, the object can be assigned a set of extended metadata tags.

Data Access

Using these services, you can enable (and disable) read and write access to your data for a specific tenant. For example, you can enable analytics tools to directly access your data for analyses without having to download it. This saves storage space and eliminates the need for regular data synchronization.

Alternatively, the Integrated Data Lake Service can generate temporary read and write access to your data, refer the below table:

Deviation AWS Azure
Data access services Cross account access Service Principal
Access limits 5 cross account accesses 5 Service Principals

The Security Token Service or Service Principal can only be used by an authorized tenant user or service.

For AWS only: For example, IDL user has third party application enabled on the AWS account, say tableau server. Now, the user wants to give access to the tableau server to the data that resides in IDL. This can be easily done by enabling the AWS account using cross account access and performing the desired use case. Further, it is also possible to provide read/write and delete access to the enabled cross accountthrough API or through the IDL manager.

Notification

The Integrated Data Lake Service provides a notification functionality, which reports when objects are ingested, updated or deleted using the service. Authorized tenant users or services can subscribe to notifications. Currently only tenant user or services can subscribe to only 15 notifications.

Info

If the permission to send notification to SNS topic is removed, then the tenantAdmin will be notified through an email to check and respond. The email will be sent for a week and thereafter the subscription will be removed.

Features

The data lake services exposes its API for realizing the following tasks:

  • Import time series data
  • Generate signed URLs or to upload, update or download objects
  • Delete objects
  • Add, update and delete tags for objects
  • Receive notifications
  • Cross account access for AWS only
  • Cross account accesses for AWS only
  • Subtenancy support
  • Bulk batch upload of objects

The data lake services exposes UI for below functionalities:

  • Cross account access - For enabling AWS account to read the data from IDL
  • Cross account accesses - For giving prefix level access to the enabled cross account
  • TimeSeries Import functionality
  • Service Principal - For enabling Azure account to read and write the data from IDL
  • Data Explorer - For enabling to explore the files/objects

Limitations

  • All requests pass through MindSphere Gateway and must adhere to the MindSphere Gateway Restrictions.
  • Maximum supported object size for object upload and download using signed URL is 5 GB.
  • Maximum supported object size for object upload and download using Shared Access Signatures (SAS) is 5 GB.
  • Signed URLs expire after two hours.
  • Shared Access Signatures (SAS) expire after twelve hours.
  • Objects are not version controlled.
  • The cross account accesses will be emptied and stopped working as expected under the revocation process as per the bucket policy.
  • The token will remain active until its expiry before deprovisioning.
  • The S3 Signed will remain active until its expiry before deprovisioning.
  • All the Bulk Import limitation will still be valid for time series import functionality in IDL.
  • Only 10 cross account accesses can be created in disabled state for AWS.
  • Only 5 cross account accesses can be enabled for any given time for AWS.
  • Only 5 service principals can be enabled for any given time for Azure.
  • User can subscribe to only 15 subscriptions.
  • The data in UTS might take 48hrs to reflect.
  • ‘Write’ access cannot be provided at Time Series Import folder in Cross Account Accesses or Service Principal.
  • Upload Pre-Signed URL in AWS and Shared Access Signatures (SAS) in Azure for Time Series Import folder cannot be created.
  • User path should be pre-fixed with Time Series Import for downloading the time series data files.
  • Characters used for values of file name must be in the character set '[a-zA-Z0-9.!*'() _-/=]'. Spaces are not allowed in the beginning or at the end. Also, consecutive spaces are not allowed within the name.
  • Objects uploaded by using native URLs will be deleted by using native URLs only. IDL Service URLs do not support the deletion of files which are uploaded using native URLs.
  • Only 2 secrets can be generated at any given time for each Service Principal.
  • Secret will be active for maximum 90 days, thereafter it will be expired.
  • In current release of Integrated Data Lake, Service Principal cannot be generated for already existing folders. This feature will be enabled in the subsequent releases of Integrated Data Lake.
  • For event notification, user should provide the topic from EU1 region only. Integrated Data Lake will not be able to send the notification to other region topics.
  • It will take approximately 5-10 minutes for the data to be available in the search, after uploading in Integrated Data Lake.

To get the current list of limitations go to release notes and choose the latest date. From there go to "MindAccess Developer Plan Subscribers and MindAccess Operator Plan Subscribers" and pick the IoT service you are interested in.

Example Scenario

The quality assurance representative of an airline company wants to upload flight data (years 2009-2019) to MindSphere. So, they can run analytics tools and make the data accessible for querying.

They can use the MindSphere Integrated Data Lake Service to upload Excel sheets and enable data access from other accounts. This allows the airline company to integrate analytics tools like AWS Glue or Power BI on Azure and quickly perform queries. For example, they can query for "the most popular airport in last 10 years" or "the airport with most cancelled flights in the past year".

Any questions left?

Ask the community


Except where otherwise noted, content on this site is licensed under the MindSphere Development License Agreement.