Integrated Data Lake Service¶
The Integrated Data Lake (IDL) is a repository that allows you to store structured and unstructured data in its native format until it is needed. It handles large data pools for which the schema and data requirements are not defined until the data is queried. This offers more agility and flexibility than traditional data management systems.
The Integrated Data Lake Service allows you to store your data as is, analyze it using dashboards and visualizations or use it for big data processing, real-time analytics, and machine learning.
For accessing Integrated Data Lake Service you need to have the respective roles listed in Data lake Services roles and scopes.
A user can only interact with objects within their tenant and subtenants.
The Integrated Data Lake Service enables data upload and download using signed URLs. The signed URLs have an expiration data and time and can only be used by authorized tenant users or services. The maximum object size for data upload and download using signed URLs is 5 GB.
Time Series Import¶
The Integrated Data Lake Service allows authorized tenant users or services to import time series data into the data lake. This enables on-demand time series upload for analytics and machine learning tools.
The Integrated Data Lake Service assigns each object a unique identifier. Additionally, the object can be assigned a set of extended metadata tags.
Using this service, you can enable (and disable) read-only access to your data for a specific tenant. For example, you can enable analytics tools to directly access your data for analyses without having to download it. This saves storage space and eliminates the need for regular data synchronization.
Alternatively, the Integrated Data Lake Service can generate temporary STS tokens for read-only access to your data. The token can only be used by an authorized tenant user or service.
Using cross account access, user can enable the AWS account to access data from MindSphere datalake. The user can be provided with a maximum of 5 cross account accesses. For example, IDL user has third party application enabled on the AWS account, say tableau server. Now, the user wants to give access to the tableau server to the data that resides in IDL. This can be easily done by enabling the AWS account using cross account access and performing the desired use case. Further, it is also possible to provide read/write and delete access to the enabled cross account through API or through the IDL manager.
Currently, the user can be provided with 5 cross account accesses at any given time.
The Integrated Data Lake Service provides a notification functionality, which reports when objects are ingested, updated or deleted using the service. Authorized tenant users or services can subscribe to notifications. Currently only tenant user or services can subscribe to only 15 notifications.
The data lake services exposes its API for realizing the following tasks:
- Import time series data
- Generate signed URLs to upload, update or download objects
- Delete objects
- Add, update and delete tags for objects
- Receive notifications
- Cross account access
- Cross account accesses
- Subtenancy support
The data lake services exposes UI for below functionalities
- Cross account access - For enabling AWS account to read the data from IDL
- Cross account accesses - For giving prefix level access to the enabled cross account
- TimeSeries Import functionality
- All requests pass through MindSphere Gateway and must adhere to the MindSphere Gateway Restrictions.
- Maximum supported object size for object upload and download using signed URL is 5 GB.
- Signed URLs expire after two hours.
- Objects are not version controlled.
- Bucket policy during revocation process will be emptied, so cross-account accesses, which might have been enabled before will stop working as expected.
- In case of active token before deprovisioning, it will remain usable until its expiry.
- In case of active S3 Signed before deprovisioning, it will remain usable until its expiry.
- All the Bulk Import limitation will still be valid for time series import functionality in IDL.
- Only 10 cross account accesses can be created in disabled state.
- Only 5 cross account accesses can be enabled for any given time
- User can subscribe to only 15 subscriptions.
- The data in UTS might take 48hrs to reflect.
- ‘Write’ access cannot be provided at Time Series Import folder in Cross Account Accesses.
- Upload Pre-Signed URL for Time Series Import folder cannot be created.
- User path should be pre-fixed with Time Series Import for downloading the time series data files.
To get the current list of limitations go to release notes and choose the latest date. From there go to "MindAccess Developer Plan Subscribers and MindAccess Operator Plan Subscribers" and pick the IoT service you are interested in.
The quality assurance representative of an airline company wants to upload flight data (years 2009-2019) to MindSphere. So, they can run analytics tools and make the data accessible for querying.
They can use the MindSphere Data Lake Service to upload Excel sheets and enable data access from other accounts. This allows the airline company to integrate analytics tools like AWS Glue and quickly perform queries. For example, they can query for "the most popular airport in last 10 years" or "the airport with most cancelled flights in the past year".
Any questions left?
Except where otherwise noted, content on this site is licensed under the MindSphere Development License Agreement.