Skip to content

Semantic Data Interconnect (SDI) Data Management

Idea

The Semantic Data Interconnect (SDI) Data Management related APIs handles the entire workflow of data registration and preparation. SDI provides a simple way to prepare data for establishing semantic correlations and data query processing. The stages to use SDI includes:

For more information about these stages refer Basics.

Access

For accessing this service, you need to have the respective roles listed in SDI roles and scopes.

Application users can access the REST APIs using REST Client. Depending on the APIs, users will need different roles to access SDI Data Query Service.

Note

Access to SDI Data Query Service APIs are protected by MindSphere authentication methods, using OAUTH credentials.

Basics

Data Registration

Data scientist or analyst decides the data source and categorization of the data, the data tag name, data upload strategy (replace/append) and file type (JSON, CSV or\and XML). Once these decisions are made then, Data Registration APIs can be used to create the registry.
Data Registration APIs are used to organize the incoming data. When configuring a data registry, you can update your data based on a replace or append strategy. During each data ingest operation, The replace strategy will replace the existing schema and data, whereas the append strategy will update the existing schema and data For example, if schema changes and incoming data files are completely different every time, then you can use replace strategy.

Custom Data Types

The SDI by default identifies basic data types for each property, such as String, Integer, Float, Date, etc. Once the data source and data type are identified, the user can provide the custom data types with regex pattern so that SDI can apply that type during schema creation. The developer can use Custom Data Types APIs to manage custom types. The user can use the set of APIs to create their own custom data type. The SDI also provides an API to suggest data type based on user-provided sample test values. The custom data type contains data type name and one or more regular expression pattern that needs to be matched for incoming data. The SDI also provides suggestions and helps decide the regular expression for the data type. This returns a list of possible regex matcher with given tests and sample values. Users can pick the regex pattern that matches the sample values the most and register those patterns as custom data types.

Data Ingest

The developer can use Data Ingest APIs to bring data to SDI so that schema can be created by querying and semantic model creation process. The Integrated Data Lake (IDL) user should follow Data Lake APIs to process the data for SDI. Data Ingest is the starting point to create schema and data management for schemas. Once the valid registries are created for data source, then the user can perform file upload and start data ingest process to create schemas. This is used to upload files from various systems and start data ingest process for SDI. Currently SDI supports JSON, XML and CSV file formats for enterprise data and Parquet format for time series data. The SDI supports two ways of data ingestion:

For Integrated Data Lake (IDL) customers

If you are a new IDL customer using SDI with IDL for Enterprise and IoT data, you should follow the below steps:

  1. Purchase the SDI and IDL base plan.
  2. By default, SDI enables cross-account access to provisioned tenants under sdi folder.
  3. SDI uses POST/objectEventSubscriptions IDL endpoint to subscribe to SDI topic at the time of provisioning of SDI & IDL to tenant. The IDL will notify anytime this folder is changed.
  4. Retrieve <storageAccount> using IDL API - GET/objects. Use the storageAccount from response to register IDL datalake with SDI.
  5. Register IDL datalake with SDI by calling SDI POST/dataLakes with payload {"type": "MindSphere", "name": "idl", "basePath": "<storageAccount>/data/ten="}.
  6. Enterprise data uploaded into sdi folder or MindSphere IoT data imported into sdi folder will be processed based on notification received from IDL.
  7. If the user wants SDI to process files from folder other than sdi, then repeat steps 2 and 3 so that SDI will process files uploaded against that folder.
Enterprise Data Flow
  1. Use SDI documentation and create a data registry as explained under Data Registry APIs. Once the data registry is created, SDI will return registryId.
  2. Store the registryId retrieved from this data registry.
  3. Identify the files that need to be uploaded for given registryId and create metadata for files using IDL POST/objectMetadata/{objectpath} APIs.

1
  {"tags": ["registryId” :_”<registry id>"]}
In case input file contains XML and there is no default rootTag identified for XML files, or you want to provide different rootTag for a file then add this tag in above Metadata creation:

1
  {"tags": ["registryId” :_”<registry id>", "rootTag” : “<root tag>”  #if not using defaultRootTag or want to use different rootTag>"]}
4. You can upload the file and SDI will retrieve the message from IDL for each upload and create a schema with the uploaded files. If you are using Postman to upload the file then make sure to choose binary option before uploading the file using IDL generated URL.
5. Use SDI's searchSchema to retrieve all the schemas for uploaded files and create a query using SDI APIs.

IoT Data Flow
  1. Identify the asset and aspect for which you want SDI to process the data. Create IoT data registry using SDI data registry creation APIs for given Asset and Aspect by using POST /iotDataRegistries endpoint.
  2. Once IoT data registry is created, use IDL API to perform the import by using POST /timesSeriesImportJobs endpoint.
  3. Once timeSeriesImportJobs is performed and data is stored to the path that SDI is subscribed to, IDL will send a message to SDI.
  4. Timeseries data that is imported in SDI subscribed folder will be consumed by SDI and ready to be used in queries.
  5. Use SDI's search schema to retrieve all the schemas for uploaded files and start writing a query using SDI APIs.

For Customer Data Lakes

Users need additional SDI Data Storage Upgrade to enable data storage in SDI. Users can use SDI POST /dataUpload API to upload the JSON, CSV or XML file.
Once the data is uploaded in SDI, then the user can follow this set of APIs to ingest the data into SDI and review job status. The ingested job must match a valid data registry. POST /ingestJobs will generate the jobId for each job and GET /ingestJobStatus/{id} can be used to track the status of this job. It is successful when SDI creates a schema and data is ready for query.

Search Schema

The schema is available once ingestJobs is successful. Schema registry allows a user to retrieve schema based on the following:

  • The source name, data tag, or schema name for the Enterprise category.
  • The assetId, aspectName, or schema name for IoT category.

Features

Data registration

This is the first step before any data is ingested or connected to SDI for schema extraction or query execution. Data analysts or admins need to register the data sources from where data will be used for analysis and semantic modeling.

The registration consists of data source names, datatags (or sub-sources/tables within a source), file pattern, file upload strategies. For file-based batch data ingestion, SDI provides various file upload data management policies. Currently, SDI provides Append and Replace as two policies that can be set for each datatag within a data source.

  • Append: It joins files ingested for source and data tag. It provides a success response if schema matches Otherwise, it creates appends to the existing schema. This policy can be used in batch-based data ingestion from data upload API.
  • Replace: This data management policy replaces the entire data set for the corresponding source and data tag. This policy is useful in updating meta-data kind of information.

Data Registry service is primarily used for two purposes:

  1. Maintaining the registry for a tenant: Using this service, you can create your domain-specific registry entries using this service. This registry is the starting point for any analytics and file upload. Data registry allows you to restrict the kind of file that will be uploaded based on the file pattern and area that it will be uploaded. SDI will either replace or append to existing data based on the file upload strategy. The following endpoints can be used to create and retrieve the data registry created by customer:

    • /dataRegistries POST
    • /dataRegistries/{id} PATCH
    • /dataRegistries/{id} GET
    • /dataRegistries GET
  2. Create custom data types for a given tenant: This endpoint allows you to create a sample regular expression pattern that can be used during schema extraction. This helps to generate the regular expression patterns based on the available sample values and then register one or more system generated regular expression patterns. It also allows you to retrieve the generated pattern. SDI system by default uses set of regular expressions when extracting the schema based on uploaded file. If tenant has provided the custom data types using custom generated regular expression with this service, then those are used as well to infer the data types on uploaded file. The following endpoints can be used:

    • /suggestPatterns POST - Generates the regular expression patterns for a given set of sample values.
    • /dataTypes/{name} GET - Retrieves datatypes for a tenant and data type name.
    • /dataTypes GET – Retrieves datatypes for a tenant.
    • /dataTypes POST - Register Datatypes to a tenant based on sample value generated or customer created data types.
    • /dataTypes/{name}/addPatterns– POST – update registered data types.

Data Ingest

Once registrations are done, raw data can be ingested from either integrated data lakes or customer data lakes. For more information refer the Basics section. It serves as the data ingestion starting point for the SDI application. Currently SDI supports CSV, JSON, and XML formatted domain-specific files. There are two scenarios that can be used to upload the file:

  1. Quick SDI processing: This mode allows uploading files without any registration of sources through the data registry service. SDI processes file uploaded with default policy for schema generation.
  2. Upload file with valid data source registry: This is the preferred mode since it allows more validation against the data registry and creates multiple schemas based on different domains created under data registry. Using this mode, you can create combination of schemas from different domains, query them or use for analytical modeling.

In the Data Management, SDI schedules an automatic Extract, Load and Transform (ELT) job to extract/infer schema from this data. The schema is then stored as per data source and data tag corresponding to these data sources. If there is change in the schema, then schema is appended and consolidated schema is provided to user. Users can search for a schema of these ingested or linked data to:

  • Build queries on the basis of physical schema
  • Develop the semantic model by mapping business properties to physical schema attributes
  • Get an initial inferred semantic model for selected schemas
  • Build queries based on a created semantic model

Schema Registry service is primarily used for maintaining the schemas for a tenant. Schemas are stored with the default name format of datasource_dataTag. In case the ingested files are ingested on (Quick SDI processing) the fly- schema name is the name of the file.

SDI system extracts and stores the schema once the file is uploaded. The user can search a schema created by SDI system using data Tag, schema name and source name. Multiple schemas can be searched using empty list or provide a filtered list based on above parameters to search the schema generated by system.

Search Schema

This allows the user to search the schema based on data tag, schema name or source name. The search schema array must contain identical elements in search criteria.

POST Method: /searchSchemas

SDI works on schema-on-read philosophy. Users don’t need to know schema before data is ingested into MindSphere SDI. SDI has capabilities to infer/extract schema consisting of attributes and data types from ingested data and store it specific to tenant and data tag. The Data Ingest Service API supports XML, JSON, and CSV as input data file formats. The file containing XML format can provide the root element that the customer wants to process.

Limitations

  1. The number of data sources that can be registered depends upon offering plan subscribed by the user tenant.
  2. Data Ingest POST method supports files up to 100 MB in size.
  3. SDI allows maximum ingest rate of 70 MBPS.
  4. Once the data is ingested successfully, the schema is available for search once job is finished.
  5. Users can create maximum 500 registries per tenant.
  6. Users can create maximum 200 custom data types per tenant and each data type cannot contain more than 10 regular expression patterns.
  7. The search schema request and infer ontology is limited to 20 entry per search.
  8. The response of search schema will contain maximum 20 last ingested original file names.
  9. The SDI supports maximum 250 properties per schema.
  10. Semicolon, pipe, etc as delimiter are not accepted in CSVs. Input CSV files to have comma(,) as delimiter.
  11. For Input files in JSON format; JSON key names should not have special characters like dot, space, comma, semi colon, curly braces, brackets, new line, tab, etc.

Any questions left?

Ask the community


Except where otherwise noted, content on this site is licensed under the MindSphere Development License Agreement.