Skip to content

Semantic Data Interconnect (SDI) Data Management

Idea

The Semantic Data Interconnect (SDI) Data Management related APIs handles the entire workflow of data registration and preparation. SDI provides a simple way to prepare data for establishing semantic correlations and data query processing. The stages to use SDI includes:

For more information about these stages refer Basics.

Access

For accessing this service, you need to have the respective roles listed in SDI roles and scopes.

Application users can access the REST APIs using REST Client. Depending on the APIs, users will need different roles to access SDI Data Query Service.

Note

Access to SDI Data Query Service APIs are protected by MindSphere authentication methods, using OAuth credentials.

Basics

Data Registration

Data scientist or analyst decides the data source and categorization of the data, the data tag name, data upload strategy (replace/append) and file type (JSON, CSV or\and XML). Once these decisions are made then, Data Registration APIs can be used to create the registry.
Data Registration APIs are used to organize the incoming data. When configuring a data registry, you can update your data based on a replace or append strategy. During each data ingest operation, The replace strategy will replace the existing schema and data, whereas the append strategy will update the existing schema and data For example, if schema changes and incoming data files are completely different every time, then you can use replace strategy.

Custom Data Types

The SDI by default identifies basic data types for each property, such as String, Integer, Float, Date, etc. Once the data source and data type are identified, the user can provide the custom data types with a regex pattern so that SDI can apply that type during schema creation. The developer can use Custom Data Types APIs to manage custom types. The user can use the set of APIs to create their custom data type. The SDI also provides an API to suggest data type based on user-provided sample test values. The custom data type contains data type name and one or more regular expression pattern that needs to be matched for incoming data. The SDI also provides suggestions and helps decide the regular expression for the data type. This returns a list of possible regex matcher with given tests and sample values. Users can pick the regex pattern that matches the sample values the most and register those patterns as custom data types. The SDI also supports deleting an unused custom data type. If any custom data type is identified by schema, then it cannot be deleted.

Data Ingest

The developer can use Data Ingest APIs to bring data to SDI so that schema can be created by querying and semantic model creation process. The Integrated Data Lake (IDL) user should follow Data Lake APIs to process the data for SDI. Data Ingest is the starting point to create schema and data management for schemas. Once the valid registries are created for a data source, then the user can perform file upload and start the data ingest process to create schemas. This is used to upload files from various systems and start the data ingest process for SDI. Currently, SDI supports JSON, XML and CSV file formats for enterprise data and Parquet format for time series data. The SDI supports two ways of data ingestion:

For Integrated Data Lake (IDL) customers

If you are a new IDL customer using SDI with IDL for Enterprise and IoT data, you should follow the below steps:

  1. Purchase the SDI and IDL base plan.
  2. By default, SDI enables cross-account access to provisioned tenants under sdi folder.
  3. SDI uses POST/objectEventSubscriptions IDL endpoint to subscribe to the SDI topic at the time of provisioning of SDI & IDL to a tenant. The IDL will notify anytime this folder is changed.
  4. Retrieve <storageAccount> using IDL API - GET/objects. Use the storageAccount from response to register IDL datalake with SDI.
  5. Register IDL datalake with SDI by calling SDI POST/dataLakes with payload {"type": "MindSphere", "name": "idl", "basePath": "<storageAccount>/data/ten="}.
  6. Enterprise data uploaded into sdi folder or MindSphere IoT data imported into sdi folder will be processed based on a notification received from IDL.
  7. If the user wants SDI to process files from a folder other than sdi, then repeat steps 2 and 3 so that SDI will process files uploaded against that folder.
Enterprise Data Flow
  1. Use SDI documentation and create a data registry as explained under Data Registry APIs. Once the data registry is created, SDI will return registryId.
  2. Store the registryId retrieved from this data registry.
  3. Identify the files that need to be uploaded for a given registryId and create metadata for files using IDL POST/objectMetadata/{objectpath} APIs.

1
  {"tags": ["registryId” :_”<registry id>"]}
In case input file contains XML and there is no default rootTag identified for XML files, or you want to provide different rootTag for a file then add this tag in above Metadata creation:

1
  {"tags": ["registryId” :_”<registry id>", "rootTag” : “<root tag>”  #if not using defaultRootTag or want to use different rootTag>"]}
4. You can upload the file and SDI will retrieve the message from IDL for each upload and create a schema with the uploaded files. If you are using Postman to upload the file then make sure to choose binary option before uploading the file using IDL generated URL.
5. Use SDI's searchSchema to retrieve all the schemas for uploaded files and create a query using SDI APIs.

IoT Data Flow
  1. Identify the asset and aspect for which you want SDI to process the data. Create an IoT data registry using SDI data registry creation APIs for given Asset and Aspect by using POST /iotDataRegistries endpoint.
  2. Once the IoT data registry is created, use IDL API to perform the import by using POST /timesSeriesImportJobs endpoint.
  3. Once timeSeriesImportJobs is performed and data is stored to the path that SDI is subscribed to, IDL will send a message to SDI.
  4. Timeseries data that is imported in SDI subscribed folder will be consumed by SDI and ready to be used in queries.
  5. Use SDI's search schema to retrieve all the schemas for uploaded files and start writing a query using SDI APIs.

For Customer Data Lakes

Users need additional SDI Data Storage Upgrade to enable data storage in SDI. Users can use SDI POST /dataUpload API to upload the JSON, CSV or XML file.
Once the data is uploaded in SDI, then the user can follow this set of APIs to ingest the data into SDI and review job status. The ingested job must match a valid data registry. POST /ingestJobs will generate the jobId for each job and GET /ingestJobStatus/{id} can be used to track the status of this job. It is successful when SDI creates a schema and data is ready for query.

Search Schema

The schema is available once ingestJobs is successful. Schema registry allows a user to retrieve schema based on the following:

  • The source name, data tag, or schema name for the Enterprise category.
  • The assetId, aspectName, or schema name for IoT category.

Features

Data registration

This is the first step before any data is ingested or connected to SDI for schema extraction or query execution. Data analysts or admins need to register the data sources from where data will be used for analysis and semantic modeling.

The registration consists of data source names, datatags (or sub-sources/tables within a source), file pattern, file upload strategies. The SDI currently supports csv, json or xml so the file pattern must end with csv, xml or json file types. Here is example filepattern for different types: The example file pattern to support different extensions: [a-z]+.json (for JSON file type) , [a-z]+.csv (CSV file type), [a-z]+.xml (XML file type). The SDI can accept multiple extension in this format: [a-z]+.(json|csv|xml) Replace [a-z] with supported filetered file type as desired.

For file-based batch data ingestion, SDI provides various file upload data management policies. Currently, SDI provides Append and Replace as two policies that can be set for each datatag within a data source.

  • Append: It joins files ingested for source and data tag. It provides a success response if schema matches Otherwise, it creates appends to the existing schema. This policy can be used in batch-based data ingestion from data upload API.
  • Replace: This data management policy replaces the entire data set for the corresponding source and data tag. This policy is useful in updating meta-data kind of information.

Data Registry service is primarily used for two purposes:

  1. Maintaining the registry for a tenant: Using this service, you can create your domain-specific registry entries using this service. This registry is the starting point for any analytics and file upload. The data registry allows you to restrict the kind of file that will be uploaded based on the file pattern and area that it will be uploaded. SDI will either replace or append to existing data based on the file upload strategy. The following endpoints can be used to create and retrieve the data registry created by the customer:

    • /dataRegistries POST
    • /dataRegistries/{id} PATCH
    • /dataRegistries/{id} GET
    • /dataRegistries GET
  2. Create custom data types for a given tenant: This endpoint allows you to create a sample regular expression pattern that can be used during schema extraction. This helps to generate the regular expression patterns based on the available sample values and then register one or more system generated regular expression patterns. It also allows you to retrieve the generated pattern. SDI system by default uses a set of regular expressions when extracting the schema based on the uploaded file. If a tenant has provided the custom data types using custom generated regular expression with this service, then those are used as well to infer the data types on the uploaded file. The following endpoints can be used:

    • /suggestPatterns POST - Generates the regular expression patterns for a given set of sample values.
    • /dataTypes/{name} GET - Retrieves datatypes for a tenant and data type name.
    • /dataTypes/{name} DELETE - Deletes datatypes for a tenant and data type name only if it is not used by any schema.
    • /dataTypes GET – Retrieves datatypes for a tenant.
    • /dataTypes POST - Register Datatypes to a tenant based on sample value generated or customer created data types.
    • /dataTypes/{name}/addPatterns– POST – update registered data types.

Data Ingest

Once registrations are done, raw data can be ingested either from integrated data lakes or customer data lakes. For more information, refer the Basics section. It serves as the starting point for data ingestion for the SDI application. Currently, SDI supports CSV, JSON, and XML formatted domain-specific files. Two scenarios can be used to upload the file:

  1. Upload the file using IDL: This is the preferred mode, SDI will start processing the file once it is uploaded to IDL with the correct configuration. For more information on configuring the IDL, refer the Integrated Data Lake Service section.
  2. Upload file with valid data source registry: This approach is used for the SDI the only customer. It allows more validation against the data registry and creates multiple schemas based on different domains created under the data registry. Using this mode, you can create a combination of schemas from different domains, query them or use for analytical modeling.

In the Data Management, SDI schedules an automatic Extract, Load and Transform (ELT) job to extract/infer schema from this data. The schema is then stored as per data source and data tag corresponding to these data sources.

Schema Evolution The schema change is applicable for append strategy only. If schema changes from one data ingest to another then SDI takes care of consolidating the schema when new properties are found. In case property contains incompatible data types over different ingestion process then SDI will update the type to encompassing type. The encompassing type is accommodated up to 5000 records for existing data, the schema is considered stable for the existing data after 5000 records for incompatible datatype change.

Users can search for a schema of these ingested or linked data to:

  • Build queries based on the physical schema
  • Develop the semantic model by mapping business properties to physical schema attributes
  • Get an initial inferred semantic model for selected schemas
  • Build queries based on a created semantic model

Schema Registry service is primarily used for maintaining the schemas for a tenant. Schemas are stored with the default name format of datasource_dataTag. In case the ingested files are ingested on (Quick SDI processing) the fly- schema name is the name of the file.

SDI system extracts and stores the schema once the file is uploaded. The user can search a schema created by the SDI system using data Tag, schema name and source name. Multiple schemas can be searched using an empty list or provide a filtered list based on the above parameters to search the schema generated by the system.

SDI currently recognizes UTC format dates (for example, 2020-02-15T04:46:13Z) and W3C format dates (for example, 2020-10-15T04:46:13+00:00), if present in the raw data files. They will be identified as timestamp data type in the resulting schema.

Search Schema

This allows the user to search the schema based on data tag, schema name or source name. The search schema array must contain identical elements in the search criteria.

POST Method: /searchSchemas

SDI works on schema-on-read philosophy. Users do not need to know the schema before data is ingested into MindSphere SDI. SDI has capabilities to infer/extract schema consisting of attributes and data types from ingested data and store it specific to tenant and data tag. The Data Ingest Service API supports XML, JSON, and CSV as input data file formats. The file containing XML format can provide the root element that the customer wants to process.

Limitations

  1. The number of data sources that can be registered depends upon the offering plan subscribed by the user tenant.
  2. Data Ingest POST method supports files up to 100 MB in size.
  3. SDI allows maximum ingest rate of 70 MBPS.
  4. Once the data is ingested successfully, the schema is available for search once a job is finished.
  5. Users can create a maximum of 500 registries per tenant.
  6. Users can create a maximum of 200 custom data types per tenant and each data type cannot contain more than 10 regular expression patterns.
  7. The search schema request and infer ontology is limited to 20 entry per search.
  8. The response of the search schema will contain a maximum of 20 last ingested original file names.
  9. The SDI supports a maximum of 250 properties per schema.
  10. Semicolon, pipe, etc as delimiter are not accepted in CSVs. Input CSV files to have a comma(,) as a delimiter.
  11. For Input files in JSON format; JSON key names should not have special characters like dot, space, comma, semicolon, curly braces, brackets, new line, tab, etc.
  12. The existing schema evolution support incompatible data type change for up to 5000 records.

Any questions left?

Ask the community


Except where otherwise noted, content on this site is licensed under the MindSphere Development License Agreement.