Custom object format
Constructor cannot cover all available object formats which you may store in the data catalog. You can implement your own format, but please note that the web viewer may not be able to properly display it.
If you need to store the format in DataFrame, you need to implement a class inherited from DataObject with these two mandatory methods:
- load, serialize
- from_native
The example of implementing a numpy matrix object as a data catalog object is provided below:
- NumpyMatrixObject is inherited from DataObject. The @object_manager.add_object decorator is used to register this type to the SDK.
| [Copy Code](javascript:void(0)) Python |
from typing import TYPE_CHECKING, Optional, ClassVar import numpy from research_sdk.structures import DataObject from research_sdk.structures.manager import object_manager if TYPE_CHECKING: from research_sdk.storage.base import DataStorageInterface, DataStorageMeta @object_manager.add_object class NumpyMatrixObject(DataObject): object_type: ClassVar[str] = "numpy_matrix/binary" _native: ClassVar[Optional["numpy.matrix"]] @classmethod def load(cls, storage: "DataStorageInterface", meta: "DataStorageMeta") -> "NumpyMatrixObject": raise NotImplementedError def serialize(self) -> bytes: raise NotImplementedError @classmethod def from_native(cls, native: numpy.matrix) -> "NumpyMatrixObject": raise NotImplementedError |
- The mime type is defined in the object_type class variable.
| [Copy Code](javascript:void(0)) Python |
object_type: ClassVar[str] = "numpy_matrix/binary" |
| Information | You cannot check that such type already exists in the data catalog. |
There is one more property of the class that returns the _native class variable. It is not recommended to redefine it in your class as it's unreasonable.
It's necessary to implement functions and describe what they do.
- To define our structured storage format, there are special helpers for dataclasses.
| [Copy Code](javascript:void(0)) Python |
from dataclasses import dataclass from research_sdk.serializers import bytes_to_object, object_to_bytes from research_sdk.serializers.serializer import DataclassSerializerBase, custom_encoder_manager @dataclass class NumpyMatrixStorage: data_type: str shape: list data: bytes @custom_encoder_manager.add_serializer class NumpyMatrixStorageSerializer(DataclassSerializerBase): data_class = NumpyMatrixStorage |
The NumpyMatrixStorage defines the structure of an object in the Data Catalog, where
- data_type and shape are the attributes of the numpy matrix.
- data is data from the matrix.
| Information | It is possible to create custom serialization for non-primitive types, but it is out of scope of the documentation. Simply use simple python types in such dataclasses. We support int, float, bool, str and bytes. |
NumpyMatrixStorageSerializer is the dataclass wrapped with the @custom_encoder_manager.add_serializer decorator to enable proper serialization and deserialization of data.
- It is necessary to define the load function. This function is used to load an object from the data calalog and called by the SDK when it detects your mime type during the object loading process.
| [Copy Code](javascript:void(0)) Python |
@classmethod def load(cls, storage: "DataStorageInterface", meta: "DataStorageMeta") -> "NumpyMatrixObject": serialized = storage.read_object_raw(meta.id) obj = NumpyMatrixStorage(**bytes_to_object(serialized)) return cls( name=meta.name, description=meta.description, native=numpy.frombuffer(obj.data, dtype=obj.data_type).reshape(obj.shape), object_id=meta.id, ) |
You can see here a call of the read_object_raw function from the storage, which returns your the object in bytes format. To deserialize it to dict and put to the proper fields you may use the bytes_to_object function.
After that, you need to instantiate the NumpyMatrixObject class with the loaded data, properly filling the native parameter.
| Information | IMPORTANT: Pass object_id to the constructor at this step. This object_id is the unique object identifier in the data catalog |
- It is necessary to use the from_native method to create a DataObject from the native object stored in a table cell, when the object name is unknown. It is recommended to specify a meaningful name and description.
The registration is needed to properly handle data with your defined mime type loaded from the data catalog.
| [Copy Code](javascript:void(0)) Python |
@classmethod def from_native(cls, native: numpy.matrix) -> "NumpyMatrixObject": return cls( name=f"Matrix {native.shape}", native=native, ) |
- It is necessary to serialize the data.
| [Copy Code](javascript:void(0)) Python |
def serialize(self) -> bytes: serialized = NumpyMatrixStorage( data_type=self._native.dtype.name, shape=self._native.shape, data=self._native.tobytes(), ) return object_to_bytes(serialized) |
Fill the NumpyMatrixStorage dataclass and then call the object_to_bytes method to convert a dataclass to bytes. SDK will call this function to handle saving of an object.