Getting Started
3 minute read
Getting started
Data Abstraction & Virtualization component is part of the “internal” DataPorts architecture. Therefore, no UI to use is being implemented. Below are the starting guidelines for each subcomponent:
(Pre)Processing & Filtering Software (PaFS) The (Pre)Processing & Filtering Software (PaFS) achieves the Preprocessing, Cleaning and Filtering of every incoming dataset. It is a sub-component that receives data streams through a GET REST API. All kinds of data can be accepted, since PaFS (and DAV in general) has a generic nature, regarding the data it receives and deals with. It preprocesses the datasets by the means of fully collecting them, transforming them into a Apache Spark / Python code – friendly format and taking care of proper column – row structure. It cleans the datasets by the means of detecting “dirty” values (such as NaNs, empty fields, outliers and wrong values). It filters the datasets by the means of eliminating all the dirty values found, either by replacing them, or by removing them (along with their rows in the dataset / dataframe). Before any further information, it should be noted that PaFS is intended to function as an Apache Spark Job. Therefore, an Apache Spark Cluster should already be deployed (this implementation is tested in the Apache Spark Framework version 3.0.1.).Also, the VDR subcomponent must be built first.
Virtual Data Repository (VDR) The Virtual Data Repository (VDR) is the place where all the pre-processed, cleaned and filtered datasets, coming from PaFS, are saved. It can be seen as a data lake, containing all the pre-processed datasets. After a dataset has been passed through all its functions, it then gets stored in VDR (along with its columns’ correlation matrix). VDR consists of a MongoDB, carrying modifications and custom parameters, in order to comply with DAV’s efficiency standards. An important note is that the following construction / implementation steps are for a local cluster system, with physical machines / servers. Furthermore, VDR is the subcomponent that has to be built first. It is intended to function on a Kubernetes Cluster. Therefore, a Kube Cluster should already be deployed.
Virtual Data Container (VDC) The Virtual Data Container (VDC) is the layer between DAV and any potential data recipient. It is the agent through which communication with data recipients is achieved, for data (stored in VDR) to be made available. Ιt is a generic (sub) component, the role of which is to further process and filter the data, by applying specific filtering rules defined by the data consumers via HTTP POST requests. Through these requests, the data consumer also defines the format in which he/she wants to receive the data (data transformation). Furthermore, VDC is responsible for exposing useful metadata (size, number of rows and variables, timestamp of last update etc.) for each one dataset that is stored in VDR. Those metadata are available via a RESTful API. Before any further information, it should be noted that VDC is intended to function on a Nifi Cluster and on a Spark cluster. Also, the VDR and PaFS subcomponents must be built first.