How to use it
11 minute read
DAV consists of the three aforementioned subcomponents. The installation of each one is presented below:
(Pre)Processing & Filtering Software (PaFS) PaFS main code is written in Python programming language and, more specifically, Python Spark (PySpark) job. For the code to run as a Spark job in a Spark environment, some steps must be completed. Those steps are as follows:
- Make sure that an Apache Spark cluster has been set up on the physical machine(s) that PaFS will run upon.
- Download all the necessary .jar files / libraries (available at DAV’s repository) and put them in the <user_path>/spark/jars folder.
- Repeat the step above in all the nodes / workers of the Spark cluster. This is a useful step, since the user might need to change roles to their machines inside the cluster, using other masters and workers at times. Moving the PaFS PySpark job to the master every time, it shall work immediately, since the .jar files will be present.
- Submit the PaFS’s .py file as a Spark job to the master.
- Check that PaFS is running, by examining its Spark job logs
Virtual Data Repository (VDR) As mentioned earlier, the given construction / implementation steps are for a local cluster system, with physical machines / servers. This means that these steps are not for an NFS server, or any implementation of this kind:
- Make sure that Kubernetes points at DAV’s namespace (as always)
- Create a storageclass.yaml file, defining the information of the storage class that shall be used in the MongoDB – VDR
- Define multiple persistent volumes, inside a persistentvolumes.yaml file. For that to work properly, we must have already created volumes at /mnt/disk/. Created volumes (such as vol1, vol2, vol3 etc.) inside the disk folder will be referenced through the .yaml file, as persistent volumes available for use from Mongo – VDR. Note that there is no need to create persistent volume claims (PVCs), since these will be created automatically by the statefulSet.yaml file later on
- Create a mongodb “headless” service .yaml file, for Mongo – VDR to function as a service inside the Kubernetes cluster The last .yaml file needed is the statefulSet one. This is where the critical details of MongoDB’s nature will be set, such as the number of replicas available after launch.
- Apply all the .yaml files above, by the same order of creation, to the Kubernetes framework (kubectl create –f YAMLFILENAME)
- Create a temporary mongo shell as a Kubernetes service, in order to enter the MongoDB created and configure it properly.
Virtual Data Container (VDC) First and foremost, VDC is intended to function on a Nifi Cluster and on a Spark cluster. Also, the VDR and PaFS subcomponents must be built first. For Nifi we provide a docker image (iccs/vdc:1.0) and a docker-compose.yml template file in order to make the installation as easy as possible. After setting all the desired parameters in the docker-compose.yml open a terminal in its directory and run the following command
sudo docker-compose up -d
After a while a new vdc container will be created and the Nifi UI page will be available. When this happens the system admin must log in using their credential at https://nifi_web_host:nifi_web_port/nifi/. By dragging and dropping from the template button on the menu bar they can instantiate any of the three preinstalled apis:
- vdc_metadata_api
- vdc_with_spark
- ack_vdc_api Next comes the activation of the various controller services for each api. This can be done by selecting ‘‘Configure’’ on the api workflow when right clicking on them. On the controller services tab the admin must click the little thunder button to enable the service. Finally, by right clicking on each workflow api the admin may start or stop it. Creating Nifi components from already made templates is very easy. Instructions can be found here: https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#templates. Nifi’s powerful UI makes the installation process very straightforward. WARNING: Some controller services require secret credentials which must be set by the addmin in the properties of the service. Example of this type of controllers are the MongoDBControllerService which require the username and password of the MongoDB user.
Regarding VDC, its communication and request handling operations are achieved through Apache Nifi, which is used to create the data flows, for DAV to be able to expose the cleaned and processed data to the data consumers. To this end, the figure bellow presents the flow developed to implement the VDC Rules System (data processing and filtering based on user-defined rules), which is analysed in the following subsection. As can be seen in that figure, the actual processing, filtering, and transformation of the data is performed by Spark. The following bullets briefly describe the functionality that is executed by each one part of the depicted flow:
- VDC API -> The HTTP POST endpoint for the VDC API
- PARSE JSON REQUEST DATA AND GENERATE SPARK JOB SPECIFICATIONS -> Parse the input data (rules file, desired output data format, etc.) and create the corresponding Spark Job request
- INVOKE SPARK JOB AND SEND ACK -> Send the request to the Spark Listener and also send an acknowledgement message to the user who made the request
- HANDLING ERRORS -> Generate an HTTP response in case something goes wrong during the processing of the input data (e.g., errors in the structure of the rules file)
Correspondingly, the second figure below shows the metadata extraction flow. The different parts of the flow implement the following functionalities:
- METADATA API LISTENER -> The HTTP GET endpoint for the datasets’ metadata API
- COMMUNICATING WITH MongoDB -> Run the appropriate MongoDB queries for collecting the datasets’ statistics
- PARSING DATASETS METADATA AND SENDING AS JSON RESPONSE -> Parse the collected metadata into the format of the API specification and send an HTTP response
- HANDLING ERRORS -> Generate an HTTP response in case something goes wrong during the execution (runtime errors such as collection or DB not found)
A very important feature of the Virtual Data Container (VDC) is the rules system. This rule structure is very simple. The three core elements of each rule are a “subject column”, an “operator” and the “object”. The expected rules list format is a JSON Array, which shall include rules (JSON Objects), containing those string values. VDC parses this list and applies the rules to the requested dataset. This is the architecture of the incoming rules JSON file:
- A JSON Object, containing
- A string field with the dataset’s name, and another one with the dataset’s id
- A JSON array containing the rules as JSON Objects
- Each JSON Object (rule) in the array shall contain a string field with its name, and another JSON Object with the rule itself
- Each rule Object shall include the “subject_column”, “operator” and “object” fields
- In case the “operator” is a disjunction, meaning the “or” expression, then the “object” field shall be a JSON Array, containing two (or more) objects with single string “operator” and “object” fields found)
- Each JSON Object (rule) in the array shall contain a string field with its name, and another JSON Object with the rule itself
The figure above indicates all the accepted operators that can be used by the rules’ author (data scientist, application developer, end-user etc.). A new addition is that of “—“ and “++”, for deleting or keeping selected columns (respectively). Any other operator is not recognized by the VDC. Therefore, any rule object containing unknown operators is discarded. In addition, there are two main principles, on which the rules system is based upon. These principles enable a user to better understand the rules’ nature:
- A rule’s main goal is to apply filters to one subject (column) at a time, not combining subjects (columns). If more than one column is concerned as subjects in one rule, then this step could be regarded more as “pre-processing” and less as “filtering”. Moreover, changing the content of specific rows / values, or removing rows with specific value type, is also a step at lower level than what the rules’ system suggest
- DAV is, from its nature, a generic framework, meaning that it can be used for all kinds of datasets. Making a rule more complex than that of the standard “subject - operator - object” architecture, simply violates DAV’s generic nature. It is relatively easy and simple to implement basic pre-processing steps for selected datasets; it usually does not take more than - a couple of - lines of code. However, these kinds of pre-processing steps would not be applied to other datasets, assuming that any kind of dataset can enter DAV. Therefore, the data scientists would have to go to a conditional solution, such as “if the incoming dataset is X, then apply these selected lines of code”. This is a very easy solution, but ruins DAV’s fundamental generic nature
Here is the list of the rules, explained:
- “++” : when reading a record of data, it creates a subset of this record consisting of only the columns whose names are in the object field; the value of the subject_column field must be a valid column name for the specified dataset, but is otherwise irrelevant; the format of the object field value is: “<column1_name>|<column2_name>|…”
- “—“ : when reading a record of data, it creates a subset of this record by excluding the columns whose names are in the object field and keeping all other columns; the value of the subject_column field must be a valid column name for the specified dataset, but is otherwise irrelevant; the format of the object field value is: “<column1_name>|<column2_name>|…”
- “<=“: when reading data, it keeps the records whose subject_column contains values with less than / or equal to the one given in the object field; the value of the subject_column field must be a valid column name for the specified dataset; the format of the object field value is: “<string_value>”
- “>=“: when reading a dataset, it keeps the records whose subject_column contains values with greater than / or equal to the one given in the object field; the value of the subject_column field must be a valid column name for the specified dataset; the format of the object field value is: “<string_value>”
- “<“: when reading data, it keeps the records whose subject_column contains values with less than the one given in the object field; the value of the subject_column field must be a valid column name for the specified dataset; the format of the object field value is: “<string_value>”
- “>” : when reading a dataset, it keeps the records whose subject_column contains values with greater than the one given in the object field; the value of the subject_column field must be a valid column name for the specified dataset; the format of the object field value is: “<string_value>”
- “==” : when reading data, it keeps those records whose subject_column has the value given in the object field; the value of the subject_column field must be a valid column name for the specified dataset; the format of the object field value is: “<string_value>”
- “!=“: when reading a record of data, it keeps those records whose subject_column do not have the value given in the object field; the value of the subject_column field must be a valid column name for the specified dataset; the format of the object field value is: “<string_value>”
- “or”: This disjunction operator is an array containing two additional operators and two object fields, corresponding to the one given subject_column. It is provided to apply two filter rules at the same time, at the subject_column. Inside the array, one should create two objects, containing one operator and one object column each.
An indicative example of the rules’ system usability is derived from the specific filtering actions that were requested from the Automatic Model Training Engine to be applied to the Valencia PCS traffic dataset. Those actions are presented in the figure below:
Regarding the first two actions for the “Arrival” variable (column), these should be implemented as a rule with disjunction, meaning “or”. The JSON object of this rule (containing the two “Arrival” column actions) should be written according to the following example:
Regarding the third action, related to the “Status” variable (column), the rule’s author should use the “not equal to” operator, meaning “!=“. The JSON object of this rule should be written according to the following example:
All three remaining actions are related to the “equal to” operator, meaning “==“. Therefore, the proper way to write down the corresponding rules JSON file should be (taking the “Regular line” column as an example):
In conclusion, the goal is to assist any potential DataPorts user to define a JSON Array of a dataset’s actions / rules with ease. For that reason, its structure is kept simple, and the principles behind the rules’ architecture are sane. Consequently, those filtering rules can be specified not only by data scientists, but also by users with limited or no technical background within a ports ecosystem, who are interested in exploiting DataPorts cognitive services. The rules system can be applied to all kinds of datasets, based on the needs for cleaning & filtering.