Abstract

The creation of new knowledge from manipulating and analysing existing knowledge is one of the primary objectives of any cognitive system. Out of the Big Data governing Vs, namely Volume, Velocity, Variety, Veracity, Validity, Volatility and Value, the first three are considered the primary ones. Most of the effort on Big Data research has been focussed upon Volume and Velocity, while Variety, “the ugly duckling” of Big Data, is often neglected and difficult to solve. A principal challenge with Variety is being able to understand and comprehend the data in gaining insight. Organisations have been investing in analytics relying on internal and external data to gain a competitive advantage. However, the legal and regulatory acts imposed nationally and internationally have become a challenge. The approach focuses on the use of self-learning systems that will enable automatic compliance of data against regulatory requirements along with the capability of generating valuable and readily usable metadata towards data classification. While for data confidentiality, a framework that utilises algorithmic classification and workflow capabilities is proposed. Such a rule-based system, implementing the corporate data classification policy, will minimise the risk of exposure by facilitating users to identify the approved guidelines and enforce them quickly. Two experiments towards confidential data identification and data characterisation were conducted in evaluating the feasibility of the approach. The focus of the experiments was to confirm that repetitive manual tasks can be automated, thus reducing the focus of a Data Scientist on data identification and thereby providing more focus towards the extraction and analysis of the data itself. In addition to that, a survey with subject matter experts, a diverse audience of academics and senior business executives in the fields of security and data management, was conducted featuring and evaluating a working prototype. The proof-of-concept showcased the model’s capabilities and provided a hands-on experience for expert to better understand the proposal. The experimental work confirmed that: a) the use of algorithmic techniques attributed to the substantial decrease in false positives regarding the identification of confidential information; b) evidence that the use of a fraction of a data set, along with statistical analysis and supervised learning is sufficient in identifying the structure of information within it; c) the model for corporate confidentiality is viable and the proposed features of the system are of value. With this proposal, the issues of understanding the nature of data can be mitigated, enabling a greater focus on meaningful interpretation of the heterogeneous data, while at the same time the organisations can secure their data and confirm data confidentiality and compliance.

Keywords

Big Data, Variety, Booster Metrics, Data Characterisation, Data Confidentiality, Data Loss Prevention

Document Type

Thesis

Publication Date

2022

Share

COinS