securing data warehouses: a semi-automatic approach for...

L. Bellatreche and F. Mota Pinto (Eds.): MEDI 2011, LNCS 6918, pp. 71–84, 2011. © Springer-Verlag Berlin Heidelberg 2011

Securing Data Warehouses: A Semi-automatic Approach for Inference

Prevention at the Design Level

Salah Triki1, Hanene Ben-Abdallah1, Nouria Harbi2, and Omar Boussaid2

1 Laboratoire Mir@cl, Département d’Informatique,

Faculté des Sciences Economiques et de Gestion de Sfax, Tunisie, Route de l’Aéroport Km 4 – 3018 Sfax, BP. 1088

{Salah.Triki,Hanene.BenAbdallah}@Fsegs.rnu.tn 2 Laboratoire ERIC, Université Lyon 2,

5 avenue P. Mendès France 69676 Bron, Cedex, France {Nouria.Harbi,Omar.Boussaid}@univ-lyon2.fr

Abstract. Data warehouses contain sensitive data that must be secured in two ways: by defining appropriate access rights to the users and by preventing potential data inferences. Inspired from development methods for information systems, the first way of securing a data warehouse has been treated in the literature during the early phases of the development cycle. However, despite the high risks of inferences, the second way is not sufficiently taken into account in the design phase; it is rather left to the administrator of the data warehouse. However, managing inferences during the exploitation phase may induce high maintenance costs and complex OLAP server administration. In this paper, we propose an approach that, starting from the conceptual model of the data sources, assists the designer of the data warehouse in indentifying multidimensional sensitive data and those that may be subject to inferences.

Keywords: Data warehouse, Security, Precise Inference, Partial inference.

1 Introduction

Organizations have a significant amount of data that can be analyzed to identify trends, examine the effectiveness of their activities, and take decisions to increase their profits. By gathering and consolidating data issued from the organization’s information system, a data warehouse (DW) allows decision makers to perform decision analyses and financial forecasts. In fact, several tools dedicated to data warehousing offer various operations for OnLine Analytical Processing (OLAP), assisting users in the decision analysis process.

On the other hand, data in an organization’s DW are proprietary and sensitive and should not be accessed without controles. Indeed, some data, like medical data, religious or ideological beliefs, are personal and may harm their owners if disclosed. For this, several governments passed laws for the protection of the citizens’s private

72 S. Triki et al.

lives. Among these laws, HIPAA1 (Health Insurance Portability and Accountability Act) aims to protect patient medical data by forcing American health care establishments to follow strict safety rules. Similarly, GLBA2 (Gramm-Leach-Bliley Act) requires U.S. financial institutions to protect customer data; on the other hand, Safe Harbor3 allows companies conforming to transfer and use data on European Internet ; and Sarbanes-Oxley4 Act guarantees the reliability of corporate financial data. Agencies must use strict safety rules to comply with these laws, otherwise they are punished.

Securing a DW is a twofold task. The first fix the access rights of the DW users. Similar to information systems, this security task can be treated at a conceptual or logical level; the fixed access rights are enforced by the OLAP server. As for the second security task, it seeks to ban malicious users from infering prohibited information through permitted acceses. In fact, there are two types of inferences: precise inferences where the exact data values are deducted, and partial inferences where data values are partially disclosed. Inference prevention at design level reduces administration costs and maintenance of OLAP servers. Despite this, inference prevention at design level has not received enough interest from researchers.

The aim of this paper is to propose an approach to model the prevention of inferences using the data source design represented as a class diagram. Our approach has two advantages over existing approaches. The first advantage is its genericity since it is applicable to any business domain. The second advantage is that it takes into account the data available to the malicious user to detect inferences; the majority of inference cases are produced by combining available data.

The remainder of the paper is organized as follows: in section 2, we present a state of the art in the DW security domain at the requirement and design levels. In Section 3, we detail our approach. Section 4 presents an example illustrating the use of our approach. Finally, we summarize the work done and outline our work in progress.

2 Related Work

The need for securing DW was felt long ago [1] [2]. Several proposed approaches tackled the DW security problem at the requirement, design or logical levels.

At the requirement level, [3] propose a profile based on i* and an approach that can model the security requirements. The proposed profile takes into account the RBAC ("Role Based Access Control") model and MAC (" Mondatory Access Control ") model. Thus, for each data to be protected, a security class must be defined in terms of: security role, security level and compartment. Using this profile, the proposed approach to model security requirements operates on three stages : i) analyzing the rules and privacy policies that exist in the organization; ii) interviewing the security-in-charge personnel to define the data to be secured; and iii) affecting security classes for each data. This approach is informally presented. 1 http://www.hhs.gov/ocr/privacy/index.html 2 http://www.gpo.gov/fdsys/pkg/PLAW-106publ102/ content-detail.html 3 http://www.export.gov/safeharbor/ 4 http://www.soxlaw.com/

Securing Data Warehouses 73

At design level, several studies have been carried out. [4] propose a UML profile for modeling security and extensions of OCL (Object Constraint Language) to specify security constraints. The UML profile, called SECDW (Secure Data Warehouse), includes new types, stereotypes and tagged values to model the RBAC and MAC models. [5] extended SECDW to represent the concept of conflict among multidimensional elements. However, neither work proposed an approach to design a secure DW model.

On the other hand, [6] proposed an approach using the UML state-transition diagram to detect inferences in a DW design. In the state-transition diagram, the states represent the data to display and transitions represent users’ multidimensional queries. The approach takes into account the possibility of inferences from empty cells in a cube (i.e., unavailable data for measures), without addressing the possibility of inferences from available data. For their part, [7] proposed an approach specific to the field of market research in general and the particular case of the company GFK [9]. This approach addressed the case of partial and precise inferences. Precise inferences occur when the exact measure values is deduced, while partial inferences occur when “an idea” about the measure values is deduced. In this work, inferences were detected manually by studying the application domain of market research.

At logical level, [5] treated the case of implementation in the multi-dimensional relational model based on the extension of the CWM (“Common Warehouse Metamodel”). This extension allows the definition of security constraints and audit rules for each element of the relational model. Security constraints allow the implementation of RBAC and MAC and audit rules can log access attempts to analyze problematic cases.

After our review of the state of the art of DW security, we noticed the following four points:

- Modeling access rights has been treated at the requirement, design and logical levels. Existing work ([3] [4] [5] [9]) were able to offer notations for modeling the MAC and RBAC models. However, the proposed approaches were informally described.

- Prevention of inferences has been widely treated at the physical level ( [10] [11] [12] [13] [14]). This level can enduce high administrative costs and high maintenance.

- Prevention of inferences at the design level has not been sufficiently addressed. The existing works ([6] [7]) do not take into account the potential inferences from the data available and are specific to a particular application domain.

- Existing proposals lack assistance in identifying data from the DW that are potentially subject to inferences.

In this paper, we treat the last two points by proposing at the design level: i) a UML-based language for modeling data potentially subject to inferences, and ii) a semi-automatic approach to identify such data.

3 Proposed Approach

The approach we propose is based on the data sources’ class diagram. In addition, it assumes that the DW schema is already designed and mapped to the data sources.

74 S. Triki et al.

In fact, our approach fits in and complements the three types of DW design approaches: bottom-up ([15] ), top-down ([16]), and mixed ([17]). In all three types of design approaches, once the DW schema is developed, it must be matched with the data sources to indicate the source of the elements that will be used to load each element of the DW; this mapping is vital for the definition of the ETL procedures. In the case of bottom-up and mixed approaches, this mapping is produced by default since the DW schema definition is developed from the data sources. As for the top-down approaches, this mapping is needed to validate the specified DW schema.

Our approach (see Fig. 1) comprises three phases. The first phase, carried out by the security designer, identifies the elements to be protected in the DW design. In the second phase, we first automatically build an inferences’ graph used to detect the elements which may lead to inferences; secondly, the designer distinguishes the elements that lead to precise inferences and those that lead to partial inferences. In the third phase, we automatically enrich the DW schema model by UML annotations highlighting the elements subject to both types of inferences. Note that we use in this paper the star schema to model the DW schema.

3.1 Definition of Sensitive Data

Given a DW schema, the definition of sensitive data annotates the elements of the multidimensional model. It is made by the DW security designer who may be assisted by an expert in the field. The role of the domain expert is to identify the data to be protected. This data is indicated by annotations with the UML stereotype “Sensitive data” (see Fig. 1).

3.2 Inference Graph Construction

Definition 1: An inference graph is a set of nodes connected by oriented arcs. The nodes represent the data (in the source) and the arcs indicate the direction of inference and the inference type (partial/precise). Graphical notations: An inference graph is graphically composed of:

- Two types of nodes: nodes colored in gray represent the sensitive data, and nodes colored in white represent the non-sensitive data.

- Two types of arcs: dotted arcs indicate partial inferences and solid arcs indicate precise inferences.

Take the case of health, disease (sensitive data), treatment and service are represented by nodes. The correspondence between the disease and treatment is the inferences and their meaning (Fig. 2): Knowing the treatment, one can infer the disease. This inference is precise because two different diseases may not have the same treatment, so we have a solid arc treatment to illness. On the other hand, if in a hospital, each service treats a number of diseases, then, knowing the service, one can have an idea about the kind of disease but not its name; this is modeled by the dotted arc from service to illness.


Fig. 1. Proposed Approach for DW schema security

DW schema enriched

(1) Sensitive data identification

DW schema

DW schema: Sensitive data identified

(2.a) Inference graph construction

(2.b)Partial inferenceIdentification

Inferences graph

Data sources class

diagram

A

B C D

E

A

B C D

E

Inference graph : Partial inference detected

(2.c) Calculating the transitive closure

(3) DW schema enrichment

D

A

B C D

E

Graph inferences: new partial inference detected

76 S. Triki et al.

Fig. 2. Inference graph: Example

The construction of an inferences graph involves the class diagram of data sources that will load the DW. We use the mapping to prune out the inference graph built from the data sources and restrict it to only the nodes corresponding to data used for the loading of the DW schema. In our approach (see Fig. 1), the inference graph is built automatically based on the cardinality of the class diagram of the data sources. To do this, we apply the following six rules:

R1. Each class is represented by a node colored in gray if the corresponding data is sensitive and colored in white otherwise.

R2. Each binary association / aggregation between two classes C1 and C2 will be represented by an arc according to the following three cases:

Case 1 (see Fig. 3 (a, b)): if the association/aggregation has cardinality * on C1’s side and cardinality 1 or 0..1 on the C2’s side, then an arc from C1 to C2 is added to the inference graph (see Fig. 3 (c))

Case 2 (see Fig. 4 (a, b)): if the association/aggregation has cardinality * C1’s side and cardinality 1 or 0..1 on C2’s side and C1 is also connected to C3 by an association/agregation with cardinality * on C1’side and the cardinality of 1 or 0..1 on C3’s side, then two arcs are added to the inference graph. The first from C2 to C3 and the second from C3 to C2 (see Fig. 3 (c))

Case 3 (see Fig. 5 (a)): if the association has a class C3, then two arcs are added to the inference graph; one from C3 to C1 and another from C3 to C2 (see Fig. 5 (b)).

(a) (b) (c)

Fig. 3. Inference first case

Treatment

Illness

Service

1 or 0..1*

1 or 0..1*

C1

C2


(a) (b) (c)

Fig. 4. Inference second case

(a) (b)

Fig. 5. Inference third case

(a) (b) (c) (d)

Fig. 6. Representing composition

R3. Each composition (see Fig. 6 (a)) will be represented by an arc from the component to the composite (see Fig. 6 (b)). If in addition the cardinality of the component side is 1 or 0..1 (see Fig. 6 (c)), then a second arc from the composite is added to the component (see Fig. 6 (d)).

R 4. Each n-ary association with cardinalities * and 1 or 0..1 will be represented by arcs from classes with cardinalities 1 or 0..1 to those with the cardinality *. For example, the ternary association in Fig. 7 (a) is represented by the graph in Fig. 7 (b).

1 or 0..1

1 or 0..1

*

*

1 or 0..1

1 or 0..1

*

*

1 or0..1

1 or 0..1

*

*

C1

C2

C3

C3

C1 C2

*

C1

C2

1 or 0..1

C1

C2

78 S. Triki et al.

(a) (b)

Fig. 7. Representing n-ary association

R5. If an inheritance relationship exists between two classes C1 (parent) and C2 (child) and if C1 is connected to C3 by an association with a cardinality * on C1’s side and the cardinality on C3’s side is 1 or 0..1 (see Fig. 8 (a)), then an arc is added from C2 to C3 (See Fig. 8 (b)).

R6. If an inheritance relationship exists between two classes C1 (parent) and C2 (child) and if C1 is connected to C3 by an association with a cardinality 1 or 0..1 on C1’s side and the cardinality on C3’s side is * (see Fig. 8 (c)), then an arc is added from C3 to C2 (See Fig. 8 (d)).

(a) (b)

(c) (d)

Fig. 8. Representing an inheritance relationship

The automatic construction of the inference graph does not indicate the type of inferences: partial or precise. This indication cannot be, unfortunately, deducted automatically. Thus, after constructing the inference graph, the designer must distinguish partial inferences (drawn by dotted arcs).

1 or 0..1

1 or0..1 *

*

C1

C2 C3

*

* 1 or 0..1

C2C1

C3

* 1 or 0..1

C2C1

C3

In addition, to ensure approach continues with tgraph. On the resulting grap

- Precise path: it is a path- Partial path: it is a path

3.3 Enrichment of the DW

The inference graph is useapproach assumes that thesource is already done. Wenrichment rules:

- For each element of thcorresponding element in tNameInferedData”.

- For each element of thcorresponding element in NameInferedData”.

4 Example

Fig. 9 contains the class diaTable 1 contains details of v

Fig. 10 presents a DWdiagnostic along the analyDate Time, and Doctor's doctor who performed theotherwise.

Fig

Securing Data Warehouses

that all possible inferences have been determined, he automatic calculation of the transitive closure of ph, we distinguish two types of paths:

h where all connected nodes allow precise inferences. h with at least one node allows partial inferences.

W Schema

ed to enrich the DW schema (see Fig. 1). To do so, e mapping between elements of the DW schema and We exploit this mapping to apply the two follow

he inference graph belonging to a Precise path annotatethe DW schema with “Precise Inference: ElementNam

he inference graph belonging to a partial path annotatethe DW schema with "Partial Inference: ElementNam

agram of a fictitious data source in the healthcare domvarious classes.

W schema that analyzes the costs and durations of ysis axes: Disease, Treatment, Critical Illness, Trans

specialty. The latter takes the value generalist if e diagnostic is a generalist and specialty of the doc

g. 9. Class diagram of the data sources

79

our the

our the

wing

e its me:

e its me:

main.

the sfer,

the ctor

80 S. Triki et al.

Class DetaDate and Time Date

Admission Patie

Diagnostic Patie

Doctor Doct

Specialty Speci

Illness IllneTreatment Treat

Critical illness Seriowher

Transfer This and t

4.1 Inference Graph

Illness corresponding to a professional secrecy in meallow us to infer a patient'sfrom the cardinality of the are the sensitive data; dotteand those in solid line arinferences seven partial andthe graph has highlighted oshown in Fig. 12 new paths

These paths are partial bTable 2 the first column shthe second column.

From Table 2 and Fig. 12

- A user with access to infer that the diagnosed illn

Table 1. Details of Fig. 9 classes

ails and time of a patient admission.

ent admission.

ent diagnostic.

or who made the diagnostic.

ialty of doctor who made the diagnosis.

ess diagnosed.

tment necessary to cure the disease.

ous disease requiring patient transfer to another hospital re he will receive appropriate care.

association class contains the date of transfer of the patient the hospital will welcome.

Fig. 10. Multidimensional Model

given patient is a sensitive information since it is partedical activities. In our example, we look at the data ts illness. Fig. 11 contains the inference graph construcclass diagram of the source data. In this graph, gray noed arcs represent the inferences that we considered partre believed to be precise inferences. This graph shod one precise. The calculation of the transitive closureother potential inferences. For the sake of clarity, we hs. because they contain nodes that allow partial inferenceshows some new inferences composed of the paths listed

2, we can deduce that:

the dates and times of admission and transfer data, mness were critical,

t of that cted odes tial, ows e of

have

s. In d in

may


Fig. 11. Initial Inferences graph

Fig. 12. Inference graph after calculating the transitive closure

- Access to the date and time of admission and the specialty of the doctor who performed the corresponding diagnosis on admission, allow the user to infer the type of illness the patient has

- Access to treatment received by a patient, allow a user to infer the disease the patient has.

Illness Critical Illness

Treatment Diagnostic Transfer

Doctor

Specialty

Date and Time

Admission

Illness Critical Illness

Treatment Diagnostic Transfer

Doctor

Specialty

Date and Time

Admission

82 S. Triki et al.

Inference

Date and Time → Diag

Date and Time → Doct

Date and Time → Tra

Date and Time → Illn

4.2 DW Schema Enrichm

Based on the inference grschema, we get automaticawith ptential inferences (seare have the same annotatiFig. 11, for the sake of clari

Fig. 13. DW

5 Conclusion

In this paper, we presentedmodel annotated with infortwo advantages over existindomain. The second advanapproach constructs a graphThe class diagram allows u

Table 2. Partial paths

Partial path

gnostic Date and Time → Admission , Admission → Diagnostic

tor Date and Time → Admission Admission → Diagnostic Diagnostic → Doctor

ansfer Date and Time → Admission Admission → Diagnostic Diagnostic → Transfer

ness Date and Time → Admission Admission → Diagnostic Diagnostic → Illness

ment

raph and the mapping between the data source and Dally the security annotation for the DW schema elemeee Fig. 13). In this model, the dimensions of time and dion to specify that the two sets can lead to inferencesity, we have not listed all the annotations.

schema annotated with the security information

d an approach to produce a conceptual multidimensiormation for the prevention of inferences. Our approach ng approaches. The first is its independence from the d

ntage is the use of available data to detect inferences. Oh of inferences based on the class diagram of data sourcus with the assistance of the domain expert to identify

DW ents date . In

onal has

data Our ces. the


elements to lead to precise and partial inference. These elements will be annotated in the multidimensional model.

Currently, we are studying how to transfer to the logical level the annotations defined at the design level.

References

1. Bhargava, B.K.: Security in data warehousing (Invited talk). In: Kambayashi, Y., Mohania, M., Tjoa, A.M. (eds.) DaWaK 2000. LNCS, vol. 1874, pp. 287–288. Springer, Heidelberg (2000)

2. Pernul, G., Priebe, T.: Towards olap security design - survey and research issues. In: 3rd ACM International Workshop on Data Warehousing and OLAP DOLAP 2000, Washington, DC, Novembre 10, pp. 114–121 (2000)

3. Soler, E., Stefanov, V., Mazón, J.-N., Trujillo, J., Fernández-Medina, E., Piattini, M.: Towards comprehensive requirement analysis for data warehouses: Considering security requirements. In: The Third International Conference on Availability, Reliability and Security ARES 2008, Barcelone, Espagne, pp. 104–111. IEEE Computer Society, Los Alamitos (2008)

4. Soler, E., Villarroel, R., Trujillo, J., Fernández-Medina, E., Piattini, M.: Representing security and audit rules for data warehouses at the logical level by using the common warehouse metamodel. In: The First International Conference on Availability, Reliability and Security ARES 2006, Vienne, Autriche, pp. 914–921. IEEE Computer Society, Los Alamitos (2006)

5. Triki, S., Ben-Abdallah, H., Feki, J., Harbi, N.: Modeling Conflict of Interest in the design of secure data warehouses. In: The International Conference on Knowledge Engineering and Ontology Development 2010, Valencia, Espagne, pp. 445–448 (2010)

6. Carlos, B., Ignacio, G., Eduardo, F.-M., Juan, T., Mario, P.: Towards the Secure Modelling of OLAP Users’ Behaviour. In: The 7th VLDB Conference on Secure Data Management, Singapore, September 17, pp. 101–112. Springer, Heidelberg (2010)

7. Steger, J., Günzel, H.: Identifying Security Holes in OLAP Applications. In: Proc. Fourteenth Annual IFIP WG 11.3 Working Conference on Database Security, Schoorl (near Amsterdam), The Netherlands, August 21-23 (2000)

8. Icon Group Ltd. GFK AG: International Competitive Benchmarks and Financial Gap Analysis (Financial Performance Series). Icon Group International (2000)

9. Villarroel, R., Fernández-Medina, E., Piattini, M., Trujillo, J.: A uml 2.0/ocl extension for designing secure data warehouses. Journal of Research and Practice in Information Technology 38(1), 31–43 (2006)

10. Haibing, L., Yingjiu, L.: Practical Inference Control for Data Cubes. IEEE Transactions on Dependable and Secure Computing 5(2), 87–98 (2008)

11. Cuzzocrea, A.: Privacy Preserving OLAP and OLAP Security. In: Encyclopedia of Data Warehousing and Mining, pp. 1575–1158 (2009)

12. Zhang, N., Zhao, W.: Privacy-Preserving OLAP: An Information-Theoretic Approach. IEEE Transactions on Knowledge and Data Engineering 23(1), 122–138 (2011)

13. Terzi, E., Zhong, Y., Bhargava, B.K., Pankaj, Madria, S.K.: An Algorithm for Building User-Role Profiles in a Trust Environment. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2002. LNCS, vol. 2454, pp. 104–113. Springer, Heidelberg (2002)

84 S. Triki et al.

14. Bhargava, B.K., Zhong, Y., Lu, Y.: Fraud Formalization and Detection. In: Kambayashi, Y., Mohania, M., Wöß, W. (eds.) DaWaK 2003. LNCS, vol. 2737, pp. 330–339. Springer, Heidelberg (2003)

15. Golfarelli, M., Rizzi, S.: A Methodological Framework for Data Warehouse Design. In: ACM First International Workshop on Data Warehousing and OLAP DOLAP, Bethesda, Maryland, USA, pp. 3–9 (Novembre 1998)

16. Feki, J., Nabli, A., Ben-Abdallah, H., Gargouri, F.: An Automatic Data Warehouse Conceptual Design Approach. In: Wang, J. (ed.) Encyclopedia of Data Warehousing and Mining, 2nd edn. (2008)

17. Lujan-Mora, S., Trujillo, J.A.: Comprehensive Method for Data Warehouse Design Fifth International Workshop on Design and Management of Data Warehouses, DMDW 2003, Berlin, Allemagne (Septembre 2003)

securing data warehouses: a semi-automatic approach for...

Documents