nexoncn.com

文档资料共享网 文档搜索专家

文档资料共享网 文档搜索专家

当前位置：首页 >> >> Optimization of On-Line Analytical Processing Systems Conceptual Data Modeling and Query Pr

NATIONAL TECHNICAL UNIVERSITY OF ATHENS

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING DIVISION OF COMPUTER SCIENCE

Optimization of On-Line Analytical Processing Systems: Conceptual Data Modeling and Query Processing Techniques

Ph.D. Thesis

ARIS TSOIS Dipl. Electrical and Computer Engineering N.T.U.A. (1995)

Athens, February 2005

NATIONAL TECHNICAL UNIVERSITY OF ATHENS

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING DIVISION OF COMPUTER SCIENCE

Optimization of On-Line Analytical Processing Systems: Conceptual Data Modeling and Query Processing Techniques

Ph.D. Thesis

ARIS TSOIS Dipl. Electrical and Computer Engineering N.T.U.A. (1995)

Advisory Committee:

T. Sellis Y. Vasiliou F. Afrati

Approved by the seven-member examining committee on February 28 2005. ...................................

T. Sellis Professor N.T.U.A.

...................................

Y. Vasiliou Professor N.T.U.A.

...................................

F. Afrati Professor N.T.U.A.

...................................

Ε. Zachos Professor N.T.U.A.

...................................

Α. Stafylopatis Professor N.T.U.A.

...................................

Y. Ioannidis Professor Univ. of Athens

...................................

V. Vassalos Assistant Professor Athens Univ. of Economics and Business

Athens, February 2005

ARIS TSOIS Ph.D. Electrical and Computer Engineering N.T.U.A. ? 2005 – All rights reserved

Abstract

On-Line Analytical Processing (OLAP) is a technology that encompasses applications used by directors, managers and analysts to obtain a better understanding of their business and therefore to make better decisions. The two main characteristics of OLAP systems are: (a) the ability to provide their users with a multidimensional conceptual view of the data and support for analysis based on multiple hierarchies, and (b) the ability to respond fast to all user queries, even though, more often than not, these queries require complex grouping and aggregation operations on enormous quantities of data. In this thesis we study problems related to the above characteristics and propose solutions that can be used to implement more efficient OLAP systems. In particular, in the first part of the thesis we tackle the issue of providing to the OLAP users an adequate multidimensional conceptual model. The result of this effort is a new conceptual data model named MAC. In the second part we investigate the problem of how an OLAP system can achieve fast response times for ad-hoc queries. To this end we propose a number of optimization techniques that can be used in combination with a recently proposed, and very promising, multidimensional data structure. For one of these techniques, the most efficient one, we provide a thorough analysis of when the optimization technique can and should be used. As part of this analysis we define a generalization of this technique, called Generalized Pre-Grouping, which is applicable even without the use of specialized multidimensional data structures.

Contents

1 INTRODUCTION............................................................................................ 1-1 1.1 1.2 1.3 ON-LINE ANALYTICAL PROCESSING AND DATA WAREHOUSING ................ 1-1 CONTRIBUTION ........................................................................................... 1-7 OUTLINE ..................................................................................................... 1-8

PART I: CONCEPTUAL MODELING 2 CONCEPTUAL DATA MODELING FOR OLAP .................................... 2-13 2.1 2.2 2.3 3 INTRODUCTION ......................................................................................... 2-13 REQUIREMENTS......................................................................................... 2-14 RELATED WORK ........................................................................................ 2-21

THE MAC MODEL ...................................................................................... 3-25 3.1 INTRODUCTION ......................................................................................... 3-25 3.2 DIMENSION LEVELS .................................................................................. 3-28 3.2.1 Notation: .......................................................................................... 3-30 3.2.2 Graphical notation for Dimension Levels........................................ 3-30 3.2.3 Properties of Dimension Levels ....................................................... 3-31 3.2.4 Operations on Dimension Levels ..................................................... 3-32 3.3 DIMENSION PATHS .................................................................................... 3-38 3.3.1 Graphical notation for Dimension Paths......................................... 3-41 3.3.2 Operations on Dimension Paths ...................................................... 3-43 3.4 DIMENSIONS ............................................................................................. 3-51 3.4.1 Graphical notation for Dimensions ................................................. 3-56 3.4.2 Operations on Dimensions............................................................... 3-58 3.5 DIMENSION DOMAINS ............................................................................... 3-61 3.6 THE MULTIDIMENSIONAL AGGREGATION CUBES (MACS) ....................... 3-65 3.6.1 Graphical representation of Cubes.................................................. 3-67 3.6.2 Discussion ........................................................................................ 3-67 3.6.3 Operations on MACs........................................................................ 3-71 3.7 EXAMPLE APPLICATION OF THE MAC MODEL........................................... 3-78

PART II: QUERY PROCESSING TECHNIQUES 4 PROCESSING MULTIDIMENSIONAL-ANALYSIS QUERIES ........... 4-93 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 QUERY PROCESSING OVERVIEW ............................................................... 4-94 RELATED WORK ........................................................................................ 4-99 LOGICAL DATABASE SCHEMA ................................................................ 4-103 QUERY CLASSIFICATION ......................................................................... 4-106 THE ABSTRACT EXECUTION PLAN .......................................................... 4-108 ANALYSIS OF THE OPTIMIZATION OPPORTUNITIES ................................... 4-111 THE HIERARCHICAL PRE-GROUPING OPTIMIZATION TECHNIQUE ............ 4-114 THE PULL-UP SELECTION OPTIMIZATION TECHNIQUE ............................. 4-117 SORT ORDER BASED OPTIMIZATION ....................................................... 4-120 EXPERIMENTAL RESULTS ........................................................................ 4-131

1-1

5

COST-BASED OPTIMIZATION .............................................................. 5-147 5.1 THE COST MODEL .................................................................................... 5-148 5.1.1 The hybrid-hash grouping algorithm............................................. 5-150 5.1.2 The simple-hash grouping algorithm............................................. 5-151 5.1.3 The merge-join algorithm .............................................................. 5-151 5.1.4 The main-memory nested-loop join algorithm............................... 5-152 5.1.5 The indexed-based join algorithm ................................................. 5-154 5.1.6 The statistics estimation problem................................................... 5-156 5.1.7 The cost-based optimization process ............................................. 5-157 5.2 CASE STUDY: COST ESTIMATION OF AN EXAMPLE QUERY ....................... 5-158 5.3 CASE STUDY: THE COST ESTIMATION ...................................................... 5-162 5.3.1 Applying the cost model to TB ....................................................... 5-162 5.3.2 Estimation of the cost of the original query plan........................... 5-163 5.4 CASE STUDY: THE ALTERNATIVE QUERY PLANS ..................................... 5-166 5.4.1 The non-pre-grouping alternative.................................................. 5-166 5.4.2 An optimized execution plan .......................................................... 5-168 5.4.3 Overview of the three alternative plans ......................................... 5-170 5.5 CONCLUSIONS ......................................................................................... 5-171

6

THE GENERALIZED PRE-GROUPING ................................................ 6-173 6.1 DEFINITIONS ........................................................................................... 6-174 6.1.1 Attributes and Relations................................................................. 6-174 6.1.2 NULL values .................................................................................. 6-175 6.1.3 Operators ....................................................................................... 6-176 6.1.4 Integrity Constraints ...................................................................... 6-177 6.1.5 Generalized Aggregate functions................................................... 6-184 6.1.6 The Generalized Projection operator Л ........................................ 6-185 6.1.7 Interesting classes of aggregate functions..................................... 6-186 6.2 THE GENERALIZED PRE-GROUPING TRANSFORMATION.......................... 6-187 6.3 DECOMPOSING PRE-GROUPING ............................................................... 6-192 6.3.1 Elementary transformations........................................................... 6-192 6.3.2 Simple transformations .................................................................. 6-212 6.3.3 Complex transformations............................................................... 6-222 6.4 CONCLUSIONS ......................................................................................... 6-250

7

EPILOGUE .................................................................................................. 7-253 7.1 SUMMARY AND CONCLUSIONS................................................................ 7-253 7.2 FUTURE WORK ........................................................................................ 7-256 7.2.1 Future work on conceptual modeling ............................................ 7-256 7.2.2 Future work on query processing and optimization ...................... 7-257

REFERENCES..................................................................................................... 7-259 APPENDIX........................................................................................................... 7-269

1-2

List of Figures

FIGURE 1-1: THE DIMENSIONS AND HIERARCHIES OF THE ANALYSIS ........................... 1-3 FIGURE 1-2: EXAMPLE SECTION OF A CUBE................................................................. 1-4 FIGURE 1-3: AN EXAMPLE STAR-SCHEMA ................................................................... 1-7 FIGURE 2-1: THE ANALYSIS PATHS ........................................................................... 2-16 FIGURE 2-2: EXAMPLE RESULT FOR QUERY Q1......................................................... 2-19 FIGURE 3-1: DIMENSION LEVELS .............................................................................. 3-31 FIGURE 3-2: DIMENSION MEMBERS .......................................................................... 3-31 FIGURE 3-3: DIMENSION PATHS ................................................................................ 3-42 FIGURE 3-4: EXAMPLES OF DIMENSION PATHS ......................................................... 3-43 FIGURE 3-5: DIMENSIONS ......................................................................................... 3-57 FIGURE 3-6: DEPENDENCY IN A DIMENSION ............................................................. 3-57 FIGURE 3-7: THE GRAPH OF THE DIMENSION LOCATION ............................................ 3-63 FIGURE 3-8: THE GRAPH OF THE DIMENSION CLIENT................................................ 3-65 FIGURE 3-9: CUBES................................................................................................... 3-67 FIGURE 3-10: THE DIMENSION D_PRODUCT .............................................................. 3-80 FIGURE 3-11: THE D_DATE DIMENSION .................................................................... 3-84 FIGURE 3-12: THE D_CUSTOMER DIMENSION ............................................................ 3-85 FIGURE 3-13: THE D_SALESMAN DIMENSION ............................................................ 3-86 FIGURE 3-14: THE D_WAREHOUSE DIMENSION .......................................................... 3-87 FIGURE 3-15: THE DIMENSIONS D_TRANSACTIONTYPE, D_PAYMENTTYPE, D_SPECIALIDL ................................................................................... 3-88 FIGURE 3-16: THE CUBE DEFINITION ........................................................................ 3-89 FIGURE 4-1: THE SCHEMA OF THE DATA WAREHOUSE ............................................... 4-96 FIGURE 4-2: THE DIMENSION HIERARCHIES .............................................................. 4-96 FIGURE 4-3: THE ORIGINAL EXECUTION PLAN ........................................................... 4-98 FIGURE 4-4: THE TRANSFORMED EXECUTION PLAN .................................................. 4-99 FIGURE 4-5: STAR SCHEMA WITH H-SURROGATE ATTRIBUTES ................................ 4-106 FIGURE 4-6: THE ABSTRACT PROCESSING PLAN ...................................................... 4-110 FIGURE 4-7: ALL THE POSSIBLE EXECUTION PLAN AREAS AND OPERATIONS ........... 4-123

1-3

FIGURE 4-8: THE HIERARCHIES OF THE DIMENSION TABLE USED IN THE EXPERIMENTS ..... ........................................................................................................... 4-132 FIGURE 4-9: THE SCHEMA USED IN THE EXPERIMENTS. ........................................... 4-133 FIGURE 4-10: RDB / MDB RATIO PER QUERY ........................................................ 4-136 FIGURE 4-11: ELAPSED TIME RDB/MDB RATIO – TEMPLATE GROUPING ............... 4-137 FIGURE 4-12: QUARTILES OF THE ELAPSED TIME RATIO PER TEMPLATE .................. 4-138 FIGURE 4-13: SELECTIVITY ..................................................................................... 4-139 FIGURE 4-14: THE EXECUTION PLAN OF THE SAMPLE QUERY ON RDB.................... 4-143 FIGURE 4-15: THE EXECUTION PLAN OF THE SAMPLE QUERY ON MDB................... 4-144 FIGURE 5-1: THE EXAMPLE QUERY ......................................................................... 5-159 FIGURE 5-2: THE ORIGINAL EXECUTION PLAN ......................................................... 5-160 FIGURE 5-3: THE QUERY PROCESSING STEPS FOR THE ORIGINAL PLAN .................... 5-162 FIGURE 5-4: THE NON-PRE-GROUPING ALTERNATIVE PLAN .................................... 5-167 FIGURE 5-5: THE OPTIMAL EXECUTION PLAN (OPT)............................................... 5-169

1-4

List of Tables

TABLE 2-1: EVALUATION OF MULTIDIMENSIONAL MODELS ...................................... 2-23 TABLE 3-1: THE CELLS OF THE EXAMPLE CUBE C1 ................................................... 3-68 TABLE 3-2: THE QUERY Q4....................................................................................... 3-69 TABLE 4-1: THE AD-HOC STAR QUERY TEMPLATE................................................... 4-107 TABLE 4-2: THE TERM TRANSLATION TABLE .......................................................... 4-117 TABLE 4-3: THE SIZES OF TABLES USED IN THE EXPERIMENTS ................................ 4-134 TABLE 4-4: QUARTILES FOR THE TWO SELECTIVITY GROUPS .................................. 4-139 TABLE 4-5: THE SAMPLE QUERY ............................................................................. 4-140 TABLE 4-6: THE EXECUTION PLAN OF THE SAMPLE QUERY ON RDB....................... 4-142 TABLE 4-7: THE EXECUTION PLAN OF THE SAMPLE QUERY ON MDB...................... 4-146 TABLE 5-1: THE COST FORMULAS AND MEMORY REQUIREMENTS OF THE ALGORITHMS .... .............................................................................................................. 5-150 TABLE 5-2: THE PROFILE OF THE ORIGINAL EXECUTION PLAN ................................ 5-162 TABLE 5-3: THE PROFILE OF THE NON-PRE-GROUPING EXECUTION PLAN ................ 5-167 TABLE 5-4: THE PROFILE OF THE OPTIMAL QUERY PLAN ......................................... 5-169 TABLE 5-5: OVERVIEW OF ACTUAL PERFORMANCE AND ESTIMATION COST ............ 5-171 TABLE 6-1: EVALUATION OF EXPRESSIONS WITH NULL VALUES AND “UNKNOWN”

STATES .................................................................................................. 6-176

TABLE 6-2: THE STUDENTGRADES RELATION ........................................................ 6-180 TABLE 6-3: ELEMENTARY TRANSFORMATIONS ....................................................... 6-193 TABLE 6-4: TRANSFORMATIONS ............................................................................. 6-243

1-5

List of Definitions

DEFINITION 3-1: DIMENSION LEVEL .......................................................................... 3-29 DEFINITION 3-2: DIMENSION PATH “LINKS”. ............................................................. 3-38 DEFINITION 3-3: STRICT PATHS. ................................................................................ 3-39 DEFINITION 3-4: COMPLETE PATH............................................................................. 3-40 DEFINITION 3-5: NON-OVERLAPPING PATH. ............................................................... 3-40 DEFINITION 3-6: DIMENSION PATH ............................................................................ 3-41 DEFINITION 3-7: NAME CONSISTENCY FOR PATHS....................................................... 3-53 DEFINITION 3-8: DIMENSION. .................................................................................... 3-54 DEFINITION 3-9: DIMENSION VALUE.......................................................................... 3-63 DEFINITION 3-10: DIMENSION DOMAIN. .................................................................... 3-64 DEFINITION 3-11: MULTIDIMENSIONAL AGGREGATION CUBE ..................................... 3-66 DEFINITION 4-1: HIERARCHICAL LEVEL (HLEVEL) .................................................... 4-114 DEFINITION 4-2: GROUPING ORDER (GO(DI))......................................................... 4-115 DEFINITION 4-3: AGGREGATION ORDER (AO(DI)) ................................................... 4-115 DEFINITION 4-4: DIMENSION ORDER (DO(DI))........................................................ 4-115 DEFINITION 6-1: FUNCTIONAL DEPENDENCY ............................................................ 6-178 DEFINITION 6-2: NON-STRICT FUNCTIONAL DEPENDENCY ......................................... 6-178 DEFINITION 6-3: SUPER-KEY ................................................................................... 6-181 DEFINITION 6-4: STRICT SUPER-KEY........................................................................ 6-181 DEFINITION 6-5: INCLUSION DEPENDENCY ............................................................... 6-182 DEFINITION 6-6: STRICT INCLUSION DEPENDENCY .................................................... 6-183 DEFINITION 6-7: NON-STRICT INCLUSION DEPENDENCY ............................................ 6-183 DEFINITION 6-8: DISTRIBUTION PROPERTY ............................................................... 6-187 DEFINITION 6-9: IDENTITY PROPERTY ...................................................................... 6-187 DEFINITION 6-10: GENERALIZED HIERARCHICAL PRE-GROUPING ............................ 6-191

1-6

List of Algorithms

ALGORITHM 3-1: CREATE LINKS OF DERIVED DIMENSION PATH .............................. 3-46 ALGORITHM 3-2: SORT LEVELS IN PATH .................................................................. 3-49 ALGORITHM 3-3: CREATE DIMENSION FROM PATHS................................................. 3-55 ALGORITHM 4-1: THE HEURISTIC HIERARCHICAL PRE-GROUPING ALGORITHM...... 4-117 ALGORITHM 4-2: THE SORT ORDER OPTIMIZATION ALGORITHM ............................ 4-131

List of Theorems

THEOREM 6-1: SPLIT-Л........................................................................................... 6-223 THEOREM 6-2: SURROGATE-JOIN............................................................................. 6-230 THEOREM 6-3: SURROGATE-JOIN-EARLY-GROUPBY................................................. 6-239 THEOREM 6-4: MAIN THEOREM ............................................................................... 6-243

1-7

1-8

1 Introduction

In the following sections we give introduce the topics of this thesis and discuss the issues and problems that motivated our work. Then, we present a summary of the contributions we make, and finally describe the outline for the rest of this document.

1.1 On-line Analytical Processing and Data Warehousing

It has been a while since companies and organizations have started using computers and information systems as tools in their daily work. Today, almost all companies and organizations use some type of information system to gather and process the majority of their business information. These systems range from small single-user, isolated applications running on a PC and covering a specific business process to large integrated multi-user Enterprise Resource Planning (ERP) systems running on multiple servers and desktop computers and covering all aspects of the business. By using information systems, companies and organizations have managed not only to make their daily transactions more efficient and reliable but also to gather and digitally store the information that describes their activities. This fact is true regardless of the complexity or the integration level of the systems being used. It soon became obvious that these large amounts of collected data can be efficiently used in making more informed decisions. Using the appropriate software tools, decision makers can summarize historical trails, discover trends and relationships within the data and visualize the various business data in proper graphical forms. All these results can help directors, managers, analysts, and other decision makers to obtain a much better understanding of their business and therefore to make better decisions. Even information systems that where not initially designed to support decision makers proved to be very useful when their data was analyzed by using the appropriate software tools. This type of software tools are usually called On-line Analytical Processing (OLAP) systems. The term OLAP has been coined by Codd in 1993 [Codd93] and since then has been used to describe a large variety of systems. Each of these systems satisfies, to some degree, the 12 properties that Codd initially defined. However, not all authors

1-1

and vendors of OLAP systems agree on the 12 properties stated by Codd. In fact, the authors of the OLAP Report [OLAP04] define a far more compact set of five critical properties that seem to be widely adopted. The FASMI test, which they define, states that the OLAP systems must be first of all Fast. This means that the systems must respond to the users in a matter of seconds regardless of the complexity of the user action. Any delay longer than 30 seconds is assumed to interrupt the analysis that the user is trying to perform. The second property is Analysis. The OLAP system must be able to perform the required statistical analysis and allow the users to define ad-hoc calculations. The third property is Shared. The systems should provide access to multiple users and deal with the consistency and security issues especially when the users perform update operations. The fourth property is Multidimensional. This is the most important and characteristic property of OLAP systems: they must provide a multidimensional conceptual view of the data. This means that an OLAP system must be able to organize data at a conceptual level based on multiple dimensions supporting hierarchies and multiple hierarchies on these dimensions. This is acknowledged as the only natural and logical way to analyze data. Finally, the fifth property is Information, meaning that the system must be able to handle the usually large amounts of information that are involved in the analysis process. OLAP systems can be categorized as ROLAP, MOLAP and HOLAP based on the type of data structures they are primarily using. ROLAP (relational OLAP) systems are the ones that use relational databases to obtain and store their data. MOLAP (multidimensional OLAP) systems are based on multidimensional data structures. These systems are mainly main-memory systems although some use vendor-specific multidimensional databases. Finally, HOLAP (hybrid OLAP) systems combine both approaches by using main-memory multidimensional structures for the in-memory operations and relational databases for the storage and retrieval of raw data. In order to present the main concepts used in an OLAP system we use the following simplified example. A retail company wishes to analyze the sales data recorded at its stores. The analysts are interested in viewing the total sales figures using the following three properties of the sales transaction: the date of the transaction, the store where the transaction took place and the product that was sold. Based on the date, the analysts need to aggregate the total sales values on days, months, quarters and years. Based on the store property the values need to be

1-2

aggregated according to the geographical location of the stores. In particular the company groups its stores according to predefined areas, cities and regions. Finally, the total sales figures need to be grouped based on the product, the brand of the products and the product categories defined by the company. These categories are not related to brands but correspond to the department where each product belongs. In OLAP terminology the above requirements define an analysis process with three dimensions. The dimensions correspond to the three properties: date, store and product. Each dimension is organized into one or more hierarchies according to the way the aggregation must be performed on the particular property. In our example the date dimension would have one hierarchy containing the levels day, month, quarter and year. The store dimension would have one hierarchy with the levels store, area, city and region while the product dimension would have two hierarchies: (product, brand) and (product, category) as shown in Figure 1-1. That is because the grouping of products by brand is totally independent of the grouping of products by category.

Year Region

Quarter

City Brand Category

Month

Area

Day

Store

Product

Date

Store

Product

Figure 1-1: The dimensions and hierarchies of the analysis The above dimensions are used to define a three-dimensional data cube. Each cell in this cube contains the measure used in the analysis: total sales. So, when the analyst requests the total sales figures for the month “November 2004”, for products in the category “mobile phones” that where sold in the city “Athens”, the OLAP system identifies the corresponding cell in the data cube, computes its value if it is not precomputed, and shows the results to the user. Figure 1-2 illustrates a section of the data cube and the cell previously identified. This operation is considered a slice and dice

1-3

operation. The analyst may roll-up on the date dimension and request the corresponding value for the entire year 2004. Also, he may drill-down to view a detailed list of total sales figures per store for all stores in the city of Athens maintaining the restrictions on the identified products and time period.

Athens Paris October’04 November’04 December’04 Rome London Handheld PCs

Mobile phones

Navigation systems

Figure 1-2: Example section of a cube The previous example illustrates the basic OLAP concepts. The dimensions, hierarchies and data cubes are the most important ways in which the data is viewed in an OLAP system while the slice and dice, the drill-down and roll-up are the most common operations. The area of OLAP has been a major research area in the database community for quite some time ([CD97a]). A fundamental issue faced by researchers in the OLAP domain as well as by vendors of OLAP systems is the modeling of data. The wellstudied conceptual and logical models used in other database areas, like the E/R model or the relational model, do not seem to be sufficient for the OLAP case ([Kimb96], [TBC99], [S++98], [Kimb97]). Vendors have adopted various models, while standardization bodies and researchers have developed and studied additional models. All these models share some common concepts like measures or hierarchies,

1-4

but there is still no formally defined and widely accepted (logical or conceptual) data model. The first section of this thesis tackles this modeling issue and defines a new a conceptual data model for OLAP. In order for an OLAP system to be useful it needs access to the appropriate information. The sources of this information are the various information systems of the company. Nevertheless, the best practice indicates that these sources should not be accessed directly by the OLAP system but rather a specialized database system, a data warehouse, must be used to integrate and store all the relevant data. Then, the OLAP system can use the data warehouse as the source of its information. Data warehouses are built on top of relational database systems that have been enhanced in various ways in order to be suitable for data warehousing. There are numerous reasons that make necessary the creation and usage of such specialized database systems. One of the main reasons is the integration of the data that comes from the different sources. Since information comes not from one but from many information systems, each of which may have its own way of representing and encoding information and real-world entities, integrating a number of sources into one common model is a very complicated task. In order to solve this problem, data warehouses have special extraction, transformation and loading (ETL) mechanisms that are built based on the particularities of the data sources. An additional important reason for building a data warehouse is the preservation of historical data. The source information systems may discard old data when the data becomes useless to the system or too expensive to maintain. However, a data warehouse can be used to store all this deleted information, some times in a compressed or aggregated manner. The historical perspective of data values is one of the most frequently required information in OLAP. A third and very important reason for using data warehouses is the performance issue. A typical OLAP system performs complex operations on large data volumes. Furthermore, these operations must be executed in a matter of seconds. Therefore, the OLAP system must be based on a specialized database system that can achieve these goals. The database systems commonly used are not adequate because they are OLTP systems tuned for a very different type of workload: large numbers of concurrent short and relatively light transactions that interact with a very small fraction of the

1-5

overall database. In an OLAP system the number of users is limited and the frequency of the operations is also low. However, the vast majority of the operations require complex calculations, which usually involve aggregations of large amounts of detailed data. Data warehouses are designed in order to meet, as much as possible, the requirements of OLAP systems. Vendors of data warehouse systems as well as researchers in this area have developed a number of specialized techniques that are appropriate for OLAP. Such techniques include the usage of denormalized schemata, materialized views, bitmap indexes, multidimensional indexes, multidimensional access methods, clustering and compression as well as specialized query processing techniques. The main idea in most of these techniques is the usage of pre-computation and redundancy. According to this approach, systems can pre-compute anticipated operations and store the results. This, however, introduces redundancy in the database, which must be carefully managed. The idea is not new: the trade-off between space and time complexity is a well-known fact in computer science. In fact all database systems are exploiting this idea when they are using indexes: b-trees and the other types of indexes are a form of pre-computation and they contain redundant information. However, this redundant information can speed up query processing algorithms. In the same way, denormalized schemata, multidimensional indexes, bitmap indexes and materialized views are different ways of trading space for speed. Of course the usage of new data structures requires the proper modifications to the query processing algorithms. The typical denormalized schema used in data warehouses to store data cubes is the star-schema. According to this schema, the database contains a large table named fact table and a number of additional tables called dimension tables. The fact table contains the measure values of what we previously described as the data cube. It also contains foreign keys to the dimension tables. Each dimension table encodes a dimension of the data cube along with its hierarchies. Therefore, dimension tables contain functionally related attributes and are not in the third normal form (3NF). Figure 1-3 shows an example star-schema for the data cube we described earlier in this section. The schema contains one fact table, named Sales, and three dimension tables. The attribute total_sales is the measure on which the analysis is performed.

1-6

Date Day Month Quarter Year Sales Date Product Store total_sales

Store Store Area City Region

Product Product Brand Category

Figure 1-3: An example star-schema One of the recently proposed techniques for data warehouses is the Multidimensional Hierarchical Clustering and Hierarchical Indexing (MHC/HI) technique ([MRB99], [KS01]). This technique is based on the usage of clustering access methods. The main goal is the reduction of the number of I/O operations required to answer large aggregate queries. According to this method the fact table of a data warehouse is stored indexed and clustered with respect to the dimension hierarchies. Since most aggregation queries apply restrictions on the dimension hierarchies, the fact table data needed to answer such queries are found clustered in a relatively small number of disk pages, improving the performance. In the presence of this new data organization schemata new possibilities are emerging for the optimization of query processing plans. By exploiting the existence of the multidimensional hierarchical index of the fact table a number of optimization strategies can be used to drastically reduce the complexity of answering typical OLAP queries. The second part of this thesis studies the implications on query processing when using the MHC/HI technique and proposes new optimization techniques.

1.2 Contribution

Based on the above discussion, one can observe that among the major issues to be addressed are: Conceptual Modeling for data warehouses and Query Optimization in the presence of specialized indexing methods. It is exactly these two major issues that this thesis tackles. A summary of the contributions of this thesis is as follows.

1-7

?

We define a new conceptual data model for OLAP, the MAC (Multidimensional Aggregation Cube) model, along with a number of essential operations that manipulate the concepts of the model.

?

We analyze the optimization opportunities that emerge when the MHC/HI technique is used in data warehouses and develop three optimization techniques.

?

We define a cost model that can be used to estimate the cost of various alternative execution plans and identify situations when the application of various optimization techniques (with a particular emphasis on a technique called “Hierarchical Pre-Grouping”) is not beneficial.

?

We generalize the Hierarchical Pre-Grouping technique so that it can be applicable to larger classes of queries and define the Generalized PreGrouping transformation.

?

We formally prove that nine particular conditions are sufficient for the application of the Generalized Pre-Grouping transformation and identify the relationship of this transformation to other, previously known, transformations used in relational database query optimization.

1.3 Outline

The thesis is separated into two parts. The first part tackles the issue of conceptual data modeling for OLAP and defines the MAC data model. In chapter 2 we present a number of requirements for an OLAP conceptual data model and compare existing models against these requirements. In chapter 3 we define the MAC model along with a set of operations, we propose a graphical representation for MAC schemata and present an example MAC schema based on the model of an actual data warehouse. In the second part of the thesis we discuss query processing and optimization techniques for OLAP queries on relational data warehouses that use the MHC/HI technique. In chapter 4 we start by giving a detailed definition of the problem and present related work. Then we define three optimization techniques and present experimental results for the most important technique: Hierarchical Pre-Grouping. In chapter 5 we analyze the problem of detecting situations where the Hierarchical PreGrouping technique is not improving the query execution plan and propose a cost

1-8

model that can be used to identify such problematic situations. This chapter also includes experimental results that support our proposal. In chapter 6 we initially define the generalization of the Hierarchical PreGrouping technique, the Generalized Pre-Grouping transformation, so that it is applicable to a larger class of queries. Then we decompose the transformation into a sequence of known and new transformations and prove that nine conditions are sufficient to guarantee the validity of the transformation in the context of query optimization in relational databases. Finally, chapter 7 presents our conclusions and ideas for future work.

1-9

PART I CONCEPTUAL MODELING

1-11

2 Conceptual Data Modeling for OLAP

In this chapter we give a general overview on the issue of conceptual data modeling for OLAP, we derive a number of requirements for such conceptual models and we use them to review existing OLAP models.

2.1 Introduction

In the area of On-line Analytical Processing the involved data structures are inherently complex. The multiple dimensions, the hierarchies within the dimensions and the multiple aggregation levels complicate the representation of data. Even in a simple scenario, the representation of measure values that can be analyzed at various granularity levels using multiple dimensions and the hierarchies defined on each of these dimensions lead to a complicated model. Furthermore, the various logical models and database architectures that have been developed to address performance issues or to suit a particular type of DBMS complicate even more the data-modeling problem. As a result, the data used for OLAP is represented by highly complicated data models, which can be managed only by database experts. In order to make the data model usable by a larger group of users, ideally the endusers, there is one obvious solution: to use a higher-level model, a conceptual model, that represents the data in a more abstract way [BCN92]. According to [EN94] “Conceptual data models provide concepts that are close to the way many users perceive data”. The higher abstraction level and the fact that the concepts used are closer to the way users perceive the modeled information are the main arguments in favor of using a conceptual model for OLAP applications. One can argue that complex data models are a reality even for large OLTP systems and that for these systems the conceptual models are used only by database architects and not by end-users. Indeed, the large OLTP systems have huge and very complicated logical data models, and the corresponding Entity-Relationship models, which are the dominant type of conceptual models used in OLTP, are hardly ever used by end-users. However, there is a crucial difference between the two cases. In a large OLTP system, any single user transaction refers to only a very small fraction of the overall schema, usually a couple of tables. On the contrary, typical OLAP

2-13

operations refer to large sections of the overall schema and therefore the user is exposed to the complexity of the data model. In [CTW97] the authors claim that “conceptual modeling will be one of the major approaches for describing reality and the basis for human-database interfaces”. As the complexity of the stored information increases the unique viable approach for the human-database interfaces seams to be the usage of higher-level models that describe information based on its semantics without taking into account the logical data organization used by the particular database. The previous observations motivated our research for a suitable conceptual model for OLAP applications. But what are the virtues of a suitable conceptual model for OLAP? Codd, in the paper that introduced the term OLAP [Codd93], defined a number of requirements for a data model for OLAP and some of them apply even to the case of conceptual models. Later, other researchers ([BSHD98], [PeJe99], [Vas00], [ASS02]) have also presented various requirements for logical and conceptual OLAP models. In 2001 the Common Warehouse Metamodel (CWM) [OMG01] was established by OMG (Object Management Group) as a standard metamodel for data warehouses. However, CWM is data warehouse metamodel suitable for data exchange and integration of data warehouses but it is not a conceptual model intended to be used by end-users. In the following section we compile our own list of requirements for a usable conceptual model for OLAP using an example scenario. The related work was taken into consideration without, however, adopting all arguments. Then, in section 2.3, we review a number of published data models for data warehouses and OLAP and compare them against the defined requirements.

2.2 Requirements

In this section we will present a set of requirements that we believe to be of key importance for a conceptual data model used in multidimensional analysis and OLAP applications. We will present those requirements through the use of an example scenario. The scenario is based on a real data warehousing / OLAP project in the development of which we have been involved.

2-14

Assume the following example: A chain of stores selling electrical home appliances has built a data warehouse in order to analyze its sales data. The sales data are loaded into the data warehouse from the OLTP system. For each sales transaction the OLTP system records the following information: ? ? ? ? ? The date of the transaction. The cashier ID where the transaction took place. The ID of the products being sold. The customer ID. The sales price for each product being sold.

We assume that all the above information is somehow stored in the data warehouse. We shall not talk about the design process of the data warehouse, or about how data is loaded from the OLTP system since our model does not address these issues. The conceptual data model, on which we are interested, is intended to be suitable for the end-users of the data warehouse, who analyze the information through the use of an OLAP tool. As described by a plethora of OLAP papers and books [Mendel], [Inmo96], [Kimb96], the multidimensional analysis is mainly based on drill-down, roll-up, slice and dice operations that are performed on a multidimensional view of data. Measure values are selected and aggregated using various predefined dimensions, dimension levels and hierarchies. The dimension levels, the hierarchies used for aggregation, the dimensions and the measures are the main concepts used in the analysis. For our example scenario assume that the analysis is performed on the sale price (price of items sold) using the hierarchies defined in Figure 2-1. When the hierarchies are used for the analysis of data through drill-down and roll-up operations we call the hierarchies analysis paths.

2-15

Analysis Paths

Dimensions

P1 Customer

Residence Area

Residence City

Residence Region Client

P2 Customer

Profession

P3

Date

Month

Year

Time

P4

Cashier

Store

Store Area

Store City

Store Region

P5

Store

Warehouse

Warehouse Area

Warehouse City

Warehouse Region

Location

P6

Cashier

DBServer

P7

Product

Product Group

Product Class

P8

Product

Advertising Method

Item

P9

Product

Brand

Figure 2-1: The analysis paths The analysis paths shown in Figure 2-1 are grouped into four distinct dimensions based on the most detailed level of each path. The most detailed level of each dimension corresponds to a basic property of a product’s sales price as recorded by the transactions of the OLTP system. For example, the customer ID is used for the Customer level of the Client dimension. Each path is constructed out of two or more levels and the grouping/classification relationships that link these levels. The paths represent sequences of valid roll-up and drill-down operations that can be performed during the analysis. For example, products can be grouped by brand using the grouping/classification relationship

2-16

defined by the path P9. This relationship links each product in the Product level to some brand in the Brand level. The designer or the OLAP application defines the schema of the dimensions and the hierarchies mostly at design time but an ad-hoc query might need to define its own analysis path. The paths, the levels and the grouping/classification relationships that link them, are defined based on the needs of the analysis process. For example, one possible analysis path is used to drill-down from the Warehouse Area level to the Store level. This drill-down operation reveals the stores supplied by warehouses of a particular area. The designer of the OLAP application has defined this path (P5) but has decided to leave the Cashier level out of this path. This decision could mean that when doing roll-up and drill-down operations on this analysis path it is not meaningful to drill-down to the Cashier level. From the above discussion one can realize that a conceptual data model suitable for multidimensional analysis should fulfill the following requirements: R1. Dimension levels: Provide the means with which to define levels within a dimension. R2. Grouping/classification relationships: levels. R3. Analysis paths: Provide the means with which to define analysis path on dimensions (i.e. valid sequences of drill-down/roll-up operations on the levels of a dimension) These first-class concepts of the multidimensional analysis must have an appropriate and straightforward representation within a conceptual model. Note that the dimension levels are in fact attributes that can characterize the measure being analyzed and the analysis paths are valid sequences of drill-down/roll-up operations. According to the first path (P1) of our example scenario, for each customer we have the residence address in the form of area, city, and region. Assume now that for some customers the city attribute is not set (perhaps because their residence area does not belong to any city). In this case the residence area of the customer will be linked by the grouping/classification relationship directly to a residence region and not to a Provide the means with which to define grouping/classification relationships that link members of dimension

2-17

residence city as it happens with most of the other residence areas. Generally speaking a grouping/classification relationship may involve more than two levels since a member of a level may drill-down to members of different levels. There are also some other aspects of the grouping/classification relationship that we found to be important. In some cases a member of a level is linked to more than one members of the next level. For example a member of the Product level will probably be linked to more than one members of the Advertising Method level since a product can be simultaneously advertised in various methods (newspapers, TV, radio, etc.). Also, some members of a level may have no links with the more detailed level. For example we may have a member of the Store Area level that is not linked with any members of the Store level. This would mean that at the given time there is no store in that area or that the system does not know or does not want to show which stores belong to this area. The above examples show that a natural model of grouping/classification relationships might involve n-way relationships among levels. Also those relationships might not reference all members of the involved levels. Therefore, the deduced requirements are: R4. No obligatory roll-up: A member of a dimension level need not necessarily be linked to a member of “higher” level through some grouping/classification relationship. R5. No obligatory drill-down: A member of a dimension level need not necessarily be linked to a member of “lower” level through some grouping/classification relationship. R6. Many-to-many relationships: The grouping/classification relationships should be able to define many-to-many relationships among dimension levels. A member of a level may be linked to many members of a “lower” dimension level (drill-down) but also to many members of a “higher” dimension level (roll-up). R7. Hops in the analysis paths: Within one analysis path, a member of a dimension level may lack the link to a member in the immediately “higher” level but it may be linked to a member of another level, which is further “up” in the analysis path.

2-18

All the above four requirements interfere with the summarizability [LeSh97] of a multidimensional data cube. For example, if we sum products using the type of advertisement to which they are related then a product that is related to two advertising methods will be counted twice. However, this may lead to wrong results only if the analyst is not aware of the properties of the analysis path and misinterprets the results. Let us now consider four example queries that the analysts may pose. The first query is: Q1: Give me the sum of sales for the year 2000 per Month, Product Group, Product Class, Store City and Store Region. The above query requires aggregation on several levels of the same path. For example it requires the sum of sales for each product group and also the sum of sales for each product class. The query result could be represented in a grid fashion as shown in Figure 2-2. Notice that in this figure the measure values for the region Store_Region_2 do not correspond to the sum of values for the City_3 and City_4 cities, probably because Store_Region_2 includes some areas that are not part of any city.

January 2000 City 1 Product Class 1 Group 2 Group 1 3 1 Store Region 1 City 2 2 3 City 3 4 Store Region 2 City 4 5 10 Feb...

4

5

9

1

3

5

5 Product Class 2 Group 4 Group 3

7

12

5

8

15

3

4

7

5

5

12

7

8

15

4

3

9

10

12

22

9

8

21

Figure 2-2: Example result for query Q1 Now consider the case where the analyst uses two paths of the same dimension for the classification and grouping of sales data.

2-19

Q2: Give me the sum of sales for year 2000 per customer Profession and Residence Region. The two levels used for grouping, Profession and Residence Region are in fact independent although they both belong to the same dimension. So, grouping on both of them at the same time defines a two-dimensional space. If the two levels where related, like Store City and Store Region are in the previous query, we would have a one-dimensional space. The next query shows the necessity of supporting multiple measures defined over the same set of dimensions. Q3: Give me the sum of sales, the maximum sale value and the number of sales per Store and Month for the year 2000. In most of the cases some of the measures will be functionally dependent on other more primitive measures. Generally speaking, a set of primitive and derived measures can simultaneously be required for an analysis process. Finally, we present a query with a somewhat more complicated selection condition: Q4: Give me the sum of sales per Month, Store Area and Brand selecting only the store areas that have increased their total sales for the year 2000 by more than 10 percent from the previous year. This query requires calculation of the total sales value per store area for the years 2000 and 1999 and then the selection of the areas whose sum of sales value for the year 2000 is greater than 110% of their sum of sales value for the year 1999. For those store areas the query requires the sum of sales calculated by Month and Brand. This is a typical query that performs selection based on aggregated data at a different level than the data required for its output. Our experience shows that all the above queries are common OLAP queries, and we strongly believe that a conceptual model suitable for multidimensional analysis should accommodate such queries. This means that the structure of the query result as well as the structure of any other information involved in the query definition must be easily represented by the concepts of the model. The above requirement comes as a result of our intention to have a conceptual model that can efficiently represent all

2-20

kinds of information handled by OLAP applications and not only the raw (source detailed) data. Therefore, the conceptual data model should also be able to represent the results of queries and cube calculations. Based on the example queries we derive the following set of requirements: R8. Aggregation based on multiple levels of a dimension: Allow the definition of measure data that corresponds to aggregations on arbitrary combination of levels (of different paths) even if these levels belong to the same dimension. R9. Cubes with measures of various granularities: Allow the definition of multidimensional cubes that contain measure data of various granularity levels (i.e. two distinct measure values of one cube can represent the aggregation based on different dimension levels, for any of their dimensions). R10. Multiple measures: Allow multiple measures to be defined for a given set of dimensions and in some cases represent them in one concept. This would reflect the fact that those measures are semantically linked.

2.3 Related work

Modeling multidimensional data is not an OLAP specific issue. In the database community, several research areas like statistical databases, scientific databases, geographical databases and temporal databases deal with multidimensional data. Still, each of these areas has particular modeling needs and has developed specialized multidimensional data models. The area closest to data warehouses and OLAP is the statistical database area [Shos97] where several multidimensional models have been proposed [OOM85], [RR91]. In fact those models where proposed long before the appearance of the term “OLAP” [Codd93]. In the data warehouse and OLAP area the first multidimensional data models where developed by product vendors as research in the OLAP domain has followed the evolution of industrial products. Also, various standardization bodies have defined their own models [Olap97] [TPC99] [OMG01]; [VaSe99] provides a good overview and comparison of these models. During the last few years a plethora of multidimensional data models for data warehouses and OLAP have been proposed. A comparison of some of them can be

2-21

found in [VaSe99] and [SBH99]. We are aware of 12 models that have been published in research papers. Most of them are logical data models and only few ([TBC99] [S++98]) can be considered as purely conceptual. Each of these models has taken a somehow different modeling approach ranging from a simple global table to sophisticated object classes. We evaluated 12 published models against the requirements described in the previous section. This may not be fair for the purely logical models since the requirements represent conceptual modeling needs. Still, the evaluation is done only to demonstrate that none of the models published so far has the required expressive power. In fact the evaluation shows that one requirement is not satisfied by any of the models and even for the remaining requirements there is no model satisfying all of them. The requirements of our evaluation have been presented in the previous section and appear summarized in the following list. Each requirement states what the model should be capable of representing R1. Dimension levels R2. Grouping/classification relationships R3. Analysis paths R4. No obligatory roll-up R5. No obligatory drill-down R6. Many-to-many relationships R7. Hops in the analysis paths R8. Aggregation based on multiple levels of a dimension R9. Cubes with measures of various granularities R10. Multiple measures Note that the aggregation level of a measure value is the lowest dimension level that can be used to characterize this value. Also, an analysis path is a lattice of grouping/classification relationships defined on a set of levels. This lattice prevents the user from performing a meaningless (according to the schema designer) drilldown or roll-up operation to an arbitrary -outside the lattice- level of the dimension.

2-22

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 [AGS97] [CaTo98] [DaTh97] [GoRi98] [GyLa97] [Lehn98] [LiWa96] [PeJe99] [S++98] [TBC99] [Truj99] [Vass98] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

Table 2-1: Evaluation of multidimensional models The result of our evaluation is shown in Table 2-1. Note that some of the models ([GyLa97], [Truj99], [AGS97]) represent relationships among levels using userdefined functions, which are then used in operations. Also, other models ([LiWa96], [Lehn98], [DaTh97]) leave the relationships to be defined by the particular data instances and provide no schema definition for them. In both cases we considered that the requirements 2,4,5,6,7 involving grouping/classification relationships as part of the schema are not met. The requirement not met by any of the models is the concept of an analysis path. We believe that this information is an important structural part of the dimension design and should be represented at the conceptual level. There are a few requirements mentioned in several papers ([Codd93], [PeJe99], [TBC99] [GyLa97]), which were not taken into consideration. In our opinion, the most important of such requirements is the support for correct aggregation of data. As described in [LeSh97] the measures cannot always be consistently aggregated by an arbitrary aggregation function. In order to provide support for correct aggregations the model must include additional information regarding measures and grouping/classification relationships. Our model does not include such additional information since we believe that this kind of information, as well as information

2-23

about aggregation functions and derived measures, can be described by an independent functional model which will supplement the conceptual model. A second important requirement stated by various papers ([GyLa97], [PeJe99], [AGS97]) is the need for symmetric treatment of dimensions and measures. It is important to note that what the authors mean by symmetric treatment is the ability to transform a measure into a dimension and the other way around. All models claiming to support this requirement ([GyLa97], [PeJe99], [AGS97]) do so by providing the appropriate transformation operations. Hence, this requirement does not mean that dimensions and measures are represented in the same manner by the model.

Having presented the requirements that we deem necessary to be fulfilled by conceptual models for OLAP, as well a short review of related work, we move next to introduce our proposal, the MAC model.

2-24

3 The MAC model

In this chapter we give the formal definition of the MAC conceptual data model, define operations on the concepts of the model and propose a graphical representation for MAC schemata. A preliminary version of the MAC conceptual data model appeared in [TKS01].

3.1 Introduction

The MAC (Multidimensional Aggregation Cube) model has been designed to be a conceptual model suitable for Multidimensional Analysis and OLAP. The use of general-purpose models in the area of OLAP and Multidimensional Analysis has proven to be problematic due to the structural complexity of the information. The main problems we have identified when using a general-purpose model, like the Entity-Relationship model, are the following: ? The general-purpose models cannot represent the specific characteristics of the domain in a compact way. The results when using such models are highly complicated and unclear schemata. These schemata are understandable only by the designers and not by the users. However, Multidimensional Analysis and OLAP require constantly the examination and modification of the structure of the information. This means that the users need to access, understand and modify the structure of the information as a result of their daily operations. New dimensions need to be defined and others need to be modified. Dimension levels need to be created and new grouping and aggregation relationships need to be used. Additional measures need to be computed and derived from the initial information. All these operations cannot be done efficiently when the schema is too complicated for the user to manage. ? The concepts used by the users in Multidimensional Analysis are not modeled in a unique way. The designer can use various ways and strategies in order to map a user’s concept, like a dimension, to structures of the model. For example, when using the Entity-Relationship model a dimension can be

3-25

modeled as an attribute, a single entity or even a number of entities related by various relationships. This fact adds to the complexity and maintainability of the schema. Different users can give different interpretations of a schema and modify it using different tactics. ? Additional problems appear when a logical model is used instead of a conceptual model. For performance reasons the logical schema may discard some information or it may add to the complexity of the problem. For example, when defining a dimension using a relational table with the attributes Day, Month, Year that has Day as a key, the information that each month belongs to only one year is lost. A different example is when a cube is split into a number of sub-cubes for performance reasons. Such cases not only complicate the users but also make it very hard to optimize queries on these schemata. This fact is of catalytic importance since OLAP queries are known to be very expensive and the optimizations can have a significant impact. A specialized conceptual model for OLAP and Multidimensional Analysis should solve (as much as possible) the above problems (hopefully without creating others!). In literature various models have been proposed as explained in section 2.3. The MAC model, presented next, was designed in order to solve the above problems and provide a modeling solution mainly for OLAP and Multidimensional Analysis applications. While designing the model a number of additional goals and sub-goals where considered. The most important of these goals are: ? ? The model should use concepts that are close to the way users like to perceive the information helping them to understand and use the model. The model should be simple and use as few concepts as possible but each concept should have a clear and distinct role. This helps the user to understand the model and helps to restrict the number of operations that need to be considered improving the optimization opportunities. ? ? The model should be as broad as possible in order to be usable even with strange and unusual modeling requirements. The operations on the concepts should be as few and as simple as possible. However, the model should distinguish between simple and expensive operations. This reparation aims at creating optimization opportunities.

3-26

?

The operations should have minimal or no overlap. The purpose and effect of an operation should be clear and distinguishable from those of other operations.

?

The MAC model should be as similar as possible to other existing models.

The MAC model uses concepts that are familiar to the OLAP users like cubes, dimensions and levels. The MAC model defines formally these concepts as well as a number of operations that appear to be useful in using these concepts. Just like many other models, MAC models the elementary data by using attributes. The attributes are the building blocks for the Dimensions Levels that are linked and organized using Dimension Paths and Dimensions. The Multidimensional Aggregation Cubes (MACs or simply Cubes) are the most complex concept of the MAC model, representing a collection of measures that are related to particular dimensions. A set of MAC-model concepts that describe a Multidimensional Database is called a MAC schema. A MAC schema can contain the definition of Dimension Levels, Dimension Paths, Dimensions, and Cubes (MACs). When working with a database schema it is required to have a way to identify each member (concept) of the schema. For example, in a relational database schema each table has a unique name. Likewise, a MAC schema assigns a unique name to each of its contained concepts. Each Dimension Level, Dimension Path, Dimension and each Cube contained in a MAC schema is assigned, within the schema, a unique identifying name. These names can be used in operations and expressions as well as in the graphical representation of MAC concepts. A distinguishing property of the MAC model is the definition of the Dimension Path concept in combination with the definition of the Dimension concept. A Dimension Path defines a meaningful hierarchy of Dimension Levels and is used not only to define the aggregation relationships among the various Dimension Levels in a compact way but also to encode an important aspect of the OLAP information: the valid sequences of abstraction operations (drill-down/roll-up) which the user can request. No other OLAP models known to us explicitly encode this information. Some of them store this information as part of Dimensions restricting Dimensions to only one hierarchy. Others ignore it completely providing only the aggregation relationships between individual Dimension Levels. The MAC model uses the Dimension Path concept to encode the valid sequence of abstraction operations and in

3-27

the same time to provide a compact representation of all the individual aggregation relationships that exist among the Levels of a Path. The Dimension Paths are then used in the definition of Dimensions. Dimensions are defined as sets of Paths. This general definition allows the modeling of very complicated scenarios with multiple hierarchies, multiple detailed and top levels or even multiple hierarchies that appear unrelated. In the following sections we give the complete and formal definition of the above concepts and define operations on each of these concepts. The operations defined by the MAC model serve two purposes. First they define a way to define queries at the conceptual level. At this level the queries should be easier to formulate and be more understandable by the users. Secondly, operations allow for the definition of derived instances (views). The definition of views is equivalent to the definition of consistency constrains. So, using the defined operations the designer can define derived concepts and impose consistency constrains.

3.2 Dimension Levels

A Dimension Level is a set of Dimension Members. The Dimension Members are the most detailed modeling concepts of the MAC model and represent instances of realworld properties that OLAP measures may have. In our example scenario, the sale price is one measure of our multidimensional analysis. A property of this measure is the location where the sale took place – the cashier where the sale was recorded. For our example scenario we would define the Dimension Level Cashier in order to represent the cashiers where the sales transactions take place and each particular cashier of each store would be modeled as a Dimension Member of this level. The Dimension Members have one or more attributes. Each Dimension Level (also called a Level for short) uses a subset of these attributes to define the key of the Dimension Level. In most of the cases a single attribute acts as the key but multiattribute keys can also exist. This is due to the set semantics of Dimension Levels. The set semantics guaranties that each Dimension Member represents a uniquely identifiable, within the Level, property instance. Note that the attributes that form the key of a Dimension Level are the only attributes that are required to be common among the Dimension Members of a Level.

3-28

Each Dimension Member can have an arbitrary set of other attributes. This “freedom” allows the definition of Levels containing Dimension Members with different sets of attributes and only the key attributes are known to be common among all Dimension Members. In fact the Dimension Level does not have a predefined schema but only a predefined key. This design choice does not follow the structured data paradigm where the schema is clearly separated from the instance. The Dimension Levels can be considered as semi-structured having only a well-defined key at the design time and not a well-defined attribute structure. We follow this modeling choice in order to provide flexibility to the model and allow the natural representation of real-life scenarios. A typical example is the level Product that usually contains not only a product ID key attribute but also details concerning the individual products. These details differ for the various product groups since different products have different properties and need different descriptive attributes. For example, a computer motherboard product would have an attribute describing the CPUs supported by the motherboard while such a descriptive attribute is meaningless for a dishwasher. Our example scenario requires several Dimension Levels to be defined. In fact all levels shown in Figure 2-1 are modeled as Dimension Levels of the MAC model. The attributes of each Level (i.e. the attributes of the Dimension Members contained in the Level) depend on the available data and the analysis requirements. For example the Residence City Level could have the attributes ID, Name and Population where the attribute ID is the key of the Level, Name is the real world name of the city and Population stores the estimated value of the population of the city. Next we give the format definition of what we call a Dimension Level: Definition 3-1: Dimension Level: A Dimension Level is defined by the tuple <Dm,

Dkey> where Dkey represents a set of attribute names and Dm represents a set of

Dimension Members each of which contains the attributes of Dkey and assigns to them a unique value (within Dm). If Dkey is an empty set then Dm can contain at most one Member. If Dm is an empty set then Dkey can contain any attribute.

The attribute names, as well as all other names used in the following, are strings of characters. We can always assume the existence of a countable set of character strings to which all such names belong.

3-29

3.2.1 Notation:

In the following, in order to represent a Dimension Member, which we call Member for short, we will use the notation: <attribute_1, …, attribute_n> where each

attribute_m (1≤m≤n) is of the form attribute_namem=valuem.

If M is a Dimension Member then M.attribute_name represents the value of the Member’s attribute attribute_name. Note that in our definitions and examples we use italics for variables (for example

attribute_1), courier font for example names (for example Level1) and bold courier

font for reserved words (for example Key). Example: <ID=1,Name=”Athens”,Country=”Hellas”>.Country=”Hellas” To define a Dimension Level from a set of Members we use the syntax: Level(Members={member_1, , …, member_n}, Key={key_attribute_1,…,key_attribute_k}) Example: A very simple Customer Level could be defined as:

Level(Members={<ID=1,Name=”Adam”,FavoritSport=”Tenis”>, <ID=2,Name=”Eva”,FavoritColor=”Red”>}, Key={ID})

3.2.2 Graphical notation for Dimension Levels

Dimension Levels are represented by rectangles as shown in Figure 3-1. The name of the Dimension Level, which is assigned as a unique identifier within a schema, appears in the top section of the rectangle. Apart from the name the rectangle may contain a second section where all the key-attributes of the Dimension Level are mentioned (Figure 3-1a). When we do not need to show the key attributes we can use the simpler notation of Figure 3-1b.

3-30

Level Name

Level Name

Key attribute names

(a)

(b)

Figure 3-1: Dimension Levels The Dimension Members of a Level are seldom shown in graphical form. If needed the Members can be represented as records in a third section of the Dimension Level’s rectangle. Figure 3-2 represents an example.

Users

ID

<ID=”john”, Full_name=”John Black”, Age=33> <ID=”suzan”, Full_name=” Suzan Smith”, Age=25>

Figure 3-2: Dimension Members

3.2.3 Properties of Dimension Levels

In order to use Dimension Levels in MAC-algebra expressions we define a number of properties for Dimension Levels. We use the common object-oriented syntax when using properties. So, if L1 is a Level and P is property name then L1.P is the value of this property for the Level L1. Note that the names used for properties are considered reserved words and are denoted in bold in the following. The defined properties are: Key: The set of key attributes of the Dimension Level. Syntax: Level.Key Attributes: The set of all attributes defined by Members of the Dimension Level. Note that not all Members of a Level need to have the same attributes. Syntax: Level.Attributes

3-31

IsEmpty: The property is true if the Dimension Level contains no Members, otherwise it is false. Syntax: Level.IsEmpty

3.2.4 Operations on Dimension Levels

Select: Select a number of Members of a (source) Dimension Level to create the result Level. The result Level has the same key as the source Level. A Boolean condition is used to distinguish the qualifying Members. The Boolean condition is applied on each Member of the source Level separately and has access to two variables: source, which represents the entire source Level and member which represents the Member under examination. So, an expression like member.attrib1=”value1” would qualify only the Dimension Level Members that have the attribute attrib1 and its value is “value1”. Note that NULL values are treated just like in SQL expressions using a three-state logic and treating Unknown as False. Also, invalid expressions are considered false. If the Member does not have an attribute named attrib1 then the above example expression is false. Syntax: Level.Select(member_select_condition) Example: Starting from a Level L that contains all the products, create a new Level by selecting only the products that have been released in 2003. If we assume that the release year of a product is stored in the attribute releaseYear of the Members in L then we can use the following expression to create the new Level L2: L2=L.Select(member.releaseYear=2003) Project: Starting from a (source) Dimension Level create a new Level that contains the same set of Dimension Members but with a modified schema. The operation can modify the schema of the Dimension Members by renaming attributes, removing nonkey attributes, adding new attributes and modifying the value of non-key attributes. The key of the resulting Level contains all the key attributes of the source Level (possibly renamed) plus any of the added attributes. The operation is restricted from removing or modifying the value of key attributes as this action might create multiple members with equal key values. Only non-key attributes can be removed or have their value modified by the operation. The parameters of the operation define the attributes

3-32

to be added (to all Members) by defining their names and the expressions that assign to each of these new attributes a value. The parameters also define which of the new attributes (if any) will be added to the key of the resulting Level. A second parameter of the operation defines which attributes are to be removed (from all Members that have such attributes). A third parameter defines attribute renaming and a final parameter defines the attribute value modification expressions. The value modification expressions define the new value of an attribute only for Members that already have such an attribute. The Members that do not have the attributes mentioned in the attribute value modification expressions are not altered. The expressions used in the parameters have access to the two variables: level which represents the entire source Level and member which represents the Member being examined. Syntax: Level.Project(list_of_added_attributes, list_of_deleted_attributes,

list_of_renamed_attributes, list_of_modification_expressions)

Example: We assume the existence of a Customer Level that contains among others the following information: the address of the customer in the form of a StreetAddress and RegionAdress attributes, the average annual income of the customer in the form of a numberical AnnualIncome attribute that represents the income in Drachmas (Greek currency before Euro), the home telephone number of the customer in the attribute HomeTelephoneNumber, the passport number of the customer in the attribute PassportNumber. We assume also the existence of the function AddressToZipCode that returns the zip code of a given address. Starting from the Customer Level we can create a new Level that contains all the attributes of the Customer Level except for the PassportNum attribute and in addition it contains the zip code of the address in the attribute PCode. Furthermore we require the new Level to rename the HomeTelephone attribute to Ph and to change the value of the AnnualIncome attribute so that this attribute represents Euros and not Drachmas. The following operation can create the new Level NCustomer: NCustomer=Customer.Project(add PCode = AddressToZipCode( member.StreetAddress, member.RegionAddress), remove PassportNum, rename Ph=HomeTelephone, modify AnnualIncome=member.AnnualIncome/340.75)

3-33

Derive: Use the Members of the (source) Dimension Level to create a new Dimension Level. This operation is very general and allows for the creation of a completely new Level. The new Dimension Members are created from one or more Dimension Members of the source Level. A number of attributes can be used to group the source Dimension Members and for each group a new Dimension Member is created in the result Level. If the grouping attributes contain the key of the source Level then each source Dimension Member contributes to one Member in the result. The attributes of the Members in the resulting Level are defined using (aggregation) functions on the attributes of source Members from which they are created. The operation is similar to the generalized projection operation of the extended relational algebra. The parameters of the operation specify the grouping attributes (if any), the name of the attributes in the result Level and the expressions that define their values. These expressions can be simple equality with an attribute of the source Level or even a complex expression that involves aggregation functions. The key of the resulting Level contains the grouping attributes defined by the operation. The source Level Members are grouped according to their values in the grouping attributes. For each such group a new Member is created in the result Level. For each group the expressions that define the values of the new attributes are evaluated. These expressions have access to the variables: level which represents the entire source Level and member which represents a random Member from the group. The variable member within an aggregation function represents the set of Members of the group. Note that just like in SQL the grouping process treats NULL values as equal. If the grouping attributes are not specified the entire source Level is considered as one group and the result Level contains (at most) one Member. Syntax: Level.Derive(list_of_derived_attributes,list_of_grouping_attributes) Example: Starting from the NCustomer Level presented in the previous example (the result of the Project operation) we can derive a new Level CustAvrgIncPCode that contains the average annual income for each customer’s residence region identified by a Zip Code. The Members of the customer Level must be grouped according to the Zip Code and the average annual income must be calculated for each group. The resulting Level needs to contain only two attributes:

3-34

the ZipCode and the average annual income. The following operation defines the new Level CustAvrgIncPCode: CustAvrgIncPCode=NCustomer.Derive( {AvrgInc=AVRG(member.AnnualIncome)},{PCode}) Union: Create the union of Members from the two source Levels. The operation is similar to the set union operation. The difference is in the special handling of key attributes and conflicts. The key of the resulting Level is the union of the keys of the two sources. If a Member of any of the source Levels does not have one of the key attributes of the result Level then a default value is assumed. The parameters of the operation can define the default value for all the key attributes of the result Level. If a default value is not defined a NULL value is used. A conflict exists whenever there is a pair of Dimension Members, one from each Level, that have the same value of the (result level) key attribute(s) but differ in the value of other attributes. The operation can contain directives on how to resolve a conflict: discard the mismatching attributes, discard the first or second Member completely or perhaps change the value of some of the non-key attribute(s) of the resulting Member. If such a directive is not defined the operation takes the default conflict resolution strategy and discards all attributes that do not match. For example if Level A has the key {a1, a2} and the Level B has the key {a1, b1} then the union of these two Levels will have the key {a1, a2, b1}. When a Member of A does not define the b1 attribute the default value for b1 is assumed. In a similar way, if a Member of B does not define the a2 attribute the default value is assumed. If m1=<a1=1, a2=NULL, x=10> is a Member of A, m2=<a1=1, b1=NULL, x=11> is a Member of B and NULL is the default key value for all key attributes of the union of A and B then a conflict is created for m1 and m2. Both Members have the same result key value <a1=1, a2=NULL, b1=NULL> but they assign different values for the x attribute. The default conflict resolution policy is to discard x and create a Member in the result Level with only the remaining attributes a1, a2 and b1. Syntax: Union(Level1, Level2 [; default_key_value, conflict_resolution_policy]) Example: Starting from the Levels StoresInWestEurope and

StoresInRussia we can create a new Level AllStores that contains all the

3-35

stores in West Europe and Russia. Assume that the key of the Level StoresInWestEurope is {CountryID, StoreID} while the key of the Level StoresInRussia is only {StoreID}. The resulting Level AllStores will have the key {CountryID, StoreID} and we can use as a default key value for the CountryID attribute the country code for Russia (which does not exist in StoresInWestEurope). In this way the following union operation will run into no conflicts maintaining all information from both Levels. AllStores = Union(StoresInWestEurope, StoresInRussia; CountryID=”Russia”) Difference: Create a new Level from the Members of the first source Level excluding the Members that match Members in the second Level. The key of the result Level is the key of the first source Level. The operation is similar to the set theoretic difference operation. The difference is that there are various ways to identify when two members are equal. The parameter of the operation defines a Boolean matching expression. The expression is evaluated for each combination of Members, one from the first source Level and one from the second source Level. If the expression evaluates to true then the involved Member of the first source Level is excluded from the result. Note that if an expression references an attribute that does not exist for the examined Member then the expression evaluates to false. Also, the NULL value is treated like in SQL Boolean expressions. The default matching expression is to require equality on all attributes of the first source Level Members. So, in order for a Member m1 to be excluded from the result Level the second source Level must have a Member m2 that contains (at least) all the attributes of m1 and assigns to them the same values as m1. Syntax: Difference(Level1, Level2 [;matching_expression]) Example: Assume OldProducts is a Level containing very old products that stores do not sell any more. Then the difference of the Products Level with the OldProducts will create a new Level containing only the members of Products that are not in OldProducts.

3-36

Join: Create a new Level from the combination of particular Members from the two source Levels. The operation is similar to the relational join operation. The operation defines a join condition that is evaluated for every pair of Members, one from each Level. If the condition evaluates to true a new Member is added to the result Level. This new Member contains all attributes from both Members but the attributes are renamed by adding a supplied prefix to the original names. A different (non null) prefix must be specified for the first and second source Level. The key of the result Level is the set of attributes containing the renamed key attributes of both sources. The join condition is a Boolean expression that uses the renamed attributes names of the two examined Members as well as the variables level1 and level2 to refer to the first and second source Level. Syntax: Join(Level1, Level2 ; name_prefix1, name_prefix2, join_condition) Example: Assume the Levels CustAvrgIncPCode and NCustomers defined in the examples of the Derive and Project operations. Assume also that for a number of customers in NCustomers the attribute AnnualIncome has a NULL value due to lack of information. We can create a new Level that contains all customers of NCustomers and in which the EstAnnualIncome attribute is assigned either the actual non-NULL value of AnnualIncome or an estimated value: the average annual income of all other customers that live in the same area. The areas are identified by Zip Code. To create this Level we use a Join operation that joins the Members of the NCustomers Level to the appropriate Member of the CustAvrgIncPCode Level containing the average annual income value. Then a Project operation creates the EstAnnualIncome attribute and removes redundant attributes. Join(NCustomers, CustAvrgIncPCode, “_”, “Avrg_”, PCode=Avrg_PCode).Project(add EstAnnualIncome=IF(IsNull(_AnnulIncome) then Avrg_AvrgInc else _AnnualIncome), remove {Avrg_PCode, Avrg_AvrgInc}) Other common operations like intersection, semi-join, outer-join can also be defined using the above operations.

3-37

3.3 Dimension Paths

The Dimension Paths define hierarchical relationships among Dimension Levels modeling meaningful sequences of drill-down/roll-up operations. In the field of multidimensional analysis the drill-down and roll-up operations follow pre-designated paths rather than random group-by sets. A Path is defined on the Dimension Levels from which and to which the drill-down and roll-up operations are performed. For example if the user wants to roll-up from days to months then the Levels Day and Month will be part of a Path. Each Path defines a total ordering of its Levels according to the hierarchy that it represents. The first Level is the most detailed level of the hierarchy (the bottom-level) while the last Level is the top-level of the hierarchy (the most aggregated Level). Seldom the top-level of a path is the ALL Level. This Level contains only one Member (all) and it represents the aggregation of all lower-level Members (into one Member). For example the Path Day-Month-ALL could represent the roll-up/drilldown operations performed from days to months and then the aggregation of all months (from months to all). However, the Paths do not treat the ALL Level in a special way and one can define Paths that do not have ALL as the top Level. Apart from being ordered, the Levels of a Path are assigned unique names. The names are used to identify the concept that each Level represents in the Path. This information is needed both by the users and by the operations applied on Paths. For example, the union operation applied on two Paths uses the Level names to identify which Levels are common in both Paths. The Levels on which a Dimension Path is defined along with the names and the ordering assigned to them define the schema of the Path. The Path instance is a set of “links” that relate Members of the involved Dimension Levels. Definition 3-2: Dimension Path “links”: A “link” of a Dimension Path is a relationship between a Member of a Level A and a Member of a Level B, when A, B are two distinct Levels of the Path. A Path “link” may contain a finite non-negative number of properties. These properties are defined by tuples of the form <attribute

name, attribute value>.

3-38

The “links” define the hierarchical relationships that exist among the Levels of the Path. We use the notation <A.m1, B.m2, P> to describe a “link”, where A and B are Level names in the Path and m1, m2 are key values identifying Members in A and B respectively. The P part is a set of attribute names along with values for these attributes. These attributes provide additional (optional) descriptive information about the “link”. The “links” are use to define aggregation/decomposition relationships among Levels. For example in a typical Day-Month-Year Path the Members of the Day Level are linked to the corresponding month in the Month Level and the months are linked to the corresponding year in the Year Level. So, the “links” identify how lower-level Members are grouped into higher-level Members and at the same time how to drill-down from a higher-level Member to the lower-level Members. The typical usage of “links” is to relate the Members of a Level with Members of the exactly next (in the order defined in the Path) Level. For example in the Path Store-City-Country, the Members in the Store Level are related to the corresponding Member in the City Level. Also, the Members of the City Level are related to the corresponding Members in the Country Level. In this case there is no “link” relating a store directly to a country. However, the MAC model does not impose this restriction in order to allow the modeling of arbitrary hierarchies. A Member of a Level can be related through a “link” to a Member of any other Level and not necessarily the next in order. Even in a simple Store-City-Country Path the user may need to relate a store directly to a country and not to city – if for example the city is unknown or the store is not within a city. The typical and most simple case is when each Member of a Level is linked to exactly one Member of the order-wise next Level and to no other Member. In this case we have a strict, complete and non-overlapping hierarchical relationship in the Path. Definition 3-3: Strict Paths. We call a Path strict when the “links” relate the Members of a Level only to Members of the order-wise next Level. Recall that the Path defines a total ordering among the participating Levels so each Level (except for the topLevel) has exactly one order-wise next Level.

3-39

Definition 3-4: Complete Path. A Path is called complete if each Member of a Level is related to both a Member in a lower Level and a Member in a higher Level (when such Levels exist).

Definition 3-5: Non-overlapping Path. A Path is called non-overlapping when no Member is related to more than one higher-level Member.

For example, in the Path Product-Advertising_Method a product may be related to two or more advertising methods (Members of the Advertising_Method Level). In this case the Path is called an overlapping Path. If each product is linked to at most one advertising method then we get a nonoverlapping Path. Although most of the Paths are expected to have the above properties, there are cases when the user requirements lead to a number of “anomalies” like not strict Paths or incomplete Paths or even overlapping Paths. In an attempt to allow the modeling of such hierarchies and Paths with the MAC model the definition of a Dimension Path is quite generic and imposes a minimum number of restrictions. The ordering of Levels in a Path implies a number of dependencies. Each Level is considered to be dependent on the previous Level of the Path. Recall that the Dimension Members are used to describe properties of measures. Assume that a measure value has a property that is described by the Dimension Member Mx of the Level LevelX and that LevelX+1 is the next Level (higher Level) in the Path. In this case all the properties that the particular measure value has and which are described by Members of LevelX+1 are considered to be determined by the links of the Path that connect the Mx Member to Member(s) of the LevelX+1. So, we say that the Level

LevelX+1 is dependent on the Level LevelX.

The previous statement describes the semantics of using Paths for analyzing the properties of measures. This means that a Path should not contain any combination of Dimension Levels but only Levels that describe properties that are related and dependant one from the other.

3-40

Definition 3-6: Dimension Path. A Dimension Path is a tuple of the form

<ListOfLevels, SetOfLinks>

where

ListOfLevels

is

a

list

of

the

form

<LevelName1:Level1, LevelName2:Level2, …, LevelNameN:LevelN> and SetOfLinks

is a set of “links”. Both ListOfLevels and SetOfLinks can be empty.

ListOfLevels assigns a distinct name to each of the participating Level: Level1 is

assigned the name LevelName1, Level2 is assigned the name LevelName2 and so on. The names are used to identify the Levels within the Path so a name cannot appear twice in the list ListOfLevels. Note that Level1 and Level2 need not be distinct levels. The list ListOfLevels define a total ordering for the participating Levels:

LevelName1< LevelName2 < … <LevelNameN. We call the Level LevelName1 the

detailed Level (bottom-level) and the Level LevelNameN the top-level. The set SetOfLinks contains “links” which are tuples of the form:

<LevelNameA.memberA, LevelNameB.memberB, LinkParameters>. For each “link” LevelNameA and LevelNameB must be Level names assigned in the ListOfLevels and LevelNameA<LevelNameB must hold. Also, memberA must be the key value of a

Member of the Level identified by the name LevelNameA and memberB must be the key value of a Member of the LevelNameB Level.

LinkParameters is a set of <attribute_name, attribute_value> pairs that

contains no two members with the same attribute_name part. The attribute_name part defines an attribute name while the attribute_value part contains the value for this attribute.

3.3.1 Graphical notation for Dimension Paths

In order to represent Dimension Paths in a graphical form we use the notation of Figure 3-3. Just like Dimension Levels, the Dimension Paths are assigned a unique name within a schema. The Levels of the Path are linked to the circles of the path according to the total order defined by the Path. The bottom Level (the most detailed) is linked to the circle number 1 while the top Level (the most aggregated) is linked to the last circle of the Path. Each link of a Level to a circle has a name: the name assigned by the Path to the Level.

3-41

Level N

Level N Name

N

...

Level 2

Level 2 Name

2

Level 1

Level 1 Name

1

Path Name

Figure 3-3: Dimension Paths In Figure 3-4 we show two examples of Dimension Paths. The first is the Path P7 shown in Figure 2-1. The top Level of the Path is the ALL Level while the most detailed Level is the Product level. In this case the Path does not rename any Level maintaining the names assigned to the Dimension Levels in the schema. In such cases the link from the Level to the Paths’s circle may appear without a name in order to simplify the notation. The second example Path is the Product Competition. This Path contains twice the Level Product and illustrates the case where the same Level may appear more than once in a Path. The Path links each product of the Original Product Level to its main competitive product in the Competitive Product Level. Both, Original Product and Competitive Product, are in fact the same Level of the schema: the Product Level.

3-42

ALL

ALL

4 Competitive Product 2

Product Class

Product Class

3 Product Original Product 1

Product Group

Product Group

2 Product Competition

Product

Product

1

P7

Figure 3-4: Examples of Dimension Paths

3.3.2 Operations on Dimension Paths

Select: Select a number of links from the source Path maintaining the same Pathschema. The select operation creates a new Path that has the same structure with the source Path but contains only a subset of its links. So, the source and result Path differ only in the links that they contain. The parameter of the operation is a Boolean condition that is evaluated for each link of the source Path. The result Path contains all the links for which the condition evaluates to true. The conditions have access to the variables: fromMember representing the Member of the lower-level that participates in the link, fromLevel representing the Level to which fromMember belongs to, toMember representing the higher-level Member that participates in the link, toLevel representing the Level of toMember, parameters representing the parameters (descriptive attributes) of the examined link and source representing the entire source Path. Note that the Members of the Levels are not modified; only the links are restricted. Syntax: Path.Select(<link_select_condition>) Example: Assume we have the simple Path DaysWeeks in which every week is linked to the corresponding 7 days. From this Path a user might want to define the

3-43

Working_DaysWeeks that links to a week only the corresponding working days. Such a Path is needed in order to aggregate only on the working days and not on all the days. A selection operation on DaysWeeks can create Working_DaysWeeks. If the Day Level has an attribute HolidayFlag that identifies the non-working days then the selection condition would be: fromMember.HolidayFlag=false. The following operation applies a selection condition on the Path DaysWeeks and removes all links starting from non-working days creating a new Path: DaysWeeks.Select(fromMember.HolidayFlag=false) Project: Modify the structure of a Path by renaming, removing or reordering the Levels of the source path. The project operation is used to modify the structure of a Path. The operation can change the names assigned to the Dimension Levels within the Path as well as their order. The operation can also remove completely one or more Levels from the Path along with all the related links. The parameters of the operation include the renaming expression, the Level selection expression and the Level ordering expression. The renaming expression defines mappings among old and new names in the form: new_level_name=old_level_name. The Level selection expression is a Boolean condition that is evaluated for each Level. The expression has access to the variables level (representing the Level being examined) and path (representing the entire source Path). The resulting Path contains all Levels for which this expression evaluates to true. The Level ordering expression defines the order of the Levels in the resulting Path and has the form of a list of Level names. If omitted the order is defined by the source Path. Syntax: Path.Project(renaming_expression, level_selection_expression,

level_ordering_expression)

Example: Assume that we have the Path P1: Store-Area-City-RegionCountry-ALL that is used to group stores based on their geographic location. Starting from this Path we can create a new Path CustArea-CustRegionCustCountry-ALL that will be used to group the customers. A project operation can be used to remove the Store and City Levels and to rename the Level Area to

3-44

CustArea, the Level Region to CustRegion and the Level Country to CustCountry. The following operation creates the required path. P1.Project({CustArea=Area, CustRegion=Region, CustCountry=Country},path.LevelName(level) NOT IN (“Store”,”City”)) Derive: Create a new Path by modifying the Levels of the source Path and inheriting its links. The derive operation is used to modify the Levels that participate in the Path. The first parameter of the operation defines a Level transformation expression. This expression is evaluated for each Level of the source Path. Each evaluation of the Level expression results in a new Level that is included in the resulting Path. So, each Level X of the source Path is represented in the result Path by the result of the Level transformation expression for X. The resulting Path uses the same names and ordering of Levels as the source Path. The links of the resulting Path are created based on the links of the source Path. The new links are created using a set of matching conditions: one for each Level in the Path. The set of matching conditions is the second parameter of the Derive operation. For each Level in the path the corresponding matching condition defines a mapping from the Members of the original Level to the Members of the transformed Level. The matching condition can also be considered as a join condition for the join of the original Level (contained in the source Path) with the transformed Level (contained in the resulting Path). If a Dimension Member mx (of the original Level) joins with a Member my (of the transformed Level) then the links referencing mx are inherited in the resulting Path only that they reference my instead of mx. A similar process modifies also the other part of the link so that the links in the resulting Path reference only Members of the transformed Levels. According to this process a link in the source Path could generate zero or more links for the resulting Path and more than one generated link could connect the same Dimension Level Members. However, a Path cannot contain duplicate links (links connecting the same Members). The operation groups the generated links and for each group of links that connect the same Dimension Level Members only one link is contained in the resulting Path. This link connects the corresponding Members and has as parameters the union of all

3-45

parameters of the links in the group. If two or more links in the group give different value to the same parameter and the user has not defined an aggregation function for this parameter then the parameter is assigned the NULL value. Alternatively, the user may define aggregation functions that are used to compute the value of parameters. Due to the complexity of the derive operation the previous informal description of the operation leaves room for misinterpretations. In the following we formalize the semantics of the derive operation using an algorithmic approach. The derive operation has three parameters: the Level transformation expression

E(level), the set of matching conditions M and the set of parameter computation

expressions PE. The matching conditions in M correspond to Levels of the source Path and we write M.X to denote the matching condition that corresponds to the Level X of the source Path. Each such condition has access to the attributes of the original Level (of the source Path) using the variable sourceMemeber and to the attributes of the transformed Level using the variable newMember. If a matching condition for a Level is not specified in M then a default condition is assumed. The default condition requires equality on the common key attributes of the source and transformed Level. If there is no common key attribute the condition returns always true. The set PE contains tuples of the form <UpperLevel,LowerLevel,ParameterName,Function>. The UpperLevel and LowerLevel define Level names of the Path. The

ParameterName defines a name of a parameter and Function is an aggregation

function that is used to compute the value of the parameter ParameterName. For each level X of the source Path the Level transformation expression E(level) is evaluated resulting in a Level: NewX=E(X). The expression has access to X through the use of the variable level and can reference the entire source Path through the variable path. The resulting Path contains NewX and it assigns the name NameX to

NewX where NameX is the name used in the source Path to identify X. The resulting

Path maintains the Level ordering defined in the source Path. So, if and only if

NameX>NameY in the source Path then the same holds for the resulting Path.

The links of the resulting path are computed using the following algorithm: Algorithm 3-1: Create links of derived Dimension Path Let T be an initial empty bag of links.

3-46

For each link <X.mx, Y.my, P> in the source Path: ? Let JX be the result of the join operation on the Level X.mx with the Level

E(X) using the join condition M.X. The Level X.mx is the Level with only

one Member; the Member of X identified by the key value mx. If jmx is a Member of JX then we use the notation jmx.new_key to refer to the value of the key attributes of E(X) that jmx has. ? ? Let JY be the result of the join operation on the Level Y.my with the Level

E(Y) using the join condition M(Y).

For each Member jmx of JX and for each Member jmy of JY: Add to T the link <X.(jmx.new_key),Y.(jmy.new_key),P>.

Group the links of T according to the Members they connect (ignoring the parameters). Each group contains one of more links. For each such group LG where all links in LG have the starting Member Z.mz and the end Member

W.mw the resulting Path contains the link <Z.mz, W.mw, GP>.

The parameters GP are computed in the following way: For each parameter named Px that exist it at least one link in LG the parameters GP contain the parameter named Px to which it assigns the value Vx. The value Vx is computed according to the following rules: Let LGPx be the subset of LG that contains only the links that assign a value to Px. If <Z,W,Px,F> exist in the parameter computation expression set PE where F is an arbitrary aggregation function then Vx= F(GPx). If <Z,W,Px,F> does not exist in PE and all links in GPx assign the same value Cx to Px then Vx=Cx. Otherwise Vx=NULL.

Syntax: Path.Derive(level_transformation_expression,

set_of_matching_conditions, set_of_parameter_computation_expressions)

Example: Assume that we have the Path P2: Product-Group-Brand-ALL in which some of the products are not linked to any group in the Group Level and some groups are not linked to any brand in the Brand Level. This mean that the unique all Member of the ALL Level is not linked to all the groups and all the products. A user may need to remove from the Path all these products and groups that are not

3-47

linked to the all Member in order to compute aggregated measures. In order to remove the unlinked Members the Levels of the Path need to be modified by a select operation. The derive operation can be applied on the Path to apply the select operation on all Levels. We assume that there is a function IsConnected(Level1,

member1, Level2, member2) which can be applied on a Path and which returns true

when there is a sequence of links in the Path that connects member1 (of Level1) to

member2 (of Level2) (or the two Members and Levels are the same). Using this

function the following operation performs the required removal of unlinked Members. P2.Derive(level.Select(path.IsConnected(level, member,ALL, all))) ModifyParameters: Modify the parameters of the links. The ModifyParameters creates a new Path that is similar to the source Path except for the modifications in the parameters of the links. The operation can remove, rename, add or alter the value of the parameters that source links have. Each link of the source Path exists in the resulting Path modified according to the parameters of the operation. The parameters of the operation define the remove set, the rename set, the add set and the alter set. The remove set defines the names of the parameters that are not maintained in the links of the resulting Path. The add set defines the names of the added parameters along with expressions that assign values to them. The expressions can reference all the attributes of the source link (the key values of the connected Members and the value of the parameters that the link has in the source Path). If a source link contains a parameter defined in the add set then this parameter is overridden by the added parameter. The rename set defines pairs of parameter names of the form A=B. The parameters named B in the source links are named A in the links of the resulting Path. The alter set define parameter names along with the functions that assign values to these parameters. The alter set differs from the add set in that only the source links that contain the parameters defined in the alter set are modified according to the alter set while links that do not contain these parameters are not affected. Syntax: Path.ModifyParameters(remove_set, rename_set, add_set, alter_set) Union: Create a new Path defined on the Levels of both source Paths. The resulting Path contains all the links defined in both source Paths. The Union operation is used

3-48

to unite two Paths taking into consideration the names assigned to the various Levels. When the two source Paths assign the same name to one of their Levels these two Levels are considered equivalent and the resulting Path contains only one Level that corresponds to these two Levels. This new Level is constructed from the (Level) union of the two source Levels that share the same name. In this case the key of the resulting Level may be different (superset) to the keys of the two source Levels. According to the definition of the Level union operation this happens when the two contributing Levels do not have the same key. When such a key mismatch exist the corresponding source Path links (from both sources) must be modified appropriately before they can be inherited in the resulting Path. This process is similar to the process performed by the Level union operation: the key values in the links are extended assigning a default value to the missing key attributes. The default values can be specified by the parameters of the union operation. As already mentioned, the resulting Path contains the (possibly modified) links from both source Paths. In the case that two links (one from each source Path) connect the same pair of dimension Members (in the resulting Path) then only one such link is included in the resulting Path. Recall, that a Path cannot contain duplicate links. The parameters of this link are the union of parameters of the two source links. If the two links assign a different value to the same parameter name then a default or user defined conflict resolution strategy can be used to identify the value assigned to this parameter name in the resulting link. The ordering of Levels in the resulting Path is biased favoring the first source Path. The ordering is constructed according to the following algorithm: The Levels that exist in the first source Path are sorted according to the order define in this Path. The Levels of the second source Path that do not exist in the first source Path are sorted using a common Level that is either bellow or above the Level. The ordering defined in the second source Path is maintained as long as it does not contradict to the previous rules. The following algorithm gives a more formal definition of how the Levels of the resulting path are ordered. Algorithm 3-2: Sort Levels in Path Let Path1 be the first source Path of the union operation and Path2 the second source Path of the operation. For each pair of Levels LevelX and

3-49

LevelY of Path1 the ordering LevelX < LevelY holds in the resulting Path if

and only if this holds in Path1. The Levels of Path2 are ordered according to one of the following rules: A) If Path2 has no common Level with Path1: let LevelBottom1 be the bottom Level of Path1 (LevelBottom1<LevelX for any other Level LevelX of

Path1) and LevelTop2 be the top Level of Path2. Then in the resulting path

we have LevelTop2<LevelBottom1. Also, for any two Levels LevelZ and

LevelW of Path2: LevelZ<LevelW in the resulting Path if and only if this hold

in Path2. B) If Path1 and Path2 have common Levels: Let CommonLevels be the set of these common Levels. For each LevelZ in Path2 that is not in

CommonLevels we have:

B1) If there is a Level LevelW in CommonLevels for which LevelW<LevelZ: let LevelK be the highest such Level (there is no LevelW in CommonLevels for which LevelK<LevelW in Path1 and LevelW<LevelZ in Path2). Then in the resulting Path we have LevelK<LevelZ. Also, for each Level LevelJ of Path1 for which Path1 has LevelK<LevelJ the resulting Path has LevelZ<LevelJ. B2) If B1 does not hold then we know that there is a level LevelW in

CommonLevels for which LevelZ<LevelW. Let LevelK be the lowest such

level. Then in the resulting Path we have LevelZ<LevelK. Also, for each Level LevelJ of Path1 for which Path1 has LevelJ<LevelK the resulting Path has LevelJ<LevelZ. In addition to B1 and B2, for each Level LevelW in Path2 that are not in

CommonLevels and for which Path2 defines LevelZ<LevelW we have the

following rule: if the resulting Path does not imply LevelZ<LevelW or

LevelW<LevelZ due to the B1 or B2 rules then the resulting Path has LevelZ<LevelW.

Proposition 3.1: Algorithm 3-2 defines a total ordering of the Levels of the resulting Path.

3-50

Syntax: Union (Path1, Path2 [; set_of_default_key_values,

set_of_conflict_resolution_policies])

Example: The Union operation is very useful for extending a Path with new Levels. Assume that we have the Paths P3: Store-Area-City and P4: Area-CityRegion-Country-ALL. The union operation can create a complete Path that aggregates the store up to the country Level: P34=Union(P3, P4). Difference: Create a new Path similar to the first source Path but which does not have the links of the second source. The result Path has the same Levels as the first source Path. Also, it assigns to them the same names and maintains the order defined in the first source Path. The links of the result path are the links of the first source Path that do not exist in the second source Path. We say that a link <X,mx,Y,my,P> exists in the second source Path if the second source Path contains the link <X,mx,Y,my,O> where the parameters O can be different than the parameters P. So the process that identified the links that are not included in the resulting Path ignores the parameters of the links. Syntax: Difference(Path1, Path2) Example: The Difference operation is used to restrict a Path according to a similar Path. Usually the two source Paths are the result of operations on the same initial Path. For example, staring from an initial Path P1:Product-Group-CategoryClass-ALL the user created the Path P2 that has the same structure but a number of products, groups and categories have been removed. After using P2 the user may want to perform aggregations using only the removed products, groups and categories. The operation Difference(P1,P2) produces the required Path. All the products of P1 exist in the result Path but links exist only for the products, groups and categories that are not in P2.

3.4 Dimensions

A Dimension is a concept used to define meaningful groups of Dimension Paths. This grouping is essential in order to model the semantic relationships that exist among the

3-51

various Paths and allows powerful OLAP modeling, as we will show in our examples to follow. The Dimensions are complex concepts. They are used in our model to represent the various properties of measures as well as assist the process of multidimensional analysis. With their complex structure, the Dimensions classify properties into Dimension Levels, define the hierarchical relationship among properties of different Levels and define meaningful analysis paths: the Dimension Paths. Furthermore, Dimensions allow the users to group analysis Paths based on the semantics of the Paths even if these Paths appear unrelated from a structural point of view. Although the Dimensions are complex structures, this complexity is mostly hidden within the Dimension Paths on which the Dimensions are defined. The Dimensions appear to be simple as they are defined as sets of Dimension Paths. To each of these Paths the Dimension assigns a unique name that is used to identify and manipulate the Path. Apart from the set of Paths a Dimension also contains a set of dependencies among Levels. The dependencies and their meaning are described in the following section that defines the Dimension Domain. The reason for grouping Dimension Paths into Dimensions is the semantic relationship that usually exists among various Paths. For example, consider the Paths P4 and P5 of Figure 2-1. The Path P4 contains the Levels Cashier, Store, Store Area, Store City, Store Region and P5 the Levels Store, Warehouse, Warehouse Area, Warehouse City, Warehouse Region. Both the above Paths characterize the sales measures from the store point of view. For both Paths, the Level Store is used to represent the stores where the sales are recorded so it is meaningful to group these two Paths and view them as one Location Dimension of the sales measure. Generally speaking we can say that the number of Paths involved to define a Dimension depend on the content of the Dimension and on the complexity of the scenario being modeled. Considering each Path as one independent dimension is not conceptually correct and in practice Dimensions with multiple hierarchies are used in many OLAP systems ([Kimb96], [Inmo96]). The MAC model goes beyond the typical support of multiple hierarchies in that it does not require the Paths that form a Dimension to share a common detail Level. In fact there is no restriction on the Paths that are grouped into a Dimension. These Paths may appear totally unrelated and it is up to the user to use the semantics

3-52

of these Paths and group them into a Dimension. The reason for allowing Paths that do not share even one common Level to participate in a Dimension is requirementdriven. It is easy to see that at some point the user may need such a Dimension. For example consider the Paths P1:Store-Area-City-Region and P2: StoreAccountant-SkillLevel. These two Paths describe the measures from the store point of view. They group the stores either by geographic location or by the responsible accountant of the store and its skill level. It is natural to create only one Dimension with both these Paths. However, during the analysis the user may stop being interested in the information at the detailed Store Level and request the removal of the Store Level from the Dimension. At this point the two Paths appear unrelated although the semantics of the Paths, which the user knows, indicate that these two Paths analyze the same property of the measures: the location where the sale took place. So, these two Paths must belong to the same Dimension. In the above example both paths, P1 and P2, have a Level named Store. However, “Store” is only the name assigned to a Level within the Path. Although it is natural to consider that the Levels identified by the same name are actually the same Levels, there is no such formal requirement. Recall that a Dimension Level is a set of Dimension Members and a key. What happens if the Store Level of the Path P1 does not contain all the Members of the Store Level of the Path P2? Or what if these two Levels are completely different and have different keys? Would then make sense to put P1 and P2 into one Dimension? We already stated that the names assigned to the Dimension Levels within a Dimension Path represent the semantics of the Level. If the mapping from the semantics of a Level to the name assigned to the Level is maintained when defining P1 and P2 then the two Levels named Store in P1 and P2 must be identical. We call such Paths name consistent. The following definition gives a formal definition of when a set of Paths is name consistent. Definition 3-7: Name consistency for Paths. A set S of Dimension Paths is name consistent when for each two distinct Paths in S that assign the same name to one of their Levels, and for each such name, the Levels identified by this common name are identical.

3-53

Two Levels are identical when their defining tuples are equal. This means that they have the exact same Dimension members and the same attributes names as their key. Just like in the case of the Paths P1 and P2 in the previous example, in a Dimension we expect the participating Paths to be consistent with their assignment of names to Levels. In fact we consider that the Dimension contains only once the Levels that appear to be common among the participating Paths. For example, it is natural to consider that the Level Store exist only once in the Dimension Location although two Paths contain it. Due to the way that Dimension Paths are defined, the MAC model does not follow this approach. The Dimension Levels are contained within the Dimension Paths. So, Dimensions that contain Paths assigning the same name to one of their Levels contain these “synonym” Levels once in every such Path. Having Dimensions where such synonym Levels are not identical is rather a problem than a useful feature. Hence, the MAC model requires the set of Paths that participate in a Dimension to be name consistent. In order to clarify the definition of a Dimension we give the following formal definition: Definition 3-8: Dimension. A dimension is a tuple of the form <SetOfPaths,

SetOfDependencies>. The SetOfPaths is a set with members of the form

<PathName:Path> and there are no two tuples in this set with the same PathName or the same Path. PathName is a name assigned to the Dimension Path Path. The Paths appearing in the SetOfPaths constitute a name consistent set of Paths.

SetOfDependencies is a set with members of the form <FromLevelName, ToLevelName> where FromLevelName and ToLevelName are names assigned to

some Dimension Levels by some of the Dimension Paths in SetOfPaths. Both

SetOfPaths and SetOfDependecies can be empty and if SetOfPaths is empty then SetOfDependencies must also be empty.

In a Dimension definition SetOfPaths assigns a distinct name to each of the participating Paths. The names are used to identify the Paths within the Dimension. So, a name cannot appear twice in the set SetOfPaths.

3-54

A Dimension can be constructed from any set of Dimension Paths. When a Dimension is constructed from a number of Paths these Paths must be name consistent with each other. If they are not, then the following Algorithm 3-3 can modify them and make them name consistent. The main idea of the algorithm is to create the union of synonym Levels and replace the original Levels in the various Paths where they appear with the result of the union. Algorithm 3-3: Create Dimension from Paths Let PathSet be the examined set of Dimension Paths. Create the set LevelNames containing all the names assigned to Levels by the Paths in PathSet. For each name X in LevelNames: Define the set PathsOfX to contain the subset of PathSet in which every Path assignes X to a Level. If PathsOfX contain more than one Path then modify these Paths in the following way: Create the set of Levels XSet to contain all the Levels identified by the name X in the Paths of PathsOfX. Create the Level union of all Levels in

Xset.

Let

UX=Union(XSet) be the resulting Level. The default key values

and conflict resolution policies are used for the union operation. Replace in all Paths in PathsOfX the Level identified by X with the Level UX. Note that the Level is changed in the Path and not the name of the Level. The replacement is done through a derive operation on each Path. The set of matching conditions are derived from the key attributes of the original X Level of the Path and the key attributes of the UX level. If the original X Level has the attributes A as a key and the UX Level has {A,B} as the key then the matching condition is:

3-55

newMember.A = sourceMember.A AND newMember.B = NULL

The above algorithm is used in the Derive and Union operations that are described bellow. These operations need to create a new Dimension from a number of arbitrary Paths.

3.4.1 Graphical notation for Dimensions

Dimensions are represented by rectangles to which the participating Dimension Paths are linked. Figure 3-5 illustrates the general case. Each Path can be assigned a role name by assigning names to the links (just like in the case of Levels and Dimension Paths). The Dimensions may also contain a number of dependencies among the participating Levels. These dependencies are represented by directed links among the circles that are connected to the involved Levels. Figure 3-6 illustrates an example where a dependency exists from L-2 of Path-1 to L-4 of Path-2. If required, the dependency links can be named after the Dimension to which they belong. Note that the arrow’s head of a link that represents a dependency differs from the arrow’s head used for the links the of Dimension Paths.

3-56

L 1.N1

N1

L 2.N2

N2

L K.NK

NK

...

...

...

L 1.2

2

L 2.2

2

L K.2

2

L 1.1

1

L 2.1

1

L K.1

1

Path 1

Path 2

Path K

Path 2 Name

...

Path N Name Path 1 Name Dimension Name

Figure 3-5: Dimensions

L-6

3

L-3

3 L-5 2

L-2

2 L-4 1

L-1

1 Path-2 Path-1

D-1

Figure 3-6: Dependency in a Dimension

3-57

3.4.2 Operations on Dimensions

Project: Modify the structure of the Dimension without adding new information. The operation can rename Dimension Paths or remove them completely from the Dimension and it can also remove Level dependencies. The result of the operation is a new Dimension that is similar to the source Dimension except for the modifications identified by the parameters of the operation. These parameters include a renaming expression, the Path selection expression and the Level dependencies selection expression. The renaming expression defines mappings among old and new Path names in the form new_path_name=old_path_name. The Path selection expression is a Boolean condition that is evaluated for each Path. The expression has access to the Path for which it is evaluated through the variable path and to the entire source Dimension through the variable dimension. The resulting Dimension contains all Paths for which the expression evaluates to true. The Level dependencies selection expression is also a Boolean expression. This expression is evaluated for each Level dependency defined in the Dimension. The expression has access to the variables first_level and second_level that represent the Levels (not their names) identified by the dependency for which the expression is evaluated. The expression also has access to the variables first_level_name, second_level_name and dimension with obvious meanings. The resulting Dimension contains all the Level dependencies for which this expression evaluates to true. Syntax: Dimension.Project(path_selection_expression [,

path_rename_expression, level_dependencies_selection_expression])

Example: The typical usage of the project operation is to simplify a Dimension by removing one or more Paths. For example if a Dimension contains the Paths P1:Day-Month-Year-ALL and P2:Day-Week-Year-ALL then one may remove the second Path and obtain a simple Dimension with only one hierarchy. Derive: Create a new Dimension by modifying the Paths of the source Dimension and inheriting its Level dependencies. The Derive operation is used to modify the Paths that participate in a Dimension. The first parameter of the operation defines a

3-58

Path transformation expression. This expression is evaluated for each Path of the source Dimension. Each evaluation of the Path expression results in a new Path that is included in the resulting Dimension. So, each Path X of the source Dimension is represented in the result Dimension by the result of the Path transformation expression for X. The set of Paths resulting from the transformation expressions are modified according to the Algorithm 3-3 so that the set is name consistent. The Path transformation expression has access to the Path for which it is applied through the variable path and to the entire Dimension through the variable dimension. The expression must always result in a Dimension Path. The Level dependencies are inherited from the source Dimension. Since the Path transformation expression may change the Level names, a parameter of the operation provides a set of Level renaming expressions. These expressions are similar to the ones described for the Project operation on Paths. A Level renaming expression defines a mapping among old and new names in the form:

new_level_name=old_level_name. Each Level dependency of the source Dimension

appears in the resulting Dimension after being modified according to the set of Level renaming expressions. For example, if the source Dimension contains the dependency <LevelName1,LevelName2> and one of the Level renaming expression is

NewLevelName=LevelName1 then the resulting Dimension will contain the

dependency <NewLevelName, LevelName2>. Additional dependencies can be defined in the set of additional dependencies, which can appear as a parameter of the operation. Duplicate dependencies are removed from the resulting Path if created. Syntax: Dimension.Derive( path_transformation_expression [ ,

set_of_level_renaming_expressions, set_of_additional_dependencies ])

Example: Consider the Dimension Location described in Figure 2-1. The dimension contains two Paths: P4:Cashier-Store-StoreAreaThe initial Dimension contains StoreCity-StoreRegion and P5:Store-Warehouse-WarehouseAreaWarehouseCity-WarehouseRegion. information about all Cashiers of all Stores etc. Assume that we are interested only in stores with more than one floor. The information about the number of floors

3-59

that a store has is stored in the attribute NumberOfFloors that Members of the Store Level have. The following derive operation can be used on the initial Dimension to create the restricted Dimension. Note that the Members of all Levels in all Paths of the Dimension are influenced by the selection condition. For example, if a warehouse does not supply stores with more than one floor then this warehouse does not appear in the resulting Dimension. Location.Derive(path.Derive(level.Select(path.IsConnected (level, member, Store, NumberOfFloors>1)))) Union: Create the union of two Dimensions. The resulting Dimension contains all Paths from both source Dimensions. If two Paths are assigned the same name then these Paths are united into a single resulting Path using the Path union operation. The Algorithm 3-3 is used in order to reassure that the resulting set of Paths is name consistent and can form a Dimension. The set of Level dependencies of the resulting Dimension is constructed from the union of the sets of Level dependencies defined by the two source Dimensions and the set of additional Level dependencies defined as an optional parameter of the operation. Syntax: Union(Dimension1, Dimension2 [; set_of_additional_dependecies ]) Example: The Union operation can be used to add Paths to a Dimension. For example if we have a Dimension D1 with only one Path P1:Day-Month-Year and we want to add to this Dimension the Path P2:Day-Week-Year we can define a new Dimension D2 that contains only P2 and create the union of D1 and D2. A different usage of the operation is to combine Dimensions that contain the same Path but these Paths have been restricted with different conditions. A Dimension D1 containing the Path P1:Day-Month-Year may have been restricted to contain only the days, months and years for the range of years 1995-2000. A second similar dimension D3 that contains the Path P1:Day-Month-Year may contain only the days, months and years for the range of years 1980-1990. By creating the union of D1 and D3 we get a Dimension that has only one Path (P1) but this Path contains the days, months, years of both ranges: 1980-1990 and 1995-2000. This Union operation creates the union of the two Paths named P1, which results in creating the union of the Levels Day, Month and Year.

3-60

3.5 Dimension Domains

The Dimensions in our model are used to define the structure of the Multidimensional Aggregation Cubes (called MACs or Cubes for simplicity). Still, a Cube definition does not involve Dimensions as sets of Paths but rather as sets of what we call Dimension Values1. Each Dimension implicitly defines a set of Dimension Values called a Dimension Domain. The Dimension Values are used in the Cubes as the coordinates of the Cube’s cells. Before we define what exactly is a Dimension Value we need to define the graph of a Dimension. The graph is used to represent the relationships among the Levels that participate in a Dimension. The nodes of the graph are the Dimension Level names used by all Paths of the Dimension. As we have already discussed in the previous section, the name consistency property required for the set of Paths that form a Dimension guaranties that for a given Dimension a Level name appearing in one of its Paths uniquely identifies a Dimension Level instance (or multiple identical Dimension Level instances which are treated in the Dimension as one). The edges of the graph are the Dimension Level dependencies that hold within the Dimension. Each Dimension Path implicitly defines a number of dependencies: each Level of a Path is dependent on the next Level of the Path. For example in the Path P1:Day-Month-Year the Level Day is dependent on the Level Month and this Level is dependent on the Level Year. In the graph of a Dimension that contains P1 there would be an arrow from the Level Day to the Level Month and from the Level Month to the Level Year. Apart from the dependencies implicitly defined by the Paths the Dimension also contains a set of explicit Level dependencies that are defined by the user. As already presented in the definition of Dimensions, each Level dependency contains a from Level name and a to Level name. These two Level names cannot belong to the same Path but they must belong to different Paths of the Dimension. The semantics of a Level dependency is that the Members of the from Level are determining (in some sense) the Members of the to Level as well as the Members of all other Levels that are higher than the to Level in any of the Dimension’s Paths. For example consider the

1

Note the important difference among the terms dimension value and dimension member.

3-61

Dimension Location described in Figure 2-1 and the Paths P4:CashierStore-StoreArea-StoreCity-StoreRegion and P6:CashierDBServer-PerformanceLevel. Assume that the Level StoreArea is dependent on the Level DBServer. The DBServer is the from Level and the StoreArea is the to Level of the dependency. If a transaction takes place at a cashier that is connected to the database server DB1 (a Member of DBServer) we know from the explicit Level dependency that the StoreArea of the transaction is restricted to a particular set of areas. However, we do not know which these areas are. In order to identify the areas from the database server the Dimension should have contained a Path that would have DBServer as the bottom Level (detail Level) and StoreArea as the top Level. Then the links of this Path would identify which areas are connected to which database servers. The previous observation demonstrates that an explicit Level dependency is like a Path with only two Levels, where the from Level is the bottom Level, and this Path contains no links. In fact, the definition of the Dimensions could have been done without the set of Level dependencies and dummy Paths would then be required to represent such dependencies. We have chosen not to follow the “dummy Paths” approach because we prefer to have a clear distinction between the Paths that are used for the multidimensional analysis and the Level dependencies that are used for the restriction of the Dimension Domain. Using Paths to represent both would confuse the users although it would simplify the definition of Dimensions. The knowledge about explicitly dependent Levels is used to restrict the plethora of ways in which the various properties of measures (which are defined through the Levels of the Dimension) can be used for analysis. In the previous example, knowing that the DBServer property and the StoreArea property of the transaction are related (one is dependant on the other) means that is redundant to aggregate the transactions with regard to both these properties. Using the Level dependencies implied by the Path and by using a chain of dependencies we also conclude that it is redundant to aggregate the transactions with regard to DBServer and StoreCity or StoreRegion. So, the Level dependencies restrict the ways in which measures can be grouped and analyzed. These restrictions are reflected in the Dimension Domains as described next.

3-62

Each explicit Level dependency is used to form a directed edge in the Dimension graph just like the dependencies implied by the Paths. For example the explicit dependency <DBServer, StoreArea> adds an edge from the DBServer node to the StoreArea node. Figure 3-7 shows the graph of the Dimension Location, which appears in Figure 2-1, under the assumption that the Dimension is defined with the explicit Level dependency <DBServer, StoreArea>. A Dimension Value can either be a simple Dimension Member or it can be a set of Dimension Members. Recall that a Dimension Member is a member of a Dimension Level and Levels are used to define Paths, which are grouped into Dimensions. Definition 3-9: Dimension Value: We define a Dimension Value of a Dimension D to be a set of one or more Dimension Members belonging to distinct and non-related Levels of Paths of D.

Cashier

DBServer Store

Warehouse Store Area

Warehouse Area Store City

Warehouse City Store Region

Warehouse Region

Figure 3-7: The graph of the Dimension Location The term “distinct” means that a Dimension Value can contain at most one Dimension Member from a Level. Furthermore, two or more Levels participating in a Dimension are called “non-related” if there is no path in the directed graph of the

3-63

Dimension linking any two of these Levels. A single Dimension Level is always “non-related”. Definition 3-10: Dimension Domain: The set of all possible Dimension Values defined for a Dimension is called a Dimension Domain.

It is now obvious that each edge in the Dimension graph restricts the number of non-related Levels and so it restricts the Dimension Domain. Note that the Paths of the Dimension along with the explicit Level dependencies determine the structure of the Dimension Domain by defining all possible combinations of non-related Levels. Then the instances of the Dimension Levels, the Dimension Members, determine the Dimension Domain. Take as an example the Dimension Client the graph of which is shown in Figure 3-8. Each Member of the Levels Customer, ResidenceArea, ResidenceCity, ResidenceRegion and Profession is a Dimension Value. If Athens_North, Athens_East are two Members of the Residence Area Level and doctor, teacher are two Members of the Profession Level then each of these Members is a Dimension Value. Also, each combination of Members: one from the Profession Level and one from ResidenceArea is a valid Dimension Value. The Dimension Value {Athens_North, doctor} represents a grouping of all customers living in the North of Athens that are doctors. The Level Profession is also non-related to the Levels ResidenceCity and ResidenceRegion and the various combinations of Dimension Members are valid Dimension Values. The combination of Members is possible only for non-related Levels. This means that if Patra is a Member of ResidenceCity then the set {Athens_North, Patra} is definitely not a valid Dimension Value. Either Athens_North is an area of the city Patra and it is redundant to mention Patra or Athens_North is not within Patra and the combination of these Members is meaningless.

3-64

Customer

Residence Area

Profession

Residence City

Residence Region

Figure 3-8: The graph of the Dimension Client The semantics of the Dimension Domain is that the members of the Dimension Domain, the Dimension Values, represent all possible properties and combination of properties that can be used for the multidimensional analysis with respect to the particular Dimension. For example, a query may ask for customers living in Athens_North or it may ask for customers with the doctor profession that live in the Athens_North area. Still, it is not meaningful to ask for customers that live in the Athens_North area and in Patra city at the same time, since the latter does not include the former. When a Dimension is defined with finite elements the resulting Dimension Domain is also finite. This is stated in the following proposition: Proposition: When a Dimension is defined using a finite set of Dimension Paths, and each participating Path is defined with a finite set of Dimension Levels, and each such Level contains a finite number of Dimension Members, then the Dimension domain of this Dimension is finite. The proof of the above proposition is trivial.

3.6 The Multidimensional Aggregation Cubes (MACs)

The Multidimensional Aggregation Cube (MAC) is the main and most complex concept of our model. All other concepts previously described are directly or indirectly used in the definition of a MAC. The Dimension Levels model properties of measures, the Dimension Paths define relationships among Levels, and the

3-65

Dimensions group Paths. The MAC is the only concept that associates property values with actual measure values and stresses the complex hierarchical structure defined by Dimensions. A multidimensional aggregation cube can be considered as an n-way relationship relating N Dimension Domains. This relationship has one or more attributes which represent the measures of the MAC. Each instance of this relationship is called a cell and defines a relationship among one Dimension Value from each of the involved Domains. The cell is annotated with the values of the cube attributes – the measure values. The N Dimension Values that a cell relates are called the coordinates of the cell. Obviously, the measures of a cube are functionally dependant on its coordinates. Definition 3-11: Multidimensional Aggregation Cube: A Multidimensional Aggregation Cube (MAC or simple Cube) is a tuple of the form: <SetOfDimensions, SetOfMeasures, SetOfCells> where: ?

SetOfDimensions is a set with members of the form <DimensionName: Dimension> and there are no two members with the same DimensionName

part. ? ? ?

SetOfMeasures is a set of measure names. SetOfCells is a set of tuples of the form <Coordinates:MeasureValues>

where there are no two tuples with the same Coordinates part.

Coordinates is a set with exactly N elements, where N is the number of

elements in the SetOfDimensions (the dimensionality of the cube). Each member of the Coordinates set has the form <DName:DVal> where each

DName is a dimension name defined in the SetOfDimensions and there are

no two members with the same DName part. The DVal part of each member is a Dimension Value in the Dimension Domain of the Dimension identified by the DName. ?

MeasureValues

is a set of tuples of the form: <MeasureName:

MeasureValue>. The MeasureName part of each member must be a valid

measure name defined in the SetOfMeasures while MeasureValue is the value assigned to the measure.

3-66

The elements SetOfDimensions and SetOfMeasures are considered the schema of the Cube while SetOfCells is considered the instance of the Cube. Note that according to our definition it is not required for a cell to provide values for all the measures of the Cube. In fact some cells may be completely missing from the Cube’s instance and for that cells there is no value defined for any of the measures. By definition we agree that when a cell does not define a particular value for a measure of the Cube, the value of that measure (at that particular cell) is considered to be NULL. Therefore, the NULL value has a special meaning when assigned to measures of the Cube and we can assign the NULL value to any measure of any Cube. As a consequence, it is equivalent when we state that the value of a measure (at a particular cell) is NULL and when we skip the definition of the value of that measure (for that particular cell).

3.6.1 Graphical representation of Cubes

MAC cubes are represented with a 3-dimensional cube shape that is linked to the appropriate Dimensions. The links between the Dimensions and the 3-dimensional cube shape are named according to the names assigned by the Cube to its Dimensions. The name assigned by the MAC schema to the Cube appears on the front side of the 3-dimensional cube shape. The names of the Cube’s measures appear in a rectangle attached at the bottom of the 3-dimensional cube shape. Figure 3-9 presents the notation.

Dimension 1 Dimension 2

...

Dimension N

Dimension 2 Name

Dimension 1 Name

Cube Name

Dimension N Name

Names of Measures

Figure 3-9: Cubes

3.6.2 Discussion

Assume the following example: The Cube C1 is defined over the Domains of the Dimensions Location and Item (Figure 2-1) having only Sum of sales as its

3-67

measure. C1 contains the cells defined in Table 3-1 where S represents the Sum of sales measure. Note that we have given names to the Cube and its cells only for presentation reasons. The MAC model does not assign any names to the Cube’s cells and Cubes are assigned names only as part of a MAC schema. Cell name Item Coordinates Location M S

cell_A Product = P_A Cashier = Cashier00701 10 cell_B Product = P_B Cashier = Cashier00701 20 cell_C Brand = B_1 Cashier = Cashier00701 30

Table 3-1: The cells of the example cube C1 The measure of cell_A represents the sum of sales done at the Cashier00701 for the product P_A. Likewise the measure of cell_B represents the sum of sales done at the Cashier00701 for the product P_B. Finally, cell_C represents the sum of sales done at the Cashier00701 for all products of the brand B_1. Assuming that P_A and P_B are the only products of the brand B_1, the cell_C then represents the aggregation of cell_A and cell_B. This means that the measure of cell_C must be equal to the sum of measures of cell_A and cell_B. The above example reveals the key property of the MAC model: the instance of a Cube (the cells of a cube) can contain measure values of different granularities even if there is a functional dependency among them. In our example, cell_A refers to sales measured per Product and Cashier while cell_C refers to sales measured per Brand and Cashier. These cells, although defined at different levels of granularity, can be part of the same cube. This is due to the definition of the Dimension Domain, which states that all Members of all participating Levels are valid Dimension Values. We believe that the above property of Cubes is crucial for the compact and intuitive representation of multidimensional data. As explained in the Section 2.1 OLAP users usually handle data defined over various levels of granularity. Queries may impose selection conditions on various Dimension Levels and may require their result at a completely different granularity. Furthermore, even the granularity of

3-68

source data may vary. For example, a store may change for a time period the way it records its sales. The store could record only the sum of sales per product for all its cashiers and not per product and cashier as it used to do before. Another example is when a Cube contains predicted and actual sales. The predicted values may not be computable at the lowest detail level but only at some higher levels. Furthermore, for security reasons the detailed data may not be available for stores of a particular area. If we had only single-granularity cubes than each time we needed a new combination of Levels we would have to add a new Cube to the schema. The schema would get complicated and so would the queries. The schema would not only have to include the Dimensions and the Cubes but also the functional dependencies among the existing Cubes. The queries would have to be aware of the available Cubes as well as their functional dependencies. They would have to select which ones to use and probably join some of them before defining one or more grouping operations on parts of those Cubes. Our approach allows a single Cube to include data of all meaningful granularities. In this way, we simplify the schema making it usable by the users. Also, this approach allows queries to be expressed in a very elegant and straightforward way avoiding any declaration of joins. Consider the following example: Let C2 be a Cube defined over the domain of the Dimensions Time, Location, Item, and Client. The structure of these Dimensions is illustrated in Figure 2-1 with the addition of the special Level ALL as the top Level of each Dimension. Assume SumOfS is the unique measure of C2 and it represents the sum of sales. Using this schema the query Q4, described in Section 2.1, can be expressed in a very simple manner. We use a QBE notation style and define the query Q4 in Table 3-2. The P. notation defines which coordinates and measures to be returned in the result set. Time P.Month Year.2000 Year.1999 Location P.Store_Area._x Store_Area._x Store_Area._x Item P.Brand ALL ALL Client ALL ALL ALL SumOfS P. >_s*1.1 _s

Table 3-2: The query Q4

3-69

An additional advantage of our approach is that the most typical OLAP operations: drill-down, roll-up, slice and dice are translated to simple selection queries on the Cube. For example, a drill-down on the results of Q4 to the Product Level can be expressed by simply replacing Brand with Product at the Item coordinate of Table 3-2. Nevertheless, there is an important argument against this modeling approach. The Cube can represent cells with measures that are functionally dependent but it cannot guaranty their consistency. The measure of cell_C could have been 50 in Table 3-1 without violating in any way the definition of the Cube C1. Still, from a semantically point of view this value would be inconsistent with the values of cell_A and cell_B. This inability to guaranty consistency comes from total absence of the involved aggregation functions. Each measure of a Cube is semantically related to some aggregation. In the above examples the measure Sum of sales is obviously related to the SUM aggregation function. Still, the Cube definition does not include this relationship making impossible any consistency checking. We have intentionally chosen not to include the aggregation functions in the MAC model for a number of reasons. Using aggregation functions as part of the Cube’s definition can serve two purposes. First, it can allow the definition of derived cells: cells that are not stored but their values are computed whenever required. This is similar to non-materialized views used in relational databases. Second, it can impose consistency rules on the values of the cells. Checking such consistency rules is in fact the same process as computing the derived cells. Since the computation of aggregation functions is known to be a very expensive action, we can see that both, imposing such consistency rules, and the support for derived cells would be computationally expensive. Had the MAC model incorporated aggregation functions within the definition of Cubes, it would be very difficult to realize the price of updating the values of a cell or even the price of reading the cell’s measure values. Such simple operations could in fact trigger the execution of expensive computations. We believe that is it better to have a separated functional model through which one can define derived cubes and consistency rules. In this way one can better control the complexity of the various operations and achieve optimization.

3-70

Furthermore, there are many types of aggregation functions, and each type requires a different algorithm to compute the result values. For example, computing the moving-window average is very different than computing a simple count. Providing a general algebra in which all types of aggregation functions could be expressed would be very complicated making a Cube definition very complex. Depending on the application domain the operations defined for the MAC model can be extended to support the various aggregation functions that are required. Using this set of operations one can then define consistency rules and Cubes with derived cells (if required). In this way, the MAC model can be used in a wide range of situations and it is not restricted by the aggregation functions we would have selected. Finally, in some applications it might be acceptable to view and analyze data that includes inconsistent parts (maybe due to missing, incomplete or wrong information). A separate functional model can be tailored to cope with operations on such data. By forcing consistency at the data modeling level we would make our model inappropriate for these applications.

3.6.3 Operations on MACs

Select: The operation creates the resulting Cube by selecting a subset of the source Cube’s measure values. The schema of the resulting Cube is identical to the schema of the source Cube. The parameter of the operation is a set of Boolean selection conditions on the measures of the source Cube. Each condition of this set has the form: measureName:condition. If the measureName part is empty then the

condition is evaluated for all the measures that are not mentioned by any other

selection condition. If a selection condition evaluates to true for a particular measure of a cell then this measure value is included in the result Cube (at the corresponding cell). The former is true even if there is a second condition, for the same measure, that evaluates to false. So, the selection conditions are considered to be disjunctive. The Boolean selection condition can reference the variables: measure, cell, and source with obvious meanings. Syntax: Cube.Select(set_of_conditions)

3-71

Example: An analyst may want to simplify the Cube he is analyzing by removing all sales values that are less than $100. If C1 is the name of the original Cube he can perform the following Select operation: C1.Select({SalesValue:measure>100}) Project: This operation can be used modify the measures and Dimensions of a Cube. Using this operation one or more measures may be removed, renamed while others may be created. Also, some of the Dimensions may be removed; renamed or even new Dimensions may be added to the Cube. The first parameter of the operation is a set of measure modification statements. Each statement can be either a “set

measure_name

=

measure_value_expression” or a “delete measure_name” statement. In the set

type of statements the measure_name represents the name of a new or existing measure whose values will be modified (or set). The measure_value_expression defines the values of this measure and it is evaluated once for each cell of the Cube. The expression can reference the measures of the corresponding source Cube’s cell through the variable cell or even the entire source Cube through the variable source. The delete type of statement removes the measure named

measure_name from the result Cube.

The second parameter of the operation is a set of Dimension modification statements. There are tree types of Dimension modification statements. The first type has the form: “rename new_name=initial_name”. This type of statements simply alters the names assigned by the Cube to some of its Dimensions. The second type is the delete statement. The delete statements have the form: “delete dimension_name keeping coordinate”. Such a statement removes the Dimension called dimension_name. Removing a dimension is not a simple task. If we eliminate one coordinate from the set of Cube cells we would probably get many cells with identical coordinates. However, a cube cannot have more than one cell with the same coordinates. The coordinate parameter solves this problem by identifying which cells to keep (among the many possible similar cells). The coordinate parameter is a Dimension Value of the Dimension dimension_name. One can consider that the actions taken to implement a delete statement are the following:

3-72

Remove all Cube cells whose coordinate for the Dimension dimension_name is no

coordinate. Then remove the Dimension dimension_name from the Cube’s

definition. Reduce the dimensionality of all cells by removing the coordinate corresponding to the removed Dimension (the value of which is coordinate). The third type is the add statement. The add statements add a new Dimension to the Cube and have the form: “add dimension_name=new_dimension using

join_expression)”. The new dimension added to the source Cube is defined by the

parameter new_dimension and it is assigned the name dimension_name. The

join_expression parameter is used to create the cell coordinates of the resulting Cube.

The expression is evaluated for all combinations of source Cube cells (scell) and Dimension Values (dval) of the new_dimension. For each combination (scell, dval) for which the expression (join_expression(scell, dval)) evaluates to true a cell is added to the result Cube. This cell has all the measure values and coordinates of the source cell (scell) plus one more coordinate for the new_dimension: the value of this coordinate is the matching Dimension Value (dval). The result Cube contains only the cells created by the above process. The expression has access to the coordinates and the measures of the source cell (scell) through the variable cell. Also it can reference the variables dval, cube and new_dimension. The variable dval represents the examined Dimension Value (dval). The variable cube corresponds to the source Cube and the variable new_dimension to the added dimension. Note that just like all other operations we have defined, this operation does not actually modify the source Cube (the Cube on which the operation is applied). One can consider that the operation creates the resulting Cube by creating a Cube identical to the source Cube and then applying to it the actions defined by the parameters of the operation. Syntax: Cube.Project(measure_mod_statements, dimension_mod_statemets) Derive: The operation is used to modify the Dimension of the Cube. A transformation expression can be defined for each of the Dimensions along with a Dimension Values mapping expression. When a Dimension is modified its Dimension Domain may change. Therefore, a mapping expression is needed in order to map the old Dimension Values (the ones in the source Cube) to Dimension Values that exist in

3-73

the resulting Cube. The transformation expressions are declared in the first parameter of the operation: set_of_dimension_transformation_expressions. This parameters is a set of tuples of the form: <dimension_name:transformation:mapping_expr>. A second parameter of the operation is a aggregation expression for measure values: aggregation_expression. This expression may be required in order to compute the measure values of a cell, when more than once cells from the source Cube are mapped to a single cell in the resulting Cube. When using the mapping expression mapping_expr to transform the Dimension Values of the source Cube into Dimension Values in the resulting Cube there is guarantee that the relationship is going to be a one-to-one. In such cases there would be more than once coordinates in a Dimension of the source Cube than map to a single coordinate in the corresponding Dimension in the resulting Cube. Consequently, two or more cells of the source Cube will be mapped to the same cell in the resulting Cube. In order to compute the correct measure values at such cells the user can define the appropriate aggregation expression. If such an expression is not provided the operation will simply assign a NULL value whenever the corresponding source Cube cells do not define a unique value for a measure. Syntax: Cube.Derive(set_of_dimension_transformation_expressions,

aggregation_expression)

Pull: The operation creates and adds to the Cube a new Dimension based on the values of one or more of its measures. The resulting Cube is similar to the source Cube and the only difference is the additional Dimension. Even the measures that are used to generate the new Dimension are maintained in the resulting Cube. The parameters of the operation identify the names of the measures that are used (set_of_measure_names) and the name of the new Dimension (new_dimension_name). The new Dimension is a simple, one-Path and one-Level, Dimension. The unique Path of the new Dimension has no links since it has only one Level. This new Level is constructed to contain as Dimension Members the distinct values of the identified measures, as they appear in cells of the Cube. In this case the NULL values are treated just like any other value. The Level’s attributes, which are all key attributes of the Level, are named after the measures they represent.

3-74

The cells of the resulting cube are constructed from the cells of the source Cube by adding one more coordinate: the value of this coordinate is defined by the value of the corresponding measures (set_of_measure_names). Syntax: Cube.Pull(set_of_measure_names, new_dimension_name) Example: The operation is particularly useful when a measure has with only a limited number of values. Assume that we have a Cube C1 that has a measure called Grade and this measure is assigned only the values: 1, 2, 3, 4 and 5. If we want to analyze the Cube’s data based on the value of Grade (for example if we want to count how many 1, 2, 3, 4 and 5 Grade values we have in some part of the Cube) we can transform the Grade measure into a Dimension with a Pull operation. Each cell that has a Grade value of 1 would be associated with a Dimension Member (in the new Dimension) that has Grade=1. Each cell that has a Grade value of 2 would be associated with a Dimension Member with Grade=2 and so on. Using the new dimension we can then aggregate the required parts of the C1 and obtain our results. The syntax of the operation would simply be: C1.Pull({Grade}, Grade_dim) Push: Transforms a dimension into measures. This operation is the reverse of Pull. The new measures contain the values of the coordinate that corresponds to the transformed Dimension. The parameters of the operation define a measure name prefix (name_prefix), which is explained later, the name of the Dimension being used (dimension_name) and an arbitrary set of Level names (set_of_level_names). The Level names must identify Levels of the selected dimension. If

set_of_level_names is empty the operation uses all the Levels of the selected

dimension. The resulting Cube is similar with the source Cube. The only difference is the additional measures that the result Cube contains. All the new measures have a fixed prefix in their names: the name_prefix specified as parameter. This is used in order to prevent cases where the measure name that needs to be added is already used by an existing measure. The rule is that the specified prefix must not match the prefix of any existing measure of the Cube. So, if there is a measure names m1_sum the prefix cannot be m1.

3-75

The way is which the new measures are created and assigned values, for each cell of the resulting Cube, is: let d_val be the coordinate of the cell that corresponds to the selected dimension (dimension_name). Let d_val be created by the set of Dimension Memebers {dm1, dm2, …, dmK}. For each such Dimension Member dmi check if the name (dmi_level_name) of the Dimension Level to which dmi belongs exists in the list set_of_level_names. If it is in the list then add to the cell the key attributes of dmi as measures. The names of the added measures are created using the formula:

name_prefix + dmi_level_name + key attribute name. The values of the added

measures are the values of dmi. Syntax: Cube.Push(name_prefix, dimension_name, set_of_level_names) Example: Although the operation looks complicated it is not. Assume we have a Cube C2 with a Dimension named Children that contains only one Level. The key attribute of this Level is num_of_children. We perform the operation C3=C2.Push(“m_”,”Children”) and obtain the result Cube C3, which is similar to C2 with the addition of the measure m_Children_num_of_children. Each cell that is associated to the Dimension Member with num_of_children=1 (that is the cell’s coordinate for the Dimension Children is <num_of_children=1>) would have the value 1 for this new measure. Union: The operation creates a new Cube by unifying the information from two Cubes. The result Cube has all the Dimensions and all the measures of both source Cubes. If the two source Cubes have common measures (identical names) then these measures appear only once in the result Cube. The same holds for the Dimensions. If the two source Cubes have synonym Dimensions then these Dimensions are tagged as shared Dimensions and will appear only once in the result Cube. However, when we have two synonym Dimensions (one from each source Cube) it does not necessarily mean that these two Dimensions are identical (the same Paths and dependencies). Therefore, the operation performs a Dimension Union operation for each such pair of synonym Dimensions and the new Cube contains the result of these Union operations and not the initial synonym Dimensions. Note that the Dimension Union operation can trigger a Path Union operation, which may require the execution of Level Union operations. This leads to modifications of Dimension

3-76

Members and therefore, the modification of Dimension Values: which are used as coordinates in the Cube. However, since the modifications are one-to-one (one Dimension Member is transformed into one new Dimension Member) we have no problem in linking the affected cells to the new Dimension Members. For each Dimension of the second source Cube that does not have a synonym in the first source Cube the parameters of the operation must specify a Dimension Value to which the cells of the first source Cube will be linked. The same stands for the nonshared Dimensions of the first source Cube. The parameter first_set_of_coordinates contains a set of tuples of the form: <dimension_name:dimension_value>. These tuples defined the coordinates to which all cells of the second source Cube will be linked. A second parameter, second_set_of_coordinates, contains a similar set for the non-shared Dimensions of the second source Cube and is used to link the cells of the first source Cube. The result Cube contains all measure values found in both source Cubes. One can consider that this is achieved using the following algorithm: each cell of the first source Cube, which contains measure values, is mapped to a cell in the result Cube. To achieve this, the cell’s coordinates for some of the shared Dimensions may have to be adjusted (if the Dimension Union operation modified Dimension Members). The coordinates for the rest of the Dimensions (that a cell in the result Cube has and the cell in the first source Cube does not have) we use the ones specified in the parameter

second_set_of_coordinates. After the mapping has been achieved, the measure

values of the first source Cube cell are copied to the mapped cell in the result Cube. A similar process is used then for the cells of the second source Cube only that now the missing coordinates are provided by the first_set_of_coordinates. If a result Cube cell is mapped to both a cell in the first source Cube and to a cell in the second source Cube, and if the two cells assigned different values to the same measure then a conflict resolution policy will be used. The conflict resolution policy may be defined as parameter of the operation. If it is not defined the default policy is to assign a NULL value to the conflicting measure. The expression that defines the resolution policy is an expression that is used to assign a value to the conflicting measure. It can reference the value of the first Cube through the variable firstMeasure and the value of the second Cube through secondMeasure. The expression has also access to the variables firstCell, firstCube, secondCell and secondCube.

3-77

Syntax: Union(Cube1, Cube2, first_set_of_coordinates,

second_set_of_coordinates, conflict_resolution)

Example: Consider a company that has two branches, one in USA and one in Europe. For autonomy reasons each of the branches collects its own sales data and uses them to populate its own sales Cube. The company has pretty much predefined the structure (the Dimensions) of the sales Cube but the two branches each made some minor additions. So, the company ends up with two Cubes with similar but not identical structure and different data sets. Using a Union operation the company can combine the data (the cells) of the two Cubes and obtain a global view of the data. Difference: The parameters of the operation are two Cubes with identical structure (Dimensions). The operation creates a result Cube that is identical in structure with the two source Cubes and which contains the measure values that are found in the first source Cube but not in the second source Cube. So, for every cell C of the result Cube there are two corresponding (same coordinates) cells: one in the first source Cube (C1) and one in the second source Cube (C2). For every measure m of C1, if the value of m in C1 is different than the value of m in C2 (C1.m <> C2.m) then C will contain the measure value C1.m (C.m=C1.m). Otherwise, C.m=NULL. Note that in our model it is exactly the same thing to have a cell that does not contain a measure and a cell that it assigns a NULL value to it. Syntax: Difference(Cube1, Cube2)

3.7 Example application of the MAC model

In this section we give an example OLAP scenario and show how the MAC model can be used to represent the data. The OLAP scenario is based on a real-life OLAP application with which we have been involved as part of a research project. Assume the following scenario: A large retail company, selling home appliances, wishes to analyze their transaction data in order to come up with better marketing strategies. In order to do that the company builds a data warehouse that stores (at least) the transaction data at the item granularity. Every line of every invoice and every shipping record is recorded as a fact, a transaction, in the data warehouse. Hence, every item (or batch of identical items) that is sold at a store or returned or

3-78

moved from/to the warehouses of the company is recorded as a transaction along with a number of measures. According to the design, an invoice line or shipping record line can be linked to another transaction, called root transaction. The data warehouse is built using a relational database system and the resulting schema (presented in Figure 4-9) contains one central fact table with 27 measures and 12 dimension tables. The data warehouse is fed with new data from the OLTP system on a daily basis. A MAC schema that contains Dimension Levels, Dimension Paths, Dimensions and one Cube can represent the data in the data warehouse. In order to define the schema we examine which are the key properties that characterize the measures in the fact table of the data warehouse. These properties are: 1. The product 2. The date when the transaction was performed 3. The date when the items were exported from a store/warehouse 4. The date when the items were delivered 5. The date of the root transaction 6. The customer 7. The salesman 8. The transaction type 9. The transaction type of the root transaction 10. The warehouse or store that issued the transaction 11. The warehouse or store that exported the items 12. The warehouse or store that delivered the items 13. The warehouse or store that issued the root transaction 14. The type of payment for the transaction 15. Special properties of the transaction There are 27 measures that are characterized by these properties in the data warehouse. Among them we name only a few with obvious meanings: number of items, unit cost, unit interest, unit sale price, total value, VAT value, surcharges and discount. Based on the above information we know that the schema will contain a Cube with 15 Dimensions and 27 measures. However, in order to define the Dimensions we need to know what kind of aggregation the analysis may require for each of these 15

3-79

properties. The designer of the data warehouse analyzed these needs and came up with a design that in a MAC schema would be represented by the Dimension Levels, the Dimension Paths and the Dimensions which are described next. The product property is recorded in the OLTP system through a product code attribute. Products can be grouped into so-called product groups and these groups may be further grouped into product categories. A different kind of grouping of product codes can be done by the brand of the product, while another grouping can be done using the origin country of the product. Therefore, the D_Product Dimension has 3 Dimension Paths: The first one, P_PGroup, contains the Levels Product, Product_Group, Product_Category and ALL. Recall that ALL is a special Level that has only one member (named all) and it is used to aggregate all members of a Dimension Level. The second Path, P_PBrand, contains the Levels Product, Brand, ALL and the third Path, P_PCountry, contains the Levels Product, Country and ALL. Figure 3-10 illustrates the structure of the Dimension D_Product.

ALL

4

ProdCategory

3

3

3

ProdGroup

2

Brand

2

Country

2

1

Product

1

1

P_PBrand P_PGroup P_PCountry

D_Product

Figure 3-10: The Dimension D_Product The D_Product Dimension defines the product that is being sold, returned or moved within the company (among warehouses and stores). Each product has a unique product identifier named code. This attribute is the key attribute of the most

3-80

detailed Level of the Dimension: the Product Level. The members of this Level have the following additional attributes: description, measurement_code, measurement_unit, measurement_abbr, vat_category_code, vat_category, negative_value, service_item, creation_date, inactivation_date, item_type, item_status_mnem_code, item_status_code, item_status One can realize that any of these attributes could have been used to create a new Level and a new Path that would group products based on some other property (for example the VAT category). The choice was made by the analysts, the users and designers of the OLAP application, who considered these attributes as not useful for aggregation in the analysis they want to perform. Therefore, these attributes appear as simply descriptive attributes in the Product Level just in case their value is needed. Just like in the case of the Level Product we have key attributes and additional attributes in most of the other Levels. We are not going to explain the attributes of each individual Level but refer the reader to Figure 4-9 for more details. The links in the Dimension Paths P_PGroup, P_PBrand and P_PCountry are created based on the information the system has about each product. Assuming we have a relational table Product that contains, among others, the attributes code (as the key of the table) and brand_code, and that brand_code is the key attribute of the Level Brand we can identify the links of the Dimension Path P_PBrand from the information in the following query: SELECT code, brand_code FROM Product WHERE brand_code IS NOT NULL The remaining Dimensions are constructed in a similar manner. Note however that among the 15 key properties of the measures there are four properties that refer to dates. In our MAC schema we do not need to define 4 identical date Dimensions but just one that is used (included) four times in the definition of the Cube. Of course the Cube has 15 Dimensions and each of the four date Dimensions are assigned a different name (by the Cube): the role name of the Dimension.

3-81

The resulting MAC schema contains the following 8 Dimensions. These are the names assigned to the Dimensions by the MAC schema while the Cube assigns other names to some of them. 1. D_Product 2. D_Date 3. D_Customer 4. D_Salesman 5. D_TransactionType 6. D_Warehouse 7. D_PaymentType 8. D_SpecialIDL The Levels and Dimension Paths of each Dimension are shown, using the graphical notation that was introduced earlier, in the following figures. The first one, Figure 3-11, depicts the Dimension D_Date. This Dimension is a quite complicated one since it contains 18 Levels and 7 Paths. The most detailed Level of the Dimension is the Hour Level. However, this Level is only used in one Path, the P_Date. For most of the other Paths the most detailed Level is the Day Level. Furthermore, most of the Paths have the ALL Level as the top Level allowing the aggregation of all dates into one single value. However, this is not the case for the Path P_DType. This Path does not allow the aggregation of the members of the TypeOfDay Level because such aggregation is not considered meaningful. The TypeOfDay contains four Dimension Members: isHoliday, isWorkingDay, isWeekDay, isWeekEndDay. We can use this Level to group days based on any of these properties (but not on a combination of them). Since a day can be both a holiday and a weekday we see that the relationship between the Day and TypeOfDay is many-tomany. Therefore it is not meaningful to aggregate the Members of TypeOfDay as we would end up counting many times the same days. If we wanted to group days using both their “weekday” and “holiday” properties we should have done a different design where different Paths would represent these

3-82

properties. This is the case of some of the Paths of the D_Customer Dimension (for example P_CHasCar and P_CHasCreditCard). One common point of misunderstanding is the content of the Level Day. The Members of the Day Level represent individual actual days and not just the day number within the year. If the data covers a period of 10 years we would have about 3652 Members in the Day Level. A similar comment holds for the Month, Week, Quarter and Half Levels. In order to group the days based on their day number within the year we have the Path P_YearDay as also have the Paths P_MonthDay and P_WeekDay to group days based on the day number within the month and within the week. A similar Path, P_YearMonth, exists to group months based on their names (or number within the year: 1-12). Finally, the Path P_DWeek is used to group days into weeks, which can be further grouped based on their number (1-53) within the year.

3-83

ALL 4

WeekNum

3

6

6

3

Week

2 Year 6 FiscalYear 5

2

YearDay

3 1 Half 5 FiscalHalf 4 TypeOfDay Quarter 3 Month 3 FiscalMonth 2 1 Day 2 1 P_DType 4 FiscalQuarter 3 2

1 3

MonthName

2

1 2 MonthDay

1

WeekDay

2

1

P_WeekDay P_YearMonth Hour 1

P_FisacalDate

P_MonthDay P_YearDay P_DWeek

P_Date

D_Date

Figure 3-11: The D_Date Dimension The Dimension D_Customer has a large number of Paths. It has a total of 13 Paths that can be used to analyze and group customers. Five of those are using as the most detailed Level the Customer Level and as a second Level the YesNo Level. YesNo is a very simple Level with only two Members: yes and no. It can be used to represent Boolean properties. Each of these five Paths is assigning a different name to this Level because it is using it in a different meaning. For example the Path P_ChasCreditCard is used to group customers based on the information about them having a credit card. So, the Path assigns the name HasCCard to its YesNo Level and it links all customers that have a credit card to the yes Member and the remaining customers to the no Member. In the analysis, any combination of properties can be used for grouping the customers by selecting the appropriate Dimension Value.

3-84

Store Profession CustGroup 4 ProfGroup Credibility 3 Country CustStatus 2 GDepartment Type 1 3 3 3 3 3 City 2 2 2 2 2 Area 1 1 1 1 1 Person 1 2 3 County 4 5 6 7

ALL 3 3 2 2 HasCCard HasLifeIn 3 3 1 1

2

ECom

YesNo

3 3 2 2 1 1

2 1

HasHouse

HasCar 1 Sex

P_CHasCreditCard

P_CType

P_CProfession P_CGeography

P_CHasLifeIn P_CECom

P_CStatus

P_CHasHouse P_CCredib P_CGroup P_CIniStore P_CSex P_CHasCar

W_Customer

Figure 3-12: The D_Customer Dimension Recall that Dimension Values are the coordinates of the cube cells and that they can be constructed from a number of Dimension Members of distinct and non-related Levels (of a Dimension). Since the YesNo Levels are assigned different names by the Paths that use them the Dimension treats them as different Levels and since they are all on different Paths they are non-related. For example the Dimension Value {HasCCard:yes, HasCar:no} is a valid Dimension Value that identifies the customers who have a credit card but not a car.

3-85

ALL

7

Country

6

Speciality

GDepartment

5

County 3 3 3

4 YesNo

City Department 2 2 2 Area Prim 1 1 1 Salesman P_SDepartment P_SPrim

3

3

3

2

2

Sex

2

1

1

1

P_SGeography P_SSpeciality P_SSex

P_SIsMarried

D_Salesman

Figure 3-13: The D_Salesman Dimension The D_Salesman has a similar but simpler structure than D_Customer. However it has 6 independent ways to group salesmen: the 6 Paths of the Dimension. It becomes clear that the usage of multiple hierarchies within one dimension is an important requirement for an OLAP data model. If these Paths were each treated as separate dimensions of the cube, as some models suggest, we would end up with cubes that have tens of dimensions and we would have a hard time to establish the relationships that exist among these dimensions.

3-86

ALL

7

Country

6

Branch

WarehouseType

GDepartment

5 Cost Center

County 3 3 3 City 2 2 Area 1 1 1 Warehouse P_WCGroup

4

ComericialGroup

2

3

3

3

3

WarehouseGroup

2

2

2

2

YesNo

1

1

1

1

P_WIsThirdParty P_WGroup P_WType P_WBranch P_WGeography P_WCostCenter

D_Warehouse

Figure 3-14: The D_Warehouse Dimension Note that three Dimensions, the D_Warehouse, D_Salesman and

D_Customer have Paths that group information based on the geographical location. All three Dimensions (and their Paths) use the same Levels (Area, City, Country, GDepartment and Country) for this geographical grouping. This reduces redundancy and makes it easier to update the geographical information used.

3-87

ALL

1

ALL

2

2

TransGroup

2 2 InCash YesNo NeedApproval 2

1

TransactionType

1 1 PaymentType 1

P_TGroups

P_TType P_PIsCash P_PNeedsApproval

D_TransactionType D_PaymentType

ALL

3 ReservedCode

3

2

DeliveryCode

2

CoveredCode

2

1

SpecialFlags

1

1

P_IDeliveryC

P_ICoveredC

P_IReservedC

D_SpecialIDL

Figure 3-15: The Dimensions D_TransactionType, D_PaymentType and D_SpecialIDL The Cube of our example MAC schema contains 27 measures and 15 Dimensions. Note that the Cube assigns various names to its Dimensions in order to identify the way in which the Dimension is used. In Figure 3-16 we show the graphical notation for the SalesCube and suppress the names of the measures since the actual list of names is not important in the model. Recall that the model does not define the relationships that might exist among the various measures. Something like that can be achieved only by defining consistency-checking expressions.

3-88

D_Product

D_Date

TransactionType

TransDate_Dim ExportDate_Dim

DeliverDate_Dim RootDate_Dim

TransactionType_Dim Product_Dim

SalesCube

RootTransType_Dim D_SpecialIDL

...

Special_Dim Salesman_Dim PaymentType_Dim

Customer_Dim

IssueWH_Dim RootWH_Dim ExportWH_Dim

DeliverWH_Dim

D_Customer

D_Warehouse

D_PaymentType

D_Salesman

Figure 3-16: The Cube definition

3-89

PART II QUERY PROCESSING TECHNIQUES

3-91

4 Processing Multidimensional-Analysis Queries

Managers, analysts and other decision makers use the Multidimensional-Analysis Queries in order to obtain complex information about their business. The queries usually involve the aggregation of measured data based on predefined hierarchies using some selection criteria. The challenging work of the analysts is to define meaningful dimension hierarchies and use the appropriate selection criteria. In most of the cases front-end analysis tools are used to form the queries. These tools use specialized user interfaces in order to simplify the task of defining these complex business queries. The queries are then sent to the DBMS, which undertakes the effort of computing the answers. The results returned by the DBMS are processed by the front-end tools and displayed in a user-friendly way, which usually includes 2D or 3D graphics. In this chapter we discuss some of the techniques used by the DBMS to process such queries. Processing these queries can be a very heavy task since large volumes of data are typically involved. Using a standard RDBMS to answer such queries has proven to be inefficient and some times even impossible for real-life cases. The queries either take too long to be answered or demand too many system resources. Hence, the need for specialized optimization techniques was obvious from the beginning. The query processing technique that we present in this chapter is based on the Multidimensional Hierarchical Clustering / Hierarchical Indexing (MHC/HI) approach. This is the starting point on which we base our research in the processing and optimization methods for Multidimensional-Analysis Queries. The chapter is organized as follows: In section 4.1 we present an overview of the MHC/HI framework and present, through an example query, the general processing steps in the MHC/HI environment. The example also demonstrates the key concepts of the Hierarchical Pre-Grouping optimization technique. In section 4.2 we present the related work. In section 4.3 we start the detailed presentation of the query processing technique by first defining the database schema assumed. Then, in section

4-93

4.4 we identify the query class that our query processing technique supports. Given the database schema and query class, we present in section 4.5 the general queryprocessing plan. This general plan can be optimized in various ways. Section 4.6, provides a general discussion on the various optimization opportunities that have been identified. The main optimization technique, the Hierarchical Pre-Grouping transformation, is the subject section 4.7, while other optimization techniques are discussed in sections 4.8 and 4.9. Finally, section 4.10 presents experimental result and concludes the chapter.

4.1 Query Processing Overview

In the Data Warehousing (DW) and OnLine Analytical Processing (OLAP) areas, the need for fast response times to large aggregation queries has motivated research and implementation efforts for quite some time. Various methods and solutions have been proposed from both the industry and the academy. The well-known star-schema, the specialized access methods and the usage of materialized views containing preaggregated data are some of the proposed technologies that have been implemented and used in real-life systems. All these solutions are implemented on top of the very successful relational database technology. One of the recently proposed techniques is the Multidimensional Hierarchical Clustering and Hierarchical Indexing technique ([MRB99], [KS01]). This technique is based on the star-schema organization with the focus on the usage of clustering access methods. The main goal is the reduction of the number of I/O operations required to answer the large aggregate queries. According to this method the fact table of a data warehouse is stored indexed and clustered with respect to the dimension hierarchies by using special attributes called hierarchical surrogate keys. Since most aggregation queries apply restrictions on the dimension hierarchies, the fact-table data needed to answer such queries are found clustered in a relatively small number of disk pages, improving the performance. In the presence of this new data organization schemata new processing and optimization techniques have been recently proposed ([KTS02], [PER03], [TT03]). One of these techniques, called Hierarchical Pre-Grouping, exploits the existence of the hierarchical surrogate keys in order to improve the query execution time even

4-94

further. The technique modifies the query execution plan in an attempt to aggregate the fact-table tuples as early as possible and avoid redundant joins. In order to illustrate the importance of this technique we next present an example scenario where the Hierarchical Pre-Grouping technique is applied according to the heuristic algorithm presented in section 4.7. Example: Applying Hierarchical Pre-Grouping Consider the simplified data warehouse schema shown in Figure 4-1. The data warehouse stores sales transactions recorded per item, store, customer and date. It contains one fact table SALES_FACT, which is defined over the dimensions: PRODUCT, CUSTOMER, DATE and LOCATION. The single measure of SALES_FACT is sales representing the sales value for an item bought by a customer at a store on a specific day. Each dimension is stored in a dimension table and it is organized according to a hierarchy. For example, the LOCATION dimension is organized into a hierarchy with three levels: Store-Area-Region. Stores are grouped into geographical areas and the areas are grouped into regions. The attributes store_id, area and region are called hierarchical attributes because they are

赞助商链接

更多相关文档:

更多相关标签:

相关文档

- 美世《Managing Transformation and Change in Organizations》34页
- Applying Model Transformation By-Example on Business Process Modeling Languages
- Hypertext Data
- TS16949：2009内审员培训教材-最新版
- BPMN 2.0 by Example
- ASM-based design of data warehouses and on-line analytical processing systems
- (On-Line Analytical Processing) and Data
- Process-Centric Analytical Processing of Version Control Data
- Analytical Advantages of Multivariate Data Processing. One, Two, Three, Infinity
- On-Line Analytical Processing with Conceptual Information Systems
- Data warehousing and On-Line Analytical Processing
- Framework and DataServices Modeling for Flight Data Processing Systems
- On-line analytical processing on large databases managed by computational grids
- Skyline Query Processing for Incomplete Data
- Pre-processing the dynamics of on-line handwriting data, feature extraction and recognition

文档资料共享网 nexoncn.com
copyright ©right 2010-2020。

文档资料共享网内容来自网络，如有侵犯请联系客服。email:zhit325@126.com