Netside where knowledge is shared, ideas are spread.

[Notes] CSCI 585 Database Business Intelligence and Data Warehouses

Credit to: Prof. Saty Raghavachary, CSCI 585, Spring 2020

Chapter 13

outline

  • How business intelligence provides a comprehensive business decision support framework
  • About business intelligence architecture, its evolution, and reporting styles
  • About the relationship and differences between operational data and decision support data
  • What a data warehouse is and how to prepare data for one
  • What star schemas are and how they are constructed
  • About data analytics, data mining, and predictive analytics
  • About online analytical processing (OLAP)
  • How SQL extensions are used to support OLAP-type data manipulations

Business Intelligence (BI)

  • Comprehensive, cohesive, integrated set of tools and processes
    • Captures, collects, integrates, stores, and analyzes data
  • Purpose - Generate and present information to support business decision making
  • Allows a business to transform:
    • Data into information
    • Information into knowledge
    • Knowledge into wisdom

Figure 13.1 - Business Intelligence FrameworkBusiness Intelligence Tools

Business Intelligence Benefits

  • Improved decision making
  • Integrating architecture
  • Common user interface for data reporting and analysis
  • Common data repository fosters single version of company data
  • Improved organizational performance

Figure 13.3 - Evolution of BI Information Dissemination Formats

Decision Support Data (warehouse data)

  • Effectiveness of BI depends on quality of data gathered at operational level
  • Operational data
    • Seldom well-suited for decision support tasks
    • Stored in relational database with highly normalized structures
    • Optimized to support transactions representing daily operations
  • Differ from operational data in:
    • Time span
    • Granularity
      • Drill down: Decomposing a data to a lower level
      • Roll up: Aggregating a data into a higher level
    • Dimensionality

Table 13.5 - Contrasting Operational and Decision Support Data Characteristics

Decision Support Database Requirements

  • Database schema
    • Must support complex, non-normalized data representations
    • Data must be aggregated and summarized
    • Queries must be able to extract multidimensional time slices
  • Data extraction and loading
    • Allow batch and scheduled data extraction
    • Support different data sources and check for inconsistent data or data validation rules
    • Support advanced integration, aggregation, and classification
  • Database size should support:
    • Very large databases (VLDBs)
    • Advanced storage technologies
    • Multiple-processor technologies

Table 13.8 - Characteristics of Data Warehouse Data and Operational Database Data

Figure 13.5 - The ETL Process

Data Marts

  • Small, single-subject data warehouse subset
  • Provide decision support to a small group of people
  • Benefits over data warehouses
    • Lower cost and shorter implementation time
    • Technologically advanced
    • Inevitable people issues
Rule NO. DESCRIPTION
1 The data warehouse and operational environments are separated
2 The data warehouse data are integrated
3 The data warehouse contains historical data over a long time
4 The data warehouse data are snapshot data capture ata a given point in time
5 The data warehouse data are subject oriented
6 The data warehouse data are mainly read-only with periodic batch updates from operational data. No online update are allowed
7 The data warehouse development life cycle differs from classical systems development. Data warehouse development is data-drivenl; the classical approach is process-driven
8 The data warehouse contains data with several levels of detail: current detail data, old detail data, lightly summarized data, and highly summarized data.
9 The data warehouse environment is characterized by read-only transactions very large data sets. The operational environment s is characterized by numerous updata transactions to a few data entities at a time
10 The data warehouse environment has a system that traces data source, transformations and storage.
11 Tje data warehouse’s metadata are a critical component of this environment. The metadata identify and define all data elements, The metadata provide the source, transformation, integration, storage, usage, relationship, and history of each data element.
12 The data warehouse contains a chargeback mechanism for resource usage that enforce optimal use of the data by end users.

Star Schema

  • Data-modeling technique
  • Maps multidimensional decision support data into a relational database
  • Creates the near equivalent of multidimensional database schema from existing relational database
  • Yields an easily implemented model for multidimensional data analysis

Components of Star Schemas

  • Facts
    • Numeric values that represent a specific business aspect
  • Dimensions
    • Qualifying characteristics that provide additional perspectives to a given fact
  • Attributes
    • Used to search, filter, and classify facts
    • Slice and dice: Ability to focus on slices of the data cube for more detailed analysis
  • Attribute hierarchy
    • The attribute hierarchy provides the capability to perform drill-down and roll-up searches in a data warehouse.

Figure 13.10 in multidimensional analysisAttribtute hierarchies

Star Schema Representation

  • Facts and dimensions represented by physical tables in data warehouse database
  • Many-to-one (M:1) relationship between fact table and each dimension table
  • Fact and dimension tables
    • Related by foreign keys
    • Subject to primary and foreign key constraints

example

  • Primary key of a fact table
    • Is a composite primary key because the fact table is related to many dimension tables
    • Always formed by combining the foreign keys pointing to the related dimension tables We have the fact table (of transactions) at the center, and denormalized (all-in-one) dimension tables all around.

Here is another representation.

Note: “dimensions are qualifying characteristics that provide additional perspectives to a given fact; dimensions provide descriptive characteristics about the facts through their attributes.”

Slicing and dicing the cube provides specific insights..

Techniques Used to Optimize Data Warehouse Design

  • Normalizing dimensional tables
  • Snowflake schema: Dimension tables can have their own dimension tables
  • Maintaining multiple fact tables to represent different aggregation levels
  • Denormalizing fact tables

snowflake schema

Data Analytics

  • Encompasses a wide range of mathematical, statistical, and modeling techniques to extract knowledge from data
    • Subset of BI functionality
  • Classification of tools
    • Explanatory analytics: Focuses on discovering and explaining data characteristics and relationships based on existing data
    • Predictive analytics: Focuses on predicting future outcomes with a high degree of accuracy

Online Analytical Processing (OLA)

  • Advanced data analysis environment that supports decision making, business modeling, and operations research
  • Characteristics
    • Multidimensional data analysis techniques
    • Advanced database support
    • Easy-to-use end-user interfaces

Figure 13.19 - OLAP Architecture

Figure 13.20 - OLAP Server with Local Miniature Data Marts

Relational Online Analytical Processing (ROLAP)

  • Provides OLAP functionality using relational databases and familiar relational tools to store and analyze multidimensional data
  • Extensions added to traditional RDBMS technology
    • Multidimensional data schema support within the RDBMS
    • Data access language and query performance optimized for multidimensional data
    • Support for very large databases (VLDBs)

Multidimensional Online Analytical Processing (MOLAP)

  • Extends OLAP functionality to multidimensional database management systems (MDBMSs)
    • MDBMS: Uses proprietary techniques store data in matrix-like n-dimensional arrays
    • End users visualize stored data as a 3D data cube
      • Grow to n dimensions, becoming hypercubes
      • Held in memory in a cube cache to speed access
  • Sparsity: Measures the density of the data held in the data cube

Table 13.12 - Relational vs. Multidimensional OLAP

SQL Extensions for OLAP

  • The ROLLUP extension
    • Used with GROUP BY clause to generate aggregates by different dimensions
    • Enables subtotal for each column listed except for the last one, which gets a grand total
    • Order of column list important

ROLLUP extension “Subtotal for each vendor - all products; sum of (the only set of) subtotals”.

  • The CUBE extension
    • Used with GROUP BY clause to generate aggregates by the listed columns
    • Includes the last column

CUBE extension

“Subtotal for each month - all products; subtotal for each product - all months; sum of (either set of) subtotals”.

Data Lakes

A ‘traditional’ data warehouse is an ETL-based, historical record of transactions - very RDB-like.

A ‘modern’ alternative is a ‘data lake’, which offers a more continuous form of analytics, driven by the rise of unstructed data, streaming, cloud storage, etc. In a data lake, data is NOT ETLd, rather, it is stored in its ‘raw’ (“natural”) form [even incomplete, untransformed…].