[Notes] CSCI 585 Database Business Intelligence and Data Warehouses
25 Feb 2020 | CSCI585, English/英文, NoteCredit to: Prof. Saty Raghavachary, CSCI 585, Spring 2020
Chapter 13
outline
- How business intelligence provides a comprehensive business decision support framework
- About business intelligence architecture, its evolution, and reporting styles
- About the relationship and differences between operational data and decision support data
- What a data warehouse is and how to prepare data for one
- What star schemas are and how they are constructed
- About data analytics, data mining, and predictive analytics
- About online analytical processing (OLAP)
- How SQL extensions are used to support OLAP-type data manipulations
Business Intelligence (BI)
- Comprehensive, cohesive, integrated set of tools and processes
- Captures, collects, integrates, stores, and analyzes data
- Purpose - Generate and present information to support business decision making
- Allows a business to transform:
- Data into information
- Information into knowledge
- Knowledge into wisdom
Business Intelligence Tools
Business Intelligence Benefits
- Improved decision making
- Integrating architecture
- Common user interface for data reporting and analysis
- Common data repository fosters single version of company data
- Improved organizational performance
Decision Support Data (warehouse data)
- Effectiveness of BI depends on quality of data gathered at operational level
- Operational data
- Seldom well-suited for decision support tasks
- Stored in relational database with highly normalized structures
- Optimized to support transactions representing daily operations
- Differ from operational data in:
- Time span
- Granularity
- Drill down: Decomposing a data to a lower level
- Roll up: Aggregating a data into a higher level
- Dimensionality
Decision Support Database Requirements
- Database schema
- Must support complex, non-normalized data representations
- Data must be aggregated and summarized
- Queries must be able to extract multidimensional time slices
- Data extraction and loading
- Allow batch and scheduled data extraction
- Support different data sources and check for inconsistent data or data validation rules
- Support advanced integration, aggregation, and classification
- Database size should support:
- Very large databases (VLDBs)
- Advanced storage technologies
- Multiple-processor technologies
Data Marts
- Small, single-subject data warehouse subset
- Provide decision support to a small group of people
- Benefits over data warehouses
- Lower cost and shorter implementation time
- Technologically advanced
- Inevitable people issues
Rule NO. | DESCRIPTION |
---|---|
1 | The data warehouse and operational environments are separated |
2 | The data warehouse data are integrated |
3 | The data warehouse contains historical data over a long time |
4 | The data warehouse data are snapshot data capture ata a given point in time |
5 | The data warehouse data are subject oriented |
6 | The data warehouse data are mainly read-only with periodic batch updates from operational data. No online update are allowed |
7 | The data warehouse development life cycle differs from classical systems development. Data warehouse development is data-drivenl; the classical approach is process-driven |
8 | The data warehouse contains data with several levels of detail: current detail data, old detail data, lightly summarized data, and highly summarized data. |
9 | The data warehouse environment is characterized by read-only transactions very large data sets. The operational environment s is characterized by numerous updata transactions to a few data entities at a time |
10 | The data warehouse environment has a system that traces data source, transformations and storage. |
11 | Tje data warehouse’s metadata are a critical component of this environment. The metadata identify and define all data elements, The metadata provide the source, transformation, integration, storage, usage, relationship, and history of each data element. |
12 | The data warehouse contains a chargeback mechanism for resource usage that enforce optimal use of the data by end users. |
Star Schema
- Data-modeling technique
- Maps multidimensional decision support data into a relational database
- Creates the near equivalent of multidimensional database schema from existing relational database
- Yields an easily implemented model for multidimensional data analysis
Components of Star Schemas
- Facts
- Numeric values that represent a specific business aspect
- Dimensions
- Qualifying characteristics that provide additional perspectives to a given fact
- Attributes
- Used to search, filter, and classify facts
- Slice and dice: Ability to focus on slices of the data cube for more detailed analysis
- Attribute hierarchy
- The attribute hierarchy provides the capability to perform drill-down and roll-up searches in a data warehouse.
Attribtute hierarchies
Star Schema Representation
- Facts and dimensions represented by physical tables in data warehouse database
- Many-to-one (M:1) relationship between fact table and each dimension table
- Fact and dimension tables
- Related by foreign keys
- Subject to primary and foreign key constraints
- Primary key of a fact table
- Is a composite primary key because the fact table is related to many dimension tables
- Always formed by combining the foreign keys pointing to the related dimension tables We have the fact table (of transactions) at the center, and denormalized (all-in-one) dimension tables all around.
Here is another representation.
Note: “dimensions are qualifying characteristics that provide additional perspectives to a given fact; dimensions provide descriptive characteristics about the facts through their attributes.”
Slicing and dicing the cube provides specific insights..
Techniques Used to Optimize Data Warehouse Design
- Normalizing dimensional tables
- Snowflake schema: Dimension tables can have their own dimension tables
- Maintaining multiple fact tables to represent different aggregation levels
- Denormalizing fact tables
Data Analytics
- Encompasses a wide range of mathematical, statistical, and modeling techniques to extract knowledge from data
- Subset of BI functionality
- Classification of tools
- Explanatory analytics: Focuses on discovering and explaining data characteristics and relationships based on existing data
- Predictive analytics: Focuses on predicting future outcomes with a high degree of accuracy
Online Analytical Processing (OLA)
- Advanced data analysis environment that supports decision making, business modeling, and operations research
- Characteristics
- Multidimensional data analysis techniques
- Advanced database support
- Easy-to-use end-user interfaces
Relational Online Analytical Processing (ROLAP)
- Provides OLAP functionality using relational databases and familiar relational tools to store and analyze multidimensional data
- Extensions added to traditional RDBMS technology
- Multidimensional data schema support within the RDBMS
- Data access language and query performance optimized for multidimensional data
- Support for very large databases (VLDBs)
Multidimensional Online Analytical Processing (MOLAP)
- Extends OLAP functionality to multidimensional database management systems (MDBMSs)
- MDBMS: Uses proprietary techniques store data in matrix-like n-dimensional arrays
- End users visualize stored data as a 3D data cube
- Grow to n dimensions, becoming hypercubes
- Held in memory in a cube cache to speed access
- Sparsity: Measures the density of the data held in the data cube
SQL Extensions for OLAP
- The ROLLUP extension
- Used with GROUP BY clause to generate aggregates by different dimensions
- Enables subtotal for each column listed except for the last one, which gets a grand total
- Order of column list important
“Subtotal for each vendor - all products; sum of (the only set of) subtotals”.
- The CUBE extension
- Used with GROUP BY clause to generate aggregates by the listed columns
- Includes the last column
“Subtotal for each month - all products; subtotal for each product - all months; sum of (either set of) subtotals”.
Data Lakes
A ‘traditional’ data warehouse is an ETL-based, historical record of transactions - very RDB-like.
A ‘modern’ alternative is a ‘data lake’, which offers a more continuous form of analytics, driven by the rise of unstructed data, streaming, cloud storage, etc. In a data lake, data is NOT ETLd, rather, it is stored in its ‘raw’ (“natural”) form [even incomplete, untransformed…].