Data Warehousing
systems taht record orders, production, billing and so forth. Characterized by short transactions that are executed large number of times.
Having longer term data (historical data) and additional data means that we'll have to deal with possibly far more data than operational databases.
What will be big may not be so much what we STORE, if we aggregate data, but what we PROCESS, because aggregation is a costly operation.
Different Grain of Information -> Aggregate
Aggregates, unfortunately, require by definition processing huge volumes of data, sorts, and computations all along. All operations that require I/ Os and CPU. The workload is completely different from an operational database workload and this is why data warehousing and business intelligence are considered a topic apart in information systems.
Bill Inmon, 1990s
Shdow IT vs. Official IT
Because databases are huge, schemas are complicated and queries difficult to write, a large number of people, especially in higher circles, were mostly working with figures extracted from various sources (including printed reports), and that a lot of strategic data was living outside "official IT"
This shadow IT mostly lives in spreadsheets, shared by email or (today) in a public cloud. Needless to say, data organization is poor, consistency weak, several versions circulate, and security is low.
Inmon basically suggested creating a single repository that would act as a reference and could be managed with professional (IT) standards.
Definition: a subject oriented, nonvolatile, integrated, time variant collection of data in support of management's decisions.
("themes", Read only, Clean, Historical data)
This repository should be distinct from operational databases not only because the schema might be different, but because the query patterns will be very different, with almost no repeated queries (or repeated only once a day, week or month)
- Read only
- we can index happily
- Disposable queries
- we can hard-code happily
Data is periodically uploaded from databases and other operational sources.
Coherent subsets are extracted to create "datamarts".
Reduce the scope, data domain, volume.
Availability requirements will also be low compared to operational databases. Dataware House uploads are scheduled on a regular basis, every night, every week-end, or whatever your needs are.
Don't underestimate "Extract/Transform/Load" (ETL) processes. Extraction mustn't interfere too much with operations. Transformation and loading may be complicated (it has spawned an entire industry of tools)
To outsmart competition
Generating efficient queries for people who cannot spell SQL
An enterprise has one data warehouse, and data marts source their information from the data warehouse.
starting with quickly designed, denormalized datamarts in departments.
Departmental datamarts, the data warehouse is the enterprise-wide collection of datamarts.
Every non key attribute must provide a fact about the key (1NF), the whole key (2NF), and nothing but the key (3NF).
2NF: Some columns store data derived from other columns
Facts, to make it simple, are principally figures.
Dimensions are the axes along which these figures can be aggregated (as we are in a world where we want synthesis).
I'd like to see sales in USD per country, month and per line of product.
This is a typical managerial request. Sales are facts. Everything that follows a "per" is a dimension. Simple.
A facts table is big. Its composite PK is made of FK to dimensions. The more dimensions, the finer the grain (the level of detail you can obtain). Other columns are (usually) numbers.
All facts aren't equally suitable for aggregation.
- Additive facts
Can be summed up for any dimension
- Semi-additive facts
Can be summed up for some dimensions
- Non-additive facts
Cannot be summed up
Date dimensions illustrate perfectly the Kimball approach.
Kimball's reasoning is the following: if I store a date column, that will be difficult to query. If I want to aggregate by date or month, I'll need to apply (complicated and different in all DBMS products) date functions that I'll never be able to index because most date functions are NOT determininistic. The day is likely to be my smallest time unit. What are 20 years, with one row per day? Under 7,000 rows? Very tiny today. Let's have one row per date, and decline each date under every possible form, and index every column.
We'll store date hierarchies that will ease aggregates.
Year, Querter, Month, Week, Day
When you just have one facts table surrounded by dimensions, you have a "star schema".
With additional joins you get a snowflake and Kimball dislikes it.
The problem is that dimensions change over the years.
Take stores, for instance ("per store" could be an interesting sales result): new stores can be opened, other stores closed, a store can move to better premises in the same city.
One way to handle them is to ignore them.
Keep as it was first inserted
Update dimension - lose old record
Add a new record –- Disconnect old facts/new facts
Add old/new/date information to the dimension
In practice most SCDs are type 1 or 2.
All this of course requires due consideration and has created a new job of Guardian of the Data.
- Speeding up aggregates
- Pre-computing aggregates
What was the sum of sales per state and month last year?
We'll return at most 50x12 rows, which is peanuts. We'll scan a small table of states, and a small table of dates. Neither "state" nor "month" will be very selective.
Because of 2NF, cardinality is often low.
For low cardinality columns, advanced products such as Oracle implements "bitmap indexes". Their entries are chunk of bits that cover many rows and contain a bit that says whether the row contains the key or not.
MON 10000001000000100000010000001000000
TUE 01000000100000010000001000000100000
WED 00100000010000001000000100000010000
THU 00010000001000000100000010000001000
FRI 00001000000100000010000001000000100
SAT 00000100000010000001000000100000010
SUN 00000010000001000000100000010000001
The advantage of bitmap indexes is that by applying AND and OR operations to the bitmaps you can quickly identify the rows that satisfy several conditions (Shenzhen and November) and fetch them.
This transformation is just adding redundant conditions that make the choice of the suitable plan irresistible to the optimizer.
Oracle also implements a parameter that enables the "star schema transformation"
These redundant conditions are added for every dimension in the query.
Once you know exactly which rows you want, you can prey on them in the big table.
Several companies have created dedicated hardware that is more specially geared towards the data warehouse type of queries. Some databases, called columnar databases, store data by columns rather than rows to speed up aggregates.
Another, simple way to speed up aggregates is NOT to compute them on the fly but, if we have an idea about the aggregates that will be computed most often, to precompute them every time we upload the database, at least partially.
NOT views
"Materialized views", which are actually tables, are saved query results (very often aggregates) that are prescheduled and automatically refreshed. You can query the materialized view directly, but some optimizers are smart enough to recognize when they can use a pre-computed result in a query on the original table. It's OK on data that isn't "real time data".
View = saved query text
Materialized view = saved query result
You specify in the creation statement the refresh schedule. It can be a full cancel and replace or, sometimes, an incremental refresh that requires a trigger-populated log on the original table(s).
Some products go further and try to compute every possible aggregate possible, detail AND all possible sums. If you don't ask for anything that strays from the predefined path, you get immediate results.
Things become funny when you have MANY dimensions – the number of combinations explodes ...
...

