Pythian Blog: Technical Track

Implementing Many-to-many Relationships in Data Warehousing

This article will discuss how to make many-to-many relationships in data warehousing easily queried by novice SQL users using point-and-click query tools.

This is a big problem with Oracle Discoverer-like tools where the metadata layer is basically a set of pre-joined tables from which the user simply clicks on columns and hits the run button. You can create custom complex queries that they can run, but then every query is custom, which defeats the purpose of the tool in the first place.

The design goal is to create a structure that is simple for the end user and which normally translates to something as flat as possible. This article will go through the different methods of implementing many-to-many relationships, and look at their effect on query complexity, especially for someone who use a tool that hides the SQL.

The typical data warehouse data model is a fact table surrounded by many dimension tables.

In the world of data warehousing, many-to-many relationships are sometimes unavoidable, and we all know that you do not want to normalize your fact tables. One example would something like sports interests or job categories that you’re qualified to work for. The model for that in a normalized OLTP environment would be something like:

The data warehouse world does not function with normalized fact tables since you need to start doing lots of select count(distinct) resulting in facts columns not being additive. I won’t get into that discussion here, but search for articles or literature by Ralph Kimball, the guru of data warehouse design.

Fortunately, there are a few ways of implementing many-to-many relationships for data warehousing. Typically, most DBAs pick either the Boolean Column, the Multiple Column, or the Bridge Table method.

The boolean column method consists of creating a column for each possible value, like so:

Each boolean column can be a numeric (1 or 0), or character (Y or N).

The multiple column method consists of having columns for the number match the number of choices one can make. This has its limitations since it is tightly coupled to the application; but is easily transformed. The typical data model would look like:

In this case, you could only have five sports even though you could have several dozen choices.

The last method is the Bridge Table method, which you may find in data warehouse literature. It is basically a many-to-many relationship attached to a non-normalized fact table. It would typically look like:

Note that the weighing factor is there mainly to distribute the dollar amounts evenly across the categories, where the sum of the parts cannot be greater than the total — for example, if one user has interests in Football and Baseball and spends $10, and the other has interests in Football and Soccer and spends $5. A revenue by breakdown cannot be…

Football: $15
Baseball: $10
Soccer:$5
Total: $30

…when in reality I only made $15. The weighing factor here would be equal to the number of interests for that user (i.e. it would be 2 for each user in this instance).

The result would then be:
Football: $7.5
Baseball: $5
Soccer: $2.5
Total: $15

Back to the main topic. Now that we have our methods, let’s see how easily we can write queries against them. I will have 4 queries for each method:

  1. Breakdown count by sport
  2. an “or” condition: count of basketball or football
  3. an “and” condition: count of basketball and football
  4. complex boolean condition: count of (basketball and football) or (soccer and hockey)

The Boolean Method

Breakdown count by sport:

select 'football' sports, count(*) total from fact f, sports_interest_dm s 
    where f.sports_dm_id = s.id and football_ind = 1
union
select 'basketball' sports, count(*) total from fact f, sports_interest_dm s
    where f.sports_dm_id = s.id and basketball_ind = 1
union
select 'hockey' sports, count(*) total from fact f, sports_interest_dm s
    where f.sports_dm_id = s.id and hockey_ind = 1
union
select 'baseball' sports, count(*) total from fact f, sports_interest_dm s
    where f.sports_dm_id = s.id and baseball_ind = 1
union
select 'soccer' sports, count(*) total from fact f, sports_interest_dm s
    where f.sports_dm_id = s.id and soccer_ind = 1
;

You can also do it with decodes and have your query tool transpose the results.

select sum(decode(football_ind,1,1,0)) football,
       sum(decode(basketball_ind,1,1,0)) basketball,
       sum(decode(hockey_ind,1,1,0)) hockey,
       sum(decode(baseball_ind,1,1,0)) baseball,
       sum(decode(soccer_ind,1,1,0)) soccer
from fact f, sports_interest_dm s where f.sports_dm_id = s.id;

An “or” condition: count of basketball or football:

select count(*) from fact f, sports_interest_dm s where f.sports_dm_id = s.id and (football_ind = 1 or baseball_ind =1);

An “and” condition: count of basketball and football:

select count(*) from fact f, sports_interest_dm s where f.sports_dm_id = s.id and (football_ind = 1 and baseball_ind =1);

Complex boolean condition: count of (basketball and football) or (soccer and hockey):

select count(*) from fact f, sports_interest_dm s
where f.sports_dm_id = s.id
and ((football_ind = 1 and baseball_ind =1) or (soccer_ind=1 and hockey_ind=1));

The Multiple Column Method

Breakdown count by sport query:

select sport, sum(total) from (
    select sports_1 sport, count(*) total from sports_dm s, fact f 
        where s.id= f.sports_dm_id group by sports_1
    union all
    select sports_2 sport, count(*) total from sports_dm s, fact f 
        where s.id= f.sports_dm_id group by sports_2
    union all
    select sports_3 sport, count(*) total from sports_dm s, fact f
        where s.id= f.sports_dm_id group by sports_3
    union all
    select sports_4 sport, count(*) total from sports_dm s, fact f
        where s.id= f.sports_dm_id group by sports_4
    union all
    select sports_5 sport, count(*) total from sports_dm s, fact f
        where s.id= f.sports_dm_id group by sports_5
)
group by sport;

An “or” condition: count of basketball or football:

select count(*) from sports_dm s, fact f where s.id= f.sports_dm_id
and ( sports_1 in ('football','basketball')
    or sports_2 in ('football','basketball')
    or sports_3 in ('football','basketball')
    or sports_4 in ('football','basketball')
    or sports_5 in ('football','basketball')
) ;

An “and” condition: count of basketball or football:

select count(*) from fact f
where sports_dm_id in  ( select id from sports_dm
        where sports_1 in ('football')
        or sports_2 in ('football')
        or sports_3 in ('football')
        or sports_4 in ('football')
        or sports_5 in ('football')
    intersect
    select id from sports_dm
        where sports_1 in ('basketball')
        or sports_2 in ('basketball')
        or sports_3 in ('basketball')
        or sports_4 in ('basketball')
        or sports_5 in ('basketball')
) ;

Complex boolean condition: count of (basketball and football) or (soccer and hockey):

select count(*) from fact f
where sports_dm_id in  ( (select id from sports_dm
        where sports_1 in ('football')
        or sports_2 in ('football')
        or sports_3 in ('football')
        or sports_4 in ('football')
        or sports_5 in ('football')
    intersect
    select id from sports_dm
        where sports_1 in ('basketball')
        or sports_2 in ('basketball')
        or sports_3 in ('basketball')
        or sports_4 in ('basketball')
        or sports_5 in ('basketball')
        )
    union
    (select id from sports_dm
        where sports_1 in ('soccer')
        or sports_2 in ('soccer')
        or sports_3 in ('soccer')
        or sports_4 in ('soccer')
        or sports_5 in ('soccer')
    intersect
    select id from sports_dm
        where sports_1 in ('hockey')
        or sports_2 in ('hockey')
        or sports_3 in ('hockey')
        or sports_4 in ('hockey')
        or sports_5 in ('hockey')
        )
) ;

Just imagine a few dimensions like that, and you would end up exceeding the SQL query text limit very quickly.

The Bridge Method

Breakdown count by sport query:

select s.description sports, count(*) from fact f, sports_ref s, sports_bridge_details sbd
    where f.sports_bridge_id(+) = sbd.sports_bridge_id
    and  sbd.sports_id(+) = s.id
group by s.description;

An “or” condition: count of basketball or football:

select  count(*) from fact f, sports_bridge sb
    where f.sports_bridge_id = sb.id
and sb.id in ( select sbd.sports_id from sports_bridge_details sbd, sports_ref s
    where sbd.sports_id = s.id
    and s.description in ('basketball','football'));

An “and” condition: count of basketball or football:

select  count(*) from fact f, sports_bridge sb
    where f.sports_bridge_id = sb.id
and sb.id in ( select sbd.sports_id from sports_bridge_details sbd, sports_ref s
    where sbd.sports_id = s.id
    and s.description in ('basketball','football')
    group by sbd.sports_id having count(*) >1);

This works, but you could also have an intersect in the subquery.

Complex boolean condition: count of (basketball and football) or (soccer and hockey):

select  count(*) from fact f, sports_bridge sb
where f.sports_bridge_id = sb.id
and sb.id in (  select sbd.sports_id from sports_bridge_details sbd, sports_ref s
        where sbd.sports_id = s.id
        and s.description in ('basketball','football')
        group by sbd.sports_id having count(*) >1
    union
    select sbd.sports_id from sports_bridge_details sbd, sports_ref s
        where sbd.sports_id = s.id
        and s.description in ('soccer','hockey')
        group by sbd.sports_id having count(*) >1);

Again, intersects can be substituted for group by having count(*) >1.

The following table summarizes query complexity vs. method for ease of query writing in an ad hoc query tool such as Oracle Discoverer.

Boolean method Multiple column method Bridge table method
Breakdown count by sport query Complex Complex Easy
An “or” condition: count of basketball or football Easy Complex Easy
An “and” condition: count of basketball or football Easy Complex Medium
Complex boolean condition: count of (basketball and football) or (soccer and hockey) Easy Complex Complex

As you can see, the Boolean Method is far superior in every category except the breakdown, and the Bridge is a close second. You have to use the Bridge method if the number of potential values in the dimension exceeds 100 columns or so. You can do it, but it looks ugly. The best of both worlds would be to merge the Boolean and Bridge column methods. Its fairly straight–forward: add the boolean columns to the bridge table as follows:

I hope this was informative, and I look forward to your feedback.

No Comments Yet

Let us know what you think

Subscribe by email