Normalizing your data model

A positive thing about Qlikview is that it is oriented to technical and business users. However, having inexperienced people creating data models will, with no doubt, create headaches in the near future.

I thought it would be useful to share with you how a good data model should look like, by normalizing the tables or data we have.

Normalization is the processes of organizing data in a database. When we say “organizing” we mean, creating proper table structures in a way that we can then stablish relationships among them. One of the main reasons for this is to avoid data redundancy.

In this article we are going to cover only the first three normal forms.

Let’s start with a real example that you can easily find in any of your company spreadsheets.

1. Source Document

1. Source Document

In this example, a sales person can have 2 managers, and they report sales from up to 3 countries for one of the  managers and one country for the other manager. The Cons of this approach:

– Not easily escalable
– Repeating the sames type of data in different columns.

FIRST NORMAL FORM
The first step to normalize a table or dataset is to eliminate redundant columns. In our example we have 2 columns for the managers and 4 columns to display sales by country by managers.
We want to eliminate Manager1, Manager2, Country1-Manager1, Country2-Manager1, Country3-Manager1, Country1-Manager2, and replace them with 2 columns. Manager and Country.

2. First Normal form

2. First Normal form

At this stage we can also start thinking about primary keys. A primary key is an identifier that can uniquely identify a record in a table. Each table will have a primary key.
In our example, SalesPerson is a primary key. This is because we can identify one person by it’s name (altohugh we will use IDs in real life) and with its name we can have only one record with one phone number and one email address.

SECOND NORMAL FORM
In this step we want to eliminate redundant data. We can eliminate redundant data by identifying fields with repeated values.

Our model will now look like this:

3. Second Normal form

3. Second Normal form

THIRD NORMAL FORM
In this step, we eliminate everything that is not dependant on the primary key.
Rember that a primary key is a unique identifier that can uniquely identify a row.
Our primary keys are:

SalesPerson: by name or ID we can identify a sales person and their contact details
Managers: by name or ID we can identify a manager. One person can have more than one manager and one manager can manage more than one sales person, so we will have 2 keys (Sales person and Manager)
Country: same as managers. Two keys to identify which country a sales person is responsible for.

Our final third normal form tables would look like this:

Thrid Normal form

Thrid Normal form

Note that Manager and Country have a many to many relationship. In order to identify which sales person reports to a specific manager, we need both fields, SalesPerson and Manager.

Finally, our Qlikview model would look like this:

5. Normalized Data Model

5. Normalized Data Model

JV

Advertisements

4 responses to “Normalizing your data model”

  1. suraj says :

    thank you so much 🙂

  2. Raj says :

    Very good post JV..

    But I do need some your valuable inputs regarding below statements from your BI experience.

    That having a star schema is what Qliktech is recommending
    and the data model in third normalization form can form too many tables – so much of maintenance and CPU usage is very high as we are having many tables.

    QlikView just duplicates the data and only unique values are responsible for RAM/CPU consumption…should we really worry about data redundancy is still a issue with QlikView.

    Based out of your experience, if possible can you brief out how QlikView handles a star schema and snow flake schema. Does it really matter ?

    Thanks
    Raj

    • biexperience says :

      Hello Raj, thanks for your comment.
      As a matter of fact you are right. A normalized model will, indeed, tend to have more tables. However, this is the way it should be.

      A star schema is a denormalized model. An Entity Relationship model and a Star Schema are different things. I consider I would first show how to create a normalized model so I can then explain how to create a Star and Snowflake model.

      Most of the problems seen in Qlikview are due to an incorrect design of the data model.

      So to clarify, any model will work, but I will usually use a Star Schema, based on a normalized schema containing everything, while the star schema will contain specific type of data and will be created it with a purpose (Sales, Finanance, HR, etc).

      Regarding your RAM/CPU consumption, you should ALWAYS be concern about data redundancy. Even if Qlikview works with pointers, the data is still there, and the pointers too. Think of it as the following:

      You have a table with 30M records in which you have Country of Birth and Country of Residence. You have 2 columns with similar values. Even if we say we use the same values, you still have 30M records with pointers. This is a lot, and you will for sure see the difference if you remove one of this columns.

      Best
      JV

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: