Microsoft Dot Net: January 2012

(If you open this page in google chrome, there is a translation tool that shows up on the left above the article. Please use that to translate this to any language in the world. )

Normalizing a database means laying out the tables in an efficient way with the main focus being on preserving the integrity of data. If the database is normalized, there is no repetition of data.

The need for Normalization

Consider the image above. If you notice, you could have a different Email associated with the same userID in both the tables. There is no restriction in the database to prevent this inconsistency from happening. The database totally relies on the UI application (that enters data into the database) to prevent this from happening. If the database was normalized it would never accept inconsistent data.

Forms of Normalization

There are 7 forms of Normalization. The seven forms of Normalization are

First Normal Form
Second Normal Form
Third Normal Form
Boyce-Codd Normal Form
Fourth Normal Form
Fifth Normal Form
Domain/Key Normal Form

Many database designers normalize OLTP databases only up to the 3rd Normal form (and reporting OLAP databases up to 2nd Normal Form). The 3rd Normal form reduces redundancy to a reasonable level without complicating the design too much. In this article we will be proceeding only up to the third normal form.

Example of Database Normalization

The Denormalized table

Consider the table below. This database stores data for products purchased by people online. This database also stores their employer information. In this example we will assume, that a person can only have one employer.

SSN	UserName	Product1	Product2	MoreProducts	EmployerName	EmployerAddress
332345432	Amy	M			Google	1 California drive
666666666	Kevin	A	B	C,D	Facebook	22nd Street Sanfrancisco
919919919	Raj	D			Google	1 California drive

In the tables below the columns that make up the primary key are denoted in yellow headings. If you are unfamiliar with SQL keys, please check out this article
http://dotnetanalysis.blogspot.com/2012/01/sql-key-super-key-candidate-key-primary.html

First Normal Form:

You can only have one value in a column
You should not create multiple columns for a one to many relationship.

(There are more rules to the first normal form, such as two rows cannot be identical etc, which are automatically enforced by the SQL server. Hence I won't be listing them here.)

The column MoreProducts in the denormalized table is in conflict with the first rule. Which is eliminated in the First Normal Form.

The product1, product2, MoreProducts columns in the denomalized table is in conflict with the second rule of first normal form. Hence instead of having multiple columns for products, we put it into multiple rows.

SSN	UserName	EmployerName	EmployerAddress	Product
332345432	Amy	Google	1 California drive	M
666666666	Kevin	Facebook	22nd Street Sanfrancisco	A
666666666	Kevin	Facebook	22nd Street Sanfrancisco	B
666666666	Kevin	Facebook	22nd Street Sanfrancisco	C
666666666	Kevin	Facebook	22nd Street Sanfrancisco	D
919919919	Raj	Google	1 California drive	D

In the above table, following the rules of the primary key, SSN and Product together have been chosen as the primary key.

Second Normal Form:

The table is in First Normal Form
All the non primary key columns in the table should depend on the entire primary key.

"The following explanations make this more specific:

If the table has a one-column primary key, the attribute must depend on that key.
If the table has a composite primary key, the attribute must depend on the values in all its columns taken as a whole, not on one or some of them.
If the attribute also depends on other columns, they must be columns of a candidate key; that is, columns that are unique in every row."

Source: http://publib.boulder.ibm.com/infocenter/idshelp/v10/index.jsp?topic=/com.ibm.ddi.doc/ddi54.htm

In the First Normal Form table above, the SSN+Product columns together form the primary key. The UserName column does not depend on the entire primary key. It only depends on a part of the primary key (ie SSN). The UserName column is clearly in violation of the second rule.

Similarly the reader can also notice that the EmployerName and EmployerAddress columns are also in violation of the second rule.

These violations are fixed in the Second Normal Form shown below.

SSN	UserName
332345432	Amy
666666666	Kevin
919919919	Raj

SSN	EmployerName	EmployerAddress
332345432	Google	1 California drive
666666666	Facebook	22nd Street Sanfrancisco
919919919	Google	1 California drive

SSN	Product
332345432	M
666666666	A
666666666	B
666666666	C
666666666	D
919919919	D

In the Second Normal Form, every column is dependent on the entire primary key in that table and not part of the primary key (as was the case in the first normal form).

Third Normal Form

Database is in Second Normal form
There are no transitive dependencies (That is every non primary key is dependent directly on the primary key. )

In the Second Normal Form the EmployerAddress column depends on EmployerName which in turn depends on SSN. That is a transitive (indirect) dependency, which needs to be removed in the third normal form.

SSN	UserName
332345432	Amy
666666666	Kevin
919919919	Raj

SSN	EmployerName
332345432	Google
666666666	Facebook
919919919	Google

EmployerName	EmployerAddress
Google	1 California drive
Facebook	22nd Street Sanfrancisco

SSN	Product
332345432	M
666666666	A
666666666	B
666666666	C
666666666	D
919919919	D

Third Normal form in the real world applications

What is explained above is how to lay out a normalized database with natural keys (keys that make sense to the business) . For most practical applications, once this is done, a surrogate key is introduced which replaces the natural key as the primary key. In which case the database would look like this.

UserId	SSN	UserName
1	332345432	Amy
2	666666666	Kevin
3	919919919	Raj

UserEmployerId	UserId	EmployerId
1	1	1
2	2	2
3	3	1

EmployerId	EmployerName	EmployerAddress
1	Google	1 California drive
2	Facebook	22nd Street Sanfrancisco

UserProductId	UserId	Product
1	1	M
2	2	A
3	2	B
4	2	C
5	2	D
6	3	D

Advantages of using surrogate keys:

Suppose Martha Haley's name changes to to Martha Griffin after marriage. If last name was a part of the primary key, it would be hard to establish that Martha Haley and Martha Griffin are the same person. With surrogate key, this won't happen.
If one of the columns forming natural keys needs to be updated, the foreign keys can still remain unchanged. (Suppose first name and last name together form a natural key. Even if the person's last name changes, if we are using a surrogate key, the foreign key that connects this table to other tables remains unchanged).
With surrogate keys joins are faster.
Clustered indexes on surrogate keys (which are usually ints) are smaller and faster than the clustered indexes on natural keys with multiple columns.
Popular ORM tools like nhibernate, are more compatible with surrogate keys.

However note that when a surrogate key is introduced, a unique key constraint should be set on the natural key, to preserve the integrity of the data.

The drawback of a normalized database is that data access is slower. As you can see, if I need to find the address of Amy's employer in the 3rd Normal Form, I would have to join 3 tables. That is why reporting databases should never be normalized beyond 2nd Normal form.

Microsoft Dot Net

Translate

Sunday, January 29, 2012

Database Normalization Tutorial with example