(If you open this page in google chrome, there is a translation tool that shows up on the left above the article. Please use that to translate this to any language in the world. )
Normalizing a database means laying out the tables in an efficient way with the main focus being on preserving the integrity of data. If the database is normalized, there is no repetition of data.
The need for Normalization
Consider the image above. If you notice, you could have a different Email associated with the same userID in both the tables. There is no restriction in the database to prevent this inconsistency from happening. The database totally relies on the UI application (that enters data into the database) to prevent this from happening. If the database was normalized it would never accept inconsistent data.
Forms of Normalization
There are 7 forms of Normalization. The seven forms of Normalization are
Many database designers normalize OLTP databases only up to the 3rd Normal form (and reporting OLAP databases up to 2nd Normal Form). The 3rd Normal form reduces redundancy to a reasonable level without complicating the design too much. In this article we will be proceeding only up to the third normal form.
Example of Database Normalization
The Denormalized table
Consider the table below. This database stores data for products purchased by people online. This database also stores their employer information. In this example we will assume, that a person can only have one employer.
First Normal Form:
(There are more rules to the first normal form, such as two rows cannot be identical etc, which are automatically enforced by the SQL server. Hence I won't be listing them here.)
Second Normal Form:
In the First Normal Form table above, the SSN+Product columns together form the primary key. The UserName column does not depend on the entire primary key. It only depends on a part of the primary key (ie SSN). The UserName column is clearly in violation of the second rule.
Similarly the reader can also notice that the EmployerName and EmployerAddress columns are also in violation of the second rule.
These violations are fixed in the Second Normal Form shown below.
In the Second Normal Form, every column is dependent on the entire primary key in that table and not part of the primary key (as was the case in the first normal form).
Third Normal Form
In the Second Normal Form the EmployerAddress column depends on EmployerName which in turn depends on SSN. That is a transitive (indirect) dependency, which needs to be removed in the third normal form.
Third Normal form in the real world applications
What is explained above is how to lay out a normalized database with natural keys (keys that make sense to the business) . For most practical applications, once this is done, a surrogate key is introduced which replaces the natural key as the primary key. In which case the database would look like this.
The drawback of a normalized database is that data access is slower. As you can see, if I need to find the address of Amy's employer in the 3rd Normal Form, I would have to join 3 tables. That is why reporting databases should never be normalized beyond 2nd Normal form.
Normalizing a database means laying out the tables in an efficient way with the main focus being on preserving the integrity of data. If the database is normalized, there is no repetition of data.
The need for Normalization
Consider the image above. If you notice, you could have a different Email associated with the same userID in both the tables. There is no restriction in the database to prevent this inconsistency from happening. The database totally relies on the UI application (that enters data into the database) to prevent this from happening. If the database was normalized it would never accept inconsistent data.
Forms of Normalization
There are 7 forms of Normalization. The seven forms of Normalization are
- First Normal Form
- Second Normal Form
- Third Normal Form
- Boyce-Codd Normal Form
- Fourth Normal Form
- Fifth Normal Form
- Domain/Key Normal Form
Many database designers normalize OLTP databases only up to the 3rd Normal form (and reporting OLAP databases up to 2nd Normal Form). The 3rd Normal form reduces redundancy to a reasonable level without complicating the design too much. In this article we will be proceeding only up to the third normal form.
Example of Database Normalization
The Denormalized table
Consider the table below. This database stores data for products purchased by people online. This database also stores their employer information. In this example we will assume, that a person can only have one employer.
SSN
|
UserName
|
Product1
|
Product2
|
MoreProducts
|
EmployerName
|
EmployerAddress
|
332345432
|
Amy
|
M
|
Google
|
1 California drive
|
||
666666666
|
Kevin
|
A
|
B
|
C,D
|
Facebook
|
22nd Street Sanfrancisco
|
919919919
|
Raj
|
D
|
Google
|
1 California drive
|
In the tables below the columns that make up the primary key are denoted in yellow headings. If you are unfamiliar with SQL keys, please check out this article
http://dotnetanalysis.blogspot.com/2012/01/sql-key-super-key-candidate-key-primary.html
http://dotnetanalysis.blogspot.com/2012/01/sql-key-super-key-candidate-key-primary.html
First Normal Form:
- You can only have one value in a column
- You should not create multiple columns for a one to many relationship.
(There are more rules to the first normal form, such as two rows cannot be identical etc, which are automatically enforced by the SQL server. Hence I won't be listing them here.)
The column MoreProducts in the denormalized table is in conflict with the first rule. Which is eliminated in the First Normal Form.
The product1, product2, MoreProducts columns in the denomalized table is in conflict with the second rule of first normal form. Hence instead of having multiple columns for products, we put it into multiple rows.
The product1, product2, MoreProducts columns in the denomalized table is in conflict with the second rule of first normal form. Hence instead of having multiple columns for products, we put it into multiple rows.
SSN
|
UserName
|
EmployerName
|
EmployerAddress
|
Product
|
332345432
|
Amy
|
Google
|
1 California drive
|
M
|
666666666
|
Kevin
|
Facebook
|
22nd Street Sanfrancisco
|
A
|
666666666
|
Kevin
|
Facebook
|
22nd Street Sanfrancisco
|
B
|
666666666
|
Kevin
|
Facebook
|
22nd Street Sanfrancisco
|
C
|
666666666
|
Kevin
|
Facebook
|
22nd Street Sanfrancisco
|
D
|
919919919
|
Raj
|
Google
|
1 California drive
|
D
|
In the above table, following the rules of the primary key, SSN and Product together have been chosen as the primary key.
Second Normal Form:
- The table is in First Normal Form
- All the non primary key columns in the table should depend on the entire primary key.
"The following explanations make this more specific:
- If the table has a one-column primary key, the attribute must depend on that key.
- If the table has a composite primary key, the attribute must depend on the values in all its columns taken as a whole, not on one or some of them.
- If the attribute also depends on other columns, they must be columns of a candidate key; that is, columns that are unique in every row."
In the First Normal Form table above, the SSN+Product columns together form the primary key. The UserName column does not depend on the entire primary key. It only depends on a part of the primary key (ie SSN). The UserName column is clearly in violation of the second rule.
Similarly the reader can also notice that the EmployerName and EmployerAddress columns are also in violation of the second rule.
These violations are fixed in the Second Normal Form shown below.
SSN
|
UserName
|
332345432
|
Amy
|
666666666
|
Kevin
|
919919919
|
Raj
|
SSN
|
EmployerName
|
EmployerAddress
|
332345432
|
Google
|
1 California drive
|
666666666
|
Facebook
|
22nd Street Sanfrancisco
|
919919919
|
Google
|
1 California drive
|
SSN
|
Product
|
332345432
|
M
|
666666666
|
A
|
666666666
|
B
|
666666666
|
C
|
666666666
|
D
|
919919919
|
D
|
In the Second Normal Form, every column is dependent on the entire primary key in that table and not part of the primary key (as was the case in the first normal form).
- Database is in Second Normal form
- There are no transitive dependencies (That is every non primary key is dependent directly on the primary key. )
In the Second Normal Form the EmployerAddress column depends on EmployerName which in turn depends on SSN. That is a transitive (indirect) dependency, which needs to be removed in the third normal form.
SSN
|
UserName
|
332345432
|
Amy
|
666666666
|
Kevin
|
919919919
|
Raj
|
SSN
|
EmployerName
|
332345432
|
Google
|
666666666
|
Facebook
|
919919919
|
Google
|
EmployerName
|
EmployerAddress
|
Google
|
1 California drive
|
Facebook
|
22nd Street Sanfrancisco
|
SSN
|
Product
|
332345432
|
M
|
666666666
|
A
|
666666666
|
B
|
666666666
|
C
|
666666666
|
D
|
919919919
|
D
|
Third Normal form in the real world applications
What is explained above is how to lay out a normalized database with natural keys (keys that make sense to the business) . For most practical applications, once this is done, a surrogate key is introduced which replaces the natural key as the primary key. In which case the database would look like this.
UserId
|
SSN
|
UserName
|
1
|
332345432
|
Amy
|
2
|
666666666
|
Kevin
|
3
|
919919919
|
Raj
|
UserEmployerId
|
UserId
|
EmployerId
|
1
|
1
|
1
|
2
|
2
|
2
|
3
|
3
|
1
|
EmployerId
|
EmployerName
|
EmployerAddress
|
1
|
Google
|
1 California drive
|
2
|
Facebook
|
22nd Street Sanfrancisco
|
UserProductId
|
UserId
|
Product
|
1
|
1
|
M
|
2
|
2
|
A
|
3
|
2
|
B
|
4
|
2
|
C
|
5
|
2
|
D
|
6
|
3
|
D
|
Advantages of using surrogate keys:
- Suppose Martha Haley's name changes to to Martha Griffin after marriage. If last name was a part of the primary key, it would be hard to establish that Martha Haley and Martha Griffin are the same person. With surrogate key, this won't happen.
- If one of the columns forming natural keys needs to be updated, the foreign keys can still remain unchanged. (Suppose first name and last name together form a natural key. Even if the person's last name changes, if we are using a surrogate key, the foreign key that connects this table to other tables remains unchanged).
- With surrogate keys joins are faster.
- Clustered indexes on surrogate keys (which are usually ints) are smaller and faster than the clustered indexes on natural keys with multiple columns.
- Popular ORM tools like nhibernate, are more compatible with surrogate keys.
this one is the best explanation of normalization..thank you.
ReplyDeleteperfect explanation :)
ReplyDeletethis is awesome.....
ReplyDeleteClear Explanation. It is really helpful for me. Thanks
ReplyDeleteVery good explanation. Do you write blogs about Dot Net?
ReplyDeletenice explanation...thanks
ReplyDeletenice work bro!!!
ReplyDeleteThank you...that's was a wonderful explanation...you gave a clear picture of it with your apt examples...even a layman can understand it very quickly..Can you also please throw some light on the remaining normal forms...
ReplyDeleteHi Srikanth, I am glad you liked it :) . The aim of the article is to keep it simple so that most people can understand. Going beyond 3rd Normal form will complicate it besides its rarely ever used.
DeleteClearest explanation on database normalization! Thanks!
ReplyDeleteTutorial is amazing. keep it up dude
ReplyDeleteExcellent article.
ReplyDeleteFantastic explanation! Flawless with your delivery on the topic of Normalization. My 3 year old son could probably understand this! As the great Albert Einstein said "If you can't explain it simply, you don't understand it well enough." I can see that you have a fundamental understanding of normalization concepts from reading this article.
ReplyDeleteThe author's inclusion of Surrogate key is a GEM in this read.
ReplyDeleteNice explained
ReplyDeleteThanks