Google
 
   
Login
Username:

Password:


Lost Password?

Register now!
Search
Main Menu
top books
Polls
What do you think about php-deluxe.net?
Excellent!
Cool
Hmm..not bad
What the hell is this?
encyclopedia
recommendation
compare webbrowser
Freenet DSL
Who's Online
8 user(s) are online (7 user(s) are browsing encyclopedia)

Members: 0
Guests: 8

more...
browser tip
Unix Befehle
manual of unix befehle
recommendation!
Sponsored
partner

Database normalization

Database normalization relates to the characterization of redundancy in a relational database s structure. The key idea is to store links to potentially-duplicated data, rather than allowing copies of that information to proliferate across several tables. Well-normalized databases have a Database design that reflects the true functional dependency between tracked quantities—allowing quick updates to data with little risk of introducing inconsistencies.

There are formal methods for quantifying how normalized a relational database is, and these classifications are called normal forms (or NF). Though Algorithms exist for converting a given database between forms, these may require splitting existing tables into ones that are re-joined each time a query is issued. This can lead to poor performance, so intentional Denormalization is sometimes preferred—especially in systems that do not use the relational model (such as OLAP).

=History=

Edgar F. Codd first proposed the process of normalization and what came to be known as the 1st normal form in his paper A Relational Model of Data for Large Shared Data Banks Communications of the ACM, Vol. 13, No. 6, June 1970, pp. 377-387[http://www.acm.org/classics/nov95/toc.html].

Codd stated:

:There is, in fact, a very simple elimination* procedure which we shall call normalization. Through decomposition nonsimple domains are replaced by domains whose elements are atomic (nondecomposable) values.

:* His term eliminate is misleading, as nothing is lost in normalization.

In his paper, Codd used the term nonsimple domains to describe a heterogeneous data structure, but later researchers would refer to such a structure as an abstract data type.

=Normal Forms=

One can only describe a database as having a normal form if the relationships between quantities have been rigorously defined. It is possible to use set theory to express this knowledge once a problem domain has been fully understood, but most database designers model the relationships in terms of an idealized schema . (The mathematical supports come back into play in proofs regarding the process of transforming from one form to another.)

, but a non-fully normalized database may be vulnerable to data corruption (referred to as update anomalies). Full normalization, even when not fully implemented in the target technology, is considered a good exercise to help discover all potential internal database consistency problems.

==First normal form==

To understand first normal form (1NF), consider these two examples of things you might know: : What is your favorite color : What food will you not eat A difference between these two questions is that, while you can have only one favorite color, there may be many foods you do not eat.

Data that has a single value such as person s favorite color is inherently in first normal form. Such data can be stored in a single table with a simple key/value combination. Data that has multiple values, however, must be stored differently.

Codd argued that there was one best way to keep multi-valued data such as food a person will not eat. He suggested that the database should contain a separate table for the multi-value data and then store each food as a separate row in that table. Known as first normal form, his approach has been a standard for decades. An example of data in proper first normal form follows:

:TABLE 1 ::Person / Favorite Color ::Bob / blue ::Jane / green

:TABLE 2 ::Person / Foods Not Eaten ::Bob / okra ::Bob / brussel sprouts ::Jane / peas

Database systems when Codd was writing were relatively primitive. Newer databases now support abstract data types and other data-storage methods that usually offer better performance for the management of such data. However, such methods could be considered denormalization, as they often require the hierarchical model, which was abandoned mostly due to its inflexibility, to be reintroduced into the architecture of the system.

It is almost never a good idea to store data like this:

:TABLE 3 ::Person / Favorite Color / Foods Not Eaten 1 / Foods Not Eaten 2 / Foods Not Eaten 3

This schema is not in the 1NF, and does not accurately represent the relationship as it exists in the real world. Even if the database designer believes that users will not need to store more than three foods not eaten, this system makes it difficult to add a fourth or fifth food in the future, thus limiting expandability. There are cases, however, where denormalized structures like this are desirable. For instance:

:TABLE 4 ::Person / Website / AIM / Yahoo! Screenname / ICQ

This schema is not in the 1NF since people can have more than one AIM screenname. Most websites choose to ignore this, however, and only allow a user to register one screenname with their account.

==Second normal form==

Second normal form (2NF) prescribes full functional dependency on the primary key. It most commonly applies to tables that have composite primary keys, where two or more attributes comprise the primary key. It requires that there are no non-trivial functional dependency of a non-key attribute on a part (subset) of a candidate key. A table is said to be in the 2NF if and only if it is in the 1NF and every non-key attribute is irreducibly dependent on the primary key (i.e. not partially dependent on candidate key).

Consider a table named part describing machine parts with the following attributes:

PART_NUMBER (PRIMARY KEY) SUPPLIER_NAME (PRIMARY KEY) PRICE SUPPLIER_ADDRESS

The PART_NUMBER and SUPPLIER_NAME form the composite primary key, because the same part can be supplied by multiple suppliers. In this example, PRICE is correctly placed on the part table, because it is fully dependent on the primary key i.e. different suppliers will charge a different price for the same part.

SUPPLIER_ADDRESS, however, is only dependent on the SUPPLIER_NAME, and therefore this table breaks 2NF. This attribute should be placed on a second table named supplier comprising:

SUPPLIER_NAME (PRIMARY KEY) SUPPLIER_ADDRESS

In order to find if a table is in 2NF, ask whether any of the non-key attributes of the table could be derived from a subset of the composite key, rather than the whole composite key. If the answer is yes, it s not in 2NF. This is solved sometimes by using a correlation file, such as the supplier table above.

Easily understood definition: A unique key. A column of values that uniquely identify each row in each table.

==Third normal form==

Third normal form (3NF) requires that there are no non-trivial functional dependencies of non-key attributes on something other than a superset of a candidate key. A table is in 3NF if none of the non-primary key attributes is a fact about any other non-primary key attribute. In summary, all non-key attributes are mutually independent (i.e. there should not be transitive dependencies).

Consider a table that defines a machine part as having the following attributes.

PART_NUMBER (PRIMARY KEY) MANUFACTURER_NAME MANUFACTURER_ADDRESS

In this case, the manufacturer address does not belong on this table, because it is a fact about the manufacturer of the part, rather than the part itself. MANUFACTURER_ADDRESS should therefore be moved into a separate table with the attributes:

MANUFACTURER_NAME (PRIMARY KEY) MANUFACTURER_ADDRESS

...and the original table should be redefined as:

PART_NUMBER (PRIMARY KEY) MANUFACTURER_NAME

The problem with a table not being in 3NF is that for every MANUFACTURER_NAME we have to maintain a redundant MANUFACTURER_ADDRESS (i.e. an address for each part_number, rather than one for each MANUFACTURER_NAME).

Easily understood definition: Ensures that each table contains unique data. In other words, it ensures that a table of customer identification data does not contain order data, and so on.

==Boyce-Codd normal form==

Boyce-Codd normal form (or BCNF) requires that there are no non-trivial functional dependencies of attributes on something other than a superset of a candidate key (called a superkey). At this stage, all attributes are dependent on a key, a whole key and nothing but a key (excluding trivial dependencies, like A->A). A table is said to be in the BCNF if and only if it is in the 3NF and every non-trivial, left-irreducible functional dependency has a candidate key as its determinant. In more informal terms, a table is in BCNF if it is in 3NF and the only determinants are the candidate keys.

==Fourth normal form==

Fourth normal form (or 4NF) requires that there are no non-trivial multi-valued dependencies of attribute sets on something other than a superset of a candidate key. A table is said to be in 4NF if and only if it is in the BCNF and multi-valued dependencies are functional dependencies. The 4NF removes unwanted data structures: multi-valued dependencies.

Consider a case where we need the record of an employee with any professional qualifications they have gained, and internal training courses they have been on (an employee may have none, one or multiple of each of these). One way to record this information would be to create a relation as follows:

EMPLOYEE_ID QUALIFICATION_ID TRAINING_COURSE_ID

The problem with this design is that we might have to enter employee s qualification id more than once (or leave the qualification field blank) if they have been on more than one training course, which causes ambiguity on data storage and retrieval. Therefore this table is not in 4NF.

There are actually 2 distinct many-to-many relationships here: one between Employee and Qualification ID, and one between Employee and Training Course ID. So 2 separate tables should be created to capture these.

employee_qualification table: EMPLOYEE_ID QUALIFICATION_ID employee_training_course table: EMPLOYEE_ID TRAINING_COURSE_ID This example has assumed that the two fields being recorded for the employees are independent; contrast a different case where for example the information to be recorded is the academic qualifications possessed by each employee, and the academic institution where the qualification was achieved. In this case the two values are not independent - it is necessary to record both the qualification and the institution (as a pair of values) in each case, and so a relation such as the following would be correct:

EMPLOYEE_ID DEGREE_ID UNIVERSITY_ID And this would require no changes to fit the fourth normal form requirements.

Ronald Fagin demonstrated that it is always possible to achieve 4NF (but not always desirable). Rissanen s theorem is also applicable on multi-valued dependencies.

==Fifth normal form==

Fifth normal form (5NF and also PJ/NF) requires that there are no non-trivial join dependencies that do not follow from the key constraints. A table is said to be in the 5NF if and only if it is in 4NF and every join dependency in it is implied by the candidate keys.

==Domain/key normal form==

Domain/key normal form (or DKNF) requires that each key uniquely identifies each row in a table. A domain is the set of permissible values for an attribute. By enforcing key and domain restrictions, the database is assured of being freed from modification anomalies.

While sometimes called the 6NF, the DKNF should not be considered together with the seven other normal forms (1–6 and Boyce-Codd), because contrary to them it is not always achievable; furthermore, tables in the real 6NF are not always in the DKNF.

==Sixth normal form==

This normal form was, do not take into account this work, and most temporal extensions to SQL are not relational.

=Further Reading=

  • [http://www.datamodel.org/NormalizationRules.html Rules Of Data Normalization]
  • Date, C. J., & Lorentzos, N., & Darwen, H. (2002). [http://www.elsevier.com/wps/product/cws_home/680662 Temporal Data & the Relational Model] -Link doesn t work (1st ed.). Morgan Kaufmann. ISBN 1-55860-855-9.
  • Date, C. J. (1999), [http://www.aw-bc.com/catalog/academic/product/0,1144,0321197844,00.html An Introduction to Database Systems] (8th ed.). Addison-Wesley Longman. ISBN 0-321-19784-4.
  • Kent, W. (1983) [http://www.bkent.net/Doc/simple5.htm A Simple Guide to Five Normal Forms in Relational Database Theory] , Communications of the ACM, vol. 26, pp. 120-125
  • Date, C.J., & Darwen, H., & Pascal, F. [http://www.dbdebunk.com Database Debunkings]