Database normalization
|
Database normalization (also spelt database normalisation) relates to the level of redundancy in a database's structure. Well-normalized databases have a schema that reflects the true dependencies between tracked quantities. This means that updates can be quickly performed with little risk of data becoming inconsistent.
In the relational model, formal methods exist for quantifying "how normalized" a database is. These classifications are called normal forms (or NFs), and there are algorithms for converting a given database between them.
Any increase in normalization generally involves splitting existing tables into multiple ones, which must be re-joined each time a query is issued. This can sometimes lead to poor performance, so intentional denormalization is used for applications like on-line analytical processing.
Contents |
Normal Forms
One can only ascribe a normal form to a database if relationships between quantities have been rigorously defined. It is possible to use set theory to express this knowledge once a problem domain has been fully understood. Such formality is usually sidestepped by database designers, who instead model the relationships in terms of an "idealized schema". (The mathematical underpinnings come back into play in proofs regarding the process of transforming from one form to another.)
Edgar F. Codd originally established three normal forms: 1NF, 2NF and 3NF. There are now others that are generally accepted, but 3NF is sufficient for most practical applications. Normalizing beyond that point rarely yields enough benefit to warrant the added complexity.
First normal form
First normal form (or 1NF) requires that all column values in a table are atomic (e.g., a number is an atomic value, while a list or a set is not). This basic format eliminates repeating groups and attributes by putting each into a separate table, then connecting them with a primary key-foreign key relationship.
Consider a relation for capturing an order as follows:
- ORDER_NUMBER (PRIMARY_KEY)
- CUSTOMER_NAME
- CUSTOMER_ADDRESS
- ITEM_1_NAME
- ITEM_1_PRICE
- ITEM_1_QUANTITY
- ITEM_2_NAME
- ITEM_2_PRICE
- ITEM_2_QUANTITY
...
- ITEM_N_NAME
- ITEM_N_PRICE
- ITEM_N_QUANTITY
The attributes for holding information about each Item on the Order are repeated for the number of different Items ordered. These attributes should instead be placed on a separate relation called ORDER_ITEM containing the following attributes
- ITEM_NAME
- ORDER_NUMBER
- ITEM_PRICE
- ITEM_QUANTITY
An ORDER relation can then reference many ORDER_ITEMs.
Second Normal Form
Second normal form (or 2NF) applies to relations that have Composite Primary Keys, where 2 or more attributes comprise the Primary Key. It requires that there are no non-trivial functional dependencies of a non-key attribute on a part (subset) of a candidate key. A relation is said to be in the 2NF if and only if it is in the 1NF and every non-key attribute is irreducibly dependent on the primary key (i.e.,not partially dependent on candidate key).
Consider a table describing Parts with the following attributes:
- PART_NUMBER (PRIMARY_KEY)
- SUPPLIER_NAME (PRIMARY_KEY)
- PRICE
- SUPPLIER_ADDRESS
The part number and supplier name form the composite primary key, because the same part can be supplied by multiple suppliers. In this example, price is correctly placed on the Part relation, because it is fully dependent on the Primary Key i.e. different suppliers will charge a different price for the same part.
Supplier Address, however, is only dependent on the Supplier Name, and therefore this relation breaks 2NF. This attribute should be placed on a second relation comprising:
- SUPPLIER_NAME (PRIMARY_KEY)
- SUPPLIER_ADDRESS
In order to find if a relation is in 2NF, ask whether any of the non-key attributes of the table could be derived from a subset of the composite key, rather than the whole composite key.
Third normal form
Third normal form (or 3NF) requires that there are no non-trivial functional dependencies of non-key attributes on something other than a superset of a candidate key. A relation is in 3NF if none of the non-Primary Key attributes are a fact about any other non-Primary Key attribute. Another way of saying this is that all non-key attributes are mutually independent (i.e. there should not be transitive dependencies).
Consider a relation that defines a Part as having the following attributes.
- PART_NUMBER (PRIMARY_KEY)
- MANUFACTURER_NAME
- MANUFACTURER_ADDRESS
In this case, the manufacturer address does not belong on this relation, because it is a fact about the Manufacturer of the Part, rather than the Part itself. Manufacturer should therefore be made into a separate relation with the attributes
- MANUFACTURER_NAME (PRIMARY_KEY)
- MANUFACTURER_ADDRESS
and the Part relation should be redefined as
- PART_NUMBER (PRIMARY_KEY)
- MANUFACTURER_NAME
The problem with a table not being in 3NF is that for every MANUFACTURER_NAME we have to maintain a MANUFACTURER_ADDRESS which is redundancy.
Boyce-Codd normal form
Boyce-Codd normal form (or BCNF) requires that there are no non-trivial functional dependencies of attributes on something else than a superset of a candidate key. At this stage, all attributes are dependent on a key, a whole key and nothing but a key (excluding trivial dependencies, like A->A). A relation is said to be in the BCNF if and only if it is in the 3NF and every non-trivial, left-irreducible functional dependency has a candidate key as its determinant. In more informal terms, a relation is in BCNF if it is in 3NF and the only determinants are the candidate keys.
Fourth normal form
Fourth normal form (or 4NF) requires that there are no non-trivial multi-valued dependencies of attribute sets on something else than a superset of a candidate key. A relation is said to be in 4NF if and only if it is in the BCNF and multi-valued dependencies are functional dependencies. The 4NF removes unwanted data structures: multi-valued dependencies.
Consider a case where an Employee relation has multiple job categories and multiple locations where they work. It might be tempting to create a relation as follows to capture this information:
- EMPLOYEE_ID
- JOB_CATEGORY_ID
- LOCATION_ID
The problem with this relation is that we might have to enter Employee's Job Category more than once if they fulfill the same Job Category at more than one location. Therefore this relation is not in 4NF.
There are actually 3 distinct many-to-many relationships here, one between Employee and Job Category, one between Employee and Location, and one between Job Category and Location. So 3 relations should be created to capture these.
EMPLOYEE_JOB_CATEGORY relation:
- EMPLOYEE_ID
- JOB_CATEGORY_ID
EMPLOYEE_LOCATION relation:
- EMPLOYEE_ID
- LOCATION_ID
JOB_CATEGORY_LOCATION relation:
- JOB_CATEGORY_ID
- LOCATION_ID
Ronald Fagin demonstrated that it is always possible to achieve 4NF (but not always desirable). Rissanen's theorem is also applicable on multi-valued dependencies.
Fifth normal form
Fifth normal form (or 5NF or PJ/NF) requires that there are no non-trivial join dependencies that do not follow from the key constraints. A relation R is said to be in the 5NF if and only if it is in 4NF and every join dependency in R is implied by the candidate keys.
References
- This article was originally based on material from the Free On-line Dictionary of Computing, which is licensed under the GFDL.
es:Normalización de una base de datos fr:Formes normales fi:Tietokannan normalisointi ja:リレーションの正規化 sv:Normalform (databaser)