Normalization helps eliminate redundancies and inconsistencies in table data. It is the process of reducing tables to a set of columns where all the non-key columns depend on the primary key column. If this is not the case, the data can become inconsistent during updates. Definition: Normalization is the process of structuring relational database schema such that most ambiguity is removed. The stages of normalization are referred to as normal forms and progress from the least restrictive (First Normal Form) through the most restrictive (Fifth Normal Form). Generally, most database designers do not attempt to implement anything higher than Third Normal Form or Boyce-Codd Normal Form. If you've been working with databases for a while, chances are you've heard the term normalization. Perhaps someone's asked you "Is that database normalized?" or "Is that in BCNF?" All too often, the reply is "Uh, yeah." Normalization is often brushed aside as a luxury that only academics have time for. However, knowing the principles of normalization and applying them to your daily database design tasks really isn't all that complicated and it could drastically improve the performance of your DBMS.

So, what is normalization? Basically, it's the process of efficiently organizing data in a database. There are two goals of the normalization process: eliminate redundant data (for example, storing the same data in more than one table) and ensure data dependencies make sense (only storing related data in a table). Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored.

The database community has developed a series of guidelines for ensuring that databases are normalized. These are referred to as normal forms and are numbered from one (the lowest form of normalization, referred to as first normal form or 1NF) through five (fifth normal form or 5NF). In practical applications, you'll often see 1NF, 2NF, and 3NF along with the occasional 4NF. Fifth normal form is very rarely seen and won't be discussed on this page.

Before we begin our discussion of the normal forms, it's important to point out that they are guidelines and guidelines only. Occasionally, it becomes necessary to stray from them to meet practical business requirements. However, when variations take place, it's extremely important to evaluate any possible ramifications they could have on your system and account for possible inconsistencies. That said, let's explore the normal forms.

First normal form (1NF) sets the very basic rules for an organized database:

  • Eliminate duplicative columns from the same table.
  • Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key).
Second normal form (2NF) further addresses the concept of removing duplicative data:
  • Remove subsets of data that apply to multiple rows of a table and place them in separate tables.
  • Create relationships between these new tables and their predecessors through the use of foreign keys.
Third normal form (3NF) goes one large step further:
  • Remove columns that are not dependent upon the primary key.
Finally, fourth normal form (4NF) has one requirement:
  • A relation is in 4NF if it has no multi-valued dependencies.
Remember, these normalization guidelines are cumulative. For a database to be in 2NF, it must first fulfill all the criteria of a 1NF database.

This section briefly reviews the rules for first, second, third, fourth and fifth normal forms.

Form Description
First At each row and column position in the table, there exists one value, never a set of values (see First Normal Form).
Second Each column that is not part of the key is dependent upon the key (see Second Normal Form).
Third Each non-key column is independent of other non-key columns, and is dependent only upon the key (see Third Normal Form).
Fourth No row contains two or more independent multi-valued facts about an entity.
Fifth The aim of fifth normal form is to have relations that cannot be decomposed further. A relation in 5NF cannot be constructed from several smaller relations (see Fifth Normal Form)..

First Normal Form

A table is in first normal form if there is only one value, never a set of values, in each cell. A table that is in first normal form does not necessarily satisfy the criteria for higher normal forms.

For example, the following table violates first normal form because the WAREHOUSE column contains several values for each occurrence of PART.

Table Violating First Normal Form

PART (Primary Key) WAREHOUSE
P0010 Warehouse A, Warehouse B, Warehouse C
P0020 Warehouse B, Warehouse D

The following example shows the same table in first normal form.

Table Conforming to First Normal Form

PART (Primary Key) WAREHOUSE (Primary Key) QUANTITY
P0010 Warehouse A 400
P0010 Warehouse B 543
P0010 Warehouse C 329
P0020 Warehouse B 200
P0020 Warehouse D 278

Second Normal Form

A table is in second normal form if each column that is not part of the key is dependent upon the entire key.

Second normal form is violated when a non-key column is dependent upon part of a composite key, as in the following example:

Table Violating Second Normal Form

PART (Primary Key) WAREHOUSE (Primary Key) QUANTITY WAREHOUSE_ADDRESS
P0010 Warehouse A 400 1608 New Field Road
P0010 Warehouse B 543 4141 Greenway Drive
P0010 Warehouse C 329 171 Pine Lane
P0020 Warehouse B 200 4141 Greenway Drive
P0020 Warehouse D 278 800 Massey Street

The primary key is a composite key, consisting of the PART and the WAREHOUSE columns together. Because the WAREHOUSE_ADDRESS column depends only on the value of WAREHOUSE, the table violates the rule for second normal form.

The problems with this design are:

  • The warehouse address is repeated in every record for a part stored in that warehouse.
  • If the address of a warehouse changes, every row referring to a part stored in that warehouse must be updated.
  • Because of this redundancy, the data might become inconsistent, with different records showing different addresses for the same warehouse.
  • If at some time there are no parts stored in a warehouse, there might not be a row in which to record the warehouse address.

The solution is to split the table into the following two tables:

PART_STOCK Table Conforming to Second Normal Form

PART (Primary Key) WAREHOUSE (Primary Key) QUANTITY
P0010 Warehouse A 400
P0010 Warehouse B 543
P0010 Warehouse C 329
P0020 Warehouse B 200
P0020 Warehouse D 278

WAREHOUSE Table Conforms to Second Normal Form

WAREHOUSE (Primary Key) WAREHOUSE_ADDRESS
Warehouse A 1608 New Field Road
Warehouse B 4141 Greenway Drive
Warehouse C 171 Pine Lane
Warehouse D 800 Massey Street

There is a performance consideration in having the two tables in second normal form. Applications that produce reports on the location of parts must join both tables to retrieve the relevant information.

Third Normal Form

A table is in third normal form if each non-key column is independent of other non-key columns, and is dependent only on the key.

The first table in the following example contains the columns EMPNO and WORKDEPT. Suppose a column DEPTNAME is added. The new column depends on WORKDEPT, but the primary key is EMPNO. The table now violates third normal form. Changing DEPTNAME for a single employee, John Parker, does not change the department name for other employees in that department. Note that there are now two different department names used for department number E11. The inconsistency that results is shown in the updated version of the table.

Unnormalized EMPLOYEE_DEPARTMENT Table Before Update

EMPNO (Primary Key) FIRSTNAME LASTNAME WORKDEPT DEPTNAME
000290 John Parker E11 Operations
000320 Ramlal Mehta E21 Software Support
000310 Maude Setright E11 Operations

Unnormalized EMPLOYEE_DEPARTMENT Table After Update

Information in the table has become inconsistent.
EMPNO (Primary Key) FIRSTNAME LASTNAME WORKDEPT DEPTNAME
000290 John Parker E11 Installation Mgmt
000320 Ramlal Mehta E21 Software Support
000310 Maude Setright E11 Operations

The table can be normalized by creating a new table, with columns for WORKDEPT and DEPTNAME. An update like changing a department name is now much easier; only the new table needs to be updated.

An SQL query that returns the department name along with the employee name is more complex to write, because it requires joining the two tables. It will probably also take longer to run than a query on a single table. Additional storage space is required, because the WORKDEPT column must appear in both tables.

The following tables are defined as a result of normalization:

EMPLOYEE Table After Normalizing the EMPLOYEE_DEPARTMENT Table

EMPNO (Primary Key) FIRSTNAME LASTNAME WORKDEPT
000290 John Parker E11
000320 Ramlal Mehta E21
000310 Maude Setright E11

DEPARTMENT Table After Normalizing the EMPLOYEE_DEPARTMENT Table

DEPTNO (Primary Key) DEPTNAME
E11 Operations
E21 Software Support


Fourth Normal Form

A table is in fourth normal form if no row contains two or more independent multi-valued facts about an entity.

Consider these entities: employees, skills, and languages. An employee can have several skills and know several languages. There are two relationships, one between employees and skills, and one between employees and languages. A table is not in fourth normal form if it represents both relationships, as in the following example:

Table Violating Fourth Normal Form

EMPNO (Primary Key) SKILL (Primary Key) LANGUAGE (Primary Key)
000130 Data Modelling English
000130 Database Design English
000130 Application Design English
000130 Data Modelling Spanish
000130 Database Design Spanish
000130 Application Design Spanish

Instead, the relationships should be represented in two tables:

EMPLOYEE_SKILL Table Conforming to Fourth Normal Form

EMPNO (Primary Key) SKILL (Primary Key)
000130 Data Modelling
000130 Database Design
000130 Application Design

EMPLOYEE_LANGUAGE Table Conforming to Fourth Normal Form

EMPNO (Primary Key) LANGUAGE (Primary Key)
000130 English
000130 Spanish

If, however, the attributes are interdependent (that is, the employee applies certain languages only to certain skills), the table should not be split.

A good strategy when designing a database is to arrange all data in tables that are in fourth normal form, and then to decide whether the results give you an acceptable level of performance. If they do not, you can rearrange the data in tables that are in third normal form, and then reassess performance.

Fifth Normal Form