Normalization
Motivation 💪🏻
Imagine you are starting out with an e-commerce website. You design the pages, get your items, enlist them in your website and make it live🌐. Now, when the orders come, you have to store in a table. So, make a table with the name myBigTableThatHasAllTheData😁 and start writing data. This is not a good practice for several reasons. Lets break down each one of them ...
Let's say we maintain a food court. I am gonna trivialize the details for easy understanding. Let's consider the orders:
Flat way
| Customer Name | Item Name | Store Name | Total Order Price | Purchase Time |
|---|---|---|---|---|
| Hungry Man1 | Pizza | Pizza in 5 | 608.03 | 2026-01-05 19:47:08 |
| Hungry Man1 | Burger | Burg Me In | 203.87 | 2026-01-05 19:47:08 |
| Hungry Man2 | Pizza | Pizza in 5 | 196.27 | 2026-01-07 11:15:18 |
| Hungry Man2 | Ice Cream | Dreamy Creamy | 85.79 | 2026-01-07 13:27:49 |
| Hungry Man1 | Ice Cream | Dreamy Creamy | 401.43 | 2026-01-07 22:35:28 |
| Hungry Man1 | Burger | Burg Me In | 1247.22 | 2026-01-08 22:35:28 |
Normalized way🤘🏻
| CustId | Customer Name |
|---|---|
| H01 | Hungry Man1 |
| H02 | Hungry Man2 |
| ItemId | Item Name | Store Name |
|---|---|---|
| It01 | Pizza | Pizza in 5 |
| It02 | Burger | Burg Me In |
| It03 | Ice Cream | Dreamy Creamy |
| OrderId | CustId | ItemId | TotalPrice | Purchase Time |
|---|---|---|---|---|
| 1 | H01 | It01 | 608.03 | 2026-01-05 19:47:08 |
| 2 | H01 | It02 | 203.87 | 2026-01-05 19:47:08 |
| 3 | H02 | It02 | 196.27 | 2026-01-07 11:15:18 |
| 4 | H02 | It03 | 85.79 | 2026-01-07 13:27:49 |
| 5 | H01 | It03 | 401.43 | 2026-01-07 22:35:28 |
| 6 | H01 | It02 | 1247.22 | 2026-01-08 22:35:28 |
This is the normalized way, where we have all the complicated Ids referring to other tables. Why go through the pain🤕 of all this ...
Redundancy
Count the number of characters consumed to store data the flat way. Lets assume any character (a number or an alphabet or anything else for that matter) consumes 1 byte of storage. Count the total number...
Take your time🕛... but it will be ~266 bytes and for the normalized way it comes to ~246 bytes. It does not look impressive, but this saving becomes more pronounced when there are huge number of rows
Updates
Lets say, we need to change the name of that Pizza place from Pizza in 5 to Pizza5. We can run an update query on the flat way, which would update it at 2 places. Currently it is 2 for a small table, but as the data grows this number would also grow
But in the normalized way, it would update in exactly 1 row in the Store table. So, its faster⏩ and safer
Deletion anomaly
Lets say you delete the records for Hungry Man2 from the flat table. Then there would be no indication from the Database that he ever existed
Lets say you delete that entry from the Orders table, his identity would still exist in the Customer table.
This is useful in other ways too. Imagine a scenario where we are storing the shipping🚢 address in the flat table. If the Hungry Man's order is deleted, so is his address. Next time, he orders something, he needs to enter his address again. But in the normalized way, since the details are stored in another place, the info is safe🔒. This not only applies to address, but any personal info like credit card💳 info, preference on website etc...
Foreign key relationships
Foreign key relationships are an important part of Relationship DataBase Management Systems (RDBMS). It prevents orphan records
Lets say, we are blocking🫥 Hungry Man1 from our system. He might have ordered too many burgers🍔 from our website. Or he might just be a test user🧑🏻💻, and we want to delete these records before going live (Now, you might be thinking, we delete all records before going live, but... lets say we created this user to test a coupon🎟️ or a new feature or to fix a bug🐞 ...). In either case, if we delete the record from Customer table, the records from Orders table will automatically get deleted. This is called cascading effect, and the database automatically handles that. In fact, this feature is so well integrated that we can set the database to either cascade, insert NULL, or some default value, or raise error because some other table depends in this table
Website design
Lets say you are listing items from your DB on to the website. It is sufficient to expose only the Store table as it helps populate the items in the website. The other 2 tables can be kept safely away and we can expose info from it only when needed (like when user is logged in and connection is secure🔐, or for ADMIN🥷🏻 panel etc ...). This decreases the attack surface🤺
Disadvantages
Normalized way, is the way. Loud and clear. But like all things in this world, this way has its disadvantages
Don't normalize everything
Here, in the above example, suppose the store table was once again normalized
Store Flat way
| ItemId | Item Name | Store Name |
|---|---|---|
| It01 | Pizza | Pizza in 5 |
| It02 | Burger | Burg Me In |
| It03 | Ice Cream | Dreamy Creamy |
Store Normalized way
| ItemId | Item Name | Store Id |
|---|---|---|
| It01 | Pizza | S01 |
| It02 | Burger | S02 |
| It03 | Ice Cream | S03 |
| Storeid | Store Name |
|---|---|
| S01 | Pizza In 5 |
| S02 | Burg Me In |
| S03 | Dreamy Creamy |
We can observe that it did not yield us a good benefit. In fact it made it worse. Now we just added a bunch of Ids that occupy more space than the flat version. So, we must be aware how much and how many tables need normalization
Difficult to remember the JOINs
Generally speaking, there are many tables in the system. And to acquire the necessary information, the analyst must remember which tables to JOIN and on what keys🔑. If the tables are heavily normalized, then number of JOINs increases accordingly
That is the reason why, during analysis, data is not stored in normalized way. Otherwise, even for simple data to make sense, we need to JOIN many tables