Normalization

Motivation 💪🏻

Imagine you are starting out with an e-commerce website. You design the pages, get your items, enlist them in your website and make it live🌐. Now, when the orders come, you have to store in a table. So, make a table with the name myBigTableThatHasAllTheData😁 and start writing data. This is not a good practice for several reasons. Lets break down each one of them ...

Let's say we maintain a food court. I am gonna trivialize the details for easy understanding. Let's consider the orders:

Flat way

Customer Name	Item Name	Store Name	Total Order Price	Purchase Time
Hungry Man1	Pizza	Pizza in 5	608.03	2026-01-05 19:47:08
Hungry Man1	Burger	Burg Me In	203.87	2026-01-05 19:47:08
Hungry Man2	Pizza	Pizza in 5	196.27	2026-01-07 11:15:18
Hungry Man2	Ice Cream	Dreamy Creamy	85.79	2026-01-07 13:27:49
Hungry Man1	Ice Cream	Dreamy Creamy	401.43	2026-01-07 22:35:28
Hungry Man1	Burger	Burg Me In	1247.22	2026-01-08 22:35:28

Normalized way🤘🏻

CustId	Customer Name
H01	Hungry Man1
H02	Hungry Man2

ItemId	Item Name	Store Name
It01	Pizza	Pizza in 5
It02	Burger	Burg Me In
It03	Ice Cream	Dreamy Creamy

OrderId	CustId	ItemId	TotalPrice	Purchase Time
1	H01	It01	608.03	2026-01-05 19:47:08
2	H01	It02	203.87	2026-01-05 19:47:08
3	H02	It02	196.27	2026-01-07 11:15:18
4	H02	It03	85.79	2026-01-07 13:27:49
5	H01	It03	401.43	2026-01-07 22:35:28
6	H01	It02	1247.22	2026-01-08 22:35:28

This is the normalized way, where we have all the complicated Ids referring to other tables. Why go through the pain🤕 of all this ...

Redundancy

Count the number of characters consumed to store data the flat way. Lets assume any character (a number or an alphabet or anything else for that matter) consumes 1 byte of storage. Count the total number...

Take your time🕛... but it will be ~266 bytes and for the normalized way it comes to ~246 bytes. It does not look impressive, but this saving becomes more pronounced when there are huge number of rows

Updates

Lets say, we need to change the name of that Pizza place from Pizza in 5 to Pizza5. We can run an update query on the flat way, which would update it at 2 places. Currently it is 2 for a small table, but as the data grows this number would also grow

But in the normalized way, it would update in exactly 1 row in the Store table. So, its faster⏩ and safer

Deletion anomaly

Lets say you delete the records for Hungry Man2 from the flat table. Then there would be no indication from the Database that he ever existed

Lets say you delete that entry from the Orders table, his identity would still exist in the Customer table.

This is useful in other ways too. Imagine a scenario where we are storing the shipping🚢 address in the flat table. If the Hungry Man's order is deleted, so is his address. Next time, he orders something, he needs to enter his address again. But in the normalized way, since the details are stored in another place, the info is safe🔒. This not only applies to address, but any personal info like credit card💳 info, preference on website etc...

Foreign key relationships

Foreign key relationships are an important part of Relationship DataBase Management Systems (RDBMS). It prevents orphan records

Lets say, we are blocking🫥 Hungry Man1 from our system. He might have ordered too many burgers🍔 from our website. Or he might just be a test user🧑🏻‍💻, and we want to delete these records before going live (Now, you might be thinking, we delete all records before going live, but... lets say we created this user to test a coupon🎟️ or a new feature or to fix a bug🐞 ...). In either case, if we delete the record from Customer table, the records from Orders table will automatically get deleted. This is called cascading effect, and the database automatically handles that. In fact, this feature is so well integrated that we can set the database to either cascade, insert NULL, or some default value, or raise error because some other table depends in this table

Website design

Lets say you are listing items from your DB on to the website. It is sufficient to expose only the Store table as it helps populate the items in the website. The other 2 tables can be kept safely away and we can expose info from it only when needed (like when user is logged in and connection is secure🔐, or for ADMIN🥷🏻 panel etc ...). This decreases the attack surface🤺

Disadvantages

Normalized way, is the way. Loud and clear. But like all things in this world, this way has its disadvantages

Don't normalize everything

Here, in the above example, suppose the store table was once again normalized

Store Flat way

ItemId	Item Name	Store Name
It01	Pizza	Pizza in 5
It02	Burger	Burg Me In
It03	Ice Cream	Dreamy Creamy

Store Normalized way

ItemId	Item Name	Store Id
It01	Pizza	S01
It02	Burger	S02
It03	Ice Cream	S03

Storeid	Store Name
S01	Pizza In 5
S02	Burg Me In
S03	Dreamy Creamy

We can observe that it did not yield us a good benefit. In fact it made it worse. Now we just added a bunch of Ids that occupy more space than the flat version. So, we must be aware how much and how many tables need normalization

Difficult to remember the JOINs

Generally speaking, there are many tables in the system. And to acquire the necessary information, the analyst must remember which tables to JOIN and on what keys🔑. If the tables are heavily normalized, then number of JOINs increases accordingly

That is the reason why, during analysis, data is not stored in normalized way. Otherwise, even for simple data to make sense, we need to JOIN many tables