MAN: Data warehousing is one of the hottest topics both in business and in data science. But if you're new to the field, you're probably wondering what a data warehouse is, why we need it, and how it works. Don't worry because, in four minutes, you'll know the answers to all these questions.

All right. First, let's start with a definition. What is the meaning of the phrase single source of truth? In information systems theory, the single source of truth is the practice of structuring all the best quality data in one place.

Let's look at a very simple example. Surely it has happened to you to work on a file and to create many different versions of it. How do you name such a file? Well, once you are done, you often place the word final at the end. This results in having a bunch of files with extensions. Final, final, final, final, final, final, or, my favorite, really final final.

If this is you, you are not alone. It seems that even corporations never know where the most recent or most appropriate file is. But what if you knew that there is one single place where you would always have the single source of information. That would be quite helpful, wouldn't it? Well, a data warehouse exists to fill that need.

So what is a data warehouse exactly? It is the place where companies store their valuable data assets, including customer data, sales data, employee data, and so on. In short, a data warehouse is the de facto single source of data truth for an organization. It is usually created and used primarily for data reporting and analysis purposes.

There are several defining features of a data warehouse. It is subject-oriented, integrated, time variant, non-volatile, summarized. Let's quickly go through these one by one. Subject-oriented means that the information in a data warehouse revolves around some subject. Therefore it does not contain all company data ever but only the subject matters of interest. For instance, data on your competitors need not appear in a data warehouse. However, your own sales data will most certainly be there.

Integrated corresponds to the example from the beginning of the video. Each database, or each team, or even each person has their own preferences when it comes to naming conventions. That is why common standards are developed to make sure that the data warehouse picks the best quality data from everywhere. This relates to master data governance but that is a topic for another time.

Time variant relates to the fact that a data warehouse contains historical data, too. As said before, we mainly use a data warehouse for analysis and reporting, which implies we need to know what happened five or 10 years ago. Non-volatile implies that the data only flows in the data warehouse as is. Once there, it cannot be changed or deleted. Summarized once again touches upon the fact that the data is used for data analytics. Often it is aggregated or segmented in some ways in order to facilitate analysis and reporting.

All right. So that's what a data warehouse is, a very well structured and non-volatile de facto single source of truth for a company. If you enjoyed this video, don't forget to hit the like button and share it with your friends. And if you'd like to become an expert in all things data science, subscribe to our channel. Thanks for watching and good luck.