Tuesday 19 June 2012

Data Warehousing Concepts


Several concepts are of particular importance to data warehousing. They are discussed in detail in this section.
Dimensional Data Model: Dimensional data model is commonly used in data warehousing systems. This section describes this modeling technique, and the two common schema types, star schema and snowflake schema.
Slowly Changing Dimension: This is a common issue facing data warehousing practioners. This section explains the problem, and describes the three ways of handling this problem with examples.
Conceptual Data Model: What is a conceptual data model, its features, and an example of this type of data model.
Logical Data Model: What is a logical data model, its features, and an example of this type of data model.
Physical Data Model: What is a physical data model, its features, and an example of this type of data model.
Conceptual, Logical, and Physical Data Model: Different levels of abstraction for a data model. This section compares and constrasts the three different types of data models.
Data Integrity: What is data integrity and how it is enforced in data warehousing.
What is OLAP: Definition of OLAP.
MOLAP, ROLAP, and HOLAP: What are these different types of OLAP technology? This section discusses how they are different from the other, and the advantages and disadvantages of each.
Bill Inmon vs. Ralph Kimball: These two data warehousing heavyweights have a different view of the role between data warehouse and data mart.
Factless Fact Table: A fact table without any fact may sound silly, but there are real life instances when a factless fact table is useful in data warehousing.
Junk Dimension: Discusses the concept of a junk dimension: When to use it and why is it useful.
Conformed Dimension: Discusses the concept of a conformed dimension: What is it and why is it important.

Recipes For Failure


This section describes 8 situations where the data warehousing effort is destined to fail, often despite the best of intentions.
1. Focusing On Idealogy Rather Than Practicality
There are many good textbooks on data warehousing out there, and many schools are offering data warehousing classes. Having read the textbooks or completed a course, however, does not make a person a data warehousing guru.
For example, I have seen someone insisting on enforcing foreign key rules between dimension tables and fact tables, starting from the first day of development. This is not prudent for several reasons: 1) The development environment by its very nature means a lot of playing with data -- many updates, deletes, and inserts. Having the foreign key constraint only makes the development effort take longer than necessary. 2) This slows the ETL load time. 3) This constraint is useless because when data gets loaded into the fact table, we already have to go to the dimension table to get the proper foreign key, thus already accomplished what a foreign key constraint would accomplish.
2. Making The Process Unnecessarily Complicated
Data warehousing is inherently a complex enough project, and there is no need to make it even complex.
Here is an example: The source file comes in with one fact. The person responsible for project management insists that this fact be broken into several different metrics during ETL. The ideal sound reasonable: To cut down on the number of rows for the fact table, and so that the front-end tool could generate the report quicker. Unfortunately, there are several problems with this approach: First, the ETL became unnecessarily complex. Not only are "case"-type statements now needed in ETL, but because the fact cannot always be broken down into the corresponding metrics nicely due to inconsistencies in the data, it became necessary to create a lot of logic to take care of the exceptions. Second, it is never advisable to design the data model and the ETL process based on what suits the front end tool the most. Third, at the end of the day, the reports ended up having to sum back these separate metrics back together to get what the users were truly after, meaning that all the extra work was for naught.
3. Lack of Clear Ownership
Because data warehousing projects typically touch upon many different departments, it is natural that the project involves multiple teams. For the project to be successful, though, there must be clear ownership of the project. Not clear ownership of different components of the project, but the project itself. I have seen a case where multiple groups each own a portion of the project. Needless to say, these projects never got finished as quickly as they should, tended to underdeliver, had inflexbile infrastructure (as each group would do what is best for the group, not for the whole project). What I have seen coming out of such projects is that it is tailor-made for finger-pointing. If something is wrong, it's always another group's fault. At the end of the day, nobody is responsible for anything, and it's no wonder why the project is full of problems. Making sure one person/one group is fully accountable for the success of the data warehousing project is paramount in ensuring a successful project.
4. Not Understanding Proper Protocol
Whether you are working as a consultant or an internal resource, you need to understand the organization protocol in order by build a successful data warehouse / data mart.
I have been in a project where the team thought all the development was done, tested, documented, migrated to the production system, and ready to deliver by the deadline, and was ready to celebrate with the bonus money the client promised for an on-time delivery. One thing that was missed, though, was that the client always requires any production system to go through its QA group first. The project manager in this case did not notice that. Hence, rather than delivering a project on time and within budget, the project had to be delayed for an additional four months before it could go online, all because project management was not familiar with the organization protocol.
5. Not Fully Understand Project Impact Before The Project Starts
Here, I am talking about the project impact turns out to be much smaller than anticipated. I have seen data mart efforts where significant amount of resources were thrown into the project, and at the completion of the project, there were only two users. This is clearly the case where someone did not make the proper call, as these resources clearly could have been better utilized in different projects.
6. Try To Bite Off More Than You Can Chew
This means that the project attempts to accomplish something more grandeur than it is supposed to. There are two examples below:
There are data warehousing projects that attempt to control the entire project -- even to the point of dictating how the source system should be built to capture data, and exactly how data should be captured. While the idea is noble -- often during a project, we find that the source system data has a lot of problems, and hence it makes sense to make sure the source system is built right -- in reality this is not practical. First, source systems are built the way they are for specific reasons -- and data analysis should only be one of the concerns, not the only concern. In addition, this will lead to a data warehousing system that is pretty in theory, but very inflexible in reality.
In the same vein, I have seen data mart efforts where the project owner attempts to push his own ideas to the rest of the company, and that person instructed his team to build the system in a way that can accomodate that possibility. Of course, what really happens is that no one else ends up adopting his ideas, and much time and effort were wasted.
7. Blindly Sticking To Certain Standards
I have seen cases where a concerted effort is put on ensuring that different data marts employ the same infrastructure, from the tools used (for example, a certain ETL tool must be used for doing ETL, regardless of how simple that ETL process is) to the user experience (for example, users must be able to access the same set of report selection criteria).
This is an absurd way of building data marts. The very reason that different data marts exist is because there are differences among them, so insisting on making sure they all conform to a certain standard is an exercise in futility. I have seen ETL tools blindly placed on ETL processes that require only a series of SQL statements.
As far as the front end goes, that makes even less sense. First of all, different projects, even though they may be very similar, are still different. Otherwise they would belong to the same project. Furthermore, users really do not care if their views into different data marts have exactly the same look and feel. What they care is whether the data is there on time, and whether the numbers are dependable.
8. Bad Project Management
Bad project management can manifest itself in several ways, and some of the examples listed previously illustrate the danger of bad project management. In short, it is safe to say a bad project manager will certain doom a project.
For data warehousing projects, the key is experience, especially hands-on experience. This is not a job for someone who just completed his or her MBA program, or someone who has only read through all the data warehousing books, but has had no practical experience.

Incremental Enhancements


Task Description
Once the data warehousing system goes live, there are often needs for incremental enhancements. I am not talking about a new data warehousing phases, but simply small changes that follow the business itself. For example, the original geographical designations may be different, the company may originally have 4 sales regions, but now because sales are going so well, now they have 10 sales regions.
Deliverables
  • Change management documentation
  • Actual change to the data warehousing system
Possible Pitfalls
Because a lot of times the changes are simple to make, it is very tempting to just go ahead and make the change in production. This is a definite no-no. Many unexpected problems will pop up if this is done. I would very strongly recommend that the typical cycle of development --> QA --> Production be followed, regardless of how simple the change may seem.

Production Maintenance


Task Description
Once the data warehouse goes production, it needs to be maintained. Tasks as such regular backup and crisis management becomes important and should be planned out. In addition, it is very important to consistently monitor end user usage. This serves two purposes: 1. To capture any runaway requests so that they can be fixed before slowing the entire system down, and 2. To understand how much users are utilizing the data warehouse for return-on-investment calculations and future enhancement considerations.
Time Requirement
Ongoing.
Deliverables
Consistent availability of the data warehousing system to the end users.
Possible Pitfalls
Usually by this time most, if not all, of the developers will have left the project, so it is essential that proper documentation is left for those who are handling production maintenance. There is nothing more frustrating than staring at something another person did, yet unable to figure it out due to the lack of proper documentation.
Another pitfall is that the maintenance phase is usually boring. So, if there is another phase of the data warehouse planned, start on that as soon as possible.

Rollout To Production


Task Description
Once the QA team gives thumbs up, it is time for the data warehouse system to go live. Some may think this is as easy as flipping on a switch, but usually it is not true. Depending on the number of end users, it sometimes take up to a full week to bring everyone online! Fortunately, nowadays most end users access the data warehouse over the web, making going production sometimes as easy as sending out an URL via email.
Time Requirement
1 - 3 days.
Deliverables
  • Delivery of the data warehousing system to the end users.
Possible Pitfalls
Take care to address the user education needs. There is nothing more frustrating to spend several months to develop and QA the data warehousing system, only to have little usage because the users are not properly trained. Regardless of how intuitive or easy the interface may be, it is always a good idea to send the users to at least a one-day course to let them understand what they can achieve by properly using the data warehouse

Quality Assurance


Task Description
Once the development team declares that everything is ready for further testing, the QA team takes over. The QA team is always from the client. Usually the QA team members will know little about data warehousing, and some of them may even resent the need to have to learn another tool or tools. This makes the QA process a tricky one.
Sometimes the QA process is overlooked. On my very first data warehousing project, the project team worked very hard to get everything ready for Phase 1, and everyone thought that we had met the deadline. There was one mistake, though, the project managers failed to recognize that it is necessary to go through the client QA process before the project can go into production. As a result, it took five extra months to bring the project to production (the original development time had been only 2 1/2 months).
Time Requirement
1 - 4 weeks.
Deliverables
  • QA Test Plan
  • QA verification that the data warehousing system is ready to go to production
Possible Pitfalls
As mentioned above, usually the QA team members know little about data warehousing, and some of them may even resent the need to have to learn another tool or tools. Make sure the QA team members get enough education so that they can complete the testing themselves.

Query Optimization


For any production database, SQL query performance becomes an issue sooner or later. Having long-running queries not only consumes system resources that makes the server and application run slowly, but also may lead to table locking and data corruption issues. So, query optimization becomes an important task.
First, we offer some guiding principles for query optimization:
1. Understand how your database is executing your query
Nowadays all databases have their own query optimizer, and offers a way for users to understand how a query is executed. For example, which index from which table is being used to execute the query? The first step to query optimization is understanding what the database is doing. Different databases have different commands for this. For example, in MySQL, one can use "EXPLAIN [SQL Query]" keyword to see the query plan. In Oracle, one can use "EXPLAIN PLAN FOR [SQL Query]" to see the query plan.
2. Retrieve as little data as possible
The more data returned from the query, the more resources the database needs to expand to process and store these data. So for example, if you only need to retrieve one column from a table, do not use 'SELECT *'.
3. Store intermediate results
Sometimes logic for a query can be quite complex. Often, it is possible to achieve the desired result through the use of subqueries, inline views, and UNION-type statements. For those cases, the intermediate results are not stored in the database, but are immediately used within the query. This can lead to performance issues, especially when the intermediate results have a large number of rows.
The way to increase query performance in those cases is to store the intermediate results in a temporary table, and break up the initial SQL statement into several SQL statements. In many cases, you can even build an index on the temporary table to speed up the query performance even more. Granted, this adds a little complexity in query management (i.e., the need to manage temporary tables), but the speedup in query performance is often worth the trouble.
Below are several specific query optimization strategies.
  • Use Index
    Using an index is the first strategy one should use to speed up a query. In fact, this strategy is so important that index optimization is also discussed.
  • Aggregate Table
    Pre-populating tables at higher levels so less amount of data need to be parsed.
  • Vertical Partitioning
    Partition the table by columns. This strategy decreases the amount of data a SQL query needs to process.
  • Horizontal Partitioning
    Partition the table by data value, most often time. This strategy decreases the amount of data a SQL query needs to process.
  • Denormalization
    The process of denormalization combines multiple tables into a single table. This speeds up query performance because fewer table joins are needed.
  • Server Tuning
    Each server has its own parameters, and often tuning server parameters so that it can fully take advantage of the hardware resources can significantly speed up query performance.

Performance Tuning


Task Description
There are three major areas where a data warehousing system can use a little performance tuning:
  • ETL - Given that the data load is usually a very time-consuming process (and hence they are typically relegated to a nightly load job) and that data warehousing-related batch jobs are typically of lower priority, that means that the window for data loading is not very long. A data warehousing system that has its ETL process finishing right on-time is going to have a lot of problems simply because often the jobs do not get started on-time due to factors that is beyond the control of the data warehousing team. As a result, it is always an excellent idea for the data warehousing group to tune the ETL process as much as possible.
  • Query Processing - Sometimes, especially in a ROLAP environment or in a system where the reports are run directly against the relationship database, query performance can be an issue. A study has shown that users typically lose interest after 30 seconds of waiting for a report to return. My experience has been that ROLAP reports or reports that run directly against the RDBMS often exceed this time limit, and it is hence ideal for the data warehousing team to invest some time to tune the query, especially the most popularly ones. We present a number of query optimization ideas.
  • Report Delivery - It is also possible that end users are experiencing significant delays in receiving their reports due to factors other than the query performance. For example, network traffic, server setup, and even the way that the front-end was built sometimes play significant roles. It is important for the data warehouse team to look into these areas for performance tuning.
Time Requirement
3 - 5 days.
Deliverables
  • Performance tuning document - Goal and Result
Possible Pitfalls
Make sure the development environment mimics the production environment as much as possible - Performance enhancements seen on less powerful machines sometimes do not materialize on the larger, production-level machines.

Report Development


Task Description
Report specification typically comes directly from the requirements phase. To the end user, the only direct touchpoint he or she has with the data warehousing system is the reports they see. So, report development, although not as time consuming as some of the other steps such as ETL and data modeling, nevertheless plays a very important role in determining the success of the data warehousing project.
One would think that report development is an easy task. How hard can it be to just follow instructions to build the report? Unfortunately, this is not true. There are several points the data warehousing team need to pay attention to before releasing the report.
User customization: Do users need to be able to select their own metrics? And how do users need to be able to filter the information? The report development process needs to take those factors into consideration so that users can get the information they need in the shortest amount of time possible.
Report delivery: What report delivery methods are needed? In addition to delivering the report to the web front end, other possibilities include delivery via email, via text messaging, or in some form of spreadsheet. There are reporting solutions in the marketplace that support report delivery as a flash file. Such flash file essentially acts as a mini-cube, and would allow end users to slice and dice the data on the report without having to pull data from an external source.
Access privileges: Special attention needs to be paid to who has what access to what information. A sales report can show 8 metrics covering the entire company to the company CEO, while the same report may only show 5 of the metrics covering only a single district to a District Sales Director.
Report development does not happen only during the implementation phase. After the system goes into production, there will certainly be requests for additional reports. These types of requests generally fall into two broad categories:
1. Data is already available in the data warehouse. In this case, it should be fairly straightforward to develop the new report into the front end. There is no need to wait for a major production push before making new reports available.
2. Data is not yet available in the data warehouse. This means that the request needs to be prioritized and put into a future data warehousing development cycle.
Time Requirement
1 - 2 weeks.
Deliverables
  • Report Specification Documentation.
  • Reports set up in the front end / reports delivered to user's preferred channel.
Possible Pitfalls
Make sure the exact definitions of the report are communicated to the users. Otherwise, user interpretation of the report can be errenous.

Front End Development


Task Description
Regardless of the strength of the OLAP engine and the integrity of the data, if the users cannot visualize the reports, the data warehouse brings zero value to them. Hence front end development is an important part of a data warehousing initiative.
So what are the things to look out for in selecting a front-end deployment methodology? The most important thing is that the reports should need to be delivered over the web, so the only thing that the user needs is the standard browser. These days it is no longer desirable nor feasible to have the IT department doing program installations on end users desktops just so that they can view reports. So, whatever strategy one pursues, make sure the ability to deliver over the web is a must.
The front-end options ranges from an internal front-end development using scripting languages such as ASP, PHP, or Perl, to off-the-shelf products such as Seagate Crystal Reports, to the more higher-level products such as Actuate. In addition, many OLAP vendors offer a front-end on their own. When choosing vendor tools, make sure it can be easily customized to suit the enterprise, especially the possible changes to the reporting requirements of the enterprise. Possible changes include not just the difference in report layout and report content, but also include possible changes in the back-end structure. For example, if the enterprise decides to change from Solaris/Oracle to Microsoft 2000/SQL Server, will the front-end tool be flexible enough to adjust to the changes without much modification?
Another area to be concerned with is the complexity of the reporting tool. For example, do the reports need to be published on a regular interval? Are there very specific formatting requirements? Is there a need for a GUI interface so that each user can customize her reports?
Time Requirement
1 - 4 weeks.
Deliverables
Front End Deployment Documentation
Possible Pitfalls
Just remember that the end users do not care how complex or how technologically advanced your front end infrastructure is. All they care is that they receives their information in a timely manner and in the way they specified.

OLAP Cube Design


Task Description
Usually the design of the olap cube can be derived from the Requirement Gathering phase. More often than not, however, users have some idea on what they want, but it is difficult for them to specify the exact report / analysis they want to see. When this is the case, it is usually a good idea to include enough information so that they feel like they have gained something through the data warehouse, but not so much that it stretches the data warehouse scope by a mile. Remember that data warehousing is an iterative process - no one can ever meet all the requirements all at once.
Time Requirement
1 - 2 weeks.
Deliverables
  • Documentation specifying the OLAP cube dimensions and measures.
  • Actual OLAP cube / report.
Possible Pitfalls
Make sure your olap cube-bilding process is optimized. It is common for the data warehouse to be on the bottom of the nightly batch load, and after the loading of the data warehouse, there usually isn't much time remaining for the olap cube to be refreshed. As a result, it is worthwhile to experiment with the olap cube generation paths to ensure optimal performance.

ETL


Task Description
The ETL (Extraction, Transformation, Loading) process typically takes the longest to develop, and this can easily take up to 50% of the data warehouse implementation cycle or longer. The reason for this is that it takes time to get the source data, understand the necessary columns, understand the business rules, and understand the logical and physical data models.
Time Requirement
1 - 6 weeks.
Deliverables
  • Data Mapping Document
  • ETL Script / ETL Package in the ETL tool
Possible Pitfalls
There is a tendency to give this particular phase too little development time. This can prove suicidal to the project because end users will usually tolerate less formatting, longer time to run reports, less functionality (slicing and dicing), or fewer delivered reports; one thing that they will not tolerate is wrong information.
A second common problem is that some people make the ETL process more complicated than necessary. In ETL design, the primary goal should be to optimize load speed without sacrificing on quality. This is, however, sometimes not followed. There are cases where the design goal is to cover all possible future uses, whether they are practical or just a figment of someone's imagination. When this happens, ETL performance suffers, and often so does the performance of the entire data warehousing system.

Data Modeling

Task Description
This is a very important step in the data warehousing project. Indeed, it is fair to say that the foundation of the data warehousing system is the data model. A good data model will allow the data warehousing system to grow easily, as well as allowing for good performance.
In data warehousing project, the logical data model is built based on user requirements, and then it is translated into the physical data model. The detailed steps can be found in the Conceptual, Logical, and Physical Data Modeling section.
Part of the data modeling exercise is often the identification of data sources. Sometimes this step is deferred until the ETL step. However, my feeling is that it is better to find out where the data exists, or, better yet, whether they even exist anywhere in the enterprise at all. Should the data not be available, this is a good time to raise the alarm. If this was delayed until the ETL phase, rectifying it will becoming a much tougher and more complex process.
Time Requirement
2 - 6 weeks.
Deliverables
  • Identification of data sources.
  • Logical data model.
  • Physical data model.
Possible Pitfalls
It is essential to have a subject-matter expert as part of the data modeling team. This person can be an outside consultant or can be someone in-house who has extensive experience in the industry. Without this person, it becomes difficult to get a definitive answer on many of the questions, and the entire project gets dragged out.

Physical Environment Setup

Task Description
Once the requirements are somewhat clear, it is necessary to set up the physical servers and databases. At a minimum, it is necessary to set up a development environment and a production environment. There are also many data warehousing projects where there are three environments: Development, Testing, and Production.
It is not enough to simply have different physical environments set up. The different processes (such as ETL, OLAP Cube, and reporting) also need to be set up properly for each environment.
It is best for the different environments to use distinct application and database servers. In other words, the development environment will have its own application server and database servers, and the production environment will have its own set of application and database servers.
Having different environments is very important for the following reasons:
  • All changes can be tested and QA'd first without affecting the production environment.
  • Development and QA can occur during the time users are accessing the data warehouse.
  • When there is any question about the data, having separate environment(s) will allow the data warehousing team to examine the data without impacting the production environment.
Time Requirement
Getting the servers and databases ready should take less than 1 week.
Deliverables
  • Hardware / Software setup document for all of the environments, including hardware specifications, and scripts / settings for the software.
Possible Pitfalls
To save on capital, often data warehousing teams will decide to use only a single database and a single server for the different environments. Environment separation is achieved by either a directory structure or setting up distinct instances of the database. This is problematic for the following reasons:
1. Sometimes it is possible that the server needs to be rebooted for the development environment. Having a separate development environment will prevent the production environment from being impacted by this.
2. There may be interference when having different database environments on a single box. For example, having multiple long queries running on the development database could affect the performance on the production database

Requirement Gathering


Task Description
The first thing that the project team should engage in is gathering requirements from end users. Because end users are typically not familiar with the data warehousing process or concept, the help of the business sponsor is essential. Requirement gathering can happen as one-to-one meetings or as Joint Application Development (JAD) sessions, where multiple people are talking about the project scope in the same meeting.
The primary goal of this phase is to identify what constitutes as a success for this particular phase of the data warehouse project. In particular, end user reporting / analysis requirements are identified, and the project team will spend the remaining period of time trying to satisfy these requirements.
Associated with the identification of user requirements is a more concrete definition of other details such as hardware sizing information, training requirements, data source identification, and most importantly, a concrete project plan indicating the finishing date of the data warehousing project.
Based on the information gathered above, a disaster recovery plan needs to be developed so that the data warehousing system can recover from accidents that disable the system. Without an effective backup and restore strategy, the system will only last until the first major disaster, and, as many data warehousing DBA's will attest, this can happen very quickly after the project goes live.
Time Requirement
2 - 8 weeks.
Deliverables
  • A list of reports / cubes to be delivered to the end users by the end of this current phase.
  • A updated project plan that clearly identifies resource loads and milestone delivery dates.
Possible Pitfalls
This phase often turns out to be the most tricky phase of the data warehousing implementation. The reason is that because data warehousing by definition includes data from multiple sources spanning many different departments within the enterprise, there are often political battles that center on the willingness of information sharing. Even though a successful data warehouse benefits the enterprise, there are occasions where departments may not feel the same way. As a result of unwillingness of certain groups to release data or to participate in the data warehousing requirement definition, the data warehouse effort either never gets off the ground, or could not start in the direction originally defined.
When this happens, it would be ideal to have a strong business sponsor. If the sponsor is at the CXO level, she can often exert enough influence to make sure everyone cooperates.

Data Warehouse Design


After the tools and team personnel selections are made, the data warehouse design can begin. The following are the typical steps involved in the datawarehousing project cycle.
  • Requirement Gathering
  • Physical Environment Setup
  • Data Modeling
  • ETL
  • OLAP Cube Design
  • Front End Development
  • Report Development
  • Performance Tuning
  • Query Optimization
  • Quality Assurance
  • Rolling out to Production
  • Production Maintenance
  • Incremental Enhancements
Each page listed above represents a typical data warehouse design phase, and has several sections:
  • Task Description: This section describes what typically needs to be accomplished during this particular data warehouse design phase.
  • Time Requirement: A rough estimate of the amount of time this particular data warehouse task takes.
  • Deliverables: Typically at the end of each data warehouse task, one or more documents are produced that fully describe the steps and results of that particular task. This is especially important for consultants to communicate their results to the clients.
  • Possible Pitfalls: Things to watch out for. Some of them obvious, some of them not so obvious. All of them are real.
The Additional Observations section contains my own observations on data warehouse processes not included in any of the design steps.

Data Warehouse Architecture


Different data warehousing systems have different structures. Some may have an ODS (operational data store), while some may have multiple data marts. Some may have a small number of data sources, while some may have dozens of data sources. In view of this, it is far more reasonable to present the different layers of a data warehouse architecture rather than discussing the specifics of any one system.
In general, all data warehouse systems have the following layers:
  • Data Source Layer
  • Data Extraction Layer
  • Staging Area
  • ETL Layer
  • Data Storage Layer
  • Data Logic Layer
  • Data Presentation Layer
  • Metadata Layer
  • System Operations Layer
  • The picture below shows the relationships among the different components of the data warehouse architecture:
    Data Warehouse Architecture
    Each component is discussed individually below:
    Data Source Layer
    This represents the different data sources that feed data into the data warehouse. The data source can be of any format -- plain text file, relational database, other types of database, Excel file, ... can all act as a data source.
    Many different types of data can be a data source:
  • Operations -- such as sales data, HR data, product data, inventory data, marketing data, systems data.
  • Web server logs with user browsing data.
  • Internal market research data.
  • Third-party data, such as census data, demographics data, or survey data. All these data sources together form the Data Source Layer.
    Data Extraction Layer
    Data gets pulled from the data source into the data warehouse system. There is likely some minimal data cleansing, but there is unlikely any major data transformation.
    Staging Area
    This is where data sits prior to being scrubbed and transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration.
    ETL Layer
    This is where data gains its "intelligence", as logic is applied to transform the data from a transactional nature to an analytical nature. This layer is also where data cleansing happens.
    Data Storage Layer
    This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of entities can be found here: data warehouse, data mart, and operational data store (ODS). In any given system, you may have just one of the three, two of the three, or all three types.
    Data Logic Layer
    This is where business rules are stored. Business rules stored here do not affect the underlying data transformation rules, but does affect what the report looks like.
    Data Presentation Layer
    This refers to the information that reaches the users. This can be in a form of a tabular / graphical report in a browser, an emailed report that gets automatically generated and sent everyday, or an alert that warns users of exceptions, among others.
    Metadata Layer
    This is where information about the data stored in the data warehouse system is stored. A logical data model would be an example of something that's in the metadata layer.
    System Operations Layer
    This layer includes information on how the data warehouse system operates, such as ETL job status, system performance, and user access history.
  • Data Warehouse Definition


    Different people have different definitions for a data warehouse. The most popular definition came from Bill Inmon, who provided the following:
    A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.
    Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject.
    Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
    Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.
    Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.
    Ralph Kimball provided a more concise definition of a data warehouse:
    A data warehouse is a copy of transaction data specifically structured for query and analysis.
    This is a functional view of a data warehouse. Kimball did not address how the data warehouse is built like Inmon did, rather he focused on the functionality of a data warehouse

    Data Warehousing

    This Data Warehousing site aims to help people get a good high-level understanding of what it takes to implement a successful data warehouse project. A lot of the information is from my personal experience as a business intelligence professional, both as a client and as a vendor.
    This site is divided into five main areas.
    - Tools: The selection of business intelligence tools and the selection of the data warehousing team. Tools covered are:
    • Database, Hardware
    • ETL (Extraction, Transformation, and Loading)
    • OLAP
    • Reporting
    • Metadata
    - Steps: This selection contains the typical milestones for a data warehousing project, from requirement gathering, query optimization, to production rollout and beyond. I also offer my observations on the data warehousing field.
    - Business Intelligence: Business intelligence is closely related to data warehousing. This section discusses business intelligence, as well as the relationship between business intelligence and data warehousing.
    - Concepts: This section discusses several concepts particular to the data warehousing field. Topics include:

    • Dimensional Data Model
    • Star Schema
    • Snowflake Schema
    • Slowly Changing Dimension
    • Conceptual Data Model
    • Logical Data Model
    • Physical Data Model
    • Conceptual, Logical, and Physical Data Model
    • Data Integrity
    • What is OLAP
    • MOLAP, ROLAP, and HOLAP
    • Bill Inmon vs. Ralph Kimball
    - Business Intelligence Conferences: Lists upcoming conferences in the business intelligence / data warehousing industry.
    - Glossary: A glossary of common data warehousing terms.