Wednesday, 20 June 2012

Data Warehousing Concepts

Several concepts are of particular importance to data warehousing. They are discussed in detail in this section.

Dimensional Data Model:
Dimensional data model is commonly used in data warehousing systems. This section describes this modeling technique, and the two common schema types, star schema and snowflake schema.

Slowly Changing Dimension: This is a common issue facing data warehousing practioners. This section explains the problem, and describes the three ways of handling this problem with examples.

Conceptual Data Model: What is a conceptual data model, its features, and an example of this type of data model.

Logical Data Model: What is a logical data model, its features, and an example of this type of data model.

Physical Data Model: What is a physical data model, its features, and an example of this type of data model.

Conceptual, Logical, and Physical Data Model: Different levels of abstraction for a data model. This section compares and constrasts the three different types of data models.

Data Integrity:
What is data integrity and how it is enforced in data warehousing.

What is OLAP: Definition of OLAP.

MOLAP, ROLAP, and HOLAP: What are these different types of OLAP technology? This section discusses how they are different from the other, and the advantages and disadvantages of each.

Bill Inmon vs. Ralph Kimball: These two data warehousing heavyweights have a different view of the role between data warehouse and data mart.

Factless Fact Table:
A fact table without any fact may sound silly, but there are real life instances when a factless fact table is useful in data warehousing.

Junk Dimension: Discusses the concept of a junk dimension: When to use it and why is it useful.

Conformed Dimension: Discusses the concept of a conformed dimension: What is it and why is it important.

Tuesday, 19 June 2012

Data Warehousing Concepts


Several concepts are of particular importance to data warehousing. They are discussed in detail in this section.
Dimensional Data Model: Dimensional data model is commonly used in data warehousing systems. This section describes this modeling technique, and the two common schema types, star schema and snowflake schema.
Slowly Changing Dimension: This is a common issue facing data warehousing practioners. This section explains the problem, and describes the three ways of handling this problem with examples.
Conceptual Data Model: What is a conceptual data model, its features, and an example of this type of data model.
Logical Data Model: What is a logical data model, its features, and an example of this type of data model.
Physical Data Model: What is a physical data model, its features, and an example of this type of data model.
Conceptual, Logical, and Physical Data Model: Different levels of abstraction for a data model. This section compares and constrasts the three different types of data models.
Data Integrity: What is data integrity and how it is enforced in data warehousing.
What is OLAP: Definition of OLAP.
MOLAP, ROLAP, and HOLAP: What are these different types of OLAP technology? This section discusses how they are different from the other, and the advantages and disadvantages of each.
Bill Inmon vs. Ralph Kimball: These two data warehousing heavyweights have a different view of the role between data warehouse and data mart.
Factless Fact Table: A fact table without any fact may sound silly, but there are real life instances when a factless fact table is useful in data warehousing.
Junk Dimension: Discusses the concept of a junk dimension: When to use it and why is it useful.
Conformed Dimension: Discusses the concept of a conformed dimension: What is it and why is it important.

Recipes For Failure


This section describes 8 situations where the data warehousing effort is destined to fail, often despite the best of intentions.
1. Focusing On Idealogy Rather Than Practicality
There are many good textbooks on data warehousing out there, and many schools are offering data warehousing classes. Having read the textbooks or completed a course, however, does not make a person a data warehousing guru.
For example, I have seen someone insisting on enforcing foreign key rules between dimension tables and fact tables, starting from the first day of development. This is not prudent for several reasons: 1) The development environment by its very nature means a lot of playing with data -- many updates, deletes, and inserts. Having the foreign key constraint only makes the development effort take longer than necessary. 2) This slows the ETL load time. 3) This constraint is useless because when data gets loaded into the fact table, we already have to go to the dimension table to get the proper foreign key, thus already accomplished what a foreign key constraint would accomplish.
2. Making The Process Unnecessarily Complicated
Data warehousing is inherently a complex enough project, and there is no need to make it even complex.
Here is an example: The source file comes in with one fact. The person responsible for project management insists that this fact be broken into several different metrics during ETL. The ideal sound reasonable: To cut down on the number of rows for the fact table, and so that the front-end tool could generate the report quicker. Unfortunately, there are several problems with this approach: First, the ETL became unnecessarily complex. Not only are "case"-type statements now needed in ETL, but because the fact cannot always be broken down into the corresponding metrics nicely due to inconsistencies in the data, it became necessary to create a lot of logic to take care of the exceptions. Second, it is never advisable to design the data model and the ETL process based on what suits the front end tool the most. Third, at the end of the day, the reports ended up having to sum back these separate metrics back together to get what the users were truly after, meaning that all the extra work was for naught.
3. Lack of Clear Ownership
Because data warehousing projects typically touch upon many different departments, it is natural that the project involves multiple teams. For the project to be successful, though, there must be clear ownership of the project. Not clear ownership of different components of the project, but the project itself. I have seen a case where multiple groups each own a portion of the project. Needless to say, these projects never got finished as quickly as they should, tended to underdeliver, had inflexbile infrastructure (as each group would do what is best for the group, not for the whole project). What I have seen coming out of such projects is that it is tailor-made for finger-pointing. If something is wrong, it's always another group's fault. At the end of the day, nobody is responsible for anything, and it's no wonder why the project is full of problems. Making sure one person/one group is fully accountable for the success of the data warehousing project is paramount in ensuring a successful project.
4. Not Understanding Proper Protocol
Whether you are working as a consultant or an internal resource, you need to understand the organization protocol in order by build a successful data warehouse / data mart.
I have been in a project where the team thought all the development was done, tested, documented, migrated to the production system, and ready to deliver by the deadline, and was ready to celebrate with the bonus money the client promised for an on-time delivery. One thing that was missed, though, was that the client always requires any production system to go through its QA group first. The project manager in this case did not notice that. Hence, rather than delivering a project on time and within budget, the project had to be delayed for an additional four months before it could go online, all because project management was not familiar with the organization protocol.
5. Not Fully Understand Project Impact Before The Project Starts
Here, I am talking about the project impact turns out to be much smaller than anticipated. I have seen data mart efforts where significant amount of resources were thrown into the project, and at the completion of the project, there were only two users. This is clearly the case where someone did not make the proper call, as these resources clearly could have been better utilized in different projects.
6. Try To Bite Off More Than You Can Chew
This means that the project attempts to accomplish something more grandeur than it is supposed to. There are two examples below:
There are data warehousing projects that attempt to control the entire project -- even to the point of dictating how the source system should be built to capture data, and exactly how data should be captured. While the idea is noble -- often during a project, we find that the source system data has a lot of problems, and hence it makes sense to make sure the source system is built right -- in reality this is not practical. First, source systems are built the way they are for specific reasons -- and data analysis should only be one of the concerns, not the only concern. In addition, this will lead to a data warehousing system that is pretty in theory, but very inflexible in reality.
In the same vein, I have seen data mart efforts where the project owner attempts to push his own ideas to the rest of the company, and that person instructed his team to build the system in a way that can accomodate that possibility. Of course, what really happens is that no one else ends up adopting his ideas, and much time and effort were wasted.
7. Blindly Sticking To Certain Standards
I have seen cases where a concerted effort is put on ensuring that different data marts employ the same infrastructure, from the tools used (for example, a certain ETL tool must be used for doing ETL, regardless of how simple that ETL process is) to the user experience (for example, users must be able to access the same set of report selection criteria).
This is an absurd way of building data marts. The very reason that different data marts exist is because there are differences among them, so insisting on making sure they all conform to a certain standard is an exercise in futility. I have seen ETL tools blindly placed on ETL processes that require only a series of SQL statements.
As far as the front end goes, that makes even less sense. First of all, different projects, even though they may be very similar, are still different. Otherwise they would belong to the same project. Furthermore, users really do not care if their views into different data marts have exactly the same look and feel. What they care is whether the data is there on time, and whether the numbers are dependable.
8. Bad Project Management
Bad project management can manifest itself in several ways, and some of the examples listed previously illustrate the danger of bad project management. In short, it is safe to say a bad project manager will certain doom a project.
For data warehousing projects, the key is experience, especially hands-on experience. This is not a job for someone who just completed his or her MBA program, or someone who has only read through all the data warehousing books, but has had no practical experience.

Incremental Enhancements


Task Description
Once the data warehousing system goes live, there are often needs for incremental enhancements. I am not talking about a new data warehousing phases, but simply small changes that follow the business itself. For example, the original geographical designations may be different, the company may originally have 4 sales regions, but now because sales are going so well, now they have 10 sales regions.
Deliverables
  • Change management documentation
  • Actual change to the data warehousing system
Possible Pitfalls
Because a lot of times the changes are simple to make, it is very tempting to just go ahead and make the change in production. This is a definite no-no. Many unexpected problems will pop up if this is done. I would very strongly recommend that the typical cycle of development --> QA --> Production be followed, regardless of how simple the change may seem.

Production Maintenance


Task Description
Once the data warehouse goes production, it needs to be maintained. Tasks as such regular backup and crisis management becomes important and should be planned out. In addition, it is very important to consistently monitor end user usage. This serves two purposes: 1. To capture any runaway requests so that they can be fixed before slowing the entire system down, and 2. To understand how much users are utilizing the data warehouse for return-on-investment calculations and future enhancement considerations.
Time Requirement
Ongoing.
Deliverables
Consistent availability of the data warehousing system to the end users.
Possible Pitfalls
Usually by this time most, if not all, of the developers will have left the project, so it is essential that proper documentation is left for those who are handling production maintenance. There is nothing more frustrating than staring at something another person did, yet unable to figure it out due to the lack of proper documentation.
Another pitfall is that the maintenance phase is usually boring. So, if there is another phase of the data warehouse planned, start on that as soon as possible.

Rollout To Production


Task Description
Once the QA team gives thumbs up, it is time for the data warehouse system to go live. Some may think this is as easy as flipping on a switch, but usually it is not true. Depending on the number of end users, it sometimes take up to a full week to bring everyone online! Fortunately, nowadays most end users access the data warehouse over the web, making going production sometimes as easy as sending out an URL via email.
Time Requirement
1 - 3 days.
Deliverables
  • Delivery of the data warehousing system to the end users.
Possible Pitfalls
Take care to address the user education needs. There is nothing more frustrating to spend several months to develop and QA the data warehousing system, only to have little usage because the users are not properly trained. Regardless of how intuitive or easy the interface may be, it is always a good idea to send the users to at least a one-day course to let them understand what they can achieve by properly using the data warehouse

Quality Assurance


Task Description
Once the development team declares that everything is ready for further testing, the QA team takes over. The QA team is always from the client. Usually the QA team members will know little about data warehousing, and some of them may even resent the need to have to learn another tool or tools. This makes the QA process a tricky one.
Sometimes the QA process is overlooked. On my very first data warehousing project, the project team worked very hard to get everything ready for Phase 1, and everyone thought that we had met the deadline. There was one mistake, though, the project managers failed to recognize that it is necessary to go through the client QA process before the project can go into production. As a result, it took five extra months to bring the project to production (the original development time had been only 2 1/2 months).
Time Requirement
1 - 4 weeks.
Deliverables
  • QA Test Plan
  • QA verification that the data warehousing system is ready to go to production
Possible Pitfalls
As mentioned above, usually the QA team members know little about data warehousing, and some of them may even resent the need to have to learn another tool or tools. Make sure the QA team members get enough education so that they can complete the testing themselves.