Sunday 1 July 2012

Data Warehousing Glossary


Aggregation: One way of speeding up query performance. Facts are summed up for selected dimensions from the original fact table. The resulting aggregate table will have fewer rows, thus making queries that can use them go faster.
Attribute: Attributes represent a single type of information in a dimension. For example, year is an attribute in the Time dimension.
Conformed Dimension: A dimension that has exactly the same meaning and content when being referred to from different fact tables.
Data Mart: Data marts have the same definition as the data warehouse (see below), but data marts have a more limited audience and/or data content.
Data Warehouse: A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process (as defined by Bill Inmon).
Data Warehousing: The process of designing, building, and maintaining a data warehouse system.
Dimension: The same category of information. For example, year, month, day, and week are all part of the Time Dimension.
Dimensional Model: A type of data modeling suited for data warehousing. In a dimensional model, there are two types of tables: dimensional tables and fact tables. Dimensional table records information on each dimension, and fact table records all the "fact", or measures.
Dimensional Table: Dimension tables store records related to this particular dimension. No facts are stored in a dimensional table.
Drill Across: Data analysis across dimensions.
Drill Down: Data analysis to a child attribute.
Drill Through: Data analysis that goes from an OLAP cube into the relational database.
Drill Up: Data analysis to a parent attribute.

Data Warehousing Books


Below is a list of 5 most recently-published books related to data warehousing. You can also view the books according to the following subject areas:
Oracle Warehouse Builder
Oracle Warehouse Builder 11G R2: Getting Started
By
Published on

This book talks about how to use Oracle Warehouse Builder 11G R2.


Microsoft Data Warehouse Toolkit
The Microsoft Data Warehouse Toolkit: With SQL Server 2008 R2 and the Microsoft Business Intelligence Toolset
By
Published on

This book explains how to build a data warehouse using the Microsoft SQL Server 2008 R2 and Office 2010 technology, including discussions on PowerPivot for Excel and SharePoint, Master Data Services, as well as updated capabilities of SQL Server Analysis, Integration, and Reporting Services.


Business Intelligence Strategy
Business Intelligence Strategy; A Practical Guide for Achieving BI Excellence
By
Published on

Geared toward IT management and business executives seeking to excel in business intelligence initiatives, this practical guide explores creating business alignment strategies that help prioritize business requirements, build organizational and cultural strategies, increase IT efficiency, and promote user adoption.


Pentaho Kettle Solutions
Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration
By
Published on

This practical book is a complete guide to installing, configuring, and managing Pentaho Kettle. Get up to speed on Kettle basics and learn how to apply Kettle to create ETL solutions. Learn how to design and build every phase of an ETL solution.


Data Modeling A Beginners Guide
Data Modeling, A Beginner's Guide
By
Published on

This book teaches you techniques for gathering business requirements and using them to produce conceptual, logical, and physical database designs.


ETL Books OLAP Books Business Intelligence Books
General Books Data Modeling Books Vendor-Specific Books

Saturday 30 June 2012

Data Warehousing - Data Warehouse Team Selection


There are two areas of discussion: First is whether to use external consultants or hire permanent employees. The second is on what type of personnel is recommended for a data warehousing project.
The pros of hiring external consultants are:
1. They are usually more experienced in data warehousing implementations. The fact of the matter is, even today, people with extensive data warehousing backgrounds are difficult to find. With that, when there is a need to ramp up a team quickly, the easiest route to go is to hire external consultants.
The pros of hiring permanent employees are:
1. They are less expensive. With hourly rates for experienced data warehousing professionals running from $100/hr and up, and even more for Big-5 or vendor consultants, hiring permanent employees is a much more economical option.
2. They are less likely to leave. With consultants, whether they are on contract, via a Big-5 firm, or one of the tool vendor firms, they are likely to leave at a moment's notice. This makes knowledge transfer very important. Of course, the flip side is that these consultants are much easier to get rid of, too.
The following roles are typical for a data warehouse project:
  • Project Manager: This person will oversee the progress and be responsible for the success of the data warehousing project.
  • DBA: This role is responsible to keep the database running smoothly. Additional tasks for this role may be to plan and execute a backup/recovery plan, as well as performance tuning.
  • Technical Architect: This role is responsible for developing and implementing the overall technical architecture of the data warehouse, from the backend hardware/software to the client desktop configurations.
  • ETL Developer: This role is responsible for planning, developing, and deploying the extraction, transformation, and loading routine for the data warehouse.
  • Front End Developer: This person is responsible for developing the front-end, whether it be client-server or over the web.
  • OLAP Developer: This role is responsible for the development of OLAP cubes.
  • Trainer: A significant role is the trainer. After the data warehouse is implemented, a person on the data warehouse team needs to work with the end users to get them familiar with how the front end is set up so that the end users can get the most benefit out of the data warehouse system.
  • Data Modeler: This role is responsible for taking the data structure that exists in the enterprise and model it into a schema that is suitable for OLAP analysis.
  • QA Group: This role is responsible for ensuring the correctness of the data in the data warehouse. This role is more important than it appears, because bad data quality turns away users more than any other reason, and often is the start of the downfall for the data warehousing project. The above list is roles, and one person does not necessarily correspond to only one role. In fact, it is very common in a data warehousing team where a person takes on multiple roles. For a typical project, it is common to see teams of 5-8 people. Any data warehousing team that contains more than 10 people is definitely bloated.
  • Data Warehousing - Open Source Business Intelligence


    What is open source business intelligence?

    Open source BI are BI software can be distributed for free and permits users to modify the source code. Open source software is available in all BI tools, from data modeling to reporting to OLAP to ETL.
    Because open source software is community driven, it relies on the community for improvement. As such, new feature sets typically come from community contribution rather than as a result of dedicated R&D efforts.

    Advantages of open source BI tools

    Easy to get started
    With traditional BI software, the business model typically involves a hefty startup cost, and then there is an annual fee for support and maintenance that is calculated as a percentage of the initial purchase price. In this model, a company needs to spend a substantial amount of money before any benefit is realized. With the substantial cost also comes the need to go through a sales cycle, from the RFP process to evaluation to negotiation, and multiple teams within the organization typically get involved. These factors mean that it's not only costly to get started with traditional BI software, but the amount of time it takes is also long.
    With open source BI, the beginning of the project typically involves a free download of the software. Given this, bureaucracy can be kept to a minimum and it is very easy and inexpensive to get started.
    Lower cost
    Because of its low startup cost and the typically lower ongoing maintenance/support cost, the cost for open source BI software is lower (sometimes much lower) than traditional BI software.
    Easy to customize
    By definition, open source software means that users can access and modify the source code directly. That means it is possible for developers to get under the hood of the open source BI tool and add their own features. In contrast, it is much more difficult to do this with traditional BI software because there is no way to access the source code.

    Disadvantages of open source BI tools

    Features are not as robust
    Traditional BI software vendors put in a lot of money and resources into R&D, and the result is that the product has a rich feature set. Open source BI tools, on the other hand, rely on community support, and hence do not have as strong a feature set.
    Consulting help not as readily available
    Most of the traditional BI software - MicroStrategy, Business Objects, Cognos, Oracle and so on, have been around for a long time. As a result, there are a lot of people with experience with those tools, and finding consulting help to implement these solutions is usually not very difficult. Open source BI tools, on the other hand, are a fairly recent development, and there are relatively few people with implementation experience. So, it is more difficult to find consulting help if you go with open source BI.

    Open source BI tool vendors

    Friday 29 June 2012

    Data Warehousing - Metadata Tool Selection


    Buy vs. Build
    Only in the rarest of cases does it make sense to build a metadata tool from scratch. This is because doing so requires resources that are intimately familiar with the operational, technical, and business aspects of the data warehouse system, and such resources are difficult to come by. Even when such resources are available, there are often other tasks that can provide more value to the organization than to build a metadata tool from scratch.
    In fact, the question is often whether any type of metadata tool is needed at all. Although metadata plays an extremely important role in a successful data warehousing implementation, this does not always mean that a tool is needed to keep all the "data about data." It is possible to, say, keey such information in the repository of other tools used, in a text documentation, or even in a presentation or a spreadsheet.

    Having said the above, though, it is author's believe that having a solid metadata foundation is one of the keys to the success of a data warehousing project. Therefore, even if a metadata tool is not selected at the beginning of the project, it is essential to have a metadata strategy; that is, how metadata in the data warehousing system will be stored.
    Metadata Tool Functionalities
    This is the most difficult tool to choose, because there is clearly no standard. In fact, it might be better to call this a selection of the metadata strategy. Traditionally, people have put the data modeling information into a tool such as ERWin and Oracle Designer, but it is difficult to extract information out of such data modeling tools. For example, one of the goals for your metadata selection is to provide information to the end users. Clearly this is a difficult task with a data modeling tool.
    So typically what is likely to happen is that additional efforts are spent to create a layer of metadata that is aimed at the end users. While this allows the end users to gain the required insight into what the data and reports they are looking at means, it is clearly inefficient because all that information already resides somewhere in the data warehouse system, whether it be the ETL tool, the data modeling tool, the OLAP tool, or the reporting tool.
    There are efforts among data warehousing tool vendors to unify on a metadata model. In June of 2000, the OMG released a metadata standard called CWM (Common Warehouse Metamodel), and some of the vendors such as Oracle have claimed to have implemented it. This standard incorporates the latest technology such as XML, UML, and SOAP, and, if accepted widely, is truly the best thing that can happen to the data warehousing industry. As of right now, though, the author has not really seen that many tools leveraging this standard, so clearly it has not quite caught on yet.
    So what does this mean about your metadata efforts? In the absence of everything else, I would recommend that whatever tool you choose for your metadata support supports XML, and that whatever other tool that needs to leverage the metadata also supports XML. Then it is a matter of defining your DTD across your data warehousing system. At the same time, there is no need to worry about criteria that typically is important for the other tools such as performance and support for parallelism because the size of the metadata is typically small relative to the size of the data warehouse.

    Reporting Tool Selection


    Buy vs. Build
    There is a wide variety of reporting requirements, and whether to buy or build a reporting tool for your business intelligence needs is also heavily dependent on the type of requirements. Typically, the determination is based on the following:
    • Number of reports: The higher the number of reports, the more likely that buying a reporting tool is a good idea. This is not only because reporting tools typically make creating new reports easier (by offering re-usable components), but they also already have report management systems to make maintenance and support functions easier.
    • Desired Report Distribution Mode: If the reports will only be distributed in a single mode (for example, email only, or over the browser only), we should then strongly consider the possibility of building the reporting tool from scratch. However, if users will access the reports through a variety of different channels, it would make sense to invest in a third-party reporting tool that already comes packaged with these distribution modes.
    • Ad Hoc Report Creation: Will the users be able to create their own ad hoc reports? If so, it is a good idea to purchase a reporting tool. These tool vendors have accumulated extensive experience and know the features that are important to users who are creating ad hoc reports. A second reason is that the ability to allow for ad hoc report creation necessarily relies on a strong metadata layer, and it is simply difficult to come up with a metadata model when building a reporting tool from scratch.
    Reporting Tool Functionalities
    Data is useless if all it does is sit in the data warehouse. As a result, the presentation layer is of very high importance.
    Most of the OLAP vendors already have a front-end presentation layer that allows users to call up pre-defined reports or create ad hoc reports. There are also several report tool vendors. Either way, pay attention to the following points when evaluating reporting tools:
  • Data source connection capabilities In general there are two types of data sources, one the relationship database, the other is the OLAP multidimensional data source. Nowadays, chances are good that you might want to have both. Many tool vendors will tell you that they offer both options, but upon closer inspection, it is possible that the tool vendor is especially good for one type, but to connect to the other type of data source, it becomes a difficult exercise in programming.
  • Scheduling and distribution capabilities In a realistic data warehousing usage scenario by senior executives, all they have time for is to come in on Monday morning, look at the most important weekly numbers from the previous week (say the sales numbers), and that's how they satisfy their business intelligence needs. All the fancy ad hoc and drilling capabilities will not interest them, because they do not touch these features.
    Based on the above scenario, the reporting tool must have scheduling and distribution capabilities. Weekly reports are scheduled to run on Monday morning, and the resulting reports are distributed to the senior executives either by email or web publishing. There are claims by various vendors that they can distribute reports through various interfaces, but based on my experience, the only ones that really matter are delivery via email and publishing over the intranet.
  • Security Features: Because reporting tools, similar to OLAP tools, are geared towards a number of users, making sure people see only what they are supposed to see is important. Security can reside at the report level, folder level, column level, row level, or even individual cell level. By and large, all established reporting tools have these capabilities. Furthermore, they have a security layer that can interact with the common corporate login protocols. There are, however, cases where large corporations have developed their own user authentication mechanism and have a "single sign-on" policy. For these cases, having a seamless integration between the tool and the in-house authentication can require some work. I would recommend that you have the tool vendor team come in and make sure that the two are compatible.
  • Customization Every one of us has had the frustration over spending an inordinate amount of time tinkering with some office productivity tool only to make the report/presentation look good. This is definitely a waste of time, but unfortunately it is a necessary evil. In fact, a lot of times, analysts will wish to take a report directly out of the reporting tool and place it in their presentations or reports to their bosses. If the reporting tool offers them an easy way to pre-set the reports to look exactly the way that adheres to the corporate standard, it makes the analysts jobs much easier, and the time savings are tremendous.
  • Export capabilities The most common export needs are to Excel, to a flat file, and to PDF, and a good report tool must be able to export to all three formats. For Excel, if the situation warrants it, you will want to verify that the reporting format, not just the data itself, will be exported out to Excel. This can often be a time-saver.
  • Integration with the Microsoft Office environment Most people are used to work with Microsoft Office products, especially Excel, for manipulating data. Before, people used to export the reports into Excel, and then perform additional formatting / calculation tasks. Some reporting tools now offer a Microsoft Office-like editing environment for users, so all formatting can be done within the reporting tool itself, with no need to export the report into Excel. This is a nice convenience to the users.
    Popular Tools
  • Business Intelligence: OLAP Tool Selection

    Buy vs. Build
    OLAP tools are geared towards slicing and dicing of the data. As such, they require a strong metadata layer, as well as front-end flexibility. Those are typically difficult features for any home-built systems to achieve. Therefore, my recommendation is that if OLAP analysis is part of your charter for building a data warehouse, it is best to purchase an existing OLAP tool rather than creating one from scratch.
    OLAP Tool Functionalities
    Before we speak about OLAP tool selection criterion, we must first distinguish between the two types of OLAP tools, MOLAP (Multidimensional OLAP) and ROLAP (Relational OLAP).
    1. MOLAP: In this type of OLAP, a cube is aggregated from the relational data source (data warehouse). When user generates a report request, the MOLAP tool can generate the create quickly because all data is already pre-aggregated within the cube.
    2. ROLAP: In this type of OLAP, instead of pre-aggregating everything into a cube, the ROLAP engine essentially acts as a smart SQL generator. The ROLAP tool typically comes with a 'Designer' piece, where the data warehouse administrator can specify the relationship between the relational tables, as well as how dimensions, attributes, and hierarchies map to the underlying database tables.
    Right now, there is a convergence between the traditional ROLAP and MOLAP vendors. ROLAP vendor recognize that users want their reports fast, so they are implementing MOLAP functionalities in their tools; MOLAP vendors recognize that many times it is necessary to drill down to the most detail level information, levels where the traditional cubes do not get to for performance and size reasons.
    So what are the criteria for evaluating OLAP vendors? Here they are:
  • Ability to leverage parallelism supplied by RDBMS and hardware: This would greatly increase the tool's performance, and help loading the data into the cubes as quickly as possible.
  • Performance: In addition to leveraging parallelism, the tool itself should be quick both in terms of loading the data into the cube and reading the data from the cube.
  • Customization efforts: More and more, OLAP tools are used as an advanced reporting tool. This is because in many cases, especially for ROLAP implementations, OLAP tools often can be used as a reporting tool. In such cases, the ease of front-end customization becomes an important factor in the tool selection process.
  • Security Features: Because OLAP tools are geared towards a number of users, making sure people see only what they are supposed to see is important. By and large, all established OLAP tools have a security layer that can interact with the common corporate login protocols. There are, however, cases where large corporations have developed their own user authentication mechanism and have a "single sign-on" policy. For these cases, having a seamless integration between the tool and the in-house authentication can require some work. I would recommend that you have the tool vendor team come in and make sure that the two are compatible.
  • Metadata support: Because OLAP tools aggregates the data into the cube and sometimes serves as the front-end tool, it is essential that it works with the metadata strategy/tool you have selected. Popular Tools