Monday, 25 June 2012

Data Warehousing - ETL Tool Selection


Buy vs. Build
When it comes to ETL tool selection, it is not always necessary to purchase a third-party tool. This determination largely depends on three things:
  • Complexity of the data transformation: The more complex the data transformation is, the more suitable it is to purchase an ETL tool.
  • Data cleansing needs: Does the data need to go through a thorough cleansing exercise before it is suitable to be stored in the data warehouse? If so, it is best to purchase a tool with strong data cleansing functionalities. Otherwise, it may be sufficient to simply build the ETL routine from scratch.
  • Data volume. Available commercial tools typically have features that can speed up data movement. Therefore, buying a commercial product is a better approach if the volume of data transferred is large.
ETL Tool Functionalities While the selection of a database and a hardware platform is a must, the selection of an ETL tool is highly recommended, but it's not a must. When you evaluate ETL tools, it pays to look for the following characteristics:
  • Functional capability: This includes both the 'transformation' piece and the 'cleansing' piece. In general, the typical ETL tools are either geared towards having strong transformation capabilities or having strong cleansing capabilities, but they are seldom very strong in both. As a result, if you know your data is going to be dirty coming in, make sure your ETL tool has strong cleansing capabilities. If you know there are going to be a lot of different data transformations, it then makes sense to pick a tool that is strong in transformation.
  • Ability to read directly from your data source: For each organization, there is a different set of data sources. Make sure the ETL tool you select can connect directly to your source data.
  • Metadata support: The ETL tool plays a key role in your metadata because it maps the source data to the destination, which is an important piece of the metadata. In fact, some organizations have come to rely on the documentation of their ETL tool as their metadata source. As a result, it is very important to select an ETL tool that works with your overall metadata strategy. Popular Tools
    • IBM WebSphere Information Integration (Ascential DataStage)
    • Ab Initio
    • Informatica
    • Talend
  • Business Intellegence Tools


    The most common tools used for business intelligence are as follows. They are listed in the following order: Increasing cost, increasing functionality, increasing business intelligence complexity, and decreasing number of total users.
    Excel
    Take a guess what's the most common business intelligence tool? You might be surprised to find out it's Microsoft Excel. There are several reasons for this:
    1. It's relatively cheap.
    2. It's commonly used. You can easily send an Excel sheet to another person without worrying whether the recipient knows how to read the numbers.
    3. It has most of the functionalities users need to display data.
    In fact, it is still so popular that all third-party reporting / OLAP tools have an "export to Excel" functionality. Even for home-built solutions, the ability to export numbers to Excel usually needs to be built.
    Excel is best used for business operations reporting and goals tracking.
    Reporting tool
    In this discussion, I am including both custom-built reporting tools and the commercial reporting tools together. They provide some flexibility in terms of the ability for each user to create, schedule, and run their own reports. The Reporting Tool Selection selection discusses how one should select an OLAP tool.
    Business operations reporting and dashboard are the most common applications for a reporting tool.
    OLAP tool
    OLAP tools are usually used by advanced users. They make it easy for users to look at the data from multiple dimensions. The OLAP Tool Selection selection discusses how one should select an OLAP tool.
    OLAP tools are used for multidimensional analysis.
    Data mining tool
    Data mining tools are usually only by very specialized users, and in an organization, even large ones, there are usually only a handful of users using data mining tools.
    Data mining tools are used for finding correlation among different factors.

    Business Intelligence


    Business intelligence is a term commonly associated with data warehousing. In fact, many of the tool vendors position their products as business intelligence software rather than data warehousing software. There are other occasions where the two terms are used interchangeably. So, exactly what is business inteligence?
    Business intelligence usually refers to the information that is available for the enterprise to make decisions on. A data warehousing (or data mart) system is the backend, or the infrastructural, component for achieving business intellignce. Business intelligence also includes the insight gained from doing data mining analysis, as well as unstrctured data (thus the need fo content management systems). For our purposes here, we will discuss business intelligence in the context of using a data warehouse infrastructure.
    This section includes the following:
    Business intelligence tools: Tools commonly used for business intelligence.
    Business intelligence uses: Different forms of business intelligence.
    Business intelligence news: News in the business intelligence area.

    Conformed Dimension


    A conformed dimension is a dimension that has exactly the same meaning and content when being referred from different fact tables. A conformed dimension can refer to multiple tables in multiple data marts within the same organization. For two dimension tables to be considered as conformed, they must either be identical or one must be a subset of another. There cannot be any other type of difference between the two tables. For example, two dimension tables that are exactly the same except for the primary key are not considered conformed dimensions.
    Why is conformed dimension important? This goes back to the definition of data warehouse being "integrated." Integrated means that even if a particular entity had different meanings and different attributes in the source systems, there must be a single version of this entity once the data flows into the data warehouse.
    The time dimension is a common conformed dimension in an organization. Usually the only rules to consider with the time dimension is whether there is a fiscal year in addition to the calendar year and the definition of a week. Fortunately, both are relatively easy to resolve. In the case of fiscal vs calendar year, one may go with either fiscal or calendar, or an alternative is to have two separate conformed dimensions, one for fiscal year and one for calendar year. The definition of a week is also something that can be different in large organizations: Finance may use Saturday to Friday, while marketing may use Sunday to Saturday. In this case, we should decide on a definition and move on. The nice thing about the time dimension is once these rules are set, the values in the dimension table will never change. For example, October 16th will never become the 15th day in October.
    Not all conformed dimensions are as easy to produce as the time dimension. An example is the customer dimension. In any organization with some history, there is a high likelihood that different customer databases exist in different parts of the organization. To achieve a conformed customer dimension means those data must be compared against each other, rules must be set, and data must be cleansed. In addition, when we are doing incremental data loads into the data warehouse, we'll need to apply the same rules to the new values to make sure we are only adding truly new customers to the customer dimension.
    Building a conformed dimension also part of the process in master data management, or MDM. In MDM, one must not only make sure the master data dimensions are conformed, but that conformity needs to be brought back to the source systems.

    Junk Dimension


    In data warehouse design, frequently we run into a situation where there are yes/no indicator fields in the source system. Through business analysis, we know it is necessary to keep those information in the fact table. However, if keep all those indicator fields in the fact table, not only do we need to build many small dimension tables, but the amount of information stored in the fact table also increases tremendously, leading to possible performance and management issues.
    Junk dimension is the way to solve this problem. In a junk dimension, we combine these indicator fields into a single dimension. This way, we'll only need to build a single dimension table, and the number of fields in the fact table, as well as the size of the fact table, can be decreased. The content in the junk dimension table is the combination of all possible values of the individual indicator fields.
    Let's look at an example. Assuming that we have the following fact table:
    Fact Table Before Junk Dimension
    In this example, the last 3 fields are all indicator fields. In this existing format, each one of them is a dimension. Using the junk dimension principle, we can combine them into a single junk dimension, resulting in the following fact table:
    Fact Table With Junk Dimension
    Note that now the number of dimensions in the fact table went from 7 to 5.
    The content of the junk dimension table would look like the following:
    Junk Dimension Example
    In this case, we have 3 possible values for the TXN_CODE field, 2 possible values for the COUPON_IND field, and 2 possible values for the PREPAY_IND field. This results in a total of 3 x 2 x 2 = 12 rows for the junk dimension table.
    By using a junk dimension to replace the 3 indicator fields, we have decreased the number of dimensions by 2 and also decreased the number of fields in the fact table by 2. This will result in a data warehousing environment that offer better performance as well as being easier to manage.