Thursday, 28 June 2012

Data Warehousing - ETL Tool Selection


Buy vs. Build
When it comes to ETL tool selection, it is not always necessary to purchase a third-party tool. This determination largely depends on three things:
  • Complexity of the data transformation: The more complex the data transformation is, the more suitable it is to purchase an ETL tool.
  • Data cleansing needs: Does the data need to go through a thorough cleansing exercise before it is suitable to be stored in the data warehouse? If so, it is best to purchase a tool with strong data cleansing functionalities. Otherwise, it may be sufficient to simply build the ETL routine from scratch.
  • Data volume. Available commercial tools typically have features that can speed up data movement. Therefore, buying a commercial product is a better approach if the volume of data transferred is large.
ETL Tool Functionalities While the selection of a database and a hardware platform is a must, the selection of an ETL tool is highly recommended, but it's not a must. When you evaluate ETL tools, it pays to look for the following characteristics:
  • Functional capability: This includes both the 'transformation' piece and the 'cleansing' piece. In general, the typical ETL tools are either geared towards having strong transformation capabilities or having strong cleansing capabilities, but they are seldom very strong in both. As a result, if you know your data is going to be dirty coming in, make sure your ETL tool has strong cleansing capabilities. If you know there are going to be a lot of different data transformations, it then makes sense to pick a tool that is strong in transformation.
  • Ability to read directly from your data source: For each organization, there is a different set of data sources. Make sure the ETL tool you select can connect directly to your source data.
  • Metadata support: The ETL tool plays a key role in your metadata because it maps the source data to the destination, which is an important piece of the metadata. In fact, some organizations have come to rely on the documentation of their ETL tool as their metadata source. As a result, it is very important to select an ETL tool that works with your overall metadata strategy. Popular Tools
    • IBM WebSphere Information Integration (Ascential DataStage)
    • Ab Initio
    • Informatica
    • Talend
  • Data Warehousing - Database/Hardware Selection




    Buy vs. Build

    The only choices here are what type of hardware and database to purchase, as there is basically no way that one can build hardware/database systems from scratch.
    Database/Hardware Selections
    In making selection for the database/hardware platform, there are several items that need to be carefully considered:


  • Scalability: How can the system grow as your data storage needs grow? Which RDBMS and hardware platform can handle large sets of data most efficiently? To get an idea of this, one needs to determine the approximate amount of data that is to be kept in the data warehouse system once it's mature, and base any testing numbers from there.
  • Parallel Processing Support: The days of multi-million dollar supercomputers with one single CPU are gone, and nowadays the most powerful computers all use multiple CPUs, where each processor can perform a part of the task, all at the same time. When I first started working with massively parallel computers in 1993, I had thought that it would be the best way for any large computations to be done within 5 years. Indeed, parallel computing is gaining popularity now, although a little slower than I had originally thought.
  • RDBMS/Hardware Combination: Because the RDBMS physically sits on the hardware platform, there are going to be certain parts of the code that is hardware platform-dependent. As a result, bugs and bug fixes are often hardware dependent. True Case: One of the projects I have worked on was with a major RDBMS provider paired with a hardware platform that was not so popular (at least not in the data warehousing world). The DBA constantly complained about the bug not being fixed because the support level for the particular type of hardware that client had chosen was Level 3, which basically meant that no one in the RDBMS support organization will fix any bug particular to that hardware platform.
    Popular Relational Databases
    Popular OS Platforms