Posts Tagged ‘White Paper’

Tips for Quality in Data Analytics Projects

July 18, 2011 Leave a comment


If you are thinking of launching a data analytics project at the departmental level. Following best practices in gathering, preparing and loading data into your analytics system will be helpful. This white paper was written by ZENeSYS for iQCodex, the maker of leading data collection and transformation tool – iQCodex. Visit for more information.

What is Data Analytics?
Companies are eager to harness the sea of information around them for deep insights. For example, with the recent standardization of medical health records, health care providers now have access to historical data on chronic illness from a variety of sources, which can be helpful for better diagnosis. The Internet, business databases, news feeds partners, clients and internal company online data availability means access to market conditions in near real time. Because of this, companies are feeling the pressure to process information for operational efficiency and competitive advantage using Data Analytics – the science of analyzing data to develop insights.

Data Quality is Key
A significant job in data analytics task is in data collection, scrubbing and validation or better known as ETL (extraction, transformation and load). According to some studies between 40% and 70% of the cost in a data analytics ends up being spent in extraction, transformation and loading functions.

Business Units Take on Data Analytics Projects
With the arrival of affordable and easy to use ETL tools, business units rather than IT departments are taking on data analytics initiatives. This is understandable because the input and output of a data analytics initiative are the business units themselves. Input being raw data, which only the business units know where to find and output being insights, which again, only business units are capable of defining. If your business unit is considering a data analytics project then it is important to understand the best practices involved in capturing, transforming, validating and updating data for analysis.

Best Practices for Ensuring Data Quality
Extraction, Transformation and Loading (ETL) in simple terms means the job of (1) Collecting or capturing data (2) Ensuring Accuracy (3) Conversion for relevance and homogeneity and (4) Updating for keeping it fresh.

We will explain each of these in more detail now.

Collection of Data: Data will inevitably come from a variety of sources and we can be rest assured that the format will vary from one source to another.If it comes from a structured source such as a database, then it will be in a database language such as SQL. This is good news as structured data are easiest to deal with.

However the bad news is, structured data from external sources such as medical records or financial transactions have a different twist to it. These are likely to be in XML file formats – an industry specific data exchange language. These XML standards tend to vary for each industry e.g. insurance, finance or health.

Non-structured data can come from text, PDF, Word, or Excel files, though in most cases it will be Excel or a comma separated file also known as CSV files. This type of data generally comes from customers, news, industry reports, and internal project documents. Tapping into this category of data is often underestimated. By some estimates, up to 40% to 70% of data in analytics projects comes from MS Excel, MS Access or some other form of desktop or legacy application.
The important thing to note here is that you have to make sure the ETL tool you use is able to handle all your required input formats with equal ease.

Accuracy of Data: The foremost thing in ensuring accuracy is avoiding manual data entry. This means an ETL tool should be able to “read” all data sources electronically. For this to happen the ETL interface should be able to handle all the formats described in the previous section i.e. SQL, XML, Excel, Access, and plain text CSV files.

Electronic reading not only assures accuracy, but it reduces cost, speeds up data collection and most importantly, safeguards against errors and fraud – something that is becoming increasingly important as the regulators tighten the laws.
A good ETL tool will also provide a means of cross checking data for errors. For example, there should be validation checks for date formats, phone numbers, postcodes. Some ETL tools can provide look-up tables and programming capability to cross check for data accuracy.

For example, an insurance claim record may contain claim date for an incident that occurred prior to coverage date. So here a date format check will simply not be enough, we need a way to “validate” the incident date by fetching the policy validity dates before the claim data is allowed to go through.

Data Conversion: Conversion of data is probably the most complex feature in an ETL tool. We will go through several aspects of data conversion requirements. Sometimes data is not in the format we need. A simple example would be the need to convert currency. ETL tools should have the means to covert to the desired alternative units on the fly.

The second type of conversion requirement is semantic normalization. Consider a name that is being referred to in multiple ways e.g. a student, an apprentice, a trainee, or an observer. If they all mean the same thing for your analysis then it has to be normalized into a single preferred label before it enters your analytics engine.

Another type of conversion is a bit more complex than the date conversion example. What if we need to grab only “net of tax sales figures” where only gross sales figures are available at source? Your ETL tool should be able to compute the “net sales” from gross sales numbers by deducting the prevailing tax rates for that region as the source data is captured. This is a capability for allowing a “computed value” as input.

A fourth type of data conversion is the need for “protocol conversion”. With the advent of standardization, interchange of data between organizations is now being facilitated with communications standards (also called protocols) such as SWIFT in the financial services industry, HL7 for health records, ACCORD for insurance industry. A good ETL tool must provide for the necessary “hooks” to translate these standards or protocols and render the data to a format that is ready for analysis in your analytics engine.

The fifth type of conversion challenge is that of handling ad-hoc formats. Even today there are many organizations that have “system islands” meaning different systems are running different computer applications with no connection between them. The challenge here is that a data analytics project may need to tap into data from one or many disjointed system such as human resources, accounts payable, inventory management, or a customer relationship management. Some of these systems may be old with proprietary data formats. In such cases an ETL tool must be able to read comma-separated files – something that older systems are capable of generating. However, conversion from such file formats will never go according to plan.

When an error is encountered, an ETL tool needs to “flag” an exception. An administrator of the ETL tool will then examine the failed conversion step and make the conversion manually. Sophisticated ETL tools will “learn” these exception conditions so the same type of encounter again will not result in a manual intervention request.

A duplicate data is a very common exception condition. Here the same piece of data may appear more than once and quite likely with conflicting parts to it. For example, an accounts payable system has a customer bank account number as XYZ and a purchase order by the same customer shows their bank account number as ABC. In such cases the ETL tool must de-duplicate or in case of a conflict establish the correct bank account number.

Stripping of non-relevant data is another exception condition commonly encountered. This can and will happen when it is coming from legacy systems or an external organization. Take for example a data analytics exercise on eye related problems in diabetic patients. The data sources on patient history will likely contain many other data points apart from eye history. The ETL tool in this case has to systematically filter in only eye related data from diabetic patient records from the incoming source.

Updating of Data: Data Analytics is not a one-time effort. It is an ongoing talk that requires data refreshing on a continuous basis. However, to update data, an ETL tool needs to do more than just fetching new data.

Synchronizing the sequence of data updates: An address change update needs to be done prior to applying new eligibility rates by region in a particular customer record for instance. Each update action should have a way to take into account the dependency of another update or the planned time of update.

Timing of data updates: Even though certain data may be available immediately, the actual update needs to occur at a specific time to ensure no loss of data integrity. This happens often when dealing with legacy or older systems where processing takes place at night and the required data can only be extracted after 5 am. Here the ETL tool needs to ensure that the updates that are dependent on that nightly batch of data must start only after 5 am.

Partial or selective updates: Ad hoc sources of data such as external systems and personal data files are not capable of providing “selective updates”. Meaning they will contain the full set of data records each time. So if you are importing the credit history of your customers from an external source, each time you want an update, you will have to deal with the entire data set – even though you are only interested in those that have changed since your last upload. Sometimes there will be a flag or an indicator on which items have changed since last batch and but most likely not. It is up to the ETL system now to detect those flags and only update the changed records. In case there are no change indicator flags, the ETL tool needs to first determine which records have changed. Doing a full load for each update is not a good idea because it can slow down your Data Analytics server significantly. Another drawback is that you may end up with data that you no longer need.

Changes in transformation logic: It is quite apparent that the entire ETL process is a series of algorithms and business rules for enabling the capture, transformation and loading of data. Over a course of time, these algorithm or logic will need to be modified because of changes at source, discovery of a better way of doing things or change in policy. An ETL tool should be able to allow administrator to do this easily. To modify the formula or logic, the administrator should first be able to understand how it is setup currently and then give them the ability to edit it. This is actually a very important consideration to keep in mind while selecting an ETL tool. Those that do not provide this feature of a transparent setup will become totally unmanageable at some point as people forget how the logic was setup long time ago.

Business process changes: Lets say today your Data Analytics system is setup to import credit ratings directly from a rating agency. Tomorrow your accounts department wants to take on the role of approving the ratings because you have access to multiple rating agency data. Some ETL systems are workflow enabled to accommodate such powerful features of building in a approval processes. iQRoute from iQCodex excels in this features with many different workflow features such as escalating, authorizing, approving, and overriding.

Keeping an audit trail: With the advent of new laws such as Sarbanes Oxley in the US and Solvency II in EU, the need for proper audit trails is becoming vital. Notwithstanding, the impending regulatory pressure on compliance, it is a good practice to have an audit trail feature so if things go wrong, the system can pin point what was changed, who made the changes, and when the change was done.

Ability to track back on changes: Often things go wrong and mistakes are discovered after days or weeks. In such instances the audit trail feature will indicate the source of the problem. However to get back on our feet again, a roll back to the former condition becomes a huge lifesaver. A good ETL system will provide snapshots by date and time stamps of changes made. An administrator then can review each historical change and decide how far to go back to start all over again.

Several organizations are launching Data Analytics initiatives for getting invaluable market and operational insights at the departmental level. However, data analytics outcomes are dependent upon the quality of data. Several affordable tools are available on the market to provide ETL (extract, transform and load) capability. Before you decide to launch a data analytics project, consult this checklist to ensure you select an ETL tool that has all these features. These features are essential for quality of data and therefore the validity of insights derived.


Ingredients of a White Paper – What goes around comes around

Thanks to white papers, a vast amount of knowledge is readily available on the Net today. White papers are instrumental on getting up to speed on concepts we are not familiar with when suddenly there is a need to become an “expert” overnight!

However, what goes around, will eventually come around. This means the more white papers industry contributes, the more will researchers, potential clients, and business will benefit from this ecosystem.

But what is the hallmark of a good white paper?

  1. Subject of Current Interest: It needs to be a topic that will catch attention of the potential client community who may be grappling with sudden changes in the industry such as a new technology, regulation or point of view
  2. Short and Focused: Should not be more than 4 to 5 pages long. This can be easily done by focusing on a very specific area of the issue. Generic or all-encompassing papers can become very lengthy, expensive to create and unlikely to be read by the target community
  3. Proving the Viewpoint: There should be empirical data (facts and figures) to support the insights in the paper. Even though most readers will not follow up on references, these should be provided to indicate that its not a subjective viewpoint
  4. The 80:20 Rule for Impartial Insight versus Plug-in: At least 80% of the paper should provide impartial insight on the topic. In all honesty, it should be fair to provide 80% “education value” in exchange for a maximum 20% “promotional content”. This of course, assumes that the promotional content legitimately fits into the solution  under discussion
  5. Interesting Reading: Should be engaging to read. Must have some charts and/or pictures to break the monotony of words. Pictures are worth a thousand words. In addition, they convey the sense that someone took the time to make it easy to understand and save time for the reader
  6. References must be Quoted: Quoting references not only attributes credit to the original contributors but provides a means of assessing the degree of content legitimacy for the reader