Why Data Modeling is the Wrong Tool for Integrating Data Models
May 4, 2009
An integrated data model can connect the applications, functions, and disciplines of a company. It can be a foundation for service oriented architectures, business process automation, and master data management. The process towards establishing an integrated model can be long and winding. While data modeling methodologies are suitable for documenting the end result of this process, they may hinder more than facilitate its progress. In particular, prematurely introducing a precise common representation, may alientate groups of stakeholders, leading to a “common” model with a bias towards a few perspectives. An approach more adjusted to group dynamics and social learning is needed. Physical data models and logical information models should be complemented by conceptual knowledge models. This post presents some of the challenges involved, while a later post outlines a knowledge architecture approach to integrating data models.
The Problem is not Primarily Technical
The Butler Group estimates that a typical data integration project spends 20% of its time and resources on implementation, and 80% on developing and agreeing upon a common data architecture. They report that such projects often take longer to complete, and consume more resources than originally estimated. Resistance to change established ways of working is commonly thought to be an important obstacle. Such diagnoses contributes little understanding towards solving the problems. Could it be that the resistance is rational and well founded? Could it be the case that the methods we apply are simply not able to deal with the underlying business, organizational, and social issues?
Most approaches apply established modeling languages like ER (Entity Relationship) or class diagrams. These diagrams precisely represent a single data model, but generally fail to capture multiple views, possibly inconsistent with each other, organized according to different classification principles, dealing with different aspects of the domain, etc. Such representations are needed to support the negotiation of meaning, open-ended communication and emergent social learning processes that lead from a set of local data models to a common information architecture.
How Do We Fail?
Many projects fail because they are unable to reach agreement, or because the common information model becomes too complex for any single person to comprehend, and difficult to maintain as the business reality evolves. Implementation of common data hubs often face resistance because what was believed to be a common model, in fact was dominated by a single discipline or function. Incremental development of a common model can be difficult because the departments and disciplines that participate in the first rounds, often will impose their own perspective on the core data structures.
Superficial agreement can create misunderstandings that do not surface until towards the end of the project. We have for instance been involved in a project where 25% of the errors discovered during final testing, were due to different interpretations of the data for a single procedure call between the new system and two existing applications. Wrong use of logical data models is often a culprit in these situations.
We have developed a modeling approach that utilize knowledge architectures to arrive at integrated information and data architectures. By following this approach, you create a conceptial knowledge model suitable for interdisciplinary, cross-functional and cross-organizational communication. The conceptual knowledge architecture should be converted into a logical information model by experts on data management, to ensure that business knowledge is complemented by technical insight.
Semantic Interoperability – But not through Semantic Technologies
A knowledge architecture approach should not be confused with semantic methods, which are really just data modeling tools with an incomprehensible syntax. By focusing on the formalisation of the data models, semantic technologies provide no interpretive freedom for the stakeholders, no room for the contextual and pragmatic meaning of the data. We agree with Friesen that “the goal of increased interoperability both within and between communities will clearly not be achieved through further formalization and abstraction“. We take a pragmatic and contextual perspective on knowledge, not a formal one, where the removal of contextual interpretation is the key objective. We see knowledge models as a social web, not as a semantic web.
Supporting Social Learning Processes
People grasp the world from different points of view. Common understanding is created through open discussions and mutual respect. This demands a certain freedom of interpreting terms differently. You should not start enforcing a highly structured logical information model before a common understanding has emerged. This easily causes debates about terminology, rather than open discussion about pragmatic meaning. Disagreement can cause a “war” over the power of definition of the core terms, making it difficult to establish trust among the participants. With a minimum of good will, ambiguity allows people to speak the “same language”, even before they completely agree. Ambiguity is not absence of meaning, it is a prerequisite for participation in an open dialogue to create shared meaning.
This implies that models should be seen as boundary objects, cf. eg. Lave & Wenger om communities of practice. Boundary objects have a clear identity which is shared across the different groups, to make sure that everybody are “talking about the same thing”. Each community may however have different views on which features define the object, because they apply in for different purposes, in different contexts. While identity is unified and global, description and classification will often be local. A methodology for establishing a common data model should therefore use concrete objects and phenomena as starting points, exemplars and instances, rather than structured classes.
Data and Metadata
Seamless integration og interoperability between applications face barriers on different layers:
- Metadata, data about data, is generally the primary focus, as interpretation and transformation is needed between the languages used by the different applications.
- Data exchange can still be problematic, even with a common language. You still have to transform data values, especially borderline values, and ensure that the identity of elements can be recreated in roundtrips between applications. Data quality is a returning challenge with most systems.
- Metametadata can also cause problems. Different information architectures, storage structures or processing approaches in application platforms can influence the interpretation of data. For instance, an object class will be managed differently in a relational database, XML, and an object-oriented database. Even within a single paradigm, different applications may encode the same information differently. For instance, there are several ways of encoding specialization and dependencies in relational databases.
Our experience is that most methodologies over-emphasize metadata. A thorough understanding of different solutions for metametadata can yield general rules that will simplify integrated data models.
Problems associated with values and identity are easily overlooked, but they will surface eventually. The Butler Group identifies exchange of product data as a main challenge. Here identification structures are often developed locally within the company, or even especially in each project. These structures violate IT professionals’ basic tenet that identification and meaning should be separated. From practical considerations, meaningfull identification codes and keys are applied to locate the product within structures for system, discipline, size, material, physical location etc. This makes it just as important to tackle data values, as it is to deal with terms and languages.
Improving Data Quality
Data errors, imprecision, and shortcomings are quite common. Some have claimed that 25-50% of all product data received from suppliers contain errors. Internally in a company we have experienced that just 30-40% of the relationships between product components were registered in the product data management application, even though the data structures were designed to capture all of them. This made change management, traceability and analysis difficult.
In order to improve data quality, we should again explore the underlying organizational factors. Often one group benefits from the registration of a data set, while another group gets the added burden of inputing the data. A typical example is management reporting, which creates additional work for the employees. Employees seldom see reporting as their most productive activity, and tend to prioritize more urgent or meaningful tasks. On the other hand management tend to over-emphasize the importance of such data collection, often deciding to introduce new reporting procedures without a sound business case. Consultants should always keep these factors in mind. It is important to grasp the situation of different stakeholders, to investigate dependencies through holistic systems thinking.
Most organizations will have a company culture where norms ensure that necessary information is captured, at least to a certain extent. Mutual dependencies create a common understanding of the importance of keeping others up to date. Formal routines, procedures and technical solutions can also contribute to improved data capture. Still, many stories show that the formal and technical structures cannot deviate too far from social and cultural norms. Applications that require too many fields of data to be registered, often cause people to put in meaningless dummy values, just to satisfy the system rules.
In this way, automated data exchange may in some cases decrease the quality of data, even with all kinds of rules and checks implemented. If a technical solution replaces more direct communication between people, the sense of belonging and mutual dependency may weaken. Studies in industry, bank and insurance have shown that lack of knowledge about who uses the information we produce, can lead to alienation and poor motivation. Integration solutions should thus include channels for informal human communication, by technical or organizational means. If someone for the nth time is asked about information he should have registered in a computer system, he will often change the way he works. Informal communication channels are also crucial for dealing with exceptions and unforeseen events.
For an integrated data model, problems in data quality implies that you can never assume that the source data are complete or consistent. A common model may increase data quality by bringing together more sources, so that shortcomings in one source can be fixed by adding data from another overlapping source. Often dependencies that are represented as direct relationships in one source, may be stored as indirect relationships in a more detailed data set. An active knowledge architecture should, in order to fill the role of a common data model, include rules and processes for this kind of quality improvement.
Due to underlying organizational factors, we can never assume that all errors can be corrected. Therefore we should be careful about introducing too strict consistency and integrity rules in a common data model. This would only lead to poor quality data being kept outside of the integrated model, e.g. handled manually. The total overview and data quality would thus be decreased.
The degree of rule-based control in a database must thus be aligned with the organizational culture and practices, the way they are, not the way we want them to be. The need to invest in changing organizational culture and norms is generally underestimated in projects that develop data integration platforms. You can acheive a lot by clearly defined data ownership, and with managers who lead by example. The challenges that remain are however extremely difficult to uncover.
Problems in data quality also implies that the integration platform must “wash” the data coming into shared databases. Wrong values should be removed, and all values must be transformed into correct data types.
More fundamentally, data quality direct attention to the problem of prioritizing which data are critical to the company. Many companies today register data or demand reports which cannot be justified by cost-benefit analysis. Here we have found analysis methods that uncover the customers and providers of data to be useful. In some cases both parties in an interaction view themselves as providers, and none of them are the customer of the produced result. This opens up for significant cost savings.
Towards a Model Integration Methodology
We have previously defined principles for knowledge architecture modeling. This post have outlined some reasons why this approach, with multiple views, multi-dimensional classification, reflective, instance- and aspect-oriented models, is suitable for supporting the process towards establishing an integrated information architecture. This methodology will be presented in detail in a forthcoming post.