It is sometimes difficult to remember that the World Wide Web on the Internet made its debut to the general public only about 22 years ago with Tim Berners-Lee’s first post to the alt.hypertext newsgroup. Yet, during that relative short time, it has had a profound impact on the availability of new sources of information, accessibility and flexibility. The Internet and its ability to turn anybody into a publisher has led to a virtual explosion of data (i.e. “Big Data”) that does not lend itself well to traditional analysis and management tools. Much of this data is unstructured in format, resides in very large data sets and is minimally curated at best (i.e. noisy). In addition to this, the business community has started to become aware of the potential benefits of leveraging these new sources of information in order to enrich their existing corporate business data to provide new business intelligence. This presents the industry with a challenge; on one hand, new data sources are being recognized for their inherent value; but, on the other hand, turning the data into meaningful business information can be complicated.
So, the democratization of data facilitated by the Internet, while creating new opportunities for businesses to leverage this valuable information, has also required organizations to rethink how they look at and process these new sources of data. Many of the existing data architectures simply do not work well in this new data paradigm. Note that the recommendations contained herein are based on a natural language data domain (i.e. the analytics derived from free-form text, consisting of entities, relationships, properties, conceptual relationships, events, etc.), since data architectures are somewhat dependent on the data domain they are dealing with.
As I stated before, much of this new data is unstructured in format, not only requiring sophisticated analytics to extract meaning from the data, but also new methods for storing and accessing the results of this analysis. Many of the existing relational data architectures do not lend themselves well to dealing with the results from natural language analytics for the following reasons:
- Natural language data analysis generates an explosion of entity data, including an even greater number of “edges” connecting these entities, which define the myriad of relationships that can exist between entities. The complexity and number of “join” operations required by relational models to pull meaningful information from this type of data results in sub-optimal query performance.
- A distributed data model is required to provide the necessary scale and performance for storing and retrieving the large volume of information resulting from the natural language analytics. Distributing any database requires the data to be segmented (i.e. data sharding) across multiple servers. Experience has shown that sharding relational data models requires very complex data algorithms that are difficult to maintain and scale, and thus are not the best choice for this particular type of data.
- Natural language data, by its nature, is inherently variable in content (i.e. not normalized). The resulting analytics are best stored in a “schema-less” data model in order to provide the necessary flexibility. Relational models require the data to be normalized into predefined schemas that are difficult to change in mid-stream. A schema-less data model provides greater flexibility to handle data that by its nature is not normalized and doesn’t conform to any specific data domain.
We experienced this challenge several years ago with an implementation where we were analyzing millions of documents using our natural language analytics, generating hundreds of millions of entities, associated properties, and conceptual relationships. This information was persisted in our schema-less database as well as in a large RDMS at the customer’s request to perform specific types of entity queries. Serious performance issues with the RDBMS system began to ensue during the data ingestion process (overhead for updating the indices as data was added to the tables), as well as when querying the system. The performance issues were so pronounced in the pilot demonstration that based on the customers request the RDBMS was dropped from the system design.
The need for schema-less data models led to several relatively recent efforts resulting in the development of new data store paradigms, namely “key/value” data models (i.e. NoSQL) and graph data models.
Two of the more popular NoSQL implementations are HBase (derived from Google’s BigTable) and Cassandra (derived from Amazon’s Dynamo). While the de-normalization of data leads to data duplication with these data models, this is more than compensated by superior store and query performance, and the partition-tolerance (i.e. ability to distribute) offered by these approaches.
Examples of graph databases include Neo4j and Titan. A graph data model essentially consists of nodes and edges (and properties) defining the relationships between nodes. What makes a graph data model so appealing for storing natural language data analysis is that relationships between entities do not have to be computed (or “joined”), because they are inherent in how the data is stored, making it easy to quickly identify n-order relationships between nodes by simply traversing the graph.
This data structure adapts itself especially well to use cases involving the understanding of relationships between entities of interest (i.e. people and organizations) and the activities they are involved in. The entities (and their properties) are represented by nodes in the graph, with the edges defining the relationship and activities between these entities. With this model, it is also easy to represent special properties, such as spatial and temporal information, as first-class data objects to provide another view of the data, making it easy to not only determine the time and place associated with a specific entity, but also the flip-side of what entities are connected to a specific time or place.
In order for a business to achieve the maximum benefit from “Big Data”, it must adopt a “silo-less” data approach. By this I mean creating an enterprise “Knowledge Layer” (or Knowledge Graph) that correlates the various data resources into a common view of the data, resolving entities across all of the data and identifying important facts and relationships about these resolved entities. Since many businesses keep their data separated into different databases (or silos), correlating data between silos often requires a complex workflow that is fairly inflexible and difficult to maintain.
Creating a common knowledge graph requires that key metadata from these data sources be harmonized into a common non-domain knowledge graph, such as the one provided by Digital Reasoning’s language model. This does not mean consolidating the data into one giant database. Rather, it means the creation of an abstract data layer that correlates the important information into a common understanding of the key information contained in these databases. Digital Reasoning’s approach to creating this overarching knowledge graph combines structured data with sophisticated unstructured data analysis, providing the type of data correlations across all of the data that leads to the discovery of interesting, actionable business intelligence.
A key advantage of creating a common knowledge graph is that unique insights can be gained by correlating existing corporate information with other data sources available on the Internet. For example, traditional customer information owned by a corporation can be augmented and validated with information gathered from social networks, news articles, government sites, etc. Related to this use case is uncovering relationships between people and organizations from analyzing emails, web postings, chats, social media sites, etc. (i.e. understanding relationships between internal people and outside businesses to determine the best way to approach them for new business).
Another use case from the financial world is regulatory compliance. This involves analyzing emails for potential rogue trading or insider-trading activity, and could also include transcripts of telephone conversations, social media posts and other public sources of information. These types of data sources have historically been underutilized due to their size and because they are often only available in hard-to-analyze formats (i.e. not structured). Many enterprises are now realizing the benefits of enhancing their business intelligence analytics with information from public sources of data, but are also somewhat intimidated by the associated infrastructure and analytical requirements.
So, I would say that there are several really important lessons to be learned from all this. Although some clients initially struggle to understand how to leverage new data architectures, the desire to deliver knowledge back to the business in a way that can substantially improve the business’ customer relations, revenue and productivity, while reducing risks and potential exposures, quickly becomes a compelling event. This, in turn, drives the need to embrace these new data architectures in order to gain benefits that historically were not achievable through traditional data architectures.
As the expression goes, nothing ventured, nothing gained. And in this case, it couldn’t be more appropriate. It’s easier to say that a traditional technology strategy is working rather than undertake the deployment of new technologies. But when new technologies, analytics and data architectures can deliver unparalleled value back to the business, such as an enterprise knowledge graph, it becomes the responsibility of all key stakeholders to consider the change in order to transform the way a business operates and the insights and outcomes a business can derive from the data around it.
 When considering HBase and Cassandra in light of the CAP Theorem (Consistency, Availability, and Partition tolerance) for distributed systems, the chief differences between these two implementations are that HBase provides consistency and partition tolerance, while Cassandra generally provides high-availability and partition tolerance.
About the Author: Harry Schultz is senior vice president of product development and solutions here at Digital Reasoning. Harry has focused the last 20 years of his career on managing large-scale IT projects and software development groups, with an emphasis on team building and developing the processes and environment necessary to support successful software development efforts.