“The future belongs to those who see possibilities before they become obvious.” — John Scully
Emails, Word documents, PDFs, spreadsheets, and social media such as instant messaging and news feeds contain millions of facts about entities (people, organizations, and locations). Identifying which entities represent a real-world person, organization or location across millions of documents and merging them to aggregate knowledge about them is a major challenge for enterprises and government agencies.
Our patented machine learning platform performs three steps of analysis: Read, Resolve, and Reason.
- The “Read” step performs a number of in-document text processing steps such as Tokenization, Part of Speech assignment, Named Entity Recognition, Coreference Resolution, Relationship Extraction, and Document Classification to name a few.
- The “Resolve” step assembles, organizes, and relates the results from the “Read” phase to perform Coreference Resolution (Entity Resolution)
- The “Reason” step is used to help users correlate and understand information discovered in the “Read” and “Resolve” steps.
Through our most recently patented innovation, called “System and Method for Coreference Resolution” (U.S. Patent No. 8457950 B1), we have enhanced the analytics, which perform Coreference Resolution at the corpus level.
Large-scale entity resolution is made computationally tractable via probabilistic reasoning over factor graphs, using a scalable hierarchical representation of data. This follows from published state-of-the-art research. Since the data representation is essentially a forest in the graph theory notation, our system can leverage a distributed platform by randomly distributing trees across machines in an iterative manner. It is no surprise that there are various challenges associated with doing this successfully on a large scale, which are not limited to memory restrictions, power law effects, and high-performance data structures.
To be able to resolve “GlaxoSmithKline” in one document to “GSK” in a different document, we need to define a similarity function that computes real valued similarity scores at the minimum, with normalization being a bonus. Traditional techniques (such as those based on cosine similarity) tend to run into issues in high dimensions. Also, this similarity function should define token similarities with fuzzy match support and also be able to quantify the similarity of the surrounding context of these mentions. Similarity of the context is quantified using relationships, co-occurring terms, normalized geo/temporal entities with a weighting scheme. Another challenging aspect is the amount of features we would like to store in the hierarchical structure. Storing features in the non-leaf nodes supports faster reasoning but requires efficient memory and dimension reduction techniques.
Thus, from the inherent complexity of understanding human language, and at the desired scale, it is evident that large-scale entity resolution is a challenging task. Analytics improvements can only be obtained via applied machine learning (ML) driven experiments, which ultimately result in the better understanding of language processing. For this to be feasible from a research and development stand-point, we use a functional programming language – SCALA, which is the crux of our product’s coreference components. These rigorous and challenging ML driven experiments constitute a typical day of work at Digital Reasoning, inventing new techniques to scale and improve our analytics capabilities everyday. Statistics on what a researcher typically works with are as follows.
In some of our research testing, a distributed machine is used to reason over a set of tens of millions of mention references. These references are obtained by running our core analytics stack on unstructured text, making it a true end-to-end system. After time is spent in generating various data structures, adding features and finally creating the aforementioned forest structure and the iterative core algorithm is run with sample resolution in the succeeding section. Also, running for longer time would improve the resolution due to the inherent randomness in the algorithm. This is novel as it represents a real trade off of computational investment with algorithm quality and provides more value to the end users over time.
Below are some examples of the resolution, showing unique mentions throughout the text:
ex-king zahir shah
exiled king zahir shah
former afghan king mohammed zahir shah
former afghan king zahir shah
former afghan monarch zahir shah
former king mohammad zahir shah
former king mohammed zahir shah
former king zaher shah
former king zahir shah
king zahir shah
mohammad zahir shah
mohammed zahir shah
dreamworks animation skg inc.
dominant electricity producer
largest power producer
hyundai motor america
hyundai motor co
hyundai motor company
chairperson benazir bhutto
former premier benazir bhutto
former prime minister benazir bhutto
late benazir bhutto
mohtarma benazir bhutto
ms benazir bhutto
opposition leader benazir bhutto
ppp chairperson benazir bhutto
prime minister benazir bhutto
These examples deliver real-world benefits in circumstances where similar information can be the difference between revealing a critical relationship or risk, and missing something that is important to a business, organization or nation. Before you can ‘connect the dots’, you need to have the dots to connect. By collecting references to entities from a wide variety of source material into a single, linked knowledge graph, patterns and relationships become clear links between known dots. By being able to connect references in the news with comments being made in email by employees on a restricted trading list, financial firms are better able to proactively protect themselves from trading abuses and control room violations.
In the legal realm, quickly understanding the who, what, when, and where in the huge electronic document repositories provided during discovery can provide leads for the defense and prosecutors alike. In defense of our nation, being able to understand who is talking about what, even if the “what” is being intentionally concealed, makes all the difference to the intelligence analyst and helps to keep our nation safer.
Digital Reasoning’s customers rely on Synthesys because of the time, thought and innovation we’ve put into how our platform understands how people communicate, and how it can reveal critical signals from vasts amounts of information. Our innovation started more than a decade ago and is a never-ending process within our company. Every day the team at Digital Reasoning is looking at new ways to analyze data and bring actionable information to our customers. This innovation is in our culture and drives us to make great products while solving really complex problems.
For more information on Synthesys and how Digital Reasoning performs entity resolution, visit our resource library at: http://www.digitalreasoning.com/learn-more