Data Linkage and Model Knowledge Discovery through GraphDB 2017-11-06T11:39:43+00:00

Project Description

Modern Enterprise Data Systems, such as Data Warehouses, have one basic thing in common: complexity. A small DW could handle hundreds of tables, as much as views and stored procedures, with thousands of columns related by foreign keys and having varoius attributes, like primary keys, indexes, constraints and so on. One of the biggest challenge in similar scenarios is Data Governance: it’s hard to maintain and keep under control complex systems like those.
Even if there are many technology solutions helping better unsterstand and analyze huge data systems, their main limits are complexity, low versatility and prices. Our approach consists in a multi-layer solution, focused on technology and domain independency, to keep it as versatile as possible. To reach our goal we design a 3 layers structure, composed of:

  • A first later, reponsible of metadata extractions loaded from feeding systems, like Sql Server databases, MySql, Oracle,… This layer is domain independent, technology dependent and extracts information to load to the second layer
  • A second layer, hosted and designed in a GraphDB (Neo4j), reponsible of the Knoledge base itself, where data are stored, related and analyzed. This layer, the core of the entire system, is domain independent and technology independent, because of technical interfaces are all stored in the previous one.
  • A third layer, responsible of analytical solutions like custom reports and query interfaces. This layer, at the top of the solution pyramid, il designed to better help end-users to extract fast and reliability informations.

Those three layers are related by:

  • A common workflow manager, based on Python language and responsible of the overall execution and coordination of various tasks
  • Some data interfaces, like json and csv files, to keep layers independent and self-consistent.

As we said, those interfaces grant layers realtionships, but permit to treat them as not correlated and absolutely independent: as we’ll see later, both first and third layer can be handled and adapted as prefered, without particular constraints with the second one.
Knowledge Base layer is loaded with metadata extracted by scanner layer, which are composed by informations regarding:

  • Nodes like sql columns, tables, views, stored procedures, schemas and dbs
  • Attributes node type dependent informations, like name, contained sql code, primary key flag, format and so on
  • Relations like foreign keys, column-to-tables membership or dependencies, like a table read from a view or a stored procedure

This schema, with proper queries, can give informations about data lineage and impact analysis of a column, relationships network diagrams and so on. It can also give informations about sql loops, that are hard to find using standard relational queries. Those type of analysis, in fact, are often based on relationships between objects, context in which GraphDB paradigm can grant best results.
All of extracted informations, stored in the second layer, are queried using Cypher language and ouput can be benefited by:

  • Standard and tabular approach, exported as csv or json files
  • Graphical approach, using Neo4j self web interface or, for deeper and more complex analysis, various reporting solutions

Due to our splitted-layers approach, it’s possible and quite simple to switch from an analytical solution to another, simply providing information throw standard formats like json.

As example, designing and developing the third layer, responsible of data visualization, we started from typical commercial solutions, like QlikView, QlikSense and PowerBI. Soon enough, we realized that most of meaningful information could be better analyzed using custom solutions, more complex to design and develop but more versatile and customizable. So we decided to switch a portion of our reports to a custom develop approach, using d3js, a powerful and specific open solution based on JavaScript. This way gave us most interesting deliverables. Using standard formats to share data to the third layer, this activity was more simple as we can think. Json files were simply provided to custom d3js-powered web pages, focusing our work to simple format adapting and front-end developing.
Trying to focus our approach pros and cons, we point our 3 main scenarios:

  • Data Lineage and Impact Analysis of a DB object, like a column.
  • Process Loop finder, to prevent data corruption due to a wrong develop design
  • Technical dashboard, to explain and focus DW critical area, for example some big tables often accessed for read purpose but not properly indexed

Those scenarios, extraced from our daily works situations, can be really hard to analyzed and solved, much more when we don’t have enough time. Recurring to our method and solution, it’s possible to better and simplier analyze and assess a given system.
Thinking about future evolutions of this framework, we are focusing on 2 main goals:

  • Evolve this system horizontally, to handle:
    • More feed systems, developing metadata scanners and code parsing to get the most informations
    • More technical area, for example ETL workflows (like SSIS, Talend and so on), Multidimensional processes and Reporting generation
  • Evolve this system vertically, designing a new layer, technology independent and domain dependent, focused on ontology of a given scenario that, properly interfaced with the Knowledge Base layer, could give meaningful information about evolution and management of functional and business themes.

Project Details

Skills Needed: