Friday, October 6, 2017

Data Democratization - A Perspective

Couple of years ago, we were having following conversation in IT office.

Me: Guys, what does it take to end all our application data to our Enterprise Big Data Lake?
TM1: But why do we need to send it?  Nobody asked for it!
Me: That’s true but does it hurt us? Is it too much work?
TM1: It is not too much work but I am worried about who will see this data
TM2: That’s my worry too, if users access the data without understanding what it means.
TM1: Actually we have a reporting team who develops reports so if users want some data traditionally they ask us. We can give them canned reports anytime they need.

It brings a question to forefront, why are we overly protective about the data and its usage? I don’t blame the team members as that’s how traditionally things were run. Whenever business user wants data, they need to approach IT for a new report development. But we are now staring at a new paradigm. Users want democratized data which can be self-serviced without involving what they term as expensive, time consuming IT development.  But is it as fancy as it seems? Let’s inspect this case for democratization of data versus the traditional data aristocracy.

Why do we need it?

Firstly business needs it as in the traditional approach there are limitations in accessing data that is needed for making a decision. Only access users got to data is through a set of canned reports that are opened to users. This approach suffers serious limitations
  • 1.    The canned reports are developed to cater to static business requirements that do not have flexibility to adapt to dynamic nature of business decisions. If we developed a report which answers the question of Top performing business lines and if user wants to make a decision about non-performing business lines, the report does not help anyways.
  • 2.       While the new reports can be developed, they take time and money to see the light of the day
  • 3.       Canned reports could not break the barriers of application-centric data silos enterprises created. If a decision involves data present in two different systems, we are facing multi-year, millions of dollar investment. Data that exists with the organization is not immediately consumable.
  • 4.       Users relied on excel extract and manual crunching of numbers to get data needed on ad-hoc business needs. Thoughtful business users always asked for excel export option to all reports developed in BI systems.
  • 5.       Also it is important to highlight that there were limitations from technology side when it comes to handling huge amount of data translating It into business insights.
  • 6.       Extracting value out of unstructured data is not easy to accomplish
x

Panacea?

Business users are excited every time they hear “Self-Service” BI as this gives them information they need without having to go through IT. IT is skeptical about business users skill levels in playing with data, despite the fact that they are well-versed with basic data Querying skills. However the advent of new technology enablers has made it possible to march progressively.

Technology Enablers:

Big Data: The advent of technologies like Hadoop which can handle big data without necessarily having to take care of intrinsic complexities of parallelism and distributed systems has come as a boon.

Data Lake:  Data warehouses have gone out of fashion. Data Lake is the new concept of hosting the enterprise data in one place accessible to users who need the data. Data can be ingested into the lake (which is powered by Big Data technologies) in raw format and can be made accessible to those who need. Traditional Extract-Transform-Load (ETL) processes have taken over by new Extract-Load-Transform (ELT) method. It is more than a jargon change. With ETL , business was under pressure to explain to IT what kind of transformations it needs before data can be loaded into warehouse. Now comes ELT , where business can give a one line answer to load everything as-it-is in the Source data format and worry about transformations when it.

Real-time: Real-time data exchange between applications is not new. Message Queues were serving the purpose for years. However the need of the hour is to handle massive amount of data that is being generated real-time. Technologies like Apache Kafka help in solving this problem.

Cloud: For smaller organizations that could not afford the infrastructure of scale, Cloud offerings help in getting the infrastructure and associated scalability without necessarily having to shell out huge bucks upfront.

Analytical Tools: A variety of tools have come into existence which can help users in coming up with insights. Features like mining useful information, discovery of patterns, machine learning algorithms that are enabled with the tools aid the users in finding valuable information without having to do lot of technical stuff.
     
Challenges:
We cannot rule out the challenges in data democratization in this new found excitement! Following aspects still need to be addressed.

Data Security:
While unfettered access to data to all is real fancy thing to achieve, exposing access to sensitive data can pose threats to the Organization. If this data happens to fall in hands of unauthorized users, it can be risky. So even in the new world of data democracy, Identity and Access Management protocols have to be put in place.

Data Quality:
There has to be a data quality thresholds defined and adhered to before the data can be consumed by wider groups within the organization. Any crucial decisions made with data that is not of good quality can potentially impact the top-line/bottom-line.

Overwhelming Data:
Having so much of data available can be twin edged sword. Users might get overwhelmed with the abundance at their disposal. Data Officers must be a full-time job to aid data ingestion protocols as well as consumption methods for the users.

Data Governance:
A board of data governance tea must be formed with involvement from various stakeholders. Data Governor who will be chairing the board must lay out procedures around how the data moves around the organization, quality metrics, how it must be consumed, what should be classification of information, which information should be restricted for general user etc.,

Data Dictionary:
Data element should be tagged with metadata and context it represents, in the absence of which it becomes extremely difficult for end users to make sense out of the data. Properly defined, robust and unambiguous semantic layer and created data dictionary is essential before exposing the data to general user.

Empowering the users:
Empowering the user to make use of the data accessible to them is critical success factor for this approach. Empowering also includes providing necessary tools, training and skills to the users so that they do not draw inaccurate conclusions from the data and, as a result make wrong decisions.

Involving IT:
In the entire paradigm shift, IT has a crucial but rather unconventional role to play. Success of this new future state mostly depends on adopting to new design and architecture patterns. Some of the conventional jobs in IT might get shrunk, if not eliminated. IT engineers have to redefine their roles that would suit the new landscape and find newer opportunities to help business.

As a summary, while data democracy is something that is good and achievable for benefit of challenging business dynamics, there are few things that need to be addressed at the framework level before it can become a reality.