Couple of years ago, we were having following conversation
in IT office.
Me: Guys, what does it take to end all our application data
to our Enterprise Big Data Lake?
TM1: But why do we need to send it? Nobody asked for it!
Me: That’s true but does it hurt us? Is it too much work?
TM1: It is not too much work but I am worried about who will
see this data
TM2: That’s my worry too, if users access the data without
understanding what it means.
TM1: Actually we have a reporting team who develops reports so
if users want some data traditionally they ask us. We can give them canned
reports anytime they need.
It brings a question to forefront, why are we overly protective
about the data and its usage? I don’t blame the team members as that’s how
traditionally things were run. Whenever business user wants data, they need to approach
IT for a new report development. But we are now staring at a new paradigm.
Users want democratized data which can be self-serviced without involving what
they term as expensive, time consuming IT development. But is it as fancy as it seems? Let’s inspect
this case for democratization of data versus the traditional data aristocracy.
Why do we need it?
Firstly business needs it as in the traditional approach
there are limitations in accessing data that is needed for making a decision.
Only access users got to data is through a set of canned reports that are
opened to users. This approach suffers serious limitations
- 1. The canned reports are developed to cater to static business requirements that do not have flexibility to adapt to dynamic nature of business decisions. If we developed a report which answers the question of Top performing business lines and if user wants to make a decision about non-performing business lines, the report does not help anyways.
- 2. While the new reports can be developed, they take time and money to see the light of the day
- 3. Canned reports could not break the barriers of application-centric data silos enterprises created. If a decision involves data present in two different systems, we are facing multi-year, millions of dollar investment. Data that exists with the organization is not immediately consumable.
- 4. Users relied on excel extract and manual crunching of numbers to get data needed on ad-hoc business needs. Thoughtful business users always asked for excel export option to all reports developed in BI systems.
- 5. Also it is important to highlight that there were limitations from technology side when it comes to handling huge amount of data translating It into business insights.
- 6. Extracting value out of unstructured data is not easy to accomplish
Panacea?
Business users are excited every
time they hear “Self-Service” BI as this gives them information they need
without having to go through IT. IT is skeptical about business users skill
levels in playing with data, despite the fact that they are well-versed with
basic data Querying skills. However the advent of new technology enablers has
made it possible to march progressively.
Technology Enablers:
Big Data: The advent of
technologies like Hadoop which can handle big data without necessarily having
to take care of intrinsic complexities of parallelism and distributed systems
has come as a boon.
Data Lake: Data warehouses have gone out of fashion. Data
Lake is the new concept of hosting the enterprise data in one place accessible
to users who need the data. Data can be ingested into the lake (which is
powered by Big Data technologies) in raw format and can be made accessible to
those who need. Traditional Extract-Transform-Load (ETL) processes have taken
over by new Extract-Load-Transform (ELT) method. It is more than a jargon
change. With ETL , business was under pressure to explain to IT what kind of
transformations it needs before data can be loaded into warehouse. Now comes
ELT , where business can give a one line answer to load everything as-it-is in
the Source data format and worry about transformations when it.
Real-time: Real-time data
exchange between applications is not new. Message Queues were serving the
purpose for years. However the need of the hour is to handle massive amount of
data that is being generated real-time. Technologies like Apache Kafka help in
solving this problem.
Cloud: For smaller organizations
that could not afford the infrastructure of scale, Cloud offerings help in
getting the infrastructure and associated scalability without necessarily
having to shell out huge bucks upfront.
Analytical Tools: A variety of
tools have come into existence which can help users in coming up with insights.
Features like mining useful information, discovery of patterns, machine
learning algorithms that are enabled with the tools aid the users in finding
valuable information without having to do lot of technical stuff.
Challenges:
We cannot rule out the challenges in data democratization in this new
found excitement! Following aspects still need to be addressed.
Data Security:
While unfettered
access to data to all is real fancy thing to achieve, exposing access to
sensitive data can pose threats to the Organization. If this data happens to
fall in hands of unauthorized users, it can be risky. So even in the new world
of data democracy, Identity and Access Management protocols have to be put in
place.
Data Quality:
There has to be a
data quality thresholds defined and adhered to before the data can be consumed
by wider groups within the organization. Any crucial decisions made with data
that is not of good quality can potentially impact the top-line/bottom-line.
Overwhelming Data:
Having so much of
data available can be twin edged sword. Users might get overwhelmed with the
abundance at their disposal. Data Officers must be a full-time job to aid data
ingestion protocols as well as consumption methods for the users.
Data Governance:
A board of data
governance tea must be formed with involvement from various stakeholders. Data
Governor who will be chairing the board must lay out procedures around how the
data moves around the organization, quality metrics, how it must be consumed,
what should be classification of information, which information should be
restricted for general user etc.,
Data Dictionary:
Data element should
be tagged with metadata and context it represents, in the absence of which it
becomes extremely difficult for end users to make sense out of the data.
Properly defined, robust and unambiguous semantic layer and created data
dictionary is essential before exposing the data to general user.
Empowering the users:
Empowering the user
to make use of the data accessible to them is critical success factor for this
approach. Empowering also includes providing necessary tools, training and
skills to the users so that they do not draw inaccurate conclusions from the
data and, as a result make wrong decisions.
Involving IT:
In the entire
paradigm shift, IT has a crucial but rather unconventional role to play.
Success of this new future state mostly depends on adopting to new design and
architecture patterns. Some of the conventional jobs in IT might get shrunk, if
not eliminated. IT engineers have to redefine their roles that would suit the
new landscape and find newer opportunities to help business.