LinkedIn Open Sources Promising WhereHows Data Technology
Companies including Facebook Google and Netflix have proven to be strong contributors to the open source community, and now LinkedIn is making a significant new contribution.
Its WhereHows technology has been open sourced, which may be of much interest to organizations organizations with big combinations of structured and unstructured data. Here are details.
According to an announcement:
"In modern data-driven businesses, the complexity that arises from fast-paced analytics, data mining and ETL processes makes metadata increasingly important. In this blog post, we share our own journey and a new open source effort that aims to boost productivity and data provenance. WhereHows, a project of the LinkedIn Data team, works by creating a central repository and portal for the processes, people, and knowledge around the most important element of any big data system: the data itself. The repository has captured the status of 50 thousand datasets (with more than 15 petabytes storage footprint across multiple Hadoop, Teradata and other clusters), 14 thousand comments, 35 million job executions and related lineage information."
"At LinkedIn, WhereHows integrates with all our data processing environments and extracts coarse and fine grain metadata from them. Then, it surfaces this information through two interfaces: (1) a web application that enables navigation, search, lineage visualization, annotation, discussion, and community participation and (2) an API endpoint that empowers automation of other data processes and applications."
WhereHows is actively used at Linkedin, not only as a knowledge-based application, but also as a metadata repository to automate several other projects, such as automated data purging for compliance, multi-colo database replication and more. It promises to attract interest from many enterprises managing disparate types of data sets.
In addition, LinkedIn has open sourced a series of useful Hadoop tools, as we covered here.