What is a Data Catalog? Does your Business Need One?

 lucy kellly author profile image

Lucy Kelly

Marketing Manager

What is a Data Catalog? Does your Business Need One?

TL;DR: A data catalog is where an entire organization can store all its internal and external data assets. It’s also used for data governance and collaboration. If your company needs data for analytics, a data management tool, or an additional data monetization channel (or all of the above), a data catalog could be the solution.

What is a data catalog?

Anyone with a vague interest in the data industry or data-as-a-service will have come across the term ‘data catalog’. For the most part, data catalogs are a data management solution. A data catalog is an inventory of the data an organization collects. The catalog provides a comprehensive view of all available data assets. It serves as a centralized repository, storing metadata about each data asset, such as its source, format, and purpose. Employees across a company can then use this catalog to easily discover, understand, and access the data they need. This allows better collaboration across large organizations, more informed decision-making, and improved data governance.

Aside from data management, however, data catalogs are becoming increasingly importance for data commerce. Innovative DaaS companies are utilizing established data catalogs to reach new demand. As data catalogs such as KPMG are looking to expand their catalog using external data, DaaS companies can integrate their datasets with the catalog and charge clients for access. This is one of the most exciting developments in data commerce: internal data catalogs investing in external data assets to cater to their clients’ ever-growing analytics needs.

What kind of data is available in a data catalog?

The specific types of data you can obtain from these catalogs may vary, but here are some common categories of data that can be found:

  1. Data source information: Details about databases, tables, files, APIs, and other sources of data.
  2. Data lineage: Information on the origin, transformations, and flow of data across systems and processes.
  3. Data definitions: Definitions and descriptions of data elements, attributes, and fields.
  4. Data quality metrics: Metrics and statistics indicating the quality, completeness, and accuracy of data.
  5. Data access and permissions: Permissions and access controls for data assets.
  6. Data usage and popularity: Insights into the frequency, popularity, and usage patterns of data assets.
  7. Data relationships: Relationships and associations between different data assets and entities.
  8. Data classifications and tags: Categorizations, classifications, and tags assigned to data assets for organization and searchability.
  9. Business glossary: Business-specific terminology, definitions, and context for data assets.
  10. Data policies and governance: Information on data governance policies, standards, and compliance requirements.

The kind of data available in a data catalog also depends on the type of catalog it is. Broadly speaking, there are two kinds of data catalog: internal or external. However, we can also group catalogs according to the size of the organization using it (i.e. enterprise vs. startup), or according to the data category it deals with (e.g. geospatial data).

Internal data catalogs

Internal data catalogs are managed by an enterprise and are accessible by all employees across the organizations. For example, the tax and audit advisory company KPMG has an internal data catalog. Anyone working for KPMG can leverage the datasets in the catalog for their project and analytics. KPMG's catalog predominantly includes financial and alternative data.

External data catalogs

Organizations are increasingly looking to third-party data sources on top of their internal data. As a result, data catalogs are no longer limited to internal company use. External data can be integrated into the data catalog to enrich the organization's data assets and provide new insight. The catalog then provides metadata about the data source, provider, origin, recency, and lineage.

Data catalogs by size, industry and category

Here are some of the best-known data catalogs in the industry today, as well as some newer companies which have tkaen an innovative approach to cataloging (which we predict will become key data commerce players in years to come).

Enterprise data catalogs

Cloud data catalogs

Open-source data catalogs

Data cataloging-as-a-Service

Geospatial data catalogs

Let’s take a look at the three main reasons your business would benefit from working with a data catalog, whether it’s to use for company-wide analytics or to create a new revenue stream.

1. For collaboration

For chief data officers and data analysts working at companies from enterprise-level to startups, data catalogs are a must for data management. They’re the best place to collect, comprehend and collaborate with datasets cross-company. For example, a company’s finance team may look into revenue reports over a certain period. This team could then collaborate with analysts at the company to create more powerful data visualizations.

This collaboration doesn’t come at the expense of data quality or data governance. Data catalogs are structured to ensure that all data assets retain their lineage, as well as ensuring that they’re being used within compliance legislation. For this reason, end-to-end data governance is easy with a data catalog, even across large organizations with multiple stakeholders using the data.

2. For innovation

Data catalogs are at the forefront of AI innovation. They enable AI practitioners to easily discover and access relevant data sources, saving time and effort in the data discovery process. Data catalogs also enhance data quality and governance by providing metadata and lineage information, allowing data scientists to understand the origin, transformations, and quality of the data they work with. This leads to improved data preparation and feature engineering, which are crucial steps in AI model development.

As we’ve seen, data catalogs facilitate collaboration and knowledge sharing among AI teams by providing a common platform to document and share insights about data assets. This helps to avoid data silos and promotes reuse of existing data assets, fostering greater efficiency and innovation in AI projects. Essentially, data catalogs streamline the data discovery and preparation process, which accelerates AI innovation and improves the success rate of AI initiative.

3. For monetization

We’ve covered the buy-side benefits of data cataloging. It’s fast becoming the norm to have a data catalog to improve your company’s analytics, collaboration and innovation capabilities.

But what about the sell-side benefits of the data catalog? The buy-side demand for external data has led to more companies becoming DaaS companies - that is, monetizing their data assets or founding a data provider business. With this comes a need for more channels through which to sell data and reach customers in need of intelligence.

Data catalogs are such a channel. Data providers can work with companies operating data catalogs, whether enterprise, open-source, or otherwise, to list their data products  and reach new customers. As data commerce becomes the mainstream, each data catalog becomes a potential additional sales channel for data providers.

The easiest way to integrate with data catalogs from global enterprises is by joining Data Commerce Cloud. With one DCC account, you’re able to sync to multiple data catalogs and marketplaces, including Alation. Find out more by getting in contact with our partnerships team.

Related Articles

Are you a data company?

List your data in the most popular marketplaces.

Learn More