In this episode of Data Engineering Weekly Radio, we delve into modern data stacks under pressure and the potential consolidation of the data industry. We refer to a four-part article series that explores the data infrastructure landscape and the Software as a Service (SaaS) products available in data engineering, machine learning, and artificial intelligence.
We discussed that the siloed nature of many data products has led to industry consolidation, ultimately benefiting customers. Throughout our discussion, we touch on how the Modern Data Stack (MDS) movement has resulted in various specialized tools in areas such as ingestion, cataloging, governance, and quality. However, we also acknowledge that as budgets tighten and CFOs become more cautious, the market is now experiencing a push toward bundling and consolidation.
In this consolidation, we explore the roles of large players like Snowflake, Databricks, and Microsoft and cloud companies like AWS and Google. We debate who will be the "control center" of the data workload, as many companies claim to be the central component in the data ecosystem. As hosts, we agree it's difficult to predict the industry's future, but we anticipate the market will mature and settle soon.
We discussed the potential consolidation of various tools and categories in the modern data stack, including ETL, reverse ETL, data quality, observability, and data catalogs. Consolidation is likely, as many of these tools share common ground and can benefit from unified experiences for users. We also explored how tools like DBT, Airflow, and Databricks could emit information about data lineage, potentially leading to a "catalog of catalogs" that centralizes the visualization and governance of data.
We suggested that the convergence of data quality, observability, and catalogs would revolve around ensuring clean, trusted data that is easily discoverable. We also touched on the role of data lineage and pondered whether the control of data lineage would translate to control over the entire data stack. We considered the possibility that orchestration engines might step into data quality, observability, and catalogs, leading to further consolidation in the industry.
We also acknowledged the shift in conversation within the data community from focusing on technology comparisons to examining organizational landscapes and the production and consumption of data. We agreed that there is still much room for innovation in this space and that consolidating features is more beneficial than competing with one another.
We contemplated how tools like DBT might extend their capabilities by tackling other aspects of the data stack, such as ingestion. Additionally, we discussed the potential consolidation in the MLOps space, with various tools stepping on each other's territory as they address customer needs.
Overall, we emphasized the importance of unifying user experiences and blurring the lines between individual categories in the data infrastructure landscape. We also noted the parallels between feature stores and data products, suggesting that there may be further convergence between MLOps and data engineering practices in the future. Ultimately, customer delight and experience are the driving forces behind these developments.
We also discussed ETL's potential future, the rise of zero ETL, and its challenges. Additionally, we touched on the growing importance of data products and contracts, emphasizing the need for a contract-first approach in building successful data products.
In conclusion, Matt Turck's blog provided us with an excellent opportunity to discuss and analyze the current trends in the data industry. We look forward to seeing how these trends continue to evolve and shape the future of data management and analytics. Until the next edition, take care, and see you all!
Reference
https://mattturck.com/mad2023/
https://mattturck.com/mad2023-part-iii/