What’s in store for data engineering in 2023? Seven Predictions

Number of search queries by profession

What is the future of data engineering? In this article, I will share my predictions for 2023 and beyond.

Articles with forecasts for the next year are trite, but they have their own purpose. They help us rise above the daily routine and think about what will bring benefits in the long run.

In addition, they are usually exercises in humility: we are trying to paint a coherent “big picture” of an industry that is rapidly evolving in many directions. Try to find an industry in which people have a greater need to keep their knowledge up to date!

These potential developments become even more important as data organizations begin to evaluate and reassess their priorities in light of the economic recession, and as data engineering investments determine a company’s ability to remain agile, innovative, and competitive.


But the good news is that need is the mother of ingenuity, so I predict 2023 will be a big year for technologies that help teams save time, profits, and resources.
dataopsso engineers can focus on building, scaling, and improving their performance.

Below are my predictions for some of the most important trends to continue into next year (out of any order of priority).

▍ Prediction #1: Data teams will spend more time on FinOps/data cloud cost optimization

As more and more data work moves to the cloud, I foresee that data will become an increasing cost of the company and

get more attention from finance departments

.

It is no secret that the macroeconomic environment is beginning to transition from a period of rapid growth and profit to a more restrained optimization of operations and profitability. We’re seeing more CFOs starting to take on important roles on data teams, and it makes sense that this partnership includes addressing the running costs challenge.

Data teams will still need to contribute to the business by increasing the efficiency of other teams and increasing profits through data monetization. however, an increasingly important third challenge will be cost optimization.

There is still very little experience in this area, as data engineering teams used to emphasize speed and flexibility in order to meet the exceptional demands placed on them. Most of their time was spent writing new queries or passing new data, rather than optimizing heavy/poor queries.

Optimizing the cost of data clouds is also a major concern of storage and lakehouse data providers. Yes, of course, they need consumption to grow, but waste leads to an outflow of users. They would like to drive consumption growth with products like data apps that provide value to customers and increase retention. They are going to stay in this business for a long time.

That’s why there’s more and more talk about cost of ownership, as I did in a recent conference session with Databricks CEO Ali Ghodsi. We also see that all major players (BigQuery, RedShift, Snowflake) are focusing on best practices and functions, related With optimization.

This increase in time wasted is likely to be driven by an increase in staff, which will be more closely related to ROI. It will be easier to justify it, since special attention will be paid to hiring (FinOps fund survey predicts that the average number of FinOps employees in companies will grow from 5 to 7 people). It is also likely that data teams’ time allocation will also change as they use new processes and technologies to improve efficiency in other areas, such as data reliability.

▍ Prediction #2: Increasingly Specialized Responsibilities in the Data Processing Team

Currently, responsibilities in teams are segmented mainly by data processing stage:

  • data engineers supply data,
  • analysts clean them up,
  • data analysts/data scientists visualize them and draw conclusions from them.

These positions are not going anywhere, however, I believe that there will be additional segmentation by business tasks:

  • data reliability engineers will ensure data quality,
  • data managers will increase adoption and monetization,
  • DataOps engineers will focus on data governance and efficiency,
  • data architects will be engaged in transformation of isolated databases (data silo) and long-term investments.

This will reflect changes in the neighboring field of software development, where the position of software engineer has begun to be split into more niche positions, such as DevOps engineer or service assurance engineer. This is a natural evolution because professions start to mature and become more complex.

▍ Prediction #3: Data is becoming more granular, but centralized data platforms will persist

Prediction that data science teams will continue to move sideways

data mesh

(Zhamak Degani said this for the first time) is not such a bold statement. The data mesh has been one of the most popular concepts in the field of data processing commands for many years.

However, I see more teams taking a break along the way, settling on a system that combines domain teams with a center of excellence or platform team. For many teams, this organizing principle provides the benefits of both systems: the flexibility and alignment of decentralized teams with the efficient standards of centralized teams.

I believe that some commands will continue on their way to the data mesh, and for some this pause will be the end point. They will use data mesh principles such as domain-focused architectures, self-service, and work with data as with the productbut will remain a powerful centralized platform with a “special forces” team for data engineering.

▍ Prediction #4: Most machine learning models will make it to production (>51%)

I believe that, on average, organizations will be able to successfully deploy more machine learning models to production.

If you attended technology conferences in 2022, you might think that we are all living in the nirvana of machine learning, because successful projects often make important contributions and are interesting to talk about. However, this hides the fact that most ML projects failwithout even being born.

October 2020 Gartner announcedthat only 53% of ML projects make it from prototype to production, and that’s in organizations with some AI experience. Companies that are still working on developing their data culture are likely to perform much worse: according to some estimatesup to 80% or more of projects fail.

There are many complexities:

  • Inconsistency between business needs and machine learning goals.
  • Training machine learning models that cannot be generalized.
  • Problems with testing and validation.
  • Difficulties with deployment and maintenance.

I believe things are starting to change for ML development teams due to a combination of increased focus on data quality and the economic need to make ML more usable (with more usable interfaces like laptops or data applications like Streamlit playing a big part).

▍ Prediction #5: Early stages of data contract adoption

Data Contract Architecture Example

Anyone who follows LinkedIn data discussions knows that data contracts are one of the hottest topics of the year. And for good reason: they are associated with one of the biggest data quality issues facing data science teams.

Unexpected schema changes are the cause of most data quality problems. More often than not, they are the result of an unsuspecting software developer pushing a service update, not knowing that it is wreaking havoc on downstream data systems.

However, it is important to note that despite the online hype, data contracts are still in their infancy. Pioneers of this process, people like Chad Sanderson and Andrew Jonesshowed how it can move from theory to practice, but at the same time very clearly communicated that work on this is still ongoing in their organizations.

I predict that the importance of this topic in 2023 will accelerate its implementation in the early stages. This will set the stage for a tipping point in 2024, when it begins to infiltrate the mainstream or slowly fade away.

▍ Prediction #6: Data warehouses and data lakes are starting to blur

More recently, it could be said that lakes are better for streaming, AI, and other data science applications, while data warehouses are better for analytics.

However, in 2023 this statement will be met with disapproval.

In the past year, data warehouses have focused on streaming features. Snowflake announced Snowpipe streaming and refactored its Kafka connector so that once data entered Snowflake, it could be immediately queried, reducing latency by a factor of ten. Google announced that you can now stream Pub/Sub directly to BigQuerywhich greatly simplifies the connection of streams to the data store.

At the same time, data lakes like Databricks provided the ability to add metadata and structure to stored data. Databricks announced the creation Unity Catalog – A feature that allows teams to more easily add structures like metadata to their data resources.

New tabular formats have become another reason for the arms race: Snowflake announced the creation of Apache Iceberg for streaming and hybrid transactional-analytical processing tables (HTAP) Unistore for transactional workloads, while Databricks has focused on its delta table formatA that has both ACID and metadata properties.

▍ Prediction #7: Teams will get better at fixing data anomalies faster

Wakefield Research Poll

for 2022 showed that more than three hundred data scientists

spend an average of 40% of their time on data quality

. And that’s a big number.

The equation for downtime due to data is: the number of incidents x (average time to detect + average time to fix). The Wakefield survey also showed that organizations experience an average of 61 incidents per month, which take an average of 4 hours to identify and an additional 9 hours to resolve.

In my conversations with hundreds of data leaders this year, I have noticed that many have reduced detection time by moving from hard-coded static data testing to machine learning-based data monitoring.

And this is great because it creates the potential for innovation in the field of automatic root cause analysis. Features like segmentation analysis, query change detection, and genealogy data help narrow down the possible causes of errors in the data, helping to understand whether the problem is in the systems, in the code, or in the data itself.

▍ 2023: Big data will become smaller and more manageable this year

At the end of 2022, I want to say that now is a unique moment for data engineering, when the limitations of computing resources and drives have been practically eliminated – big data can be as big as they need to be. Of course, as it always happens, then the pendulum will swing in the opposite direction, but this is unlikely to happen next year.

Therefore, the most popular trends will not be architecture optimization or scaling, but processes that make this expanded universe more orderly, reliable and accessible.

Similar Posts

Leave a Reply Cancel reply