Data Engineer or die: a single developer story

In early December, I made a fatal mistake made a pivotal decision in his life as a developer and joined the Data Engineering (DE) team within the company. In the article, I will share some of the observations that I made during two months of working in the DE team.


Why Data Engineering?

My journey to DE began in the summer of 2019, when Xneg and I went to School of Distributed Computingand there I attained enlightenment. I began to become interested in the topic, study algorithms and even write about them, and then I thought about the field of application and quickly found out that the practical application in our company is distributed databases.

What does our team do in general? We, like all the fashionable boys and girls, want to become a Data Driven Company. And in order to make this possible, we need to at least build a reliable repository, on which it will be possible to build any reports required by the company. But most importantly – the data in this repository must be trusted. Moreover, according to these data, it is necessary to be able to restore the state of the system at time t. All this is complicated by the fact that we live in a brave new world of microservices, and this ideology implies that each service implements its small functionality, its database is its own business, and it can delete it at least every day, but at the same time we have to Be able to receive and process the status of the service.

Want to be Data Driven, first become an Event Driven

Not so simple. Events’s are different, and the developer and the date engineer look at them differently. The conversation about events is the topic of a separate article, so here I will not go into it. Moreover, such an article is already wrote someone Martin Fowler, I will not take his laurels from him, let him also become famous.

In general, there is something to think about and the area is attractive. It just so happened that in our company Data Engineer is a much wider area of ​​responsibility than just the person who writes ETL / ELT pipelines (if you do not know what these abbreviations mean – come to mitap. As a contextual advertisement)

We are engaged in the architecture of building a warehouse, and modeling data, and issues related to data security, and the pipelines themselves, of course. And we also need to make sure that, on the one hand, product developers were not very burdensome with our presence and had to be distracted as little as possible by our requirements when sawing new features into the system, and on the other hand, we need to provide conveniently laid out storage data for analysts and BI teams. That’s how we live.

Difficulties in moving from development

On the very first day of my work, I encountered a number of difficulties that I want to share with you.

1. The first thing I saw was the lack of tuning and some practices. Take, for example, code coverage with tests. In development, we have hundreds of frameworks for testing. When working with data, everything is more complicated. Yes, we can test ETL pipelines on test data, but we have to do all this with our own hands and look for solutions for each specific case. As a result, test coverage is much worse. Fortunately, there is another layer of feedback in the form of monitoring and logs, but this already requires us to react rather than proactive, which infuriates unnerving.

2. The world from the position of DE, is not at all what it seems to an ordinary product developer (well, of course, the reader is not like that, and he already knows everything, but I didn’t know and now rake it). As a developer I saw my microservice, put the data in [database of your choice], saved his state there, got something by ID’s and normal. Service is spinning, orders are muddying, that’s all. They ask me in another service to fumble my state, well, I’ll throw an event into some RabbitMQ and that’s it. And here we again returned to the event’s question described above.

What the service needs for operational work does not suit us for historical data, the question of processing service contracts and working closely with development teams begins. You can’t even imagine how many hours it took us to agree: and what kind of event Driven is he in our company.

3. You need to think with your head. No, I don’t mean that the developers don’t think (although who am I to speak for everyone), just in product development very often you already have some kind of architecture, and you cut various backslash stuff. Of course, this requires planning and reflection, but this is a streaming work, where the main problem is simply good and high quality to do.

It’s not so simple here because the transfer of various system components from a warm and comfortable monolith to the world of wild microservice jungle is not so simple. When the service begins to sprinkle with events, then you need to revise the logic of filling the storage, because the data now looks different. Here you need to think a lot and thoroughly, not as a developer, but as a data engineer. It’s a normal story when you spend days with a notebook and pen or with a marker near the board. It’s very difficult, I don’t like to think, I love fig-figs and in production.

4. Perhaps the most important is information. What do we do when we lack knowledge? Who said stackoverflow? Take this person out of the room. We are going to read docs, books on the topic, and there is still a community that organizes forums, meetings and conferences. The documentation is cool, but unfortunately it is incomplete. We are using Cosmos DB in a number of projects. Good luck reading the documentation for this product. Books are the only salvation, fortunately, they exist and can be found, they have a lot of fundamental knowledge and you have to read a lot and constantly. But the community is in trouble.

Now in our direction it is difficult to find at least one adequate conference or meeting. No, of course, there are a lot of mitaps with the word Data, but strange abbreviations like ML or AI usually appear next to this word. So, this is not for us, we are talking about how to build storage facilities, and not how to smear with neurons. These hipsters filled everything. As a result, we are without a community. By the way, if you are a Data Engineer and know good communities, please write in the comments.

Conclusions and announcement of mitap

What do we have in the end? My first experience tells me that to feel in the shoes of an engineer’s date will be useful to every developer. It just allows you to look at things differently and not be surprised when our eyes are bleeding at the sight of how developers treat their data. So if your company has DE, just chat with these guys and learn a lot (about yourself).

And finally, the announcement. Since it’s impossible to find mitaps on our topic during the day, we decided to make our own. But what, are we worse? Fortunately, we have amazing Schvepsss and our friends from New professions lab, which, like us, it seems that date engineers are unfairly deprived of attention.

I take this opportunity to invite all concerned to come to our first community meeting with the promising name “DE or DIE”, which will be held on 02.27.2020 in the office of Dodo Pizza. Details on Timepad.

If anything, I’ll be there, you can personally tell me in person, how wrong I am about the developers.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *