How to debug a program you don’t have access to


Photo: Intricate Explorer, Unsplash

Today I remembered one of my favorite “programming myths”, which may well be an urban legend, and my own version of the “black box” that required debugging.

An urban legend tells about radioactive railway cars from Ukraine that caused bugs in the computer system, you can read it here

We deal with the “black boxes” and with what they are today

A black box is a popular programming concept that assumes that we are outside a system or component with no direct access to the code. This can be caused by various factors:

  • You work with third-party software whose developers simply do not disclose the code.
  • You are interacting with an API whose internal logic is abstracted.
  • You do not have the necessary permissions to access the Git repository.
  • Even a full access system can become a de facto black box due to its complexity.
  • An employee who possessed all the keys and knowledge will suddenly quit / disappear / die.
  • The legacy system consists of a .dll that “always worked” on the server, and was not connected to a version control system. To just look at the code, you need to decompile it, if possible, of course.

All these factors boil down to the fact that we have a problem that we cannot fix instantly, and we do not know what the error is. Therefore, we need to get to work.

Our own problem was a combination of all of the above

The list of factors listed above probably does not cover all situations, because it is a direct list of factors that influenced our situation. We had just fired a developer, a complex system distributed across multiple servers, with a core .dll that “did its job,” although no one knew why or which one.

It called third-party services, there was practically no logging in it, and the only backups we had were storages in the file system with this .dll
and without any code. As you can understand, none of this could be maintained, and they planned to rewrite the system in due time, but the error was urgent and required elimination.

We can say that most of the time the system was almost completely coping with the work, and we could put up with partial solutions to the problem if they showed consistency. However, about every hundredth set of data turned out to have incorrect results and the only thing we could do was debug them from the outside (now we remember this with a smile).

This is one of those times where I felt grateful for the chaotic world in which I grew up, because most companies will not face such serious problems, and will not appoint a recent graduate to solve them. I was lucky in that sense, and even more fortunate because I had the support of experienced developers who helped me throughout the process. So this was our solution.

We reproduce the error (ideally on several test cases)

Once you know what the real bug is, the problem is almost solved, but it can be incredibly difficult to get to it if the bug occurs irregularly. Think again about the train example – if it were not possible to identify the pattern, then the reason would be almost impossible to find. Because this is what we are looking for in such a situation – the pattern of when an error occurs.

In our case, we found it relatively simply: we knew the type of error and it occurred repeatedly, but so rarely that it took us many times to narrow the search space for cases and reveal their logic.

After we have identified many test cases that regularly lead to failures, you can start debugging itself. Here is an example of another process in which we narrowed the search for the source of the problem by matching patterns: one of the systems hung every Thursday night, and there was nothing strange in the log files:

  • We knew that our system was not giving any error messages, but it hung.
  • We compared cases until we were sure the problem occurred on Thursday and no other days.
  • We checked that no automatic updates are scheduled for this time, checked all automated tasks running on the same server – nothing.
  • We looked at what the system was doing at the meta-level and narrowed it down to a shared disk timeout, which was accessed every two weeks by a disk cleanup task running on a completely different server while our service was running. No one thought that both of them began to launch at three o’clock in the morning.

Create a test environment (even if it’s production)

Once you have a set of errors, you can start working on the true cause and solution, which requires a test system.

By this I mean that you need two constants:

  • The data used should not be changed by other systems
  • Possible damage should be limited as much as possible

In our case, we could not reproduce the errors on the test server, this was obvious, since the .dll library was in a completely different state compared to the server in production. Reverting to this old state didn’t work because it broke other elements that were just as important.

So we got together and asked the question, “What is the worst that could happen if we screw this up?” And then wrote a database script that would rewrite all the results into an erroneous state so that subsequent systems would not process them.

It was possible, for example, to disconnect servers and tasks, remove the next logical step from the process, and the like, but the possibility of this depends on the architecture of a particular system.

Compare input data to find similarities and differences with working datasets

Although we cannot find out what exactly the code does in the case of a “black box”, it was possible to carry out a kind of “reverse engineering”, often giving a good understanding of the causes of problems.

In our case, we had the luxury of reasonably well-formatted JSON files from the previous system on which our input depended. After we had everything set up, it remained to literally start comparing a couple of text files in Notepad ++ until we found the similarities between the files causing the errors, and then the differences between them and the correctly working files.

We were lucky here – we were quickly able to figure out what the bug caused a specific combination of customer flags, and we immediately managed to bypass them, because this case could be “imitated” with similar, but different flags. Therefore, since we already knew that the system would be rewritten (in fact, we had no choice), it was decided to work around this bug instead of decompiling and fixing it.

Modifying the input to make sure our guess led to the expected results (and limits any damage)

Obviously, it’s a bad idea to change live data in a database in production, hoping that everything will work without real testing, but we had no other choice.

It worked great because the number of cases was low, and the first few tests we ran manually and turned out to be exactly what we wanted.

So we ended up just writing another automated task that fixed these issues before they hit the system, and then embarked on a three month project to rewrite this program from scratch, this time transparently, with version control and build pipelines.

Here’s an adventure.

Conclusion: you can learn quite a lot about the system, even if you just walk around it and poke it with a stick

I am delighted with this way of debugging and finding errors, I love when programming is combined with an adrenaline rush.

If you haven’t watched the video on SQL injection and database reverse engineering on error messages, I highly recommend it. look… The techniques used in this video are almost identical to those that can be used for non-malicious debugging.


Advertising

Order and work immediately! Creature VDS any configuration within a minute, including servers for storing large amounts of data up to 4000 GB. Epic 🙂

Subscribe to our chat in Telegram

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *