How are the files arranged? Parsing

Files… what could possibly be easier? We are all used to creating, deleting, editing, sharing files.

But can we look inside each file and understand how it works? Of course we can, so today we will dig a little into the binary code and feel the metadata.

At the same time, we will find out why the iPhone hangs on SMS and gut PowerPoint.

This is a simple audio format that contains uncompressed. All CDs are recorded in WAV format.

The first 44 bytes of a classic WAV file contain a header that contains useful information:

number of audio channels;

sampling frequency;

bit depth;

and much more.

All this data allows you to be sure that the audio will be played correctly.

Open and proprietary formats

The structure of WAV is well known and probably almost any player will be able to read such a file. This is because a WAV file is an example of an open format.

There are other open formats that you use daily. For example:

web page markup language – HTML;

pictures – PNG;

audio format – OGG;

archive – ZIP;

video – MKV;

e-book – EPUB;

and others…

But there are also closed file formats, or rather proprietary ones. Opening and editing such files with third-party software is often either prohibited altogether or distributed under licenses.

Proprietary formats are great for everyone, but in some cases they prevent competition in the software industry, as they lead to vendor lock-in. There is even such a term Vendor lock-in.

old office

For example, earlier this situation was with Microsoft Office formats: DOC, XLS, PPT.

Not only were these proprietary formats from Microsoft and they only worked with proprietary software. Also Microsoft constantly changed their file structure from one version of MS Office to another. And as a result? when a new version of the office suite is released? files from the old editor were no longer readable by the new one, but on the contrary, even more so.

The European Union did not like this situation very much. Therefore, the EU has risen up on the topic of restricting competition. As a result, the file formats were published, and everyone learned to at least read them, but writing to the old formats still requires a Microsoft license. And in parallel to this, open formats began to be developed.

ODF and OOXML

On May 1, 2006, the ODF format was born, which literally stands for an open document format for office applications. It was developed by a consortium of OASIS and Sun Microsystems.

ODF – Open Document Format for Office Application.

OASIS – Organization for the Advancement of Structured Information Standards.

The format is based on the Universal Markup Language XML. And the ODF file itself is a ZIP archive with folders, XML files and all sorts of attachments in the form of pictures, videos and more. In other words, if we open such a file through an archiver, we can easily see all the insides. Here is an example of openness!

Microsoft didn’t sleep either. Under pressure from the European Court, they teamed up with a number of companies to form the ECMA association and developed their own open format, Office Open XML, which was born a little later in 2006.

OOXML is standardized by the European Computer Manufacturers Association. Standard ECMA-376

The letter X was added to the usual format at the end and we got: DOCX, XLSX, PPTX.

OOXML – Office Open XML (DOCX, XLSX, PPTX)

OOXML is, in general, very similar to ODF. It is also based on XML markup and is also a ZIP archive. Therefore, you can also look inside office files using any archiver. You can even pull out the pictures and even replace them, which is especially convenient when working with presentations or when you are sent a text document with pictures inside a file.

Despite the apparent simplicity, the format is really complex. Only the main documentation is 5 thousand pages. And it’s almost without pictures.

Nevertheless, someone still managed to read all this documentation and therefore cool office suites were born, for example My officewhich can work with ODF format, and with Office Open XML, and even with outdated formats such as DOC.

But there is an important remark about the old formats. As a rule, modern software can only read them, but not write them, because this action requires the purchase of a Microsoft license. However, in our time, this action, to put it mildly, is meaningless.

Total

What did we end up learning? Files are of several types:

The most basic ones are binary. Companies like to come up with such formats so that no one understands how their programs store data.

A more open option is xml containers. Fortunately, most of the popular office formats are now like this. If you want to work with all these files, even at home, even on the run, download MyOffice programs! That’s all we have today.