Decoding protobuf by loading loadable-components chunks in NodeJS

I had a task to parse data from a website. aboutyou.de. I did a quick analysis of the pages and found that the site does not have much security and all the necessary information is available in HTML. At first glance, everything seemed okay. But this, by the way, is not okay. This

Reverse Engineering

I had already mentally assessed the task as “flawed,” but then it came to pagination. It was implemented in JS and no parameters in the URL allowed opening a specific page. In the list of requests, I found the one that was responsible for loading the next page, but it raised more questions than it provided answers.

Content-Type: application/grpc-web+proto This was something new for me.

In short, gRPC is a remote procedure call (RPC) system, and proto (Protocol Buffers) is a way to serialize structured data into a binary representation. During development, a special .proto file is created in which the data structure is described. When built, the .proto file is compiled into target language code with encoding and decoding methods. If you want to get acquainted with gRPC in more detail, I recommend this article

Working with this API will be much faster than loading and parsing HTML. Other data such as catalog and product information could also be retrieved. Therefore, it was decided to use this API to the maximum.

My first thought was to use protobuf-decoder, and guess the names of the fields. After N hours, the realization came that it would take a lot of time to restore all the necessary services because… Many have a variable structure and many fields.

Category page request structure

I hoped that the frontend builder would not obfuscate .proto during compilation and that at least the field names could be found in the source code. Set a breakpoint on XHR and…

We found not only field names, but also encoding and decoding methods. They are passed to the unary method of the module grpc-web. You can figure out the field names from the code, but restoring all the schemas would still take a lot of time.

Request encoding code
se = (e,t)=>{
            (0,
            s.CO)(e.uint32(10).fork(), t.config).ldelim(),
            (0,
            n.D3)(e.uint32(18).fork(), t.session).ldelim(),
            (0,
            i.WD)(e.uint32(130).fork(), t.category).ldelim(),
            (0,
            o.rb)(e.uint32(138).fork(), t.appliedFilters).ldelim(),
            (0,
            g.sj)(e.uint32(146).fork(), t.pagination).ldelim(),
            (0,
            m.a)(e.uint32(152), t.sortOptions),
            e.uint32(162).fork();
            for (const r of t.firstProductIds)
                e.int64(r);
            return e.ldelim(),
            ((e,t)=>{
                e.int32
            }
            )(e.uint32(168), t.automaticSizeFilter),
            e.uint32(176).bool(t.showSizeFinderBadges),
            e.uint32(184).bool(t.showSizeFinderProfileCompletionHints),
            ne(e.uint32(194).fork(), t.highlightedProducts).ldelim(),
            (0,
            k.Jl)(e.uint32(202).fork(), t.selectedDiscount).ldelim(),
            e
        }

A file is a chunk generated by a module loadable-components. The chunk code is quite simple. To array __LOADABLE_LOADED_CHUNKS__ an array is added with the chunk id and an object in which functions are stored by numeric indexes

Lambda vewhich is returned by the grpc method GetProductStreamused at the beginning of the function at index 91410.

91410: (e,t,r)=>{
        "use strict";
        r.r
        r.d(t, {
            CategoryStreamService_GetAdditionalProductStream: ()=>we,
            CategoryStreamService_GetGenderSwitch: ()=>ye,
            CategoryStreamService_GetProductStream: ()=>ve,
            CategoryStreamService_GetProductStreamPage: ()=>Te,
            CategoryStreamService_GetQuickFilters: ()=>me
        });
        var s = r(45121)
          , n = r(7782)
          , i = r(38214)
          , o = r(66931)
          , a = r(22648)
          , c = r(17436);
  //////////////////////////////////////////////////////////////////

The breakpoint at the beginning of the function finally convinced that modules are stored in indexes. In this section, the module from chunk 5062 with index 91410 exports the service with all methods. From this section it is also clear that to load a module by index it is used r

From the debugger it is clear that r This is a class that processes modules. All that remains is to repeat after the client.

Implementation

In order to quickly load services, the plan is as follows:

  • Let's load runtime loadable-components

  • Download the necessary chunks

  • Importing services

  • Let's restore functionality to the encoding/decoding functions

  • Let's type it all (or any)

The huge advantage of the NodeJS platform in this case is that we can execute JS code without any problems. We use the built-in module vm for this.

Download the runtime.*.js file and the necessary chunks from the site. Since debugging code running in vm is a dubious pleasure, I glued the runtime with all the chunks into one line and already executed it in the new vm context.

runtime.ts
import { readFileSync, readdirSync } from "fs";
import vm from "vm";

const context = vm.createContext({});
const files = readdirSync("./assets/chunks/");
let code = "";
for (const file of files) {
  code += readFileSync("./assets/chunks/" + file).toString() + "\n";
}

vm.runInContext(code, context);

export const runtime = context.runtime;

Now we can import the module with the service, but there is a problem. The method is returned to us as RPCMETHOD Handler. It would be possible to tighten up grpc with a dependency and pass the correct arguments to the function, but this would worsen performance, because… We need literally a couple of dozen lines from the entire library. But protobufjs had to be improved. When calling a function that returns a method, I pass a mock and wrap the resulting encodeRequest and decodeResponse with the necessary protobufjs logic. What comes out of this? Initially, encodeRequest needed to be passed protobufjs.Writer and data, and after the wrapper only data.

service.ts
import { runtime } from "./runtime.js";
import protobufjs from "protobufjs";

export type ServiceMethod<
  MethodName = string,
  ServiceName = string,
  RequestData = any,
  ResponseData = any
> = {
  methodName: MethodName;
  serviceName: ServiceName;
  encodeRequest: (data: RequestData) => Buffer;
  decodeResponse: (input: Buffer) => ResponseData;
};

export async function GetService(module_id: number, export_id: number) {
  const service_module = await runtime
    .e(module_id)
    .then(runtime.bind(runtime, export_id));

  const service = {} as any;

  for (const method in service_module) {
    if (Object.prototype.hasOwnProperty.call(service_module, method)) {
      const service_info = service_module[method]({
        unary: (e: any) => e,
        stream: (e: any) => e,
      });
      const encodeRequest = (data: any) => {
        const rpc_writer = new protobufjs.Writer();
        const encoded_array = service_info
          .encodeRequest(rpc_writer, data)
          .finish();
        const encoded_buffer = Buffer.from([
          ...[0, 0, 0, 0, 0],
          ...encoded_array,
        ]);
        encoded_buffer.writeUInt32BE(encoded_buffer.length - 5, 1);
        return encoded_buffer;
      };

      const decodeResponse = (input: Buffer) => {
        const length = input.readUint32BE(1);
        const rpc_reader = new protobufjs.Reader(input.subarray(5, 5 + length));
        return service_info.decodeResponse(rpc_reader, rpc_reader.len);
      };

      service[service_info.methodName] = {
        ...service_info,
        encodeRequest,
        decodeResponse,
      };
    }
  }

  return service;
}

Now you can literally pull out the entire service with one function

export const CategoryStreamService = (await GetService(
  5062,
  91410
)) as CategoryStreamService;

Conclusion

I think this approach can be applied to other languages. I'm sure there will be an opportunity to do something similar for compiled services of a mobile application, etc.

The parser creates much less load on the server. CF does not limit requests to the API. Thanks to this, the speed of the parser using the API is approximately 20 times faster than the version with HTML parsing.

That's all, but if you are interested in the topic of parsing, I will be glad to see you in your telegram channel.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *