Fitter – API Stitcher

Good evening everyone!

Perhaps I chose not the best time to reach the audience, but nevertheless, the main thing is that the product is good, and not an article about it. For the past few weeks, I have been writing an application in which I need to collect a huge amount of information from the network (requests to API / parsing HTML code) and at the end of the 4th integration, I thought that I should make it as easy as possible (it’s not a matter of rebuilding the application for every integration sneeze ), perhaps this is not the best preamble, but at least there was a real problem, the solution to which I wanted to show and open.

So Fitter = Stitcher is quite a slang translation, but it seems to me that it fits best. I did this thing based on the following assumptions:

  1. Data may change => update mechanism needed

  2. Data can be in several sources => data must be stitched together (map / reduce)

  3. Authorization required => API key/OAuth/Login+Pass

  4. There is not only Server-side rendering => Data may appear after loading the client side (I understand that sometimes you can emulate a request)

  5. The data may not be valid => the field may be missing or the markup has changed

  6. We don’t know where it will be deployed => no abstraction needed

  7. You need a configuration that is + – easy to change (not in the current version)

  8. It is necessary to be able to bring the data to one form

And so DEMO:

Let me tell you a little about what he can do:

  1. Take data on HTTP requests with authorization by Header

  2. Get data from Chromium binary

  3. Get data from Docker browsers

  4. Get data from Playwright

  5. Parse HTML/Json/XPath data

  6. Forward data / link from different sources

Plans for the future (Roadmap):

  1. Add scenarios: some sites for parsing require authorization / accept cookies, etc., there will be a set of commands that can be run before and after parsing

  2. Add ways to return information: so that the project can send Webhook / Queue messages, etc.

  3. Add ways to collect information: today an idea came up that any thing can be a source of information, for example, a telegram channel: and for this we need a bot with access to messages

  4. Add launch methods: same Webhook/Queue

  5. Validation – weed out non-valid data

  6. Configuration editor – see below

Current pain points (pain-points):

So far, it’s the only one: configuration – for simple situations, it’s easy, but if you want to link several sources, it’s hard to figure it out.

{
  "limits": {
    "playwright_instance": 3
  },
  "item": {
    "connector_config": {
      "response_type": "HTML",
      "connector_type": "server",
      "server_config": {
        "method": "GET",
        "url": "http://www.citymayors.com/gratis/uk_topcities.html"
      }
    },
    "model": {
      "type": "array",
      "array_config": {
        "root_path": "table table tr:not(:first-child)",
        "item_config": {
          "fields": {
            "name": {
              "base_field": {
                "path": "td:nth-of-type(1) font",
                "type": "string"
              }
            },
            "population": {
              "base_field": {
                "path": "td:nth-of-type(2) font",
                "type": "string"
              }
            },
            "temperature": {
              "base_field": {
                "path": "td:first-child font",
                "type": "string",
                "generated": {
                  "model": {
                    "type": "string",
                    "path": "temp.temp",
                    "model": {
                      "type": "object",
                      "object_config": {
                        "fields": {
                          "temp": {
                            "base_field": {
                              "type": "string",
                              "path": "//div[@id='forecast_list_ul']//td/b/a/@href",
                              "generated": {
                                "model": {
                                  "type": "string",
                                  "model": {
                                    "type": "object",
                                    "object_config": {
                                      "fields": {
                                        "temp": {
                                          "base_field": {
                                            "type": "string",
                                            "path": "div.current-temp span.heading"
                                          }
                                        }
                                      }
                                    }
                                  },
                                  "connector_config": {
                                    "response_type": "HTML",
                                    "connector_type": "browser",
                                    "attempts": 4,
                                    "browser_config": {
                                      "url": "https://openweathermap.org{PL}",
                                      "playwright": {
                                        "timeout": 30,
                                        "wait": 30,
                                        "install": false,
                                        "browser": "FireFox",
                                        "type_of_wait": "networkidle"
                                      }
                                    }
                                  }
                                }
                              }
                            }
                          }
                        }
                      }
                    },
                    "connector_config": {
                      "response_type": "xpath",
                      "connector_type": "browser",
                      "attempts": 3,
                      "browser_config": {
                        "url": "https://openweathermap.org/find?q={PL}",
                        "playwright": {
                          "timeout": 30,
                          "wait": 30,
                          "install": false,
                          "browser": "Chromium"
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

As we can see it hurts, but let’s tell you what’s going on here:

  1. We set the Playwright parallel launch limits: 3 pieces

  2. We will read the HTML GET request from the site: http://www.citymayors.com/gratis/uk_topcities.html

  3. At the output, we expect an array of the following form:

Array<City>

City = {
  name: string;
  population: string;
  temperature: string
}
  1. If everything is clear with the name / population +- fields, we take them from the table from the site indicated above, then temperature is a generated field

  2. We take the name of the city from the table and forward it to the site: https://openweathermap.org/find?q={PL} – where {PL} is the name of the city, for this we use Playwright since there is client-side rendering

  3. From the search results, we take a relative link to the city page, for example: /city/2950159 and substitute https://openweathermap.org{PL} – where {PL} is a link, for this we use Playwright since there is client side-rendering

  4. Using the link above, we pull out the temperature by the selector: `div.current-temp span.heading`

  5. And expand the field by parsing: `temp.temp` from the generated data

  6. Array element example

{
  "name": "Exeter",
  "population": "107,729",
  "temperature": "4°C"
},

Well, in principle, we derive the result.

PS: I understand that to achieve the result, it was possible to connect with the API in a simpler way, but I wanted to show a complex case.

PS2: I didn’t choose I’m promoting, because the commercial value is 0, the project will be open free, but it would be nice to collect feedback

Thank you very much for your attention, I really want to hear criticism and ideas for the future!

Project and examples:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *