Fitter – API Stitcher
Good evening everyone!
Perhaps I chose not the best time to reach the audience, but nevertheless, the main thing is that the product is good, and not an article about it. For the past few weeks, I have been writing an application in which I need to collect a huge amount of information from the network (requests to API / parsing HTML code) and at the end of the 4th integration, I thought that I should make it as easy as possible (it’s not a matter of rebuilding the application for every integration sneeze ), perhaps this is not the best preamble, but at least there was a real problem, the solution to which I wanted to show and open.
So Fitter = Stitcher is quite a slang translation, but it seems to me that it fits best. I did this thing based on the following assumptions:
Data may change => update mechanism needed
Data can be in several sources => data must be stitched together (map / reduce)
Authorization required => API key/OAuth/Login+Pass
There is not only Server-side rendering => Data may appear after loading the client side (I understand that sometimes you can emulate a request)
The data may not be valid => the field may be missing or the markup has changed
We don’t know where it will be deployed => no abstraction needed
You need a configuration that is + – easy to change (not in the current version)
It is necessary to be able to bring the data to one form
And so DEMO:
Let me tell you a little about what he can do:
Take data on HTTP requests with authorization by Header
Get data from Chromium binary
Get data from Docker browsers
Get data from Playwright
Parse HTML/Json/XPath data
Forward data / link from different sources
Plans for the future (Roadmap):
Add scenarios: some sites for parsing require authorization / accept cookies, etc., there will be a set of commands that can be run before and after parsing
Add ways to return information: so that the project can send Webhook / Queue messages, etc.
Add ways to collect information: today an idea came up that any thing can be a source of information, for example, a telegram channel: and for this we need a bot with access to messages
Add launch methods: same Webhook/Queue
Validation – weed out non-valid data
Configuration editor – see below
Current pain points (pain-points):
So far, it’s the only one: configuration – for simple situations, it’s easy, but if you want to link several sources, it’s hard to figure it out.
{
"limits": {
"playwright_instance": 3
},
"item": {
"connector_config": {
"response_type": "HTML",
"connector_type": "server",
"server_config": {
"method": "GET",
"url": "http://www.citymayors.com/gratis/uk_topcities.html"
}
},
"model": {
"type": "array",
"array_config": {
"root_path": "table table tr:not(:first-child)",
"item_config": {
"fields": {
"name": {
"base_field": {
"path": "td:nth-of-type(1) font",
"type": "string"
}
},
"population": {
"base_field": {
"path": "td:nth-of-type(2) font",
"type": "string"
}
},
"temperature": {
"base_field": {
"path": "td:first-child font",
"type": "string",
"generated": {
"model": {
"type": "string",
"path": "temp.temp",
"model": {
"type": "object",
"object_config": {
"fields": {
"temp": {
"base_field": {
"type": "string",
"path": "//div[@id='forecast_list_ul']//td/b/a/@href",
"generated": {
"model": {
"type": "string",
"model": {
"type": "object",
"object_config": {
"fields": {
"temp": {
"base_field": {
"type": "string",
"path": "div.current-temp span.heading"
}
}
}
}
},
"connector_config": {
"response_type": "HTML",
"connector_type": "browser",
"attempts": 4,
"browser_config": {
"url": "https://openweathermap.org{PL}",
"playwright": {
"timeout": 30,
"wait": 30,
"install": false,
"browser": "FireFox",
"type_of_wait": "networkidle"
}
}
}
}
}
}
}
}
}
},
"connector_config": {
"response_type": "xpath",
"connector_type": "browser",
"attempts": 3,
"browser_config": {
"url": "https://openweathermap.org/find?q={PL}",
"playwright": {
"timeout": 30,
"wait": 30,
"install": false,
"browser": "Chromium"
}
}
}
}
}
}
}
}
}
}
}
}
}
As we can see it hurts, but let’s tell you what’s going on here:
We set the Playwright parallel launch limits: 3 pieces
We will read the HTML GET request from the site: http://www.citymayors.com/gratis/uk_topcities.html
At the output, we expect an array of the following form:
Array<City>
City = {
name: string;
population: string;
temperature: string
}
If everything is clear with the name / population +- fields, we take them from the table from the site indicated above, then temperature is a generated field
We take the name of the city from the table and forward it to the site: https://openweathermap.org/find?q={PL} – where {PL} is the name of the city, for this we use Playwright since there is client-side rendering
From the search results, we take a relative link to the city page, for example: /city/2950159 and substitute https://openweathermap.org{PL} – where {PL} is a link, for this we use Playwright since there is client side-rendering
Using the link above, we pull out the temperature by the selector: `div.current-temp span.heading`
And expand the field by parsing: `temp.temp` from the generated data
Array element example
{
"name": "Exeter",
"population": "107,729",
"temperature": "4°C"
},
Well, in principle, we derive the result.
PS: I understand that to achieve the result, it was possible to connect with the API in a simpler way, but I wanted to show a complex case.
PS2: I didn’t choose I’m promoting, because the commercial value is 0, the project will be open free, but it would be nice to collect feedback
Thank you very much for your attention, I really want to hear criticism and ideas for the future!
Project and examples: