Scraping with Goutte (crawler). Parsing sites using the Goutte library
In a new post, I will show you a PHP library for parsing (scraping) sites. Using this library, you can collect any information from a third-party site, follow links, automatically submit forms.
Connecting the Goutte library and creating a request to the site
I will use my website as an example. At the very beginning, you need to make a request to the main page, then we will take elements from it, so the code below will be used in each request, I just won’t duplicate it
/* подключаем файлы полученные через Composer */
require __DIR__ . "/vendor/autoload.php";
use Goutte\Client;
use Symfony\Component\HttpClient\HttpClient;
/* создаём объект и делаем запрос на сайт Prog-Time */
$client = new Client();
$crawler = $client->request('GET', 'https://prog-time.ru/');
Getting text information with Goutte
Using method filter you can specify a selector to select elements. Since this page uses several elements with the home_heading_post class, we will use the method each.
$crawler->filter('.bottom_list_last_posts .home_link_post .home_heading_post')->each(function ($node) {
var_dump($node->text());
});
Getting the href attribute of a link
$crawler->filter('.bottom_list_last_posts .home_link_post')->each(function ($node) {
var_dump($node->attr("href"));
});
Getting the src attribute of an image
$crawler->filter('.bottom_list_last_posts .home_link_post img')->each(function ($node) {
var_dump($node->attr("src"));
});
Selection filtering (selection of elements through one)
use the method reduce to specify a function to filter the selection. In my example, a function is specified that sets the order “through 1” and “every tenth element”.
$newListLinks = $crawler
->filter('.home_link_post .home_heading_post')
->reduce(function ($node, $i) {
return ($i % 2) == 0;
// return ($i % 10) == 0;
})
->each(function ($node) {
var_dump($node->text());
});
Getting an element of the specified order
Using the method eq you can specify the element number. The numbering starts from 0, so in my example we will get 4 elements with the class “home_heading_post“
$itemPost = $crawler->filter('.home_link_post .home_heading_post')->eq(3);
var_dump($itemPost->text());
Getting the first and last element
first() – return the first element
last() – returns the last element
$firstItem = $crawler->filter('.home_link_post .home_heading_post')->first();
$lastItem = $crawler->filter('.home_link_post .home_heading_post')->last();
var_dump(lastItem->text());
Getting neighbor element at level in DOM tree
siblings() – returns neighboring elements in the DOM tree
siblingsItem->text());
Getting a link by text and clicking on the link
Using the method selectLink() we get the link, as a parameter we pass the text inside the link.
Using the method link() follow the link and get a new page.
Using the method getUri() get link URI
$linkPost = $crawler->selectLink('Парсинг на PHP с формированием данных в Excel');
$link = $linkPost->link();
vardump($link);
uri);
Getting an Image Object
$imagesPost = $crawler->selectImage('Парсинг на PHP с формированием данных в Excel');
$image = $imagesPost->image();
var_dump($image);
Getting child elements
$childrenItems = $crawler->filter('.header_post_list')->children();
vardump($childrenItems);
Submitting a form with Goutte
/* получаем страницу с формой */
$crawler = $client->request('GET', 'https://prog-time.ru/test_form.php');
/* находим кнопку для отправки формы */
$form = $crawler->selectButton('Отправить')->form();
/* передаём параметры формы и отправляем запрос */
form, [
'name' => 'Илья',
'phone' => '+7(999)999-99-99',
]);