The book "Elasticsearch, Kibana, Logstash and the new generation of search engines"

image Hi, Habrozhiteli! We published a book on Elastic Stack, designed for professionals working with large amounts of data and wanting to reliably extract them from any source in any format, as well as to search, analyze and visualize data in real time. This book is for you if you need a fundamental understanding of how the Elastic Stack works in the realms of distributed computing and data processing.

Below will be given the structure of the book about an excerpt about aggregation.

Book structure

Chapter 1, “Introduction to Elastic Stack,” introduces the main components of the Elastic Stack, explains their role in the overall structure, describes the purpose of each component. This chapter also discusses the need for distributed, scalable search and analysis, which are achieved using Elasticsearch. At the end, a guide to downloading and installing Elasticsearch and Kibana is provided so that you can start working with these tools.

Chapter 2, Getting Started with Elasticsearch, introduces the key principles behind the Elasticsearch search engine, which is the foundation of the Elastic Stack. You will be introduced to concepts such as indexes, types, nodes, and clusters. You will also learn how to use the REST API to perform basic operations.

Chapter 3, “Search is what's important” focuses on the search methods provided by Elasticsearch. You will learn about the basics of text analysis, tokenizers, analyzers, features of the relevant search. This chapter also provides practical examples of relevant searches.

Chapter 4, “Analyzing Data with Elasticsearch,” discusses the different types of aggregation. It includes examples that will allow you to better understand the principles of data analysis. You will learn how to use various types of aggregations – from the simplest to the most complex, in order to navigate through huge amounts of data. After reading this chapter, you will know when and which variant of aggregation is better to use.

Chapter 5, “Analyzing Log Data,” contains information about the need to use Logstash, its architecture, installation and configuration. Elastic 5 provides the Ingest Node tool that can replace the Logstash configuration. After reading this chapter, you will learn how to create containers using Elastic Ingest Node.

Chapter 6, “Developing Containers with Logstash,” provides fundamental knowledge about Logstash, which allows you to dynamically identify data from various sources and normalize it using selected filters. You will learn how having a wide range of filters puts Logstash on a par with other frameworks for streaming processing in real and near real time without writing code. You will also learn about the Beats platform and the FileBeat component used to transport log files (log files) from remote machines.

Chapter 7, “Visualizing Data in Kibana,” shows the effectiveness of using Kibana for visualizing and impressively presenting your data. The example of a simple data set describes the creation of visualizations in a couple of clicks.

Chapter 8, "Elastic X-Pack" talks about the expansion of Elasticsearch. By this time, you will have already studied Elasticsearch and its key components for creating data containers and will be able to connect extensions to solve specific problems. In this chapter, you will read how to install and configure X-Pack components in the Elastic Stack, learn the basics of security and monitoring, and learn how to add various notifications.

Chapter 9, “Launching the Elatic Stack into Work,” gives recommendations for launching the Elastic Stack complex into commercial operation. You will receive recommendations on how to implement your application and change the standard settings according to the requirements of operation. You will also learn how to use Elastic Cloud cloud services.

Chapter 10, Creating an Application for Analyzing Data from Sensors, describes the creation of an application for analyzing and processing data from various sources. You will learn how to model data in Elasticsearch, create data containers and visualize them in Kibana. You will also learn how to effectively use the X-Pack components to ensure the security and monitoring of your containers, and to receive notifications about various events.

Chapter 11, Monitoring Server Infrastructure, demonstrates how Elastic Stack can be used to configure real-time monitoring for servers and applications that are created entirely on the Elastic Stack. You will be introduced to another component of the Beats platform – Metricbeat, which is used to monitor servers / applications.

Sum aggregation, average, maximum and minimum values

Finding the sum of a field, a minimum or maximum value, or an average number are fairly common operations. In SQL, the query for calculating the sum is as follows:

SELECT sum (downloadTotal) FROM usageReport;

This will calculate the sum of the downloadTotal field for all entries in the table. To do this, go through all the records in the table or all the records in the selected context and add the values ​​of the selected fields.

In Elasticsearch, you can write a similar query using sum aggregation.

Sum aggregation

Here is how to write a simple amount aggregation:

GET bigginsight / _search
{
    "aggregations": {1
       "download_sum": {2
           "sum": {3
              "field": "downloadTotal" 4
           }
       }
    },
    "size": 0 5
}

1. Elements of aggs or aggregations at the top level should serve as an aggregation wrapper.

2. Give the aggregation name. In this case, we aggregate the amount in the downloadTotal field and choose the corresponding name download_sum. You can call her whatever you like. This field is useful when we need to find this particular aggregation in the response results.

3. We do sum aggregation, therefore, the element sum is used.

4. We want to make an aggregation of terms across the downloadTotal field.

5. Specify size = 0 to prevent unprocessed results from returning. We only need aggregation results, not search results. Since we did not specify any high-level query elements, the query will work with all documents. We do not need to answer the raw documents (or the results of search results).

The answer should look like this:

{
   "took": 92,
   ...
   "hits": {
      "total": 242836, 1
      "max_score": 0,
      "hits": []
   },
   "aggregations": {2
      "download_sum": {3
         "value": 2197438700 4
      }
   }
}

We will understand the basic parameters of the answer.

1. The hits.total element shows the number of documents corresponding to the request context. If no additional query or filter is specified, all documents in a type or index will be included.

2. By analogy with the request, this answer is placed inside the aggregation for presentation in this form.

3. The response of the aggregation we requested is called download_sum, therefore, we get our answer from the aggregation of the amount within the element with the same name.

4. The actual value is displayed after applying the aggregation amount.

Aggregations of average, maximum, minimum values ​​are very similar. Briefly consider them.

Average aggregation

Average aggregation finds the average of all documents in the context of the query:

GET bigginsight / _search
{
    "aggregations": {
       "download_average": {1
          "avg": {2
             "field": "downloadTotal"
          }
       }
    },
    "size": 0
}

Noticeable differences from the aggregation amount are as follows.

1. We chose another name, download_average, to make it clear that this aggregation is designed to calculate the average value.

2. The type of aggregation performed is avg instead of sum, as in the previous example.

The structure of the answer is identical to the answer from the previous subsection, but in the value field we will see the average value of the requested fields.

Aggregations of minimum and maximum values ​​are similar.

Minimum value aggregation

Find the minimum value of the downloadTotal field in the entire index / type:

GET bigginsight / _search
{
    "aggregations": {
       "download_min": {
          "min": {
              "field": "downloadTotal"
          }
       }
    },
    "size": 0
}

Maximum value aggregation

Find the maximum value of the downloadTotal field in the entire index / type:

GET bigginsight / _search
{
     "aggregations": {
        "download_max": {
           "max": {
               "field": "downloadTotal"
           }
        }
     },
     "size": 0
}

These are very simple aggregations. Now we will consider more complicated aggregations of statistics and extended statistics.

Aggregation of statistics and extended statistics

These aggregations compute some common statistical values ​​within a single query and without performing additional queries. Due to the fact that the statistics are calculated in one run, and are not requested several times, Elasticsearch resources are saved. Client code is also made easier if you are interested in several types of such data. Take a look at an example of aggregation statistics.

Aggregation of statistics

Aggregation statistics calculates the sum, average, maximum, minimum value and the total number of documents in one run:

GET bigginsight / _search
{
    "aggregations": {
       "download_stats": {
          "stats": {
             "field": "downloadTotal"
         }
       }
    },
    "size": 0
}

The request for statistics on the structure is similar to other metric aggregations that you already know about; nothing much happens here.

The answer should look like this:

{
   "took": 4,
   ...,
   "hits": {
      "total": 242836,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "download_stats": {
         "count": 242835,
         "min": 0,
         "max": 241213,
         "avg": 9049.102065188297,
         "sum": 2197438700
      }
   }
}

As you can see, the answer with the download_stats element contains the total, minimum, maximum, average, and total. Such a conclusion is very convenient, as it reduces the number of requests and simplifies client code.

Take a look at the aggregation of extended statistics.

Aggregation of extended statistics

The extended stats aggregation returns slightly more statistics in addition to the previous version:

GET bigginsight / _search
{
    "aggregations": {
       "download_estats": {
          "extended_stats": {
             "field": "downloadTotal"
          }
       }
    },
    "size": 0
}

The answer will be as follows:

{
   "took": 15,
   "timed_out": false,
   ...,
   "hits": {
      "total": 242836,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "download_estats": {
         "count": 242835,
         "min": 0,
         "max": 241213,
         "avg": 9049.102065188297,
         "sum": 2197438700,
         "sum_of_squares": 133545882701698,
         "variance": 468058704.9782911,
         "std_deviation": 21634.664429528162,
         "std_deviation_bounds": {
            "upper": 52318.43092424462,
            "lower": -34220.22679386803
         }
      }
   }
}

In response, you also get the sum of squares, the discrepancy, the standard deviation and its boundaries.

Power aggregation

Counting unique items can be performed using power aggregation. This is similar to searching for a query result, as shown below:

select count (*) from (select distinct username from usageReport) u;

Determining the power or number of unique values ​​for a particular field is a fairly common task. For example, if you have a stream of clicks1 (click-stream) from different visitors to your site, you may want to know how many unique visitors are on a site on a selected day, week, or month.

We will understand how to find the number of unique visitors using the available network traffic data:

GET bigginsight / _search
{
     "aggregations": {
        "unique_visitors": {
            "cardinality": {
               "field": "username"
            }
        }
     },
     "size": 0
}

The power aggregation response looks the same as in other metric aggregations:

{
   "took": 110,
   ...,
   "hits": {
      "total": 242836,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "unique_visitors": {
         "value": 79
      }
   }
}

Now that we understand the simplest types of aggregations, we can consider some segmental aggregations.

About the authors

Pranav Shukla – Founder and CEO of Valens DataLabs, engineer, husband and father of two children. A big data architect and professional programmer using JVM-based programming languages. Pranav has been developing corporate applications for Fortune 500 companies and startups for more than 14 years. His main focus is creating scalable, data-driven applications based on JVM, Java / Scala, the Hadoop ecosystem, Apache Spark, and NoSQL databases. Actively developing in areas related to the organization of big data, analytics and machine learning.

Pranav founded Valens DataLabs to help other companies use data to increase their competitiveness. Valens DataLabs specializes in creating new generation cloud applications for working with big data and web technologies. The company's work is based on the use of flexible practices, the principles of lean manufacturing, development based on tests and behavior, constant integration and continuous deployment of sustainable software systems.

Sharat Kumar M.N. earned a masters degree in computer science from the University of Texas, Dallas, USA. He has been working in the IT industry for more than ten years, currently holds the position of Oracle solution developer for Elasticsearch, is a supporter of Elastic Stack. An avid speaker, he spoke at several science and technology conferences, including the Oracle Code Event. Sharath is a certified Elastic teacher (Elastic Certified Instructor) – one of several technical experts in the world to whom Elastic Inc. granted the official right to conduct trainings "from the creators of Elastic". He is also an enthusiast in machine learning and data science.

About science editor

Marcelo Ochoa works in the laboratory of the Faculty of Exact Sciences at the National University of Central Buenos Aires (Universidad Nacional del Centro de la Provincia de Buenos Aires), Argentina. He is the Technical Director of Scotas (www.scotas.com), which specializes in pseudo-real-time solutions using Apache Solr and Oracle technologies. Marcelo has time to work at the university and deal with projects related to Oracle and big data technologies. He previously worked with databases, web and Java technologies. In the XML world, Marcelo is known as a DB Generator developer for the Apache Cocoon project. He was involved in creating open source projects such as DBPrism, DBPrism CMS, and Restlet.org, where he worked on the Oracle XDB Restlet Adapter, which is an alternative for writing native REST web services within the JVM database.

Since 2006, he has been participating in the Oracle ACE program, and recently joined the work on the Docker Mentor project.

»More information about the book can be found on the publisher's website.
»Table of Contents
»Excerpt

For Habrozhiteley a 25% discount on the coupon – Elasticsearch

Upon payment of the paper version of the book, an electronic version of the book is sent to the e-mail.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *