Analyzing Twitter data for the lazy in the Elastic Stack (Xbox vs PlayStation comparison)

In anticipation of super-intensive “ELK” prepared for you a translation of a useful article


Twitter data can be accessed in a variety of ways – but who wants to bother and code? Especially one that will work without interruptions and interruptions. With the Elastic Stack, you can easily collect data from Twitter and analyze it. Logstash can collect tweets as input. Kafka Connect tool dedicated to recent article, also provides this capability, but Logstash can send data to many sources (including Apache Kafka) and is easier to use.

In this article, we will cover the following issues:

  • Saving a stream of tweets in Elasticsearch via Logstash

  • Visualizations in Kibana (Xbox vs PlayStation)

  • Removing HTML tags for a keyword using standardization mechanism

Elastic Search Environment

All the necessary components are in one Docker Compose. If you already have an Elasticsearch cluster, you only need Logstash.

version: '3.3'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.9.2
    restart: unless-stopped
    environment:
      - discovery.type=single-node
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
        memlock:
            soft: -1
            hard: -1
    volumes:
      - esdata:/usr/share/elasticsearch/data
    restart: unless-stopped
    ports:
      - 9200:9200

  kibana:
    image: docker.elastic.co/kibana/kibana:7.9.2
    restart: unless-stopped
    depends_on:
      - elasticsearch
    ports:
      - 5601:5601

  logstash:
    image: docker.elastic.co/logstash/logstash:7.9.2
    volumes:
      - "./pipeline:/usr/share/logstash/pipeline"
    environment:
      LS_JAVA_OPTS: "-Xmx256m -Xms256m"
    depends_on:
      - elasticsearch
    restart: unless-stopped

volumes:
  esdata:
    driver: local

Logtash conveyor

input {
        twitter {
        consumer_key => "loremipsum"
        consumer_secret => "loremipsum"
        oauth_token => "loremipsum-loremipsum"
        oauth_token_secret => "loremipsum"
        keywords => ["XboxSeriesX", "PS5"]
        full_tweet => false
        codec => "json"
        }
}
output {
        elasticsearch {
            hosts => ["elasticsearch:9200"]
            index => "tweets"
        }
}

To get tokens and keys, you need a developer account and Twitter app. With this code, you “settle all the formalities.”

The configuration of the pipeline itself is very simple. The stream of tweets will be matched by words in keywords… If you need more metadata, just assign the parameter full_tweet value value true

Data

Some time after executing the command docker-compose up -d in the index tweets data appears. At the time of this writing, my data has been collected for about two days. The entire index weighed in at about 430MB, which is not that much. Perhaps a different license would have allowed more data traffic. The visualizations in this article display data collected over two days.

ILM is not here.  Simple index only.
ILM is not here. Simple index only.

So, we already have an index tweets… To be able to use the collected data in Kibana, you need to add an index template.

Sample document in the tweets index.
Sample document in the tweets index.

Tag Cloud – Xbox and PlayStation

Simple tag cloud with aggregation hashtags.text.keyword… PS5 appears to be winning, but consider other renderings as well.

Line Chart – Xbox and PlayStation

Here, too, I get the impression that PlayStation is more common than Xbox. To find out for sure, let’s try to group hashtags. Some write PS5, others – ps5, but this is one and the same product.

However, before moving on, let’s pay attention to one point. Is the order of the buckets important? Of course. This is what happens if you change the histogram from Terms

To group hashtags, we can use aggregate filters. Let’s add a few more hashtags, deliberately omitting the least popular ones. The Filter field uses KQL syntax – Lucene, only more powerful.

Using filters hashtags.text.keyword: (PS5 OR ps5 OR PlayStation5 OR PlayStation) and hashtags.text.keyword: (XboxSeriesX OR Xbox OR XboxSeriesS OR xbox)… Now we know for sure that the PlayStation is more popular on Twitter.

Timelion

XBOX & PLAYSTATION

You can get even more complete information with Timelion. This interesting tool allows you to visualize time series. Unlike the previous one, it can visualize data from many sources at once.

You need to get used to the syntax first. Below is the code that generated this diagram.

.es(index=tweets, q='hashtags.text.keyword: (PS5 OR ps5 OR PlayStation5 OR PlayStation)').label("PS"),
.es(index=tweets, q='hashtags.text.keyword: (XboxSeriesX OR Xbox OR XboxSeriesS OR xbox)').label("XBOX")

Bias

Timelion allows you to shift functions using the offset parameter. The example below shows the number of PlayStation tweets compared to the previous day. I don’t have much data, so the effect is not particularly interesting.

.es(index=tweets, q='hashtags.text.keyword: (PS5 OR ps5 OR PlayStation5 OR PlayStation)').label("PS"),
.es(index=tweets, q='hashtags.text.keyword: (PS5 OR ps5 OR PlayStation5 OR PlayStation)', offset=-1d).label("PS -1 day")

Function variability (delta)

Using the same parameter and subtraction method, we can calculate the variance of the function.

.es(index=tweets, q='hashtags.text.keyword: (PS5 OR ps5 OR PlayStation5 OR PlayStation)')
    .subtract(
        .es(index=tweets, q='hashtags.text.keyword: (PS5 OR ps5 OR PlayStation5 OR PlayStation)', offset=-1h)
    )
    .label("PS 1h delta"),
.es(index=tweets, q='hashtags.text.keyword: (XboxSeriesX OR Xbox OR XboxSeriesS OR xbox)')
    .subtract(
        .es(index=tweets, q='hashtags.text.keyword: (XboxSeriesX OR Xbox OR XboxSeriesS OR xbox)', offset=-1h)
    )
    .label("XBOX 1h delta")

Pie Chart – Customer Types

So so diagram

Now let’s find out which customers are using to tweet. This, it turns out, is not so easy. The field with the customer type contains an HTML tag, which reduces the clarity of the diagram.

Nice chart

Elasticsearch has many possibilities for text processing. So the filter html_strip allows you to remove HTML tags. Unfortunately, it won’t give us anything, since analyzers can only be used for fields of type text, and we are interested in the field keyword… Aggregation can be used for fields of this type.

For fields keyword can be used normalizers… They work similarly to analyzers, but they issue a single token at the output.

Below is the code adding a normalizer to the index tweets… Since html_strip cannot be used, I had to resort to regular expressions. To change the analyzer settings in the index, you need to close it. The following code snippets you can use in developer tools in Kibana.

POST tweets/_close

PUT tweets/_settings
{
  "analysis": {
    "char_filter": {
      "client_extractor": {
        "type": "pattern_replace",
        "pattern": "<a[^>]+>([^<]+)</a>",
        "replacement": "$1"
      }
    },
    "normalizer": {
      "client_extractor_normalizer": {
        "type": "custom",
        "char_filter": [
          "client_extractor"
        ]
      }
    }
  }
}


POST tweets/_open

By adding a normalizer, we can update the property with the client type and add a new value field.

PUT tweets/_mapping
{
  "properties": {
    "client": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
          "ignore_above": 256
        },
        "value":{
          "type":"keyword",
          "normalizer":"client_extractor_normalizer"
        }
      }
    }
  }
}

Unfortunately, there is more to come. The data is indexed when it is added to the index (by the way, I wonder why it was impossible to call it a collection, as in MongoDB? ). We can re-index documents using the mechanism Update By Query

POST tweets/_update_by_query?wait_for_completion=false&conflicts=proceed

This operation returns the task id. It may not work quickly if you have a lot of data. You can find a task using the command GET _cat/tasks?v

After updating the index template in Kibana, we now have a significantly more readable chart. Here we can see that approximately the same number of users use iPhones and Android devices. I was extremely intrigued by the client Bot Xbox Series X

What’s next?

I had plans to deal with Spark NLPBut first, I’ll probably tackle Twitter data flow. I am going to use out-of-the-box Spark NLP models to define language, text sentiment and other parameters using Spark Structured Streaming

Repository

Link


You can learn more about the ELK intensive course find out here

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *