Detection of DGA domains or test task for the position of intern ML-engineer

In this article, we will look at a simple task that one company uses as a test task for ML engineer interns. It involves detecting DGA domains, a task that can be solved using basic machine learning tools. We will show how to tackle it using the simplest methods. Knowing complex algorithms is important, but understanding the basic concepts and being able to apply them in practice is much more important to successfully demonstrate your skills.

DGA (Domain Generation Algorithm) is an algorithm that automatically generates domain names that are often used by attackers to bypass blocking and communicate with command servers.

The technical specifications included test datafor which predictions had to be made, and validation datawhich required demonstrating metrics in the following format:

True Positive (TP): False Positive (FP): False Negative (FN): True Negative (TN): Accuracy: Precision: Recall: F1 Score:

Sometimes companies don't provide training data and want to assess how capable you are of finding solutions on your own. This includes:

  1. Understanding the problem: Clear formulation of the problem.

  2. Methodology: Developing an action plan and choosing methods.

  3. Critical Thinking: Data analysis and hypothesis generation.

  4. Practical skills: Applying basic machine learning concepts.

It is important to demonstrate initiative and the ability to work with limited information. In our case, domains of existing companies can be found on kaggleand we will need to generate non-existent domains ourselves.

High-quality and diverse data enables algorithms to identify patterns, make predictions, and make informed decisions. Therefore, without good data, it is impossible to achieve successful results in machine learning. It is important to create high-quality data for training the model to ensure its efficiency and accuracy. We need to focus on creating such data:

  1. Let's write functions to generate random strings and domain names. Function generate_random_string generates a string of a given length with letters and optionally numbers. Function generate_domain_names creates a list of domain names with different patterns.

    def generate_random_string(length, use_digits=True):
      """
      Генерирует случайную строку заданной длины, включающую буквы и опционально цифры.
    
      :param length: Длина строки
      :param use_digits: Включать ли цифры в строку
      :return: Случайная строка
      """
      characters = string.ascii_lowercase
      if use_digits:
          characters += string.digits
      return ''.join(random.choice(characters) for _ in range(length)) 
    
    def generate_domain_names(count):
        """
        Генерирует список доменных имён с различными паттернами и TLD.
    
        :param count: Количество доменных имён для генерации
        :return: Список сгенерированных доменных имён
        """
        tlds = ['.com', '.ru', '.net', '.org', '.de', '.edu', '.gov', '.io', '.shop', '.co', '.nl', '.fr', '.space', '.online', '.top', '.info']
    
        def generate_domain_name():
            tld = random.choice(tlds)
            patterns = [
                lambda: generate_random_string(random.randint(5, 10), use_digits=False) + '-' + generate_random_string(random.randint(5, 10), use_digits=False),
                lambda: generate_random_string(random.randint(8, 12), use_digits=False),
                lambda: generate_random_string(random.randint(5, 7), use_digits=False) + '-' + generate_random_string(random.randint(2, 4), use_digits=True),
                lambda: generate_random_string(random.randint(4, 6), use_digits=False) + generate_random_string(random.randint(3, 5), use_digits=False),
                lambda: generate_random_string(random.randint(3, 5), use_digits=False) + '-' + generate_random_string(random.randint(3, 5), use_digits=False),
            ]
            domain_pattern = random.choice(patterns)
            return domain_pattern() + tld
    
        domain_list = [generate_domain_name() for _ in range(count)]
        return domain_list
  2. The code loads three CSV files, processes the data by removing column '1' and adding 'is_dga' with a value of 0. Generates 1 million DGA domain names, concatenates them with part_df and shuffles the resulting DataFrame.

    try:
      logging.info('Загрузка данных')
      part_df = pd.read_csv('top-1m.csv')
      df_val = pd.read_csv('val_df.csv')
      df_test = pd.read_csv('test_df.csv')
      logging.info('Данные успешно загружены.')
    except Exception as e:
      logging.error(f'Ошибка при загрузке данных: {e}')
    
    logging.info('Обработка данных')
    part_df = part_df.drop('1', axis=1)
    part_df.rename(columns={'google.com': 'domain'}, inplace=True)
    part_df['is_dga'] = 0
    list_dga = df_val[df_val.is_dga == 1].domain.tolist()
    generated_domains = generate_domain_names(1000000)
    part_df_dga = pd.DataFrame({
        'domain': generated_domains,
        'is_dga': [1] * len(generated_domains)
    })
    df = pd.concat([part_df, part_df_dga], ignore_index=True)
    df = df.sample(frac=1).reset_index(drop=True)
  3. We exclude domains from the validation and test sets, then balance the classes by selecting 500,000 examples for each of them. The resulting balanced set is shuffled and resets the indices

    # Исключение доменов из валидационного и тестового наборов
    train_set = set(df.domain.tolist())
    val_set = set(df_val.domain.tolist())
    test_set = set(df_test.domain.tolist())
    intersection_val = train_set.intersection(val_set)
    intersection_test = train_set.intersection(test_set)
    if intersection_val or intersection_test:
      df = df[~df['domain'].isin(intersection_val | intersection_test)]
    
    
    # Балансировка классов до одинакового числа примеров
    logging.info('Балансировка классов')
    df_train_0 = df[df['is_dga'] == 0]
    df_train_1 = df[df['is_dga'] == 1]
    num_samples_per_class = 500000
    df_train_0_sampled = df_train_0.sample(n=num_samples_per_class, random_state=42)
    df_train_1_sampled = df_train_1.sample(n=num_samples_per_class, random_state=42)
    df_balanced = pd.concat([df_train_0_sampled, df_train_1_sampled])
    df_train = df_balanced.sample(frac=1, random_state=42).reset_index(drop=True)
    
  4. We create and train the model using a pipeline that includes vectorization with TfidfVectorizer and logistic regression. After training, the model is saved to a file model_pipeline.pkl

    logging.info('Создание и обучение модели')
    
    model_pipeline = Pipeline([
        ("vectorizer", TfidfVectorizer(tokenizer=n_grams, token_pattern=None)),
        ("model", LogisticRegression(solver="saga", n_jobs=-1, random_state=12345))
    ])
    
    model_pipeline.fit(df_train['domain'], df_train['is_dga'])
    logging.info('Сохранение модели')
    joblib_file = "model_pipeline.pkl"
    joblib.dump(model_pipeline, joblib_file)
    logging.info(f'Модель сохранена в {joblib_file}')

Our whole task boils down to the fact that we need to split domains into N-grams and vectorize them using TF-IDF. An N-gram is a sequence of N elements (words or characters) in a text, but in our task we apply them to one word to isolate and analyze domain syllables. TF-IDF (Term Frequency-Inverse Document Frequency) is a method that helps to estimate the importance of a word in a document compared to other documents in a collection.

Thus, by combining N-grams and TF-IDF, we can effectively analyze domains and identify their key characteristics. Let's consider an example of existing domains: texosmotr-auto.ru And pokerdomru.ruwe will break them into 4-grams, not taking into account the generic domain (.ru)

  • For texosmotr-auto.ru: “texo”, “exos”, “xosm”, “osmo”, “smot”, “motr”, “otr-“, “r-au”, “-aut”, “auto”

  • For pokerdomru.ru: “poke”, “oker”, “kerd”, “erdo”, “domr”, “omru”

We've covered 4-grams, but do we need to use fixed N-grams for all domains? Of course not. For each domain, 3-D, 4-D, and 5-D grams are created to capture different language patterns and structural features. This approach allows for better context capture and increases the chance of discovering unique features that may be useful for classification.

  • code to create 3d, 4d and 5d grams for a domain

    def n_grams(domain):
      """
      Генерирует n-граммы для доменного имени.
    
      :param domain: Доменное имя
      :return: Список n-грамм
      """
      grams_list = []
      # Размеры n-грамм
      n = [3, 4, 5]
      domain = domain.split('.')[0]
      for count_n in n:
          for i in range(len(domain)):
              if len(domain[i: count_n + i]) == count_n:
                  grams_list.append(domain[i: count_n + i])
      return grams_list

All the resulting N-grams need to be vectorized, and the aforementioned TF-IDF method will help us with this. This approach allows us to estimate the importance of each N-gram in the context of domains by converting text data into numerical form. Vectorization using TF-IDF takes into account the frequency of occurrence of N-grams in each domain and their rarity in the overall set.

The final step is to train our model. You can use various algorithms that improve your metric, but I chose the classic logistic regression (LR) because it is easy to implement, easy to interpret, and often gives good results, for example I got the following metrics on the validation dataset:

True Positive (TP): 4605 False Positive (FP): 479 False Negative (FN): 413 True Negative (TN): 4503 Accuracy: 0.9108 Precision: 0.9058 Recall: 0.9177 F1 Score: 0.9117

Therefore, understanding basic concepts like N-grams and TF-IDF will open up opportunities for you to solve applied problems and will allow you to confidently present yourself in internships. These skills will provide a solid foundation for your professional growth in the field of machine learning and data analysis.

PS: The code sent for verification to the company that provided this test task is located Here.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *