bytes package from the inside

Hello, in the last article we discussed the definition bytes.Buffer from the inside. Now I want to draw attention to the package itself. bytes. What is hidden behind it? Every developer had to use it, whether in production or local development. It is a fairly powerful package by its standards, which provides us with functions for working with bytes.

Let's analyze each function separately and talk about which one is needed and why, and most importantly, look at the source code closely. The article can be a kind of reference book or simply serve to repeat the internal structure of the package. It is quite long, so not everyone will be able to read everything in a row. I tried to highlight the main points of each function and give examples for a clear understanding of the operating principle, since the package is quite wide and can be used in many tasks. In particular, it will be useful to understand the work to improve the performance of some parts of your services. Enjoy reading!

Table of contents

Equal

Let's start with the function that is defined first in the package. Let's look at its source code:

func Equal(a, b []byte) bool {
  return string(a) == string(b)
}

Note that these transformations are not allocated, as noted by the developers themselves. Let's consider why this function is needed.

Equal reports whether they have a And b the same length and whether they contain the same bytes.
Argument nil equivalent to an empty slice.

This approach works because in Go, strings are immutable slices of bytes, and can be compared using the operator ==. If both slices have the same length and contain the same bytes, then the converted strings will also be equal.

func TryEqual() {
  a := []byte("hello")
  b := []byte("hello")
  c := []byte("hello")

  fmt.Println(bytes.Equal(a, b)) // true
  fmt.Println(bytes.Equal(a, c)) // false
}

It is quite efficient to use this function for comparing slices of bytes, which may be needed quite often.

Compare

Compare takes two arguments of type []byte (byte slices) a And band returns an integer that compares these slices in lexicographic order. And the lexicographic order itself is the order in some alphabet.

func TryCompare() {
  a := []byte("hello")
  b := []byte("hello")
  c := []byte("hello")

  fmt.Println(bytes.Compare(a, b)) // 0
  fmt.Println(bytes.Compare(a, c)) // -1
  fmt.Println(bytes.Compare(c, a)) // 1
}

Count

Next we will look at a deeper function. Count – this is a function from the package bytes Go's standard library, which takes two arguments of type []byte (byte slices) s And sepand returns the number of non-overlapping occurrences sep V s.

func Count(s, sep []byte) int {
    // если sep является пустым срезом,
    // вернет количество UTF-8-кодированных точек в s плюс один
    if len(sep) == 0 {
        return utf8.RuneCount(s) + 1
    }

    // если sep состоит из одного байта, использует оптимизированную функцию bytealg.Count
    if len(sep) == 1 {
        return bytealg.Count(s, sep[0])
    }

    // инициализация счетчика вхождений sep в s нулем
    n := 0

    // цикл поиска вхождений sep в s
    for {
        // поиск индекса следующего вхождения sep в s
        i := Index(s, sep)

        // если вхождений sep не найдено, вернуть счетчик
        if i == -1 {
            return n
        }

        // увеличение счетчика вхождений sep в s
        n++

        // обновление s до подсреза, начинающегося после конца текущего вхождения sep
        s = s[i+len(sep):]
    }
}

This is already something more interesting, but let's look at the definition Index:

// Index возвращает индекс первого вхождения sep в s или -1, если sep не найден в s.
func Index(s, sep []byte) int {
    // определение длины sep
    n := len(sep)

    switch {
    case n == 0:
        // если sep пустой, возращаем 0
        return 0
    case n == 1:
        // если sep состоит из одного байта, используем оптимизированную функцию IndexByte
        return IndexByte(s, sep[0])
    case n == len(s):
        // если sep имеет ту же длину, что и s, проверяем, равны ли они с помощью функции Equal, 
        // которую мы разбирали выше
        if Equal(sep, s) {
            return 0
        }
        // если не равны, возращаем -1
        return -1
    case n > len(s):
        // если sep длиннее, чем s, возращаем -1
        return -1
    case n <= bytealg.MaxLen:
        // если sep достаточно мал, используем брутфорс для поиска вхождения
        // с помощью функции bytealg.Index

        // если s также достаточно мал, используем bytealg.Index
        if len(s) <= bytealg.MaxBruteForce {
            return bytealg.Index(s, sep)
        }

        c0 := sep[0]
        c1 := sep[1]
        i := 0
        t := len(s) - n + 1
        fails := 0
      
        for i < t {
            // поиск первого байта sep в s
            if s[i] != c0 {
                // если байт не найден, используем IndexByte для поиска следующего вхождения
                o := IndexByte(s[i+1:t], c0)
                if o < 0 {
                    return -1
                }
                i += o + 1
            }
          
            // проверка второго байта sep и последовательности байтов sep в s
            if s[i+1] == c1 && Equal(s[i:i+n], sep) {
                return i
            }
          
            i++
            fails++
          
            // если IndexByte производит слишком много ложных положительных результатов,
            // переходим к использованию bytealg.Index
            if fails > bytealg.Cutover(i) {
                r := bytealg.Index(s[i:], sep)
                if r >= 0 {
                    return r + i
                }
                return -1
            }
        }
      
        // если sep не найден, возращаем -1
        return -1
    }

    // если sep не является каким-либо случаем, используем оптимизированный алгоритм поиска
    // основанный на алгоритме Рэбина-Карпа
  
    c0 := sep[0]
    c1 := sep[1]
    i := 0
    fails := 0
  
    t := len(s) - n + 1
  
    for i < t {
        // поиск первого байта sep в s
        if s[i] != c0 {
            // если байт не найден, используем IndexByte для поиска следующего вхождения
            o := IndexByte(s[i+1:t], c0)
            if o < 0 {
                break
            }
            i += o + 1
        }
      
        // проверка второго байта sep и последовательности байтов sep в s
        if s[i+1] == c1 && Equal(s[i:i+n], sep) {
            return i
        }
      
        i++
        fails++
      
        // если IndexByte производит слишком много ложных результатов,
        // переходим к использованию bytealg.IndexRabinKarp
        if fails >= 4+i>>4 && i < t {
            j := bytealg.IndexRabinKarp(s[i:], sep)
            if j < 0 {
                return -1
            }
            return i + j
        }
    }
  
    // если sep не найден, возращаем -1
    return -1
}

What is the Rabin-Karp algorithm? This algorithm searches for a substring in text using hashing. The algorithm uses a hash function, which allows for quick calculation of hash values for segments of a string. This makes the algorithm effective for searching for substrings in large strings. You can read more about this algorithm here. Well and IndexByte it's just a wrapper:

func IndexByte(b []byte, c byte) int {
	return bytealg.IndexByte(b, c)
}

IndexByte returns the index of the first instance c V b or -1 if c is absent from b.

Example of use Count:

func TryCount() {
    // создаем срез байтов
    s := []byte("hello world hello")
    sep := []byte(" ")

    fmt.Println(bytes.Count(s, sep)) // 2
}

Contains

func Contains(b, subslice []byte) bool {
	return Index(b, subslice) != -1
}

If Index() returns -1it means that subslice not found in band the function Contains() returns false. If Index() returns the index of the occurrence, this means that subslice found in band the function Contains() returns true. Just some kind of wrapper, as discussed above.

func TryContains() {
    b := []byte("hello world")
	subslice := []byte("world")
	fmt.Println(bytes.Contains(b, subslice)) // true

	subslice2 := []byte("foo")
	fmt.Println(bytes.Contains(b, subslice2)) // false
}

ContainsAny

func ContainsAny(b []byte, chars string) bool {
	return IndexAny(b, chars) >= 0
}

Contains all messages about whether any UTF-8 code points are in characters within b. This function can be useful for checking whether a large slice of bytes contains any of the given characters or code points, without having to create a new slice or copy the data.

But let's look at the definition IndexAny:

func IndexAny(s []byte, chars string) int {
    if chars == "" {
        // если chars пустая строка, возращаем -1
        return -1
    }
    if len(s) == 1 {
        // если s состоит из одного байта, используем оптимизированную функцию IndexByteString для поиска вхождения
        r := rune(s[0])
        if r >= utf8.RuneSelf {
            // поиск utf8.RuneError
            for _, r = range chars {
                if r == utf8.RuneError {
                    return 0
                }
            }
            return -1
        }
        if bytealg.IndexByteString(chars, s[0]) >= 0 {
            return 0
        }
        return -1
    }
    if len(chars) == 1 {
        // если chars состоит из одного символа, используем функцию IndexRune для поиска вхождения
        r := rune(chars[0])
        if r >= utf8.RuneSelf {
            r = utf8.RuneError
        }
        return IndexRune(s, r)
    }
    if len(s) > 8 {
        // если s достаточно большой, проверяем, является ли chars ASCII-строкой
        if as, isASCII := makeASCIISet(chars); isASCII {
            // если chars является ASCII-строкой, используем оптимизированную функцию makeASCIISet для создания битового массива
            // представляющего множество ASCII-символов в chars
            for i, c := range s {
                if as.contains(c) {
                    return i
                }
            }
            return -1
        }
    }
    // используем цикл для поиска первого вхождения любой из кодовых точек в s
    var width int
    for i := 0; i < len(s); i += width {
        r := rune(s[i])
        if r < utf8.RuneSelf {
            // если r является ASCII-символом, используем оптимизированную функцию IndexByteString для поиска вхождения
            if bytealg.IndexByteString(chars, s[i]) >= 0 {
                return i
            }
            width = 1
            continue
        }
        // декодируем r из s с помощью функции utf8.DecodeRune
        r, width = utf8.DecodeRune(s[i:])
        if r != utf8.RuneError {
            // проверяем, является ли r одной из кодовых точек в chars
            if len(chars) == width {
                if chars == string(r) {
                    return i
                }
                continue
            }
            // используем оптимизированную функцию bytealg.IndexString для поиска вхождения, если это возможно
            if bytealg.MaxLen >= width {
                if bytealg.IndexString(chars, string(r)) >= 0 {
                    return i
                }
                continue
            }
        }
        // проверяем, является ли r одной из кодовых точек в chars
        for _, ch := range chars {
            if r == ch {
                return i
            }
        }
    }
    // если кодовые точки не найдены в s, возращаем -1
    return -1
}

Finds the first occurrence of any of the UTF-8 encoded code points specified in the string. charsin byte slice sand returns the index of the first byte of the occurrence or -1if the code points are not found in s.

Example ContainsAny:

func TryContainsAny() {
    b := []byte("hello world")
	chars := "aeiou"
	fmt.Println(bytes.ContainsAny(b, chars)) // true

	chars2 := "xyz"
	fmt.Println(bytes.ContainsAny(b, chars2)) // false
}

Function ContainsAny checks if a slice of bytes contains b any of the UTF-8 encoded code points specified in the string chars.

The main difference between these functions is that Contains searches for the exact sequence of bytes, while ContainsAny searches for any of the specified code points. This means that Contains may be more efficient when you need to find an exact substring, but is less flexible than ContainsAnywhich can be used to search for any given characters or code points.

For example, the function Contains can be used to check if a slice of bytes contains a certain substring, such as []byte("hello")while the function ContainsAny can be used to check if a slice of bytes contains any of the given characters, such as "aeiou".

ContainsRune

I think we can skip it, since it behaves the same as the examples above, but IndexRune is more interesting.

func ContainsRune(b []byte, r rune) bool {
	return IndexRune(b, r) >= 0
}

func IndexRune(s []byte, r rune) int {
    switch {
    case 0 <= r && r < utf8.RuneSelf:
        // если r является ASCII-символом, используем IndexByte для поиска вхождения
        return IndexByte(s, byte(r))
    case r == utf8.RuneError:
        // если r равно utf8.RuneError, ищем первое вхождение любой некорректной последовательности байтов UTF-8
        for i := 0; i < len(s); {
            r1, n := utf8.DecodeRune(s[i:])
            if r1 == utf8.RuneError {
                return i
            }
            i += n
        }
        return -1
    case !utf8.ValidRune(r):
        // если r не является допустимой кодовой точкой UTF-8, возращаем -1
        return -1
    default:
        // преобразовываем r в последовательность байтов b и используем Index для поиска вхождения
        var b [utf8.UTFMax]byte
        n := utf8.EncodeRune(b[:], r)
        return Index(s, b[:n])
    }
}

In UTF-8, each code point can be represented by 1 to 4 bytes. The number of bytes required to represent a code point depends on its value. Low-value code points, such as ASCII characters, are represented by a single byte, while higher-value code points, such as emoji, are represented by multiple bytes.

Constant utf8.Max is 4, since the maximum number of bytes required to represent a single code point in UTF-8 is 4. This constant can be used to determine the maximum buffer size required to represent a string in UTF-8, or to verify that a byte slice represents a valid UTF-8 byte sequence. We will discuss this package in more detail in future articles.

ContainsFunc

func ContainsFunc(b []byte, f func(rune) bool) bool {
	return IndexFunc(b, f) >= 0
}

Overall, there is nothing unusual or incomprehensible, let's take a look IndexFunc :

func LastIndexFunc(s []byte, f func(r rune) bool) int {
	return lastIndexFunc(s, f, true)
}

func indexFunc(s []byte, f func(r rune) bool, truth bool) int {
    // начальный индекс поиска
    start := 0
  
    // цикл для итерации по срезу байтов s
    for start < len(s) {
      
        // ширина кодовой точки
        wid := 1
      
        // декодирование кодовой точки r из s[start]
        r := rune(s[start])
      
        if r >= utf8.RuneSelf {
            // если r является первым байтом многобайтовой кодовой точки,
            // декодируем полную кодовую точку с помощью функции utf8.DecodeRune()
            r, wid = utf8.DecodeRune(s[start:])
        }
      
        // проверка, удовлетворяет ли кодовая точка r f(rune) bool
        if f(r) == truth {
            // если кодовая точка удовлетворяет, вернуть индекс первого байта этой кодовой точки
            return start
        }

      
        // переход к следующей кодовой точке
        start += wid
    }
    // если кодовая точка, удовлетворяющая функции, не найдена, вернуть -1
    return -1
}

Example of use ContainsFunc :

func TryContainsFunc() {
    b := []byte("hello world")
	f := func(r rune) bool {
		return unicode.IsUpper(r)
	}
	fmt.Println(bytes.ContainsFunc(b, f)) // false

	b2 := []byte("Hello World")
	fmt.Println(bytes.ContainsFunc(b2, f)) // true
}

LastIndex

func LastIndex(s, sep []byte) int {
	n := len(sep)
	switch {
	case n == 0:
		return len(s)
	case n == 1:
		return bytealg.LastIndexByte(s, sep[0])
	case n == len(s):
		if Equal(s, sep) {
			return 0
		}
		return -1
	case n > len(s):
		return -1
	}
	return bytealg.LastIndexRabinKarp(s, sep)
}

And as we can see, RabinCarp is used here too. It looks for the last occurrence and returns the last index of the occurrence. Example:

func TryLastIndex() {
    s := []byte("hello world")
	sep := []byte("o")
	fmt.Println(bytes.LastIndex(s, sep)) // 7

	sep2 := []byte("foo")
	fmt.Println(bytes.LastIndex(s, sep2)) // -1
}

In this example we are looking for the last occurrence of the letter “o” in a slice of bytes s and the last occurrence of the string “foo” in s. Function LastIndex() returns 7 for the first case and -1 for the second one, since the string “foo” was not found in sIn practice, we can use this function to do some file analysis to find the last occurrence, which can be useful.

LastIndexByte

func LastIndexByte(s []byte, c byte) int {
	return bytealg.LastIndexByte(s, c)
}

Almost the same, but returns a byte entry, not a slice.

SplitN

Splits a slice of bytes s into sub-sections separated by a divider sepand returns a slice of these subslices.

The function takes the following arguments:

s – slice of bytes to split.
sep – the separator by which the splitting will occur.
n – the maximum number of sub-slices that must be returned. If n less than or equal to 0, the function returns all subslices. If n greater than 0, the function returns no more than n sub-sections, and the last sub-section will contain the undivided remaining part s.

func SplitN(s, sep []byte, n int) [][]byte { return genSplit(s, sep, 0, n) }

genSplit() – implements the basic partitioning algorithm and takes an additional argument limitwhich specifies the maximum number of bytes that can be returned as a result. If limit equals 0, the function returns all subslices without restrictions.

If sep empty cut, SplitN breaks s after each UTF-8 encoded code point.

func TrySplitN() {
    s := []byte("hello,world,golang,bytes")
	sep := []byte(",")
	n := 3

	subslices := bytes.SplitN(s, sep, n)
	for _, subslice := range subslices {
		fmt.Println(string(subslice))
	}
}

hello
world
golang,bytes

This function can be useful for breaking large binary data into smaller pieces separated by a specific delimiter, such as for parsing text files or processing data packets on the network.

Let's take a closer look genSplit .

func genSplit(s, sep []byte, sepSave, n int) [][]byte {
    // если n равно 0, возвращаем nil
    if n == 0 {
        return nil
    }
  
    // если sep пустой, вызываем вспомогательную функцию explode(), которая разбивает s после каждой UTF-8-кодированной кодовой точки
    if len(sep) == 0 {
        return explode(s, n)
    }
  
    // если n меньше 0, вычисляем количество подсрезов, которые должны быть возвращены, на основе количества разделителей в s
    if n < 0 {
        n = Count(s, sep) + 1
    }
  
    // если n больше длины s, устанавливаем n равным длине s плюс один
    if n > len(s)+1 {
        n = len(s) + 1
    }

    // создаем срез a для хранения подсрезов и переменную i для отслеживания текущего индекса подсреза
    a := make([][]byte, n)
    n--
    i := 0
  
    // цикл ищет индекс первого вхождения разделителя sep в s с помощью функции Index()
    // если разделитель найден, добавляем подсрез s до индекса разделителя в срез a и обновляем s, чтобы он начинался после разделителя
    // цикл продолжается, пока не будет найдено n подсрезов или разделитель не будет найден
    for i < n {
        m := Index(s, sep)
        if m < 0 {
            break
        }
        a[i] = s[: m+sepSave : m+sepSave]
        s = s[m+len(sep):]
        i++
    }
  
    // добавляем оставшуюся часть s в срез a и возвращаем срез a до текущего индекса подсреза плюс один
    a[i] = s
  
    return a[:i+1]
}

func explode(s []byte, n int) [][]byte {
    // если n меньше или равно 0 или больше длины s, устанавливаем n равным длине s
    if n <= 0 || n > len(s) {
        n = len(s)
    }
  
    // создаем срез a для хранения подсрезов и переменные size и na для отслеживания текущего размера и индекса подсреза
    a := make([][]byte, n)
    var size int
    na := 0
  
    // цикл декодирует первую UTF-8-кодированную кодовую точку из s с помощью функции utf8.DecodeRune()
    // если количество подсрезов достигло максимума, добавляем оставшуюся часть s в последний подсрез и прерываем цикл
    // в противном случае добавляем подсрез s до текущей кодовой точки в срез a и обновляем s, чтобы он начинался после кодовой точки
    for len(s) > 0 {
        if na+1 >= n {
            a[na] = s
            na++
            break
        }
        _, size = utf8.DecodeRune(s)
        a[na] = s[0:size:size]
        s = s[size:]
        na++
    }
  
    // возвращаем срез a до текущего индекса подсреза
    return a[0:na]
}

Overall, it's a pretty useful function for splitting a slice of bytes.

SplitAfterN

The difference between these functions is where the splitting occurs relative to the separator. Function SplitN splits the slice of bytes before the delimiter, and the function SplitAfterN splits a slice of bytes after the delimiter.

For example, if we have a slice of bytes s := []byte("a,b,c") and separator sep := []byte(",")then call bytes.SplitN(s, sep, 2) will return the cut [][]byte{[]byte("a"), []byte("b,c")}and the challenge bytes.SplitAfterN(s, sep, 2) will return the cut [][]byte{[]byte("a,"), []byte("b,c")} .

func SplitAfterN(s, sep []byte, n int) [][]byte {
	return genSplit(s, sep, len(sep), n)
}

func TrySplitAfterN() {
    s := []byte("hello,world,golang,bytes")
	sep := []byte(",")
	n := 3

	subslices := bytes.SplitAfterN(s, sep, n)
	for _, subslice := range subslices {
		fmt.Println(string(subslice))
	}
}

hello,
world,
golang,bytes

Split

func Split(s, sep []byte) [][]byte { return genSplit(s, sep, 0, -1) }

Splits a slice of bytes s into sub-sections separated by a divider sepand returns a slice of these subslices. The splitting occurs before each occurrence of the separator sep.

func TrySplit() {
    s := "hello world from Go program"
	sep := " "
  
	words := bytes.Split([]byte(s), []byte(sep))
	for _, word := range words {
		fmt.Println(string(word))
	} 
}

hello
world
from
Go
program

Fields

The function splits a slice of bytes s into subslices separated by one or more whitespace characters defined by the function unicode.IsSpace. It returns a slice of subslices or an empty slice if s contains only space characters.

func Fields(s []byte) [][]byte {
    // сначала подсчитываем количество подсрезов, разделенных пробельными символами
    // это точное количество, если s содержит только ASCII-символы, в противном случае испольуется приближение
    n := 0
    wasSpace := 1
  
    // setBits используется для отслеживания установленных битов в байтах s
    setBits := uint8(0)
    for i := 0; i < len(s); i++ {
        r := s[i]

        // побитовое OR между переменной setBits и байтом r, позволяет отслеживать, какие биты были установлены в любом из байтов среза s
        setBits |= r

        // определение по массиву
        isSpace := int(asciiSpace[r])

        // увеличивает счетчик n на 1, если предыдущий символ был пробельным, а текущий нет, выполняется с помощью побитовых операций AND и XOR
        n += wasSpace & ^isSpace
      
        wasSpace = isSpace
    }

    // если в срезе есть не-ASCII символы, используем более медленный путь
    if setBits >= utf8.RuneSelf {
        return FieldsFunc(s, unicode.IsSpace)
    }

    // создаем срез подсрезов длиной n
    a := make([][]byte, n)
  
    // текущий индекс подсреза
    na := 0
  
    // индекс начала текущего поля
    fieldStart := 0
    i := 0
  
    // Пропускаем пробельные символы в начале среза
    for i < len(s) && asciiSpace[s[i]] != 0 {
        i++
    }
  
    fieldStart = i
  
    // проходим по срезу и разбиваем его на подсрезы, разделенные одним или более пробельными символами
    for i < len(s) {
        if asciiSpace[s[i]] == 0 {
            i++
            continue
        }
      
        // добавляем текущее поле в срез подсрезов
        a[na] = s[fieldStart:i:i]
        na++
        i++
      
        // пропускаем пробельные символы между полями
        for i < len(s) && asciiSpace[s[i]] != 0 {
            i++
        }
      
        fieldStart = i
    }
  
    // добавляем последнее поле, если оно не пустое
    if fieldStart < len(s) {
        a[na] = s[fieldStart:len(s):len(s)]
    }
  
    return a
}

This function can be useful for breaking text data into individual words or tokens separated by whitespace. For example, if we have a string "hello world"challenge bytes.Fields([]byte("hello world")) will return the cut [][]byte{[]byte("hello"), []byte("world")}. It uses two ways to split a byte slice into subslices: a slower way for slices containing non-ASCII characters, and a faster way for slices containing only ASCII characters. The function first counts the number of whitespace-separated subslices using the variable n. It then checks if the slice contains non-ASCII characters using the variable setBitsand selects the appropriate path.

Let's see what it is asciiSpace .

var asciiSpace = [256]uint8{'\t': 1, '\n': 1, '\v': 1, '\f': 1, '\r': 1, ' ': 1}

The array contains 256 elements of the type uint8each of which corresponds to one byte of the ASCII table. The values of the array elements are zeros and ones. If the element is zero, this means that the corresponding character is not a space, and if it is one, then it is. This allows you to avoid calling the function unicode.IsSpace for each character, which can be slow to accomplish.

We can notice that the code uses bitwise optimizations to improve performance.

FieldsFunc

The same purpose, but only according to a given condition.

func FieldsFunc(s []byte, f func(rune) bool) [][]byte {
    // определяем структуру span для хранения начального и конечного индексов подсреза
    type span struct {
        start int
        end   int
    }
  
    // создаем срез для хранения индексов подсрезов
    spans := make([]span, 0, 32)

    // находим индексы начала и конца подсрезов.
    // это делается в отдельном проходе (а не путем разделения среза s и сбора
    // результирующих подсрезов сразу), что эффективнее
  
    start := -1 // индекс начала текущего подсреза, если >= 0

    for i := 0; i < len(s); {
        // определяем размер и значение текущей кодовой точки
        size := 1
      
        r := rune(s[i])
      
        if r >= utf8.RuneSelf {
            r, size = utf8.DecodeRune(s[i:])
        }
      
        // проверяем, удовлетворяет ли кодовая точка условию f(rune) bool
        if f(r) {
            // если текущий подсрез не пустой, сохраняем его индексы в срез spans
            // и сбрасываем индекс начала текущего подсреза
            if start >= 0 {
                spans = append(spans, span{start, i})
                start = -1
            }
          
        } else {
            // если подсрез пустой, устанавливаем индекс начала текущего подсреза
            if start < 0 {
                start = i
            }
        }
      
        // Переходим к следующей кодовой точке.
        i += size
    }

    // если последний подсрез не пустой, сохраняем его индексы в spans
    if start >= 0 {
        spans = append(spans, span{start, len(s)})
    }

    // создаем срез подсрезов на основе индексов в spans
    a := make([][]byte, len(spans))
    for i, span := range spans {
        a[i] = s[span.start:span.end:span.end]
    }

    return a
}

func TryFieldsFunc() {
    s := []byte("hello, world, golang, bytes")
	f := func(r rune) bool {
		return unicode.IsSpace(r) || r == ','
	}
  
	subslices := bytes.FieldsFunc(s, f)
	for _, subslice := range subslices {
		fmt.Println(string(subslice))
	}
}

hello
world
golang
bytes

Join

Concatenates slice elements s with separator sep between them and returns a new slice of bytes.

func Join(s [][]byte, sep []byte) []byte {
    // если длина среза s равна 0, возвращаем пустой срез байтов
    if len(s) == 0 {
        return []byte{}
    }
  
    // если длина среза s равна 1, просто возвращаем копию первого элемента
    if len(s) == 1 {
        return append([]byte(nil), s[0]...)
    }

    // вычисляем общую длину результирующего среза байтов.
    var n int
  
    // если разделитель sep не пустой, добавляем его длину, умноженную на количество элементов среза s,
    // минус 1 (т.к. между последними двумя элементами разделитель не нужен)
    if len(sep) > 0 {
        if len(sep) >= maxInt/(len(s)-1) {
            panic("bytes: Join output length overflow")
        }
      
        n += len(sep) * (len(s) - 1)
    }
  
    // добавляем длину каждого элемента среза s к общей длине
    for _, v := range s {
        if len(v) > maxInt-n {
            panic("bytes: Join output length overflow")
        }
        n += len(v)
    }

    // создаем новый срез байтов длиной n.
    b := bytealg.MakeNoZero(n)[:n:n]
  
    // копируем первый элемент среза s в результирующий срез b
    bp := copy(b, s[0])
  
    // конкатенируем остальные элементы среза s, разделяя их разделителем sep
    for _, v := range s[1:] {
        bp += copy(b[bp:], sep)
        bp += copy(b[bp:], v)
    }
  
    return b
}

Example of use:

func TryJoin() {
	s := [][]byte{{'h', 'e', 'l', 'l', 'o'}, {'w', 'o', 'r', 'l', 'd'}, {'g', 'o', 'l', 'a', 'n', 'g'}}

	result := bytes.Join(s, []byte{',', ' '})

	fmt.Println(string(result)) // hello, world, golang
}

HasPrefix

func HasPrefix(s, prefix []byte) bool {
	return len(s) >= len(prefix) && Equal(s[0:len(prefix)], prefix)
}

Checks if a byte slice starts with a given prefix.

func TryPrefix() {
	s := []byte("hello, world")

	hasPrefix := bytes.HasPrefix(s, []byte("hello"))

	fmt.Println(hasPrefix) // true
}

HasSuffix

func HasSuffix(s, suffix []byte) bool {
	return len(s) >= len(suffix) && Equal(s[len(s)-len(suffix):], suffix)
}

The same thing, but it’s a suffix, so we don’t need an example.

Map

Function mapping must take one argument of type rune (Unicode code point) and return one argument of type rune. This function will be applied to each character in the byte slice. s. If the function mapping returns a negative value, the corresponding character will be removed from the byte slice without replacement. It can be useful in various situations where it is necessary to transform characters in a byte slice. For example, it can be used to convert all characters to upper or lower case, to remove non-printable characters, or to replace certain characters with others.

func Map(mapping func(r rune) rune, s []byte) []byte {
	b := make([]byte, 0, len(s))
  
	for i := 0; i < len(s); {
		wid := 1
		r := rune(s[i])
      
		if r >= utf8.RuneSelf {
			r, wid = utf8.DecodeRune(s[i:])
		}
      
		r = mapping(r)
		if r >= 0 {
			b = utf8.AppendRune(b, r)
		}
      
		i += wid
	}
  
	return b
}

func TryMap() {
    s := []byte("hello, world")

    // преобразуем все в верхний регистр
	mapping := func(r rune) rune {
		if r >= 'a' && r <= 'z' {
			return r - 32
		}
		return r
	}

	result := bytes.Map(mapping, s)
	fmt.Println(string(result)) // HELLO, WORLD
}

Repeat

func Repeat(b []byte, count int) []byte {
	if count == 0 {
		return []byte{}
	}

	if count < 0 {
		panic("bytes: negative Repeat count")
	}
  
	if len(b) > maxInt/count {
		panic("bytes: Repeat output length overflow")
	}
  
	n := len(b) * count

	if len(b) == 0 {
		return []byte{}
	}
  
	const chunkLimit = 8 * 1024
	chunkMax := n
	if chunkMax > chunkLimit {
		chunkMax = chunkLimit / len(b) * len(b)
		if chunkMax == 0 {
			chunkMax = len(b)
		}
	}
  
	nb := bytealg.MakeNoZero(n)[:n:n]
	bp := copy(nb, b)
	for bp < n {
		chunk := bp
		if chunk > chunkMax {
			chunk = chunkMax
		}
		bp += copy(nb[bp:], nb[:chunk])
	}
  
	return nb
}

If the length of the resulting slice is greater than a certain limit (8 KB), the function limits the block size to avoid overloading the processor cache.
Since there is no way to display an error when overflow occurs, we will fall in panic
8KB – empirically found value
The developers warn that the function type signature should not be removed or changed. Repeatas this may lead to linking errors in packages that use it as a linkname.

Here is the verbatim comment:

// Despite being an exported symbol,
// Repeat is linknamed by widely used packages.
// Notable members of the hall of shame include:
// – gitee.com/quant1x/num
//
// Do not remove or change the type signature.
// See go.dev/issue/67401.
//
// Note that this comment is not part of the doc comment.
//
//go:linkname Repeat

ToUpper

Let's look at another interesting function that converts all characters to uppercase.

func ToUpper(s []byte) []byte {
	// проверяем, является ли срез ASCII-срезом
	isASCII, hasLower := true, false
	for i := 0; i < len(s); i++ {
		c := s[i]
		if c >= utf8.RuneSelf {
			isASCII = false
			break
		}
		hasLower = hasLower || ('a' <= c && c <= 'z')
	}

	// если срез является ASCII-срезом, используем оптимизированный алгоритм
	if isASCII {
      
		// если в срезе нет строчных символов, просто возвращаем копию
		if !hasLower {
			return append([]byte(""), s...)
		}
      
		// создаем новый срез байтов.
		b := bytealg.MakeNoZero(len(s))[:len(s):len(s)]
      
		// преобразуем все строчные символы в верхний регистр.
		for i := 0; i < len(s); i++ {
			c := s[i]
			if 'a' <= c && c <= 'z' {
				c -= 'a' - 'A'
			}
			b[i] = c
		}
      
		// возвращаем новый срез байтов
		return b
	}
  
	// если срез не является ASCII-срезом, используем функцию Map
	return Map(unicode.ToUpper, s)
}

Well, here's a pretty trivial example:

func TryToUpper() {
    s := []byte("hello, world!")

	result := bytes.ToUpper(s)

	fmt.Println(string(result)) // HELLO, WORLD!
}

ToLower

func ToLower(s []byte) []byte {
    // проверяем, состоит ли срез байтов s только из ASCII-символов
    isASCII, hasUpper := true, false
    for i := 0; i < len(s); i++ {
        c := s[i]
        if c >= utf8.RuneSelf {
            isASCII = false
            break
        }
        hasUpper = hasUpper || ('A' <= c && c <= 'Z')
    }

    // если срез байтов s состоит только из ASCII-символов, мы можем оптимизировать преобразование
    if isASCII {
      
        // если в срезе байтов s нет заглавных букв, мы просто возвращаем копию среза байтов s
        if !hasUpper {
            r
          eturn append([]byte(""), s...)
        }
        // создаем новый срез байтов b с помощью bytealg.MakeNoZero(), который выделяет память для среза байтов без инициализации нулями
        b := bytealg.MakeNoZero(len(s))[:len(s):len(s)]
      
        // проходим по всем символам в срезе байтов s и преобразуем все заглавные буквы ASCII в нижний регистр, добавляя к ним разницу между кодами символов 'a' и 'A'
        for i := 0; i < len(s); i++ {
            c := s[i]
            if 'A' <= c && c <= 'Z' {
                c += 'a' - 'A'
            }
            b[i] = c
        }
      
        return b
    }
  
    // если срез байтов s содержит не-ASCII символы, мы используем функцию Map() для преобразования всех символов в нижний регистр
    return Map(unicode.ToLower, s)
}

Example of use:

func TryTolower() {
    s := []byte("Hello, World!")
  
	lower := bytes.ToLower(s)
	fmt.Println(string(lower)) // hello, world!
}

ToTitle

func ToTitle(s []byte) []byte { return Map(unicode.ToTitle, s) }

ToTitle converts all characters in a byte slice s to capital letters using the function unicode.ToTitle. This function uses the function Mapwhich applies the specified function to each character in a byte slice s.

func TryToTitle() {
    s := []byte("hello, world!")
  
	title := bytes.ToTitle(s)
	fmt.Println(string(title)) // HELLO, WORLD!
}

The main difference from ToUpper is that the function ToUpper converts all characters in a byte slice s to upper case, that is, to capital letters. Function ToTitle converts all characters in a byte slice s into capital letters, that is, into letters that are used to write headings. These methods will return different code points, visually you will not be able to distinguish them.

A clear example:

func TryDifference() {
    str := "aáäAÁÄbBcçÇCǳ"

	// For most characters it seems that ToTitle() and ToUpper() are the same
	fmt.Println(strings.ToTitle(str)) // AÁÄAÁÄBBCÇÇCǲ
	fmt.Println(strings.ToUpper(str)) // AÁÄAÁÄBBCÇÇCǲ

	// But let's compare the unicode points of the composite character 'ǳ'
	fmt.Println()
	str = "ǳ"
	fmt.Printf("%+q", str) // "\u01f3"
	fmt.Println()
	fmt.Printf("%+q", strings.ToTitle(str)) // "\u01f2"
	fmt.Println()
	fmt.Printf("%+q", strings.ToUpper(str)) // "\u01f1"
}

Example taken from here.

ToUpperSpecial

Function ToUpperSpecial uses Mapwhich applies the specified function to each character in a byte slice s. In this case, the function that is applied to each symbol is the function c.ToUpperwhich converts a character to uppercase using special conversion rules specified by the parameter c.

func ToUpperSpecial(c unicode.SpecialCase, s []byte) []byte {
	return Map(c.ToUpper, s)
}

An example, so as not to invent, is taken from documentationsince this is a fairly special case:

func main() {
	fmt.Println(strings.ToUpperSpecial(unicode.TurkishCase, "örnek iş")) // ÖRNEK İŞ
}

ToLowerSpecial

Exactly the same, except that it is converted to lower case.

func ToLowerSpecial(c unicode.SpecialCase, s []byte) []byte {
	return Map(c.ToLower, s)
}

ToTitleSpecial

func ToTitleSpecial(c unicode.SpecialCase, s []byte) []byte {
	return Map(c.ToTitle, s)
}

Another function from the same topic.

ToValidUTF8

func ToValidUTF8(s, replacement []byte) []byte {
	// создаем новый срез байтов b с нулевой длиной и емкостью, равной сумме длины s и длины replacement
	b := make([]byte, 0, len(s)+len(replacement))

	// создаем флаг invalid, который указывает, является ли предыдущий байт недопустимым
	invalid := false

	// проходим по всем байтам в срезе s
	for i := 0; i < len(s); {
      
		// получаем текущий байт
		c := s[i]

		// если байт является допустимым ASCII-символом,
		// добавляем его в срез b и продолжаем проверку следующего байта
		if c < utf8.RuneSelf {
			i++
			invalid = false
			b = append(b, c)
			continue
		}

		// получаем длину UTF-8 последовательности, начинающейся с текущего байта
		_, wid := utf8.DecodeRune(s[i:])

		// если текущий байт является первым байтом недопустимой UTF-8 последовательности,
		// проверяем, является ли предыдущий байт недопустимым.
		if wid == 1 {
			// если предыдущий байт недопустим, добавляем заменитель replacement в срез b.
			if invalid {
				invalid = false
				b = append(b, replacement...)
			}

			// увеличиваем счетчик байтов и продолжаем проверку следующего байта.
			i++
			continue
		}

		// если текущий байт является первым байтом допустимой UTF-8 последовательности,
		// добавляем все байты этой последовательности в срез b и продолжаем проверку следующего байта
		invalid = false
		b = append(b, s[i:i+wid]...)
		i += wid
	}

	// возвращаем срез b, который содержит только допустимые UTF-8 последовательности из исходного среза s
	return b
}

This function is used to convert a slice of bytes s into valid UTF-8 format, replacing invalid byte sequences with the specified substitute replacement. The function creates a new slice of bytes. b with zero length and capacity equal to the sum of the lengths s and lengths replacement. The function then loops through all the bytes in the slice. s and checks whether they are valid UTF-8 sequences.

func TryToValidUTF8() {
    s := []byte("\xff\x00hello\x80world")
	replacement := []byte("?")
	
	valid := bytes.ToValidUTF8(s, replacement)
	fmt.Println(string(valid)) // ?hello?world
}

Title

Function Title converts all letters that start words to uppercase, considers the slice to be a byte s contains UTF-8 encoded text and returns a copy of the slice with all initial letters of words converted to uppercase.

func Title(s []byte) []byte {
	prev := ' '
	return Map(
		func(r rune) rune {
			if isSeparator(prev) {
				prev = r
				return unicode.ToTitle(r)
			}
			prev = r
			return r
		},
		s)
}

Please note that the function Title is outdated and does not handle some punctuation characters correctly. It is recommended to use the package golang.org/x/text/cases to convert text to capital letters.
The helper function uses a closure to remember the previous character and updates its value after processing the current character.

But we can pay attention to isSeparatorinside it lies a check whether the symbol is a word separator.

func isSeparator(r rune) bool {
	// алфавитно-цифровые символы ASCII и символ подчеркивания не являются разделителями
	if r <= 0x7F {
		switch {
		case '0' <= r && r <= '9':
			return false
		case 'a' <= r && r <= 'z':
			return false
		case 'A' <= r && r <= 'Z':
			return false
		case r == '_':
			return false
		}
		return true
	}
  
	// также буквы и цифры не являются разделителями
	if unicode.IsLetter(r) || unicode.IsDigit(r) {
		return false
	}
  
	// используем пробелы как разделители
	return unicode.IsSpace(r)
}

Example of use:

func TryTitle() {
    s := []byte("hello, world!")
  
	title := bytes.Title(s)
	fmt.Println(string(title)) // Hello, World!
}

TrimeLeftFunc

This function removes all characters to the left of the first character for which the function is applied. f returns falsealso considers that the slice is bytes s contains text encoded in UTF-8.

func TrimLeftFunc(s []byte, f func(r rune) bool) []byte {
  
	i := indexFunc(s, f, false)

    // если ошибочно, то возращаем -1
	if i == -1 {
		return nil
	}

	return s[i:]
}

func TryTrimLeftFunc() {
    s := []byte(")))hello, world!)))")
	trimmed := bytes.TrimLeftFunc(s, func(r rune) bool {
		return r == ')'
	})
  
	fmt.Println(string(trimmed)) // hello, world!))) 
}

In this example we create a slice of bytes scontaining the string “)))hello, world!)))”. Then we remove all the brackets to the left of the first non-space character using the function bytes.TrimLeftFunc(s, func(r rune) bool { return r == ' ' }).

Let's figure it out ourselves indexFunc .

func indexFunc(s []byte, f func(r rune) bool, truth bool) int {
	start := 0
	for start < len(s) {
		wid := 1
		r := rune(s[start])
		if r >= utf8.RuneSelf {
			r, wid = utf8.DecodeRune(s[start:])
		}
		if f(r) == truth {
			return start
		}
		start += wid
	}
	return -1
}

In general, nothing unusual, it is the same IndexFunc, except that if truth==false, then the meaning of the predicate function is inverted

TrimRightFunc

func TrimRightFunc(s []byte, f func(r rune) bool) []byte {
	i := lastIndexFunc(s, f, false)
	if i >= 0 && s[i] >= utf8.RuneSelf {
		_, wid := utf8.DecodeRune(s[i:])
		i += wid
	} else {
		i++
	}
	return s[0:i]
}

The same as with Left, but on the right.

func TryTrimRightFunc() {
    s := []byte(")))hello, world!)))")
	trimmed := bytes.TrimLeftFunc(s, func(r rune) bool {
		return r == ')'
	})
  
	fmt.Println(string(trimmed)) // hello, world!)))
}

OtherTrim

To prevent the article from becoming too long, there are also functions like: TrimFunc, TrimPrefix, TrimSuffix and they perform exactly the same logic. There are also implementations Trim, TrimLeft, TrimRight, TrimSpace all of them are built on the same principle, but with their own characteristics by type – use asciiSet .

Runes

func Runes(s []byte) []rune {
	t := make([]rune, utf8.RuneCount(s))
	i := 0
	for len(s) > 0 {
		r, l := utf8.DecodeRune(s)
		t[i] = r
		i++
		s = s[l:]
	}
	return t
}

We return a slice of runes from a byte slice, nothing special.

Replace

func Replace(s, old, new []byte, n int) []byte {
	// m - количество вхождений среза old в срез s
	m := 0
  
	// если n не равно нулю, вычисляем количество вхождений
	if n != 0 {
		m = Count(s, old)
	}
  
	// если нет вхождений, возвращаем копию исходного среза
	if m == 0 {
		return append([]byte(nil), s...)
	}
  
	// если n меньше нуля или m меньше n, заменяем все вхождения
	if n < 0 || m < n {
		n = m
	}

	// выделяем буфер t длиной len(s) + n*(len(new)-len(old)) байтов
	t := make([]byte, len(s)+n*(len(new)-len(old)))
  
	// w - текущая позиция в буфере t
	w := 0
  
	// start - текущая позиция в срезе s
	start := 0
  
	for i := 0; i < n; i++ {
		// j - позиция начала следующего вхождения среза old в срез s
		j := start
      
		// если длина среза old равна нулю, вычисляем длину кодовой точки UTF-8
		if len(old) == 0 {
			if i > 0 {
				_, wid := utf8.DecodeRune(s[start:])
				j += wid
			}
          
		// находим позицию начала следующего вхождения среза old в срез s
		} else {
			j += Index(s[start:], old)
		}
      
		// копируем часть среза s от позиции start до позиции j в буфер t
		w += copy(t[w:], s[start:j])
      
		// копируем срез new в буфер t
		w += copy(t[w:], new)
      
		// обновляем значение start.
		start = j + len(old)
	}
  
	// копируем оставшуюся часть среза s в буфер t
	w += copy(t[w:], s[start:])
  
	// возвращаем подсрез t[0:w]
	return t[0:w]
}

The function will replace slice by slice a given number of times.

func TryReplace() {
    s := []byte("hello, world!")
	old := []byte("world")
	newB := []byte("Go")
	
	replaced := bytes.Replace(s, old, newB, 1)
	fmt.Println(string(replaced)) // hello, Go!
}

ReplaceAll

Replaces all occurrences.

func ReplaceAll(s, old, new []byte) []byte {
	return Replace(s, old, new, -1)
}

Just a wrapper over Replace.

EqualFold

func EqualFold(s, t []byte) bool {
	// этот цикл проверяет, равны ли ASCII-символы в срезах s и t
	i := 0
  
	for ; i < len(s) && i < len
		sr := s[i]
		tr := t[i]
		// если текущий символ не является ASCII-символом, переходим к метке hasUnicode
		if sr|tr >= utf8.RuneSelf {
			goto hasUnicode
		}

		// если текущие символы равны, продолжаем цикл
		if tr == sr {
			continue
		}

		// если текущие символы не равны, меняем их местами, чтобы sr был меньше tr
		if tr < sr {
			tr, sr = sr, tr
		}
      
		// если sr и tr являются ASCII-символами, проверяем, можно ли преобразовать sr в tr с помощью преобразования регистра
		if 'A' <= sr && sr <= 'Z' && tr == sr+'a'-'A' {
			continue
		}
      
		// если символы не могут быть преобразованы друг в друга, возвращаем false
		return false
	}
  
	// если оба среза пусты, возвращаем true
	return len(s) == len

hasUnicode:
	// если срезы s или t содержат не-ASCII символы, переходим к этой метке
	s = s[i:]
	t = t[i:]
  
	for len(s) != 0 && len
		// извлекаем первый символ из каждого среза
		var sr, tr rune
		if s[0] < utf8.RuneSelf {
			sr, s = rune(s[0]), s[1:]
		} else {
			r, size := utf8.DecodeRune(s)
			sr, s = r, s[size:]
		}
		if t[0] < utf8.RuneSelf {
			tr, t = rune(t[0]), t[1:]
		} else {
			r, size := utf8.DecodeRune
			tr, t = r, t[size:]
		}

		// если символы равны, продолжаем цикл. Если нет, возвращаем false

		if tr == sr {
			continue
		}

		// если символы не равны, меняем их местами, чтобы sr был меньше tr
		if tr < sr {
			tr, sr = sr, tr
		}
      
		// если tr является ASCII-символом, проверяем, можно ли преобразовать sr в tr с помощью преобразования регистра
		if tr < utf8.RuneSelf {
			if 'A' <= sr && sr <= 'Z' && tr == sr+'a'-'A' {
				continue
			}
			return false
		}

		// в общем случае используем функцию unicode.SimpleFold для преобразования символов в нижний регистр
		r := unicode.SimpleFold(sr)
		for r != sr && r < tr {
			r = unicode.SimpleFold(r)
		}
		if r == tr {
			continue
		}
		return false
	}

	// если один из срезов пуст, проверяем, пуст ли другой срез
	return len(s) == len
}

Compares 2 slices case insensitively.

func TryEqualFold() {
    s := []byte("Hello, World!")
	t := []byte("hello, world!")

	equal := bytes.EqualFold(s, t)
	fmt.Println(equal) // true
}

It would seem that there is a lot hidden behind an ordinary action.

Cut

func Cut(s, sep []byte) (before, after []byte, found bool) {
	if i := Index(s, sep); i >= 0 {
		return s[:i], s[i+len(sep):], true
	}
	return s, nil, false
}

Cut cuts a slice of bytes s around the first occurrence of the delimiter sepreturning the text before and after the separator.

func TryCut() {
    s := []byte("hello, world!")
	sep := []byte(", ")
  
	before, after, found := bytes.Cut(s, sep)
	fmt.Println(string(before), string(after), found) // hello world! true
}

Clone

func Clone(b []byte) []byte {
	if b == nil {
		return nil
	}
	return append([]byte{}, b...)
}

Clones a slice of bytes.

func TryClone() {
    b := []byte{"hello world"}
	c := bytes.Clone(b)
	fmt.Println(string(b), string(c), bytes.Equal(b, c)) // hello world! hello world! true
}

CutPrefix

func CutPrefix(s, prefix []byte) (after []byte, found bool) {
	if !HasPrefix(s, prefix) {
		return s, false
	}
	return s[len(prefix):], true
}

Removes the given prefix.

func TryCutPrefix() {
    s := []byte("/path/to/file.txt")
	prefix := []byte("/path/to/")
  
	after, found := bytes.CutPrefix(s, prefix)
	fmt.Println(string(after), found) // file.txt true
}

CutSuffix

func CutSuffix(s, suffix []byte) (before []byte, found bool) {
	if !HasSuffix(s, suffix) {
		return s, false
	}
	return s[:len(s)-len(suffix)], true
}

Removes the given suffix.

func TryCutSuffix() {
    s := []byte("file.txt")
	suffix := []byte(".txt")
  
	before, found := bytes.CutSuffix(s, suffix)
	fmt.Println(string(before), found) // file
}

Completion

Well, that's the end of the article, I hope I helped you understand a little about how the package works bytes . Overall, I think the article turned out to be quite informative and describes almost every function in the package. Thanks for reading!

Leave a Reply Cancel reply