A New Path to Nonlinear Embedding Analysis

A little pathetic

For semantic analysis, cosine similarity is usually used – a fairly familiar and understandable tool. In the modern world of NLP, more efficient LLM-based embeddings are often preferred over the more classic CountVectorizer or, say, TF-IDF. And it would seem, yes, we have a multidimensional vector, we calculated the similarity, it works, because the embedding itself contains many semantic connections, and everything is cool and cool. But sometimes I want to look at all this from the other side, from the side of non-linear and potentially more powerful algorithms, because there is an option to improve either the quality of embeddings itself, or the way we compare them. And since in the article we will consider the second option, I will immediately say that you can use mutual information or coefficient Hoeffding's D. But here we will talk about time series and a little about a piece of the theory of dynamical systems and chaos theory, which is quite surprising. I hope I have intrigued you, and so now I’ll tell you about the motivation for using these approaches, the algorithm, and, in general, what the whole point is.

The algorithm code is available at GitHub.

The model will be used to create all embeddings UAE-Large-V1.

Motivation

Some readers may have wondered how we can apply time series analysis techniques to our embedding.

So here it is:

Embeddings are basically multi-dimensional vectors. Let's take, for example, a vector length of 1024 units. No one is stopping us from decomposing it into a time series, simply imagining that each index is a certain point in time. Actually, why not, the main thing is that there is a result.

This is what our embedding looks like when presented as a time series, a bit like a signal, right?

Offer embedding "A man is playing music." laid out as a time series.

Embedding of the sentence “A man is playing music.” laid out as a time series.

Offer embedding "A man is playing music." laid out as a time series.

Embedding of the sentence “A man is playing music.” laid out as a time series.

The red dots are our vector values; in fact, they can be considered as local extrema of our time series, that is, they carry the most valuable information about it.

I also suggest looking at these “signals”, which are similar in semantic connections and which are not. By the way, try to understand who is similar and who is not.

If you were able to understand that 1 and 3 are more similar, then that's cool, but it's not always possible. Therefore, I propose to look at an algorithm that is capable of just this.

The first vector was “A man is playing music.”

The second vector was “A panda is climbing.”

The third vector was “A man plays a guitar.”

Algorithm

I did some research and the main criteria were accuracy and speed of execution. So, wavelet transforms performed best in conjunction with the phase synchronization coefficient.

The algorithm itself looks like this:

Vector normalization

The first step of the algorithm is to normalize the vectors. This process is important to ensure that all vectors are the same length.

Convert to complex vectors

This step involves converting ordinary vectors to complex ones, which allows us to work not only with the amplitude of the values, but also with their phase. To do this, we divide each normalized vector into two parts: the first half becomes the real part of the complex number, and the second becomes the imaginary part. This process enriches our data, adding an additional layer of information that will be used in further analysis.

Here's an example of how it works:
Let's take a vector V = [ 1, 2, 3, 4].

Step 1: Vector Divide

The first thing we need to do is split the vector into two parts. Since our vector has four dimensions, we split it in half:

  • First half: [ 1, 2][ 1, 2]

  • Second half: [ 3, 4][ 3, 4]

Step 2: Create a complex vector

Now we will transform these two parts into a complex vector, where the first half will be the real part and the second half will be the imaginary part. Thus, each element of the first half is combined with the corresponding element of the second half to form complex numbers:

Result

So our original vector V = [ 1, 2, 3, 4] transforms into a complex vector:

Vcomplex​ = [ 1 + 3i , 2 + 4i ].

Wavelet Transform Computation

Next, by applying the wavelet transform to the resulting complex vectors, we decompose each vector into components that better describe the local features of the data. The wavelet transform allows you to analyze embeddings at different scales, highlighting both high- and low-frequency features.

Calculation of phase characteristics and average phase synchrony

After calculating the wavelet coefficients, we move on to the analysis of phase characteristics. Phase information allows us to estimate how synchronously different parts of the embeddings change. Calculating the average phase synchrony between a pair of embeddings gives an idea of ​​the degree of their semantic relatedness.

Calculated by the formula:

\hat{\rho} = \left|  \frac{1}{N} \sum_{n=1}^{N} e^{j(\phi_1(t_n) - \phi_2(t_n))} \right|

Where ϕ1​(tn) And ϕ2​(tn) are the phase angles of two signals at an instant tn​.

With exact phase coincidence, the coefficient is equal to one, and in the absence of synchronization, it is zero, which is very convenient for us.

Here's another one for clarity when applied to time series

After calculating the phase locking coefficient, we move on to calculating improved measure of phase synchrony (P).

\begin{align*} {V} &= 1 - \hat{p} \\ \hat{P} &= (1 - {V}) \cdot \hat{p} \end{align*}

Improved phase matching ratio P provides a more accurate estimate of signal timing, compensating for variations in timing by incorporating a measure of variability V.

Final Result Calculation

After we have received the vectors normalized and converted to complex numbers and calculated their wavelet transforms, as well as phase characteristics, we move on to the key stage of the algorithm – calculating the final result of the comparison of vectors.

The final result for each vector in the list is obtained by combining phase synchrony and cosine similarity. This is done by multiplying the phase synchrony by the modified cosine similarity: ps * (0.5 * (cs + 1))Where ps – improved phase matching coefficient(P), cs – cosine similarity. Cosine similarity modification (0.5 * (cs + 1)) converts its range from [-1, 1] V [0, 1]to ensure a positive impact on the bottom line.

Regarding hyperparameters in wavelet transforms

In general, Daubechies Wavelets are best suited. Daubechies wavelets are denoted as dbNWhere N indicates the order of the wavelet. The order of a wavelet affects its ability to capture signal and noise information in the data.

Standard hyperparameters − db4, and the decomposition level is 4.

For more subtle semantic relationships:

Lower order wavelets such as Daubechies (db2, db3) may be more suitable as they provide better localization over time and allow the extraction of more precise and localized semantic relationships in the data.

For more global patterns:

Higher order wavelets such as Daubechies (db4, db6 and higher) are preferred for global pattern analysis. Their smooth, long filters help capture broader, smoother semantic structures in data while ignoring fine details.

Decomposition level

Formula for determining the maximum possible level of decomposition N when performing wavelet transform of a signal. This formula helps ensure that signal decomposition is accomplished without loss of information and without exceeding the limits specified by the signal length L and the length of the wavelet filter used Lfilter​.

N = \left\lfloor \log_2\left(\frac{L}{L_{\text{filter}}}\right) \right\rfloor

Where:

  • N — recommended level of decomposition.

  • L — signal length.

  • Lfilter — length (or number of coefficients) of the wavelet filter used for analysis. This value depends on the wavelet selected and typically ranges from 2 to 20. For specific wavelets such as Daubechies, Lfilter​increases with wavelet order.

Algorithm Tests

Testing of the algorithm will take place in two forms: simply with synthetic vectors, as well as embeddings.

Installing the packagepip install PyWaveSync

Below is the code for testing on a synthetic data set.

from wavesync.wavesync import WaveSync
import numpy as np

ws = WaveSync()

np.random.seed(42)  
vec = np.random.rand(1024)  
similar_vecs = [vec + np.random.normal(0, 0.01, len(vec)) for _ in range(5)]
dissimilar_vecs = [np.random.rand(len(vec)) for _ in range(5)]
vec_a = np.random.rand(1024)
vec_b = -vec_a  # Coordinate-wise opposite of vec_a

# Test with similar, dissimilar, and opposite vectors
wavesync_similar_scores = ws.compare(vec, similar_vecs)
wavesync_dissimilar_scores = ws.compare(vec, dissimilar_vecs)
wavesync_opposite_scores = ws.compare(vec_a, [vec_b])

print(wavesync_similar_scores, wavesync_dissimilar_scores, wavesync_opposite_scores)

Result:

Similar
[0.9946460671728045, 0.9875065629124861, 0.989072345923832, 0.991220540585618, 0.9954265121946134]

Different
[0.005015474802584571, 0.004847775250650383, 0.0012281423357652212, 0.006391884708080264, 0.0002667694383566343]

Opposite
[0.0]

Here's a comparison using cosine similarity:

Similar
[0.9998493328107453, 0.9998563262925324, 0.9998411950733593, 0.999847887188542, 0.9998466519512551]

Different
[0.7358156012914848, 0.7530882884696813, 0.7373955262095263, 0.7466333075853196, 0.7318980690691882]

Opposite
[-1.0]

The result shows that the algorithm copes perfectly with understanding the similarity of vectors; also, the integration of cosine similarity into it allows us to take into account the direction of the vectors, which is very important, and therefore for opposite ones we get – 0.0.

Now on to real proposals:

python -m pip install -U angle-emb
from angle_emb import AnglE
from wavesync.wavesync import WaveSync

sentences = [
"An animal is biting a person's finger.",
"A woman is reading.",
"A man is lifting weights in a garage.",
"A man plays the violin.",
"A man is eating food.",
"A man plays the piano.",
"A panda is climbing.",
"A man plays a guitar.",
"A woman is slicing meat.",
"A men is playing music on piano on the street for cat."]

angle = AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda()
vecs = angle.encode(sentences, to_numpy=True)
vec = angle.encode(["A man is playing music."], to_numpy=True)

ws = WaveSync()
wavesync_scores = ws.compare(vec, vecs)

print(wavesync_scores)

Query: “A man is playing music.”

  1. “An animal is biting a person's finger.” – 0.04

  2. “A woman is reading.” – 0.02

  3. “A man is lifting weights in a garage.” – 0.06

  4. “A man plays the violin.” – 0.29

  5. “A man is eating food.” – 0.09

  6. “A man plays the piano.” – 0.35

  7. “A panda is climbing.” – 0.01

  8. “A man plays a guitar.” – 0.51

  9. “A woman is slicing meat.” – 0.01

  10. “A men is playing music on piano on the street for cat.” – 0.31

As you can see, the algorithm perfectly understands the difference between embeddings, and thanks to the modernization of the phase synchronization coefficient, the algorithm reacts more subtly to the differences between them.

Speed

On one system, with 100,000 comparisons of vectors of size 1024 with standard hyperparameters, the average speed per comparison was – 2e-4 secondswhen only cosine similarity had – 8e-6.

Even though there is a big difference in speed, given the WaveSync approach, the speed is still quite high, and in most tasks where it does not require processing a super-huge number of embeddings, it performs well, but in any case there is an option to filter out the most dissimilar ones first using cosine similarity, and then apply WaveSync.

Conclusion

Time series, wavelets and so on are great, but what is the main advantage of the algorithm?

The main advantage of the algorithm over using cosine similarity is its ability to distinguish between similar and dissimilar vectors more efficiently and accurately. This provides a clearer interpretation of the results for analytical purposes.

My goal was to demonstrate how, using a non-standard approach to time series analysis, one can achieve outstanding results that are completely competitive with cosine similarity in speed and accuracy due to the analysis of non-linear dependencies. This allows you to look at working with embeddings in NLP and, in general, at comparing multidimensional vectors from a new angle.

If you have ideas, questions or suggestions, I will be glad to see them in the comments.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *