experience of using it to filter profiles in a dating application

Dream girl ("performance" YandexART)

Dream girl (“performance” YandexART)

Have you noticed how many news and articles have started to appear mentioning neural networks and dating applications in one text? Is it possible to teach a neural network to filter profiles in a dating service? Does this help? I will try to answer these and some other questions in my article. I’ll tell you how I came to this and why I started to deal with this issue in the first place. How did I implement such a system in my own home? In addition, I will touch a little on the ethical side of this issue. Anyone interested is welcome to read.

You all have probably heard (or almost) already stories about how people find their soulmate (I don’t know why they are separated) using ChatGPT or other artificial neural networks. In fact, I’ve been interested in a similar idea (except for using gpt) for a relatively long time, and I’ve been preparing this article for several months now, so I decided to study this issue on my own and ended up writing a browser extension, a small local server that processes data and configured several neural network models. But in order.

I will immediately touch on the ethical part. I didn’t want to waste either my precious time or anyone else’s, so I also refused to use a chatbot to automatically communicate with girls. I decided to filter the profiles by photographs and text data from the same profiles.


And just a small disclaimer for you: I am not a professional fit predictor, and I was simply trying to study the possibility of implementing such a system. I don't store data.


Data collection

To begin with, it was necessary to come up with a way to collect data, preferably without straining too much; it would also be advisable to mark it up immediately, or mark it automatically. And then I came to the decision to make a simple browser extension. In the beginning, you need to figure out how this can be done (in fact, I already have experience, but I’ll tell you).

Extension development

In general, to put it simply, the extension is an application, or more precisely, a web application. The extension has a main HTML page, styles, if needed, and scripts. There is also a so-called manifesto, at the moment (June 2024), the third version of the manifesto is already in use. The structure of the manifesto is simple and it is better to look at it in Google documentation. There are many ways to locate extensions, for example, a sidebar or, simply, an injected script (most likely, you have come across this option more often). I’ll make a reservation right away that I tried to do it using the sidebar for convenient management of the extension and viewing server responses, as a result, the sidebar has features that do not allow it to work normally with CSP (it does not have access to child frames on the page), only if Broadcast messages to all browser participants, which is not cool. That’s why I switched to a more cavernous option with the usual script injection onto the page. This is what allows you to immediately inject into a child frame, since the parent frame has access.

Immediately after the injection, the script places hooks on the control buttons: like, dislike, scrolling through photos. There were some peculiarities with this. The application is written in one of the front-end frameworks (I can guess which one, but I can’t prove it, so I won’t) and the elements have neither classes (with normal names) nor IDs, at least for control buttons. I found one trick. When I was doing research, I discovered that many controls have a “data-testid” attribute. I started from him. I installed hooks. But, and here an interesting feature emerged: it turns out that sometimes these elements are updated (that is, in general). I don’t know whether this is such a protection or not, I’m more inclined that the framework does this, but still, we need to decide. I had to set a timer every second, which would go through the elements and set the hooks again if there were none. To check whether handlers are worthwhile or not, I decided to use my own “control” attribute.

function setManageElements()
{
  for (let element of document.getElementsByTagName('div')) {
    let attribute = element.getAttribute('data-testid');

    if(attribute == null || element.getAttribute('control') != null) continue;

    switch (attribute) {
      case 'like':
        element.addEventListener('click', () => sendData('liked'));
        element.setAttribute('control', 'extensionLiked');
        break;
      case 'dislike':
        element.addEventListener('click', () => sendData('disliked'));
        element.setAttribute('control', 'extensionDisliked');
        break;
      case 'next-story-switcher':
        element.addEventListener('click', nextStory);
        element.setAttribute('control', 'extensionStory');
        break;
      }
  }
}

setInterval(() => {
  setManageElements();
}, 1000);

Hmm, all you have to do is attach the autoclick to like and you're done.

Why hooks, you ask? It’s simple, in this way I decided to immediately automatically mark which profiles I liked or didn’t like. To get a link to a photo, I had already found an image element with a readable class name “vkuiCustomScrollView__box” and, simply, received the source of this element when clicking on questionnaires, added it to an array and sent it to the server when evaluating the questionnaire.

function nextStory() {
  let urlImg = document
    .getElementsByClassName('vkuiCustomScrollView__box')[0]
    .getElementsByTagName('img')[0].src;

  if(!photos.includes(urlImg))
    photos.push(urlImg)
}

Retrieving text data was a little more difficult (class names are unreadable and constantly updated, remember?). But here I found a solution: there were many elements with the “vkuiTypography” class, but not all of them related to the information that needed to be pulled out. Therefore, empirically, I found out which ones need to be filtered. But in the end, the data turned out to be a complete mess, more on that later.

function getData()
{
  let data = [];

  Array.from(document.getElementsByClassName('vkuiTypography')).slice(6)
    .forEach(element => {
      data.push(element.textContent);
    }
  );

  return data;
}

Data collection process:

Server

In a few words, this is a service for processing information coming from the extension, deployed locally. Wrote on FastAPI.

app = FastAPI()

app.add_middleware(
  CORSMiddleware,
  allow_origins=origins,
  allow_credentials=True,
  allow_methods=["*"],
  allow_headers=["*"],
)


@app.post('/save-data')
async def save_data(request: Request) -> Response:
  uuid4 = uuid.uuid4().hex
  request_body: dict = orjson.loads(await request.body())
  
  info = request_body.get('info')
  photos = request_body.get('photos')
  event_type = request_body.get('event_type')
  
  with open(f'{event_type}/{uuid4}.json', 'w', encoding='utf-8') as _:
    _.write(orjson.dumps({
      'info': info,
      'photos': photos
    }).decode())

  return Response(f'{event_type}, {len(photos)} photos saved, {uuid4}')

@app.get('/check')
def check() -> Response:
  return Response(status_code=200)

The save_data method is responsible for receiving and marking information. Each profile is assigned uuid4 and the profile is saved immediately under a certain label (liked, disliked). Thus, I collected a sufficient amount of data and began processing.

Data processing

First, it was necessary to sort out the resulting text mess. I had to manually go to the application and collect the names of interests that can actually be entered into the application. Also, I wrote several regular patterns to extract name, height and age from this information. I immediately created a dictionary of these interests, which contained more than a hundred elements, and began to work further. Then everything is simple, some fields can be obtained using heuristics, some with regulars, some by simple string comparison. Immediately, I decided to download all the photos, more on that later.

def find_string(pattern: str, str: str) -> str | None:
    _ = re.match(pattern, str)
    return _.string if _ else None


profiles = []

for label in ['liked', 'disliked']:
    index_photo = 0

    for filename in tqdm.notebook.tqdm(os.listdir(f'../server/{label}')):
        with open(f'../server/{label}/{filename}', 'r', encoding='utf-8') as f:
            data = orjson.loads(f.read())

            data_profile = template_profile.copy()
            for index, datafield in enumerate(data['info']):
                if index < 15:
                    if not data_profile['_Age'] or not data_profile['_Name']:
                        age_name = find_string(r'^\S*, \d{2}$', datafield)

                        if age_name:
                            age_name = age_name.split(', ')

                            data_profile['_Age'] = age_name[1]
                            data_profile['_Name'] = age_name[0]

                    if len(datafield) > 50:
                      data_profile['_Description'] = datafield

                    if not data_profile['Height']:
                        height = find_string(r'^\d{3} см$', datafield)

                        if height:
                            data_profile['Height'] = height.split(' ')[0]

                for key, value in data_profile.items():
                    if not value:
                        if key == datafield:
                            data_profile[key] = True
                            break

            data_profile['isLiked'] = True if label == 'liked' else False
                
            profiles.append(data_profile)

            for url_photo in data['photos']:
                with open(f'photos/{label}/{index_photo}.png', 'wb') as f:
                    f.write(requests.get(url_photo).content)

                index_photo += 1

df = pandas.DataFrame.from_records(profiles)
df.to_csv('data_profiles.csv', index=None)

After processing, the result was a dataframe with 129 parameters and 1 target.

Next came image processing. The idea was simple, it was necessary to first identify the faces (I don’t evaluate the entire photo). If there were no faces, or there were more than 1 of them, then such photographs simply did not pass through. Fortunately, face detection has already been invented and there is a library called face_recognition that solves this problem. It was also necessary to increase the boxing a little, since the lib cuts the hair very much. At this stage, I also found out what the average size of the photographs was, this was needed to set up the model.

for label in ['liked', 'disliked']:
    for filename in tqdm.notebook.tqdm(os.listdir(f'photos/{label}')):
        image = face_recognition.load_image_file(f'photos/{label}/{filename}')
        positions = face_recognition.face_locations(image)

        if not positions or len(positions) > 1:
            continue

        post_t, pos_r, pos_b, pos_l = positions[0]
        
        _image = PIL.Image.open(f'photos/{label}/{filename}')
        _image = _image.crop((pos_l - 20, post_t - 50, pos_r + 30, pos_b + 10))

        _image.save(f'dataset/{label}/{filename}')

Training neural networks

I won’t dwell too much, although the part was very labor-intensive, the only conclusion I made was that a neural network can determine tastes, but very weakly. At least, my focus was on reducing the FalseNagative metric and it turned out poorly (this is my opinion). For some, an accuracy metric of 60% may even suit them.

In general, I tried several augmentations, such as zoom and rotation (girls love to twirl).

As a result, the model turned out like this:

model = Sequential([
  layers.Resizing(256, 256),
  data_augmentation,
  layers.Rescaling(1./255, input_shape=(img_height, img_width, 3)),
  layers.Conv2D(32, 3, activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(64, 3, activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(128, 3, activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(256, 3, activation='relu'),
  layers.MaxPooling2D(),
  layers.Dropout(0.2),
  layers.Flatten(),
  layers.Dense(256, activation='relu'),
  layers.Dense(2)
])

The optimization was BinaryCrossentropy and tracked the FN metric.

I don’t recommend using it, as I understand, the structure of the model very much depends on taste preferences. In addition, I really didn’t like the behavior of the loss function; it jumped very much over several dozen epochs (both on the training and validation samples). By the way, there weren’t very many parameters, about 13 million in the model. By the way, I thought about softmaxing after Dense(256), but it didn’t work.

For text data, the CatBoost model was chosen; I’ve been thinking about trying it for a long time. So, as the Boolean mathematical matrix actually turned out, the gaps were filled with False, according to logic.

df.fillna(False, inplace=True)

train_data = df.iloc[:, 0:128]
train_labels = df['isLiked']

test_data = catboost_pool = Pool(train_data, train_labels)

model = CatBoostClassifier(
  iterations=10,
  depth=10,
  loss_function='Logloss',
  verbose=Tru
)

model.fit(train_data, train_labels)

Zafitpredicted and it’s fine. When selecting the parameters, I decided to leave them as follows: 10 iterations with a tree depth of 10. The rest are by default. This turned out to be enough. Additionally, the metric can be considered an empirical approach, simply look into FeatureImportance and either agree with the model or refuse (the results may be funny).

I saved the models and sent them closer to the server.

Server improvements

I won't stop here for too long. It’s just an additional route, which accepts data in the same way as the save_data method, processes it with a pipeline from the data processing section and shoves it into the model, at the output it receives several probabilities and from them calculates a subjective coefficient, whether the questionnaire is liked or not. This threshold is adjustable. Likes and dislikes are placed automatically by click, depending on the server’s response; the click is implemented through a banal call to the click() method on a page element.

A few more words on the ethical scope of using such methods. For example, I think the use of chatbots is the most unethical way to communicate with people, regarding automatic filtering of profiles, it’s strange to be against it, since this thing simplifies the search and does not interfere with other people. But there are several nuances why I tried to optimize the FN metric; you may simply miss the pearl.

How ethical do you think it is to use such methods?
You are using?

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *