start without machine learning

Leveraging machine learning effectively is challenging. You need data. You need a reliable pipeline that supports data streams. And most of all, you need high quality markup. Therefore, more often than not, the first iteration of my projects does not use machine learning at all.

What? Start off without machine learning?

I’m not the only one talking about this.

Guess which rule is the first in 43 rules of machine learning Google?

Rule # 1: Don’t be afraid to launch a product without machine learning.

Machine learning is great, but it requires data. In theory, you can take data from another problem and tweak the model for a new product, but it will most likely fail with basic heuristics. If you assume machine learning will give you 100% growth, then the heuristic will give you 50%.

The many machine learning practitioners I interviewed as part of the project

Applying ML

, in response to this question, they also give similar answers:

“Imagine being given a new, unfamiliar problem to be solved using machine learning. How do you go about solving it? “

“To begin with, I’m going to make a serious effort to see if it can be solved without machine learning. I always strive to try less glamorous, simple things before moving on to more complex solutions. ” – Vicki Boykis, ML engineer on Tumblr

“I think it’s important to complete the task without ML first. Solve the problem manually or using heuristics. This will get you deeply into the task and data, which is the most important first step. Moreover, it is worth getting a benchmark without ML in order to honestly compare indicators. ” – Hamel Hussain, ML Lead Engineer at Github

“First, try to solve the problem without machine learning. This advice is given by all because it is good. You can write multiple rules or if / else heuristics to make simple decisions and act on them. ” – Adam Laiacano, ML Platform Lead Engineer at Spotify

Then where to start?

Whether you are using simple rules or deep learning, a sufficient level of understanding of the data always helps. Therefore, take a sample of data to collect statistics and visualize them! (Note: This mostly refers to tabular data. Other data such as images, text, sound, etc. may make it more difficult to collect statistics.)

Simple correlations help in identifying connections between each trait and the target variable. You can then select the strongest subset of features to visualize. This not only helps with understanding the data and our task (and therefore with a more efficient use of machine learning), but also allows you to better understand the context of the business domain. However, it is worth noting that correlations and the statistics collected can be misleading – sometimes it turns out that variables with strong relationships have zero correlation (more on this below).

Scatter plots are a favorite tool for visualizing numerical values. Plot the feature on the X-axis and the target variable on the Y-axis, and the relationship will manifest itself. In the example below, temperature has zero correlation with ice cream sales. However, we can see a clear connection in the scatter plot – as the temperature rises, people buy more ice cream, but above a certain temperature it gets too hot and they simply do not leave the house.


Scatter plot of the inconsistent relationship between temperature and ice cream sales (a source)

I have found that if any of the variables are categorical, box and whisker diagrams work well. Let’s say you’re trying to predict a dog’s lifespan – how important is size to this parameter?


Mustache box with lifespan of different sizes of dog breeds (a source)

With an understanding of the data, we can start solving a problem using heuristics… Here are some examples of using heuristics to solve common problems (you’ll be surprised how difficult it is to solve them):

  • Recommendations: recommend the items with the best performance from the previous period; they can also be segmented into categories (eg genres or brands). If you have a buying behavior, then you can compute aggregate statistics on collaborative interactions to compute similarity for i2i recommendations (see below). here Swing Algorithm by Alibaba).
  • Product classification: regex-based rules for product names. Here is an example from Walmart Product Classifier (Section 4.5): If the product name contains “ring”, “wedding band”, “diamond.”, “* bridal”, etc., then categorize it as rings.
  • Identifying spam in reviews: rules based on the number of reviews from one IP, when the review was published (for example, a strange publication at 3 a.m.), similarities (for example, editing distance) between the current review and other reviews published on the same day.

Many people have also answered this

tweet

, suggesting using regular expressions as a reference point without machine learning, an interquartile range for identifying outliers, a moving average for prediction, creating dictionaries for address matching, etc.

Do these heuristics really work? Yes! It often amazes me how effective they are, given the minimum effort required to implement them. Here’s an example of how one simple list of exceptions made it possible to stop scammers.

“We had a similar experience many years ago. Once blocked, the scammers quickly created new websites and sneaked away from the models, but continued to use the same images with the same filenames. Gotcha! ” Jack Hanlon (@JHanlon)

Here’s another example where regular expressions are better than deep learning.

“I’ve been criticized so much for that. One project I was working on was doing a string comparison, but the client was frustrated that I wasn’t using neural networks and hired someone else to do it. Guess which method was more accurate and cheaper? ”- Mitch Hale (@bwahacker)

Yes, you can say that these people who trained the machine learning models did not understand what they were doing. Perhaps. However, the point is that data understanding and simple heuristics can easily do better than

model.fit()

, and you will have to spend half the time on them.

These heuristics also help with weak supervision. If you are starting from scratch and do not have markup, weak supervision allows you to quickly and efficiently get many marks, albeit of low quality. These heuristics can be formalized as markup functions for generating labels. Other examples of weak supervision are knowledge bases and pre-trained models. You can read more about weak supervision at Google and Apple. here

When should you use machine learning?

Once you have a good enough benchmark without ML

, and the amount of effort to maintain and improve this benchmark outweighs the amount of effort to build and deploy an ML-based system. It’s hard to pinpoint when this happens, but when it becomes impossible to change your 195 handcrafted rules without breaking something, it’s worth considering machine learning. Here is rule # 3 of the Google ML Rule.

Rule # 3: Choose machine learning over complex heuristics.

A simple heuristic can enable a product to be released. Complex heuristics cannot be maintained. Once you have accumulated data and a basic understanding of what you are trying to achieve, move on to machine learning … and you will find that the machine learning model is easier to update and maintain.

Having robust data pipelines and high-quality data markup also makes it clear that you are ready for machine learning.

But before that happens, the data you have may not be good enough for machine learning. Let’s say you need to reduce the level of fraud on your platform, but you don’t even know what the fraudulent behavior looks like, let alone no tags.

Or you may have data, but in such poor condition that it cannot be used. For example, there is data on product categories that can be used to train a product classifier. However, sellers deliberately misclassify products in order to trick the system (for example, to set a lower commission for certain categories or to more easily rank in categories with a small number of products).

Often, manual markup is required to achieve the gold standard of a high-quality tagged dataset. If available, training and validation machine learning jobs are getting much easier.

But what if I need to use ML for the sake of ML itself?

Hmm, then you are in a difficult position. How do you balance between solving a problem (meeting customer requirements) and spending a lot of time and effort just for the sake of ML itself? I have no answer to this question, but I will quote Brandon Rohrer’s ingenious advice.

“ML Strategy Advice

When you have a task, create two solutions – a deep learning Bayesian transformer running on multicloud Kubernetes and a SQL query built on a stack of extremely simplistic assumptions. Write one solution on your resume, use the other in production. And everyone will be happy. ” – Brandon Rohrer (@_brohrer_)

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *