$%#@! — The ML Guide to Service Integration

Published in

Instawork Engineering

4 min readSep 7, 2022

Man sitting at a laptop on fire. — What could possibly go wrong? (Stable Diffusion)

Have you ever implemented an ML model that took down production systems? Ever wondered about the risks of calling your ML model in a for-loop? This guide shares a few rules to follow to make sure even when your ML service fails, production remains intact.

Assume the worst, then 10x it

As alluded to in the comic above, our code isn’t always going to be used in the way we intend. Software development is messy. Your default assumption when integrating anything should be:

Everything you think can’t possibly happen, will happen.
https://en.wikiquote.org/wiki/Murphy's_law

ALWAYS ASSUME THE WORST. Assume you will never know how/when/why your service might get called. The rules below are meant as some easy rules to follow to get you some part of the way there…

Use try/except (try/catch, etc…)

try/except clauses are your best friend. Any request made to an external ML service should be wrapped in a try/except. At first, this may feel “gross”, like you’re not writing “clean code”. But the reality is, the responses from most ML services are not mission-critical, and should be treated as optional (at least at our company). For example, at Instawork we use the ML service to decide if a user should get a certain notification. The world won’t end if the ML service call fails and we send an extra notification. But it might end if our code crashes and we don’t send any notifications at all!

try:
  my_awesome_prediction = call_to_ml_service(request, ...)
except:
  logger.error("Failed to make prediction", extra={...}})

Always have a sane default/fallback

This goes hand-in-hand with try/except. If there is a possibility of our service failing, make sure the service caller defaults to a sane fallback. For classification, generally, that means defaulting to the “negative” classification. Take the notification ML service mentioned above. If a request to the service fails, we assume the notification should be sent. The generic pattern looks like this:

my_awesome_prediction = NO_DEFECT # default value
try:
  my_awesome_prediction = call_to_ml_service(request, ...)
except:
  logger.error("Failed to make prediction", extra={...}})

When in doubt, use a small timeout

Always choose the low timeout for ML service requests. It’s better for the request to timeout than to add excessive latency to the caller. At Instawork, we have two pre-made ML service clients for engineers to use: low-timeout-client and high-timeout-client. The low-timeout client is used by default for any ML integrations, and the high-timeout client is reserved for exceptional cases.

Always run “shadow mode” asynchronously

When we first deploy a new ML service, we run it in “shadow mode”. That means we make calls to the service, but we only log the returned value and don’t use it in the business logic. When testing an ML model in shadow mode, always, always, ALWAYS, run the predictions in the background. The simplest way of doing this is running async code or a background job runner. In Python, that means using asyncio or Celery tasks. By running shadow mode asynchronously, we can get useful information about the performance of the model, without accidentally bringing down production.

@shared_task
def task_make_my_predictions(user_id, shift_group_ids):
  for shift_group_id in shift_group_ids:
    make_prediction_to_ml_service(user_id, shift_group_id)
    
...task_make_my_predictions.delay(user_id, shift_group_ids)# ... continue normal code

Bulk requests are your friend

If you’re ever wondering, “should I call my ML service from a for-loop, or make all the predictions up-front?”, always prefer the latter:

Client timeouts apply per-request. Let’s say we have a 5s timeout to our ML service. If we make 15 requests in a for-loop and each request takes 2 seconds, that’s 30 seconds total. Each individual request was fast enough, but the total latency can be unacceptable.
It’s more efficient. Querying for all the data once on the ML service and responding once will always be faster than establishing many connections, querying a bunch of data, and responding back individually 15 times.
It creates one point of failure instead of N points of failure. This makes it easier to track where something went wrong and that it’s actually the problem.

# DON'T DO THIS:
predictions = [predict(user) for user in users]# INSTEAD, DO THIS:
prediction_map = predict(users)
predictions = [prediction_map[user.id] for user in users]

Conclusions

The guidelines in this post all stem from the central belief “everything you think can’t possibly happen, will happen”. You may know this as Murphy’s Law. With this in mind, it’s best to expect our ML service will fail or be too slow. The code calling the ML service should use short timeouts, try/except with sane defaults, and use async & batch processing when possible.

These rules are good advice when it comes to integrating any service, but we found it especially relevant for ML services as we learn about the unique characteristics of running ML models in production.

Do you have any hard-learned lessons from integrating ML services? Let us know your tips in the comments!