TensorFlow on Google Cloud. Scalable workflow

The field of Data Science is so vast and is developing so rapidly that it is simply impossible to study “everything” in it. But this should not demotivate you, because there is only one way out – to develop and not allow yourself to be captured by the fear “how little I know.”

Under the cut is a project that harnesses the power of modern cloud machine learning platforms in the classic problem of recognizing cats and dogs. The project is written so that you can adapt it to your needs.


I just came to say it. I am overwhelmed and somewhat intimidated by the growing reach and depth of machine learning today.

  • Do you need to create a high-performance data pipeline? Explore Apache Beam or Spark and Protocol Buffers.
  • Need to scale model training? Explore AllReduce and multisite distributed architectures.
  • Need to deploy your models? Learn Kubernetes, TFServing, Quantization, and API Management.
  • Need to track your pipelines? Set up a database for metadata, learn Docker, and become a DevOps engineer.

And there isn’t even a modeling algorithm and modeling space here, which makes me feel like an impostor without research training. There must be an easier way!

I’ve spent the past few weeks pondering this dilemma and what I would recommend to a data scientist with a mindset like mine. Many of the topics mentioned above are important to learn, especially if you want to focus on the new field of MLOps, but are there tools and technologies to “stand on the shoulders of the giants”? Below, I have listed 4 tools that abstract most of the complexity and allow you to more efficiently design, track and scale machine learning processes.

  • TFRecorder (via Dataflow) : Easily convert data inside TFRecords to CSV file. For images, provide JPEG URI and CSV tags. Scale to distributed servers with Dataflow without Apache Beam code.
  • TensorFlow Cloud(via AI Platform Training): Scale TensorFlow model training for single- and multi-node GPU clusters on AI Platform Training with a simple API call.
  • AI Platform Predictions… Deploy your model as an API endpoint for an autoscaling service (financially supported by Kubernetes) with GPUs like the one used Waze! You can read about what kind of company it is here.
  • Weights & Biases… Log datasets and models to track versions and origins in the development pipeline. A tree of relationships between your experiments and artifacts automatically generated.

Workflow overview

I use a typical cat versus dog computer vision task to walk through each of these tools. The workflow consists of these steps:

  • Save the raw JPEG images to object storage, with each image in a subfolder with its label.
  • Create a CSV file with image URIs and tags in the required format.
  • Convert images and labels to TFRecords
  • Create dataset from TFRecords and train CNN model with Keras.
  • Save the model as a SavedModel and deploy as an API endpoint.
  • JPEG images, TFRecords and SavedModels will be stored in object storage.
  • Experiments and origins of artifacts will be tracked using Weights & Biases.

Jupyter Notebooks and applicable scripts are in this GitHub repos… Now, let’s dive into each tool.

TFRecorder

TFRecords is still confusing me. I understand the benefit and the performance provided, but I have always found it difficult to work with them when I started working with a new dataset. Apparently I’m not the only one, and luckily the TFRecorder project was recently released. Working with TFRecords has never been so easy, it only requires (1) organizing your images in a logical directory and (2) working with Pandas and CSV frames. Here are the steps I took:

  • Create a three-column CSV file, including an image URI that points to the directory location of each image.

  • Reading CSV into a pandas dataframe with a TFRecorder function call to convert files to Dataflow specifying the output directory.

dfgcs = pd.read_csv(FILENAME)dfgcs.tensorflow.to_tfr(
 output_dir=TFRECORD_OUTPUT,
 runner="DataFlowRunner",
 project=PROJECT,
 region=REGION,
 tfrecorder_wheel=TFRECORDER_WHEEL)

Here it is! Less than 10 lines of code scaled to convert millions of images to TFRecord format. As a Data Scientist, you’ve just laid the foundation for high-performance learning. Take a look at the Dataflow Task Chart and Metrics in the Dataflow Console if you’re curious about the magic that happens in the background.

After looking at the code on GitHub a bit, I figured out the tfrecorder schema:

tfr_format = {
            "image": tf.io.FixedLenFeature([], tf.string),
            "image_channels": tf.io.FixedLenFeature([], tf.int64),
            "image_height": tf.io.FixedLenFeature([], tf.int64),
            "image_name": tf.io.FixedLenFeature([], tf.string),
            "image_width": tf.io.FixedLenFeature([], tf.int64),
            "label": tf.io.FixedLenFeature([], tf.int64),
            "split": tf.io.FixedLenFeature([], tf.string),
        }

Then you can read TFRecords in TFRecordDataset for the Keras model training pipeline with code like this:

IMAGE_SIZE=[150,150]
BATCH_SIZE = 5
def read_tfrecord(example):
    image_features= tf.io.parse_single_example(example, tfr_format)
    image_channels=image_features['image_channels']
    image_width=image_features['image_width']
    image_height=image_features['image_height']
    label=image_features['label']
    image_b64_bytes=image_features['image']
    
    image_decoded=tf.io.decode_base64(image_b64_bytes)
    image_raw = tf.io.decode_raw(image_decoded, out_type=tf.uint8)
    image = tf.reshape(image_raw, tf.stack([image_height,    image_width, image_channels]))
    image_resized = tf.cast(tf.image.resize(image, size=[*IMAGE_SIZE]),tf.uint8)
    return image_resized, label
def get_dataset(filenames):
    dataset = tf.data.TFRecordDataset(filenames=filenames, compression_type="GZIP") 
    dataset = dataset.map(read_tfrecord)
    dataset = dataset.shuffle(2048)
    dataset = dataset.batch(BATCH_SIZE)
    return dataset

train_dataset = get_dataset(TRAINING_FILENAMES)
valid_dataset = get_dataset(VALID_FILENAMES)

TensorFlow Cloud (AI Platform Training)

Now that we have tf.data.Dataset, we pass it to the model training call. Below is a simple example of a CNN model using the Keras Sequential API.

model = tf.keras.models.Sequential([
 tf.keras.layers.Conv2D(16, (3,3), activation='relu', input_shape=(150, 150, 3)),
 tf.keras.layers.MaxPooling2D(2, 2),
 tf.keras.layers.Conv2D(32, (3,3), activation='relu'),
 tf.keras.layers.MaxPooling2D(2,2),
 tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
 tf.keras.layers.MaxPooling2D(2,2),
 tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
 tf.keras.layers.MaxPooling2D(2,2),
 tf.keras.layers.Flatten(),
 tf.keras.layers.Dense(256, activation='relu'),
 tf.keras.layers.Dense(1, activation='sigmoid')
])model.summary()
model.compile(loss="binary_crossentropy",
 optimizer=RMSprop(lr=1e-4),
 metrics=['accuracy'])
model.fit(
 train_dataset,
 epochs=10,
 validation_data=valid_dataset,
 verbose=2
)

I first ran this code in my IDE on a subset of images (in my case, a Jupyter Notebook), but wanted to scale it across all images and make it faster. TensorFlow Cloud allows me to use a single API command that sends code to a container and runs as a distributed GPU job.

import tensorflow_cloud as tfctfc.run(entry_point="model_training.ipynb",
 chief_config=tfc.COMMON_MACHINE_CONFIGS['T4_4X'],
 requirements_txt="requirements.txt")

This is not an April Fool’s joke. This three lines of code is a complete Python script and should be placed in the same directory as your Jupyter notebook. The hardest part is the setup instructions, checking that you are properly authenticated in your Google Cloud Platform project. Let’s delve into what is happening. First, a Docker container is created with all the required libraries and a notepad, which is stored in the Google Cloud container registry service.

This container is then pushed to the fully managed serverless AI Platform Training service. Without having to set up the infrastructure and install any GPU libraries, I was able to train this model on a machine with 16 vCPUs and 60 GB of RAM and 4 Nvidia T4 GPUs. I only used these resources when I needed them (~ 15 minutes) and can go back to developing in my local environment using the IDE or Jupyter Notebook.

The SavedModel is finally saved to the object store as stated at the very end of the learning script.

MODEL_PATH=time.strftime(«gs://mchrestkha-demo-env-ml-examples/catsdogs/models/model_%Y%m%d_%H%M%S»)
model.save(MODEL_PATH)

AI Platform Predictions

With my SavedModel in object store, I can load it into my development environment and make some rough predictions. But what if I want to let others work with the model without having to set up the Python environment and learn TensorFlow. This is where AI Platform Predictions comes into play. It allows you to deploy model binaries as API endpoints that are called using REST, a simple Google Cloud SDK (gcloud), or other client libraries. Users should be aware that this is the required input (in our case, a JPEG image file converted to a JSON array [150,150,3]) and also know if they can embed the model into their workflow. When you make changes (retrain the model on a new dataset, new model architecture, maybe even use a new framework), you can publish the new version. The following simple example script with the gcloud SDK is extremely useful for deploying models to this Kubernetes powered autoscale service.

MODEL_VERSION="v1"
MODEL_NAME="cats_dogs_classifier"
REGION="us-central1"
gcloud ai-platform models create $MODEL_NAME 
    --regions $REGION
gcloud ai-platform versions create $MODEL_VERSION 
  --model $MODEL_NAME 
  --runtime-version 2.2 
  --python-version 3.7 
  --framework tensorflow 
  --origin $MODEL_PATH

AI Platform Predictions Is a service that excites me especially, as it removes many of the complexities associated with extending your model (internally and externally) so that you can begin to benefit from the model. Although the purpose of the post is to show an experimental workflow companies like Wazeare using AI Platform Predictions to deploy and maintain their models on an industrial scale.

Weights & Biases

I have completed the experiment. But what about future experiments? I may need:

  • Do a lot of experiments this week and track your work.
  • Come back a month later and try to remember all the inputs and outputs of each experiment.
  • Share this work with colleagues who hopefully can piece together the details of the workflow.

A lot of work is done in the vastness of ML pipelines. This is an exciting but nascent area with best practices and industry standards to be developed. Some great projects include MLFlow, Kubeflow pipelines, TFX, and more. Comet.ML for my workflow needs MLOps and continuous delivery were out of sight and I wanted something simple. I chose Weights & Biases (WandB) because of its ease of use and easy integration for tracking experiments and artifacts.

Let’s start with experiments. WandB offers many customization options, but if you’re working with any of the popular frameworks, there isn’t much to do. In the case of the TensorFlow Keras API, I just (1) imported the python library wandb (2) initialized the run of the experiment, and (3) added a callback function as part of the model training step.

model.fit(
 train_dataset,
 epochs=10,
 validation_data=valid_dataset,
 verbose=2,
 callbacks=[WandbCallback()]
)

The code feeds custom metrics to a centralized experiment tracking service. Take a look at how people generally work with WandB here

WandB also provides an artifact API that suits my needs a lot more than some of the heavy tools out there today. I’ve added short snippets of code throughout the pipeline to identify 4 key elements:

  1. Initializing a step in my pipeline
  2. Using an existing artifact (when available) in this step
  3. Logging the artifact created in this step
  4. End of step indication

run = wandb.init(project="cats-dogs-keras", job_type="data", name="01_set_up_data")
<Code to set up initial JPEGs in the appropriate directory structure>
artifact = wandb.Artifact(name="training_images",job_type="data", type="dataset")
artifact.add_reference('gs://mchrestkha-demo-env-ml-examples/catsdogs/train/')
run.log_artifact(artifact)
run.finish()
run = wandb.init(project="cats-dogs-keras",job_type="data", name="02_generate_tfrecords")
artifact = run.use_artifact('training_images:latest')
<TFRecorder Code>
artifact = wandb.Artifact(name="tfrecords", type="dataset")
artifact.add_reference('gs://mchrestkha-demo-env-ml-examples/catsdogs/tfrecords/')
run.log_artifact(artifact)
run.finish()
run = wandb.init(project="cats-dogs-keras",job_type="train", name="03_train")
artifact = run.use_artifact('tfrecords:latest')
<TensorFlow Cloud Code>
artifact = wandb.Artifact(name="model", type="model")
artifact.add_reference(MODEL_PATH)
run.log_artifact(artifact)

This simple artifact API stores metadata and provenance for each run and artifact so you have complete clarity in your workflow. The user interface also has a nice tree diagram for viewing launch history and artifacts.

Conclusion

If the endless machine learning topics, tools, and technology scare you, you are not alone. I feel Imposter Syndrome every day, talking with colleagues, partners and clients about various aspects of data science, MLOps, hardware accelerators, data pipelines, AutoML, etc. Just remember that this is very important.

  • Be practical: It’s unrealistic for all of us to be full stack engineers in DevOps, Data and Machine Learning. Pick one or two areas that interest you and work with others to solve system-wide problems.
  • Focus on your problem: we are all passionate about a new framework, new research work, new tool. Start with your business problem, your dataset, your end-user requirements. Not everyone needs 100 ML models in production, which are retrained and serviced by millions of users daily (at least not yet).
  • Decide on the tools find a core set of tools that truly multiply efficiency, provide scalability, and eliminate complexity. I went through my Tensorflow toolbox (TFRecorder + TensorFlow Cloud + AI Platform Predictions + Weights & Biases), but found the right toolkit that suits your problem and workflow.

Jupyter Notebook examples from this post can be found at my github

And you can improve yourself to the maximum in Data Science, analytics or machine learning under the strict guidance of our mentors.

image

More courses

Read more

  • How to Become a Data Scientist Without Online Courses
  • 450 free courses from the Ivy League
  • How to learn Machine Learning 5 days a week for 9 months in a row
  • How much does a data analyst earn: an overview of salaries and vacancies in Russia and abroad in 2020
  • Machine Learning and Computer Vision in the Mining Industry

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *