CameraX + ML Kit for card number recognition in action

Hi, my name is Vitaly Belyaev, I am an Android developer at red_mad_robot. In this article I will tell you about the experience of integrating CameraX with ML Kit to replace the card.io library, and what came of it in the end.

The application I’m working on has a screen for adding a bank card. There you can fill in all the information with your hands, or you can click “Scan”, and using the phone camera to recognize the card number. For this we use the library card.io

Why did we decide to replace card.io?

  • we wanted to replace the third-party library, which is already in the archive, with something more up-to-date from large companies;

  • card.io uses a separate activity approach, and we try to follow a single-activity approach;

  • few options for customizing the UI in card.io;

  • it was interesting to try CameraX and ML Kit;

  • card.io pulls a lot of native libraries. If you are not using the App Bundle, then cutting out card.io will reduce your APK by 12 MB in size.

Size comparison was carried out on sample-project

What is ML Kit?

I’ll clarify right away what is ML Kit… In fact, it is a library that provides an API for using ML for various tasks, such as image marking, barcode reading, recognition of text, faces, objects, text translation, text to speech, and so on.

All this is done using trained models and can happen both locally (on-device), and remotely on the server (on-cloud).

Both Google and Huawei have their own ML Kitwhich are very similar. Google ML Kit depends on GMS and Huawei ML Kittherefore depends on the HMS.

For the task of recognizing a bank card number, that part is suitable for us ML Kit, which is related to text recognition. In both ML Kit it is called Text Recognition… In both ML Kit the Text Recogniton can work locally (on-device).

Using on-device Text Recognition we get a higher speed of work, independence from the availability of the Internet and no payment for use, compared to on-cloud decision.

As an entrance, Text Recognition takes an image, which it processes and then outputs the result in the form of text, which it recognizes. To provide Text Recognition input data, we need to get these images (frames) from the device’s camera.

We get frames from the camera for analysis

For this task, we need to work with the Camera API in order to show the preview and send frames from it for analysis to ML Kit

Google made CameraX – a library for working with the camera, a part of Jetpack that encapsulates the work with Camera1 and Camera2 API and provides a convenient lifecycle-aware interface for working with the camera.

AT CameraX there are so-called use cases, there are only three of them:

  • ImageAnalysis

  • Preview

  • ImageCapture

From the name it is easy to guess what is used and why. We are interested in Preview and ImageAnalysis

Making the setting:

val preview = Preview.Builder()
   .setTargetRotation(Surface.ROTATION_0)
   .setTargetAspectRatio(screenAspectRatio)
   .build()
   .also { it.setSurfaceProvider(binding.cameraPreview.surfaceProvider) }

val imageAnalyzer = ImageAnalysis.Builder()
   .setTargetRotation(Surface.ROTATION_0)
   .setTargetAspectRatio(screenAspectRatio)
   .setBackpressureStrategy(ImageAnalysis.STRATEGY_KEEP_ONLY_LATEST)
   .build()
   .also { it.setAnalyzer(cameraExecutor, framesAnalyzer) }

We will not go into the details of each line, you can read about it good documentation by CameraX or go codelab… Now we are configuring use cases, and it is worth noting that it is quite convenient and compact in appearance.

Then we tie all this to lifecycle and launch it.

try {
   cameraProvider.unbindAll()
   camera = cameraProvider.bindToLifecycle(
       viewLifecycleOwner,
       CameraSelector.DEFAULT_BACK_CAMERA,
       useCaseGroup
   )
   setupCameraMenuIcons()
} catch (t: Throwable) {
   Timber.e(t, "Camera use cases binding failed")
}

Here we get the so called cameraProvider – part CameraX interface. Then we execute once bindToLifecycle that’s all. Further, when the application goes to background, CameraX handles these situations itself and releases the camera, and when the application returns to the foreground, it launches our use cases… And this is very cool: those who have at least once encountered Camera1/Camera2 API, they will understand me.

While creating ImageAnalysis use case we passed it framesAnalyzer – this is also an entity from CameraX, in fact, it is just a SAM interface ImageAnalysis.Analyzer with one method analyze(), in which we receive a picture in the form ImageProxy

private val framesAnalyzer: ImageAnalysis.Analyzer by lazy {
   ImageAnalysis.Analyzer(viewModel::onFrameReceived)
}

This is how we got a picture that can be transferred to ML Kit for recognition.

GMS ML Kit

Google used to have a library ML Kit for Firebasewhere all ML-related things were collected: the ones that work on-device (scanning barcodes for example) and those that work on-cloud (Image Labeling for example).

Then they took out all the parts that can be used on-device, into a separate artifact and named it ML Kit

All parts that use on-cloud processing, placed in the library Firebase ML

Here, just new ML Kit, which is working on-device and which is completely free, we will use it to recognize the card number.

The part responsible for text recognition in ML Kitis called Text Recognition, and it connects in this way:

implementation 'com.google.android.gms:play-services-mlkit-text-recognition:16.1.3'

You need to add to the manifest inside the application tag:

<meta-data
   android:name="com.google.mlkit.vision.DEPENDENCIES"
   android:value="ocr" />

This is necessary so that the models for ML Kit downloaded when you installed your application. If this is not done, they will be loaded the first time you use recognition.

Then everything is quite simple, we do everything according to documentation and we get the recognition result:

fun processFrame(frame: Image, rotationDegrees: Int): Task<List<RecognizedLine>> {
   val inputImage = InputImage.fromMediaImage(frame, rotationDegrees)

   return analyzer
       .process(inputImage)
       .continueWith { task->
           task.result
               .textBlocks
               .flatMap { block -> block.lines }
               .map { line -> line.toRecognizedLine() }
       }
}

The library gives back enough detailed result as Text an object that contains a list TextBlock… Everyone TextBlock, in turn, contains the list Lineand each Line contains a list Element

For our testing purposes, it will be fine for now to just work with a list of strings, so we use RecognizedLine – it’s simple:

data class RecognizedLine(val text: String)

We need a separate class in order to have a common entity that can be returned from GMS and from HMS ML Kit

HMS ML Kit

Since our application is also distributed in the Huawei App Gallery, we need to use ML Kit from Huawei.

In general, in HMS, all components have a GMS-like interface, ML Kit in this regard, no exception.

But Huawei did not do any breakdown of ML libraries by on-device and on-cloud, so with this SDK you can run like on-device recognition and on-cloud

We connect HMS ML Kit Text Recognition SDK according to documentation:

implementation 'com.huawei.hms:ml-computer-vision-ocr:2.0.5.300'
implementation 'com.huawei.hms:ml-computer-vision-ocr-latin-model:2.0.5.300'

And similarly with GMS ML Kit add to the manifest:

<meta-data
   android:name="com.huawei.hms.ml.DEPENDENCY"
   android:value="ocr" />

Guided by documentation process the frame from the camera and get the result:

fun processFrame(frame: Image, rotationDegrees: Int): Task<List<RecognizedLine>> {
        val mlFrame = MLFrame.fromMediaImage(frame, getHmsQuadrant(rotationDegrees))

        return localAnalyzer
            .asyncAnalyseFrame(mlFrame)
            .continueWith { task ->
                task.result
                    .blocks
                    .flatMap { block -> block.contents }
                    .map { line -> line.toRecognizedLine() }
            }
    }

Recognition test results

I was surprised by the results – it turned out that the recognition does not work as well and stable as I thought.

In daylight, I managed to recognize the 16-digit card number on my VISA, but it took about a minute of different twisting, moving away and approaching the card. However, one of the numbers was incorrect.

In artificial lighting, as well as in dim lighting with the flash on, I did not manage to get something sane at all, similar to the card number.

In the same time, сard.io even in a very dark room with the flash on, it recognizes the card number in an average of 1-2 seconds.

Trying to use on-cloud recognition

Time on-device recognition produces unacceptable results, then the idea appeared to try on-cloud recognition.

Immediately you need to understand that this will be paid, as is the case with GMSand in the case of HMS

As I wrote earlier, Google has broken libraries into on-device and on-cloud… Therefore, instead of ML Kit we need to use Firebase ML… But not everything is so simple, since you can use it only if you have a Blaze plan for a project in Firebase.

Therefore, I decided that it would be easier to test. on-cloud recognition on HMS ML Kit… For this we need a project in App Gallery Connect.

Need to connect agconnect plugin:

classpath 'com.huawei.agconnect:agcp:1.4.1.300'

You also need to download agconnect-services.json and put it in the app folder of your project.

Text Recognition The SDK in this case is the same, and we need to use a different one. Analyzerto which you need to transfer apiKey for your project from App Gallery Connect.

We create MLTextAnalyzer according to documentation:

private val remoteAnalyzer: MLTextAnalyzer by lazy {
        MLApplication.getInstance().apiKey = "Your apiToken here"

        val settings = MLRemoteTextSetting.Factory()
            .setTextDensityScene(MLRemoteTextSetting.OCR_COMPACT_SCENE)
            .create()
        
        MLAnalyzerFactory.getInstance().getRemoteTextAnalyzer(settings)
    }

Further, the processing of the frame is very similar to on-device:

fun processFrame(bitmap: Bitmap, rotationDegrees: Int): Task<List<RecognizedLine>> {
        val mlFrame = MLFrame.fromBitmap(bitmap)

        return remoteAnalyzer
            .asyncAnalyseFrame(mlFrame)
            .continueWith { task ->
                task.result
                    .blocks
                    .flatMap { block -> block.contents }
                    .map { line -> line.toRecognizedLine() }
            }
    }

It should be noted that we use here Bitmap, but not Image for creating MLFrame, although we saw in the case of on-devicewhat can be created MLFrame of Image… We do this because MLTextAnalyzer throws NPE with a message that the internal Bitmap null if passed to it MLFramecreated from Image… If you create from Bitmapthen everything works.

Because on-cloud Text Recognition paid (albeit with a free limit), I decided that it would be better to play it safe and take a photo, that is, use ImageCapture use case instead of ImageAnalysis for on-cloud recognition.

imageCapture = ImageCapture.Builder()
            .setTargetRotation(Surface.ROTATION_0)
            .setTargetAspectRatio(screenAspectRatio)
            .setCaptureMode(ImageCapture.CAPTURE_MODE_MAXIMIZE_QUALITY)
            .build()

The recognition results in this case are unsatisfactory: out of three photos in excellent quality (I saved them in the application memory and looked after the shooting) with natural daylight, none of the card numbers were recognized correctly.

At the same time, it is worth noting that with a paid on-cloud recognition will not be able to use the same approach that we used with on-device recognition – that is, transmit camera frames at the maximum speed we can and try to recognize the card number on each of them.

It will be different on each device: on Pixel 3 XL it is on average 5 fps, on Huawei Y8p it is 2 fps, but the main thing is that on average there will be more than 1 frames per second, and they will be transmitted for recognition immediately as a user will open the screen, even if he has not yet pointed the camera at the map.

It turns out a very significant number of requests, so you have to pay a considerable amount of money.

The last chance

After failures with on-device and on-cloud text recognition, I decided to search, maybe there are more specific parts in ML Kit, namely about the recognition of the card number. In GMS ML Kit I didn’t find anything like that, but in HMS ML Kit found Bank Card Recognition

But there are 3 problems:

  1. He himself works with the camera, you just need to pass him Activity and callback to get the results. Accordingly, we cannot use CameraX

  2. At GMS ML Kit There is no such thing and, accordingly, it will only work for applications in the Huawei App Gallery, and we want it to work for everyone.

  3. Not very clear price for this feature: for on-device written Free in the trial period, and for on-cloud N / A

Show me the code

All code inserts in the article are made from the sample application code available in this repositories… It is working, you can run it on your device and check the quality of recognition. In addition to CameraX+ML Kit, there is also added card.ioso that you can compare.

Outcomes

I talked about our experience of replacing card.io on a bundle CameraX+ML Kit to recognize the card number. ML Kit(GMS and HMS) copes with the task of recognizing the card number much worse than card.io

In this regard, it was decided to leave card.io in the application and look towards reading the card number using NFC, since the vast majority of bank cards are now contactless.

All links

  1. Sample app for this article

  2. card.io

  3. CameraX

  4. CameraX codelab

  5. GMS Text Recognition

  6. GMS ML Kit Pricing

  7. Firebase ML

  8. HMS Text Recognition

  9. HMS ML Kit Pricing

  10. HMS Bank Card Recognition

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *