Audio Recognition with ShazamKit and AVFoundation

ShazamKit is a recent Apple Framework announced during WWDC 2021, that brings audio matching capabilities within your app. You can make any prerecorded audio recognizable by building your own custom catalogue using audio from podcasts, videos, and more or match music to the millions of songs in Shazam’s vast catalogue.

Today, we are going to build a simple music-matching recognizer. The idea is to build a component that is independent of the UI framework being used (SwiftUI or UIKit).

We will create a Swift class named creatively ShazamRecognizer that will have some simple tasks to perform:

Create the properties that are going to help us in building our class
Request Permission to record audio using the AVFoundation framework
Start Recording and Send the record to ShazamKit for recognition matching
Handle the response from ShazamKit (Success when a match was found or Error when no match was found)
Display our result in a UI (e.g: SwiftUI or UIKit)

Create the properties that are going to help us in building our class

final class ShazamRecognizer: NSObject, ObservableObject {
    // 1. Audio Engine
    private let audioEngine = AVAudioEngine()

    // 2. Shazam Engine
    private let shazamSession = SHSession()

    // 3. UI state purpose
    @Published private(set) var isRecording = false

    // 4. Success Case
    @Published private(set) var matchedTrack: ShazamTrack?

    // 5. Failure Case
    @Published var error: ErrorAlert? = nil
}

In the above declarations, we have:

We create the audioEngine which is used to start and stop the recording.
We create the shazamSession which is used to perform the matching process.
We use isRecording to track whether or not there is an ongoing recording operation. This value can be used for example to show different UI for each state.
We create a variable of custom type (ShazamTrack) to store our result in case of success (when a match was found).
In case of failure, we store the error in the error variable of type ErrorAlert which can be used to display an Alert or a UI, etc.

Request Permission to record audio using the `AVFoundation` framework

In our class, we proceed by adding the listenToMusic() function.

func listenToMusic() {
    // 1.
    let audioSession = AVAudioSession.sharedInstance()
    // 2. 
    audioSession.requestRecordPermission { status in
        if status {
            // 3.
            self.recordAudio()
        } else {
            // 4. 
            self.error = ErrorAlert("Please Allow Microphone Access !!!")
        }
    }
}

In our listenToMusic() function:

We use an audioSession to communicate to the operating system the general nature of our app’s audio without detailing the specific behaviour or required interactions with the audio hardware.
Using the audioSession, We request user's permission to record audio. At this point, we have to add a new property (NSMicrophoneUsageDescription) in our Info.plist, with a message that tells the user why the app is requesting access to the device’s microphone otherwise our app might crash at runtime.
In case the user gives permission we start the recording operation in our recordAudio() function that we are going to build in the next section.
In case the user has denied permission, we simply store the error in our error variable.

AVAudioSession: An audio session acts as an intermediary between your app and the operating system — and, in turn, the underlying audio hardware.

Start Recording and Send the record to ShazamKit for recognition

Here is what our recordAudio() function looks like! Hmmm!!!, Quite a function~

private func recordAudio() {
    // 1. If the `audioEngine` is running, stop it and return
    if audioEngine.isRunning {
        self.stopAudioRecording()
        return
    }

    // 2. Create a inputNode to listen to
    let inputNode = audioEngine.inputNode

    // 3. Create the format to use for our inputNode 
    /// We are using .zero as a bus for this example
    let format = inputNode.outputFormat(forBus: .zero)

    // 4. Removes the tap if already installed on the node
    inputNode.removeTap(onBus: .zero)

    // 5. Install an audio tap on the bus using our inputNode
    // Record, monitor, and observe the output of the node.
    // This will listen to music continuously
    inputNode.installTap(onBus: .zero,
                         bufferSize: 1024,
                         format: format)
    { buffer, time in
        // 6. Start Shazam Matching Operation,
        // Converts the audio in the buffer to a signature and 
        // searches the reference signatures in the session catalog.
        self.shazamSession.matchStreamingBuffer(buffer, at: time)
    }

    // 7. Prepare the audio engine to start
    audioEngine.prepare()

    do {
        // 8. Start the audio engine
        try audioEngine.start()

        DispatchQueue.main.async {
            // 9. Set the recording state to true
            self.recording = true
        }
    } catch {
        // 10. Handle any error that may occur
        self.error = ErrorAlert(error.localizedDescription)
    }
}

After going, through the above function, the question to be asked is, how do we handle the response that the shazamSession will return?

No worries, that's the topic for our next exciting section.

In case, you are wondering what the stopAudioRecording() function mentioned above looks like, here you go:

private func stopAudioRecording() {
    audioEngine.stop()
    isRecording = false
}

Handle the response from ShazamKit

First, we need to tell the shazamSession, where to delegate its result!

As you can see below, we are using our ShazamRecognizer class as the delegate for the session so that we can be informed when there is a successful result or a failure.

override init() {
    super.init()
    // Sets delegate to be ShazamRecognizer class
    shazamSession.delegate = self
}

By doing the above, we are obliged to conform to the SHSessionDelegate protocol, and implement its delegate methods. So We extend our class and add the following:

extension ShazamRecognizer: SHSessionDelegate {
    func session(_ session: SHSession, didFind match: SHMatch) {
        DispatchQueue.main.async {
            // 1.
            if let firstItem = match.mediaItems.first {
               // 1.
                self.matchedTrack = ShazamTrack(firstItem)
                // 2.
                self.stopAudioRecording()
            }
        }
    }

    func session(_ session: SHSession, didNotFindMatchFor signature: SHSignature, error: Error?) {
        DispatchQueue.main.async {
            // 3.
            self.error = ErrorAlert(error?.localizedDescription ?? "No Match found!")
            // 4. Stop Audio recording
            self.stopAudioRecording()
        }
    }
}

Our first delegate method is func session(_ session: SHSession, didFind match: SHMatch), It is called when a match was found. Here, we:

Get the first item in the match's mediaItems(An array of the media items in the catalog that match the query signature, in order of the quality of the match); We then convert the firstItem of type SHMatchedMediaItem into our own custom model ShazamTrack
We stop the audio recording by calling our stopAudioRecording() which stops our audioEngine.

Our second delegate method is func session(_ session: SHSession, didNotFindMatchFor signature: SHSignature, error: Error?), It is is called when no match was found. Here, we:

Send the error message to our error variable.
We stop the audio recording by calling our stopAudioRecording() which stops our audioEngine.

At this point, We are pretty much done with our audio recognition system!👏🏻💪🏼. We are ready to use it in our application, no matter the UI framework.

For this example, I've used SwiftUI for a quick prototype, but you can use UIKit as well without any particular effort.

You can find the full demo project

Conclusion

ShazamKit framework has a lot to offer, but in this article, we have just scratched the tip of the iceberg, I hope you have learned something today :)

Building Music Recognition with ShazamKit and AVFoundation

Table of contents

Create the properties that are going to help us in building our class

Request Permission to record audio using the `AVFoundation` framework

Start Recording and Send the record to ShazamKit for recognition

Handle the response from ShazamKit

Conclusion

Building Music Recognition with ShazamKit and AVFoundation

Table of contents

Create the properties that are going to help us in building our class

Request Permission to record audio using the AVFoundation framework

Start Recording and Send the record to ShazamKit for recognition

Handle the response from ShazamKit

Conclusion

Request Permission to record audio using the `AVFoundation` framework