Video Transcription App: Build AI Tool Yourself (Part 2)

Video Transcription App: Build AI Tool Yourself (Part 2)

A Step-by-Step Guide to Building Your Video Transcription Service with Next.js, Firebase, and OpenAI's Whisper API

ยท

14 min read

Hey everyone! In the previous part of this series, we explored the steps to create the drag-and-drop element to allow users to upload video files and to integrate the Firebase ecosystem, more specifically, Firebase Storage. Now, we are going to finish building our video transcription application. The demo provided below will refresh your memory on how the final version of the application should function:

Today we are going to discover Firebase Cloud Functions and their integration with other Firebase services such as Firebase Storage and Realtime Database. Moreover, we will use the Whisper API provided by OpenAI to convert audio from the uploaded video into text. ๐Ÿ—ฃ๏ธ

If you missed the beginning of this series, you can see it here to make sure you're following along. If you also need the source code for this project, feel free to access it on my GitHub repository.

Firebase: Cloud Functions

Before jumping into the development of Cloud Functions, we need once again to review how our application is planning to operate. This diagram from the first part describes the entire process of transcribing the uploaded video. If you feel overwhelmed, feel free to visit the first part to read the full explanation of this schema.

To start using Firebase Cloud Functions, we set up Firebase CLI to deploy functions to the Cloud Functions runtime. Open the terminal within your project folder and run the following command in the terminal to install the Firebase CLI globally:

$ npm install -g firebase-tools

Now we can initialize Firebase SDK for Cloud Functions by executing the commands below. When prompted, we will choose to write the functions in TypeScript.

$ firebase login

$ firebase init functions

This will create the functions folder within our project directory structure. The source code for our cloud functions will be placed inside the index.ts file located within this folder.

Firstly, we need to initialize the Firebase Admin SDK to integrate multiple Firebase services through a unified admin system. To do so, navigate to the "Project settings" within the Firebase console and select "Service accounts".

Next, opt for the "Generate new private key" choice. Install this key at the same level as the index.ts file in the project's folder structure.

Now we can set up Admin SDK within the index.ts file and retrieve access to the storage buckets that are going to be used throughout multiple functions.

// functions/src/index.ts

import {onObjectFinalized} from "firebase-functions/v2/storage";
import {getStorage} from "firebase-admin/storage";
import serviceAccount from "./serviceAccount.json";
import * as admin from "firebase-admin";

// Initializing the Admin SDK
admin.initializeApp({
    credential: admin.credential.cert(serviceAccount as admin.ServiceAccount)
});

// Retrieve an authenticated reference to our buckets
const videoBucket = getStorage().bucket("speech-transcriber-81b11.appspot.com");
const audioBucket = getStorage().bucket("audio-files-432e");

Extracting an Audio File from a Video

In the previous part of our series, we implemented uploading the video files to the Firebase Storage. We can follow the next step outlined in the diagram by developing the function to extract the audio from the uploaded video file and pass the extracted audio to a separate bucket. This function will be triggered using onObjectFinalized upon uploading any object to the storage.

// functions/index.ts

import {onObjectFinalized} from "firebase-functions/v2/storage";

// Triggering on uploading the file to the bucket 'speech-transcriber-81b11.appspot.com'
export const onVideoFileUploaded = onObjectFinalized(
    {bucket: "speech-transcriber-81b11.appspot.com"},
    async (event) => {
        // Extracting the file and its metadata
        const file = event.data;
        const metadata = file.metadata as { uploadId: string };

        // Extracting audio from the uploaded file...
    }
);

If you want to read more about triggering the Cloud Function on Cloud Storage events, I refer you to this detailed explanation in Firebase Docs.

This function runs every time someone uploads the file to the specified bucket. We can then extract the file itself and its metadata from event object.

Here comes the most sophisticated part of this function. We need to convert the retrieved video file into audio. The most straightforward way to achieve this is FFmpeg, a comprehensive multimedia framework capable of performing a variety of operations such as decoding, encoding, transcoding, etc.

We will specifically use fluent-ffmpeg, a Node.js module to integrate FFmpeg with JavaScript easily. This library provides an abstraction in JavaScript for writing the series of FFmpeg instructions. To install this package, use the given command:

$ npm install fluent-ffmpeg

Overall, the process of converting video into audio includes the following steps:

  • Opening the write stream to the new bucket, to which we want to upload the file

  • Opening the read stream from the bucket, to which the initial file was uploaded

  • Building the pipeline of FFmpeg commands to convert a video file into audio.

Therefore, the code for this function will have the following form:

// functions/src/index.ts

import ffmpeg from "fluent-ffmpeg";
import ffmpegPath from "@ffmpeg-installer/ffmpeg";

// Setting full path to the ffmpeg binary
ffmpeg.setFfmpegPath(ffmpegPath.path);

export const onVideoFileUploaded = onObjectFinalized(
    {bucket: "speech-transcriber-81b11.appspot.com"},
    async (event) => {
        const file = event.data;
        const metadata = file.metadata as { uploadId: string };

        // Opening read stream to the bucket storing uploaded video
        const rs = videoBucket.file(file.name).createReadStream();
        // Setting the name of the audio file
        const audioFileName = file.name.replace(/\.[^/.]+$/, ".mp3");
        // Opening write stream to the bucket storing audio
        const ws = audioBucket.file(audioFileName).createWriteStream({
            metadata: {
                metadata: {
                    uploadId: metadata.uploadId,
                },
                contentType: "audio/mpeg",
            },
        });

        // Extracting audio from video file
        await new Promise((resolve, reject) => {
            ffmpeg()
                .input(rs)  // Passing the read stream as input
                .toFormat("mp3")  // Setting new file format
                .on("error", (err) => {
                    logger.error(err.message);
                    reject(err);
                })
                .on("end", () => {
                    logger.log(`File ${file.name} converted.`);
                    // Deleting uploaded video file when done with converting
                    videoBucket.file(file.name).delete();
                    resolve(null);
                })
                // Emitting 'end' event when readable stream ends
                .pipe(ws, {end: true}); 
        });
    });

Congrats! ๐Ÿ”ฅ We have finalized the function responsible for converting video files into audio. As a result of this function, we will have extracted audio file being uploaded to a separate bucket.

By the way, while working on this project, I stumbled upon this resource that contains an example of thoroughly commented code for transcoding the video file within Cloud Function. You may find it helpful as well!

Transcribing the audio

Following our algorithm shown in the diagram, we will create another function, which transcribes the uploaded audio file into text. It will be triggered anytime the first function uploads the file to the specific storage bucket.

// functions/src/index.ts

export const onAudioFileUploaded = onObjectFinalized({bucket: "audio-files-432e"}, async (event) => {
    const file = event.data;
    const metadata = file.metadata as { uploadId: string; };

    // Transcribing audio...
});

As I previously mentioned, this function will use two key services: Whisper API and Firebase Realtime Database. We are going to send the uploaded audio file to the Whisper API and get the transcription back as a response, which we are going to upload to the Realtime Database.

Integrating OpenAI API

Firstly, we will install the OpenAI Node API package to easily access the OpenAI services in JavaScript. Run the following command to install the client:

$ npm install openai

To keep this article brief, I will skip the process of configuring the OpenAI API. If you haven't had an account before, you need to sign up with OpenAI. You then need to create a new secret API key to get access to the Open AI services. It is pretty straightforward.

After retrieving the key, we can add it to our new function. To prevent us from exposing it, we are going to create the .env file inside the functions folder to store the key as an environment variable. Afterward, we can access it within our code in the following way:

// functions/src/index.ts

import OpenAI from "openai";

export const onAudioFileUploaded = onObjectFinalized({bucket: "audio-files-432e"}, async (event) => {
    const file = event.data;
    const metadata = file.metadata as { uploadId: string; };

    const openai = new OpenAI({
        apiKey: process.env.OPENAI_API_KEY,
    });
});

The docs for Whisper API state that we need to pass an audio file object that we want to transcribe. Here comes the trouble. We can't simply pass the file retrieved from the event object to the OpenAI API call as we did in the previous function with FFmpeg.

Luckily, there exists a solution. The workaround is to download the uploaded audio file from the bucket into the temporarily created folder and then pass this file from the temporary folder into the API call.

To do so, we will make use of the tmp directory, a temporary directory created automatically by the instance running any Cloud Function. You can read more about it here. To work with the files, we will also use the filesystem module for Node.js called fs.

// functions/src/index.ts

import OpenAI from "openai";
import * as fs from "fs";
import * as path from "path";
import * as os from "os";

export const onAudioFileUploaded = onObjectFinalized({bucket: "audio-files-432e"}, async (event) => {
    const file = event.data;
    const metadata = file.metadata as { uploadId: string; };

    // Initializing the OpenAI client
    const openai = new OpenAI({
        apiKey: process.env.OPENAI_API_KEY,
    });

    // Getting the path to the temporary file location in "tmp" folder
    const tmpFilePath = path.join(os.tmpdir(), file.name);

    // Downloading the file from audio bucket into "tmp"
    await audioBucket.file(file.name).download({destination: tmpFilePath});

    // Opening the read stream from the temporary file location
    const rs = fs.createReadStream(tmpFilePath);

    // Sending the API call to transcribe the audio file
    const transcription = await openai.audio.transcriptions.create({
        file: rs,
        model: "whisper-1",
        language: "en",
    });
});

After sending the provided audio file to the Whisper API and getting its transcript as a response, we need to store this transcript inside the Firebase Realtime Database, a cloud-hosted database in which data is synced across all clients in real time. To connect the Realtime Database to our project, go again to the Firebase console and select "Realtime Database". Then, select "Create Database". You can select the default configuration settings.

To incorporate the Realtime Database in our application, we need to slightly adjust the way we initialized the Firebase SDK both for our frontend and cloud functions. This small change requires adding the URL of the newly created database.

To access the database URL, open the "Realtime Database" at the Firebase console.

Include this URL in the Firebase SDK initialization configuration both for our frontend and cloud functions:

// lib/firebase.ts

import { initializeApp } from "firebase/app";
import { getDatabase } from "firebase/database";
import { getStorage } from "firebase/storage";

const firebaseConfig = {
  // Rest of the parameters...
  databaseURL: "YOUR-DATABASE-URL",
};

const app = initializeApp(firebaseConfig);
export const storage = getStorage(app);
export const db = getDatabase(app);
// functions/src/index.ts

admin.initializeApp({
    credential: admin.credential.cert(serviceAccount as admin.ServiceAccount),
    databaseURL: "YOUR-DATABASE-URL",
});
const db = admin.database();

Having the database ready, we will add a new entry to it containing the transcript of the audio. To do so, we'll retrieve the reference to the specially created collection called 'transcripts' and insert a new transcript inside it.

As a reminder, when we initially uploaded the video on the frontend to the Firebase Storage, we assigned it a unique ID and passed it as a part of the file's metadata. We are going to use this ID as a unique identifier for the transcript in the database to access the transcript stored under the specific ID.

The final version of our code will look the following way:

// functions/src/index.ts

import {onObjectFinalized} from "firebase-functions/v2/storage";
import * as logger from "firebase-functions/logger";
import OpenAI from "openai";
import * as fs from "fs";
import * as path from "path";
import * as os from "os";

export const onAudioFileUploaded = onObjectFinalized({bucket: "audio-files-432e"}, async (event) => {
    const openai = new OpenAI({
        apiKey: process.env.OPENAI_API_KEY,
    });
    const file = event.data;
    const metadata = file.metadata as { uploadId: string; };
    const tmpFilePath = path.join(os.tmpdir(), file.name);
    await audioBucket.file(file.name).download({destination: tmpFilePath});


    const transcription = await openai.audio.transcriptions.create({
        file: fs.createReadStream(tmpFilePath),
        model: "whisper-1",
        language: "en",
    });

    // Getting a reference to the 'transcripts' collection in the database
    const transcriptRef = db.ref("transcripts");
    // Inserting new transcript under the unique ID
    transcriptRef.set({
        [metadata.uploadId]: {
            transcript: transcription.text,
        },
    }).catch((err) => {
        logger.error(err.message);
    });

    // Remove temporarily created file
    fs.unlinkSync(tmpFilePath);
});

Great! As a result, upon uploading the video file, we can get its transcript and upload it to the database through a series of cloud functions. The last piece of the puzzle is to display this transcript to the client.

Frontend: Displaying the Transcript

First of all, we will create a separate component called Transcript.tsx that receives the ID of the uploaded video as a prop. This component has the transcript of the video as a state that is rendered on the page.

// components/Transcript.tsx

"use client";
import { useState } from "react";

type Props = {
    id?: string;
}

const Transcript = ({ id }: Props) => {
    const [transcript, setTranscript] = useState<string>();

    return (
        <p>{transcript}</p>
    );
};

Now we have to retrieve the transcript of the video from the database using the passed ID. To do so, we will mainly use the features of the Firebase Realtime Database.

On the client, we will set the listener using onValue that will be triggered on any changes in the Realtime Database. It will pass the snapshot containing all data at the specified location to the event callback.

Therefore, if we set the listener to be triggered on any changes in the 'transcripts' collection, it will return the snapshot of all transcripts. However, our goal is to display solely the transcript related to the currently uploaded video. It is more efficient to set the listener only for the changes of the transcript associated with a particular video, rather than the listener for all the changes in the 'transcripts' collection.

// components/Transcript.tsx

import { db } from "@/lib/firebase";
import { onValue } from "firebase/database";
import { useEffect, useState } from "react";
import { ref } from "firebase/database";

type Props = {
    id?: string;
}

const Transcript = ({ id }: Props) => {
    const [transcript, setTranscript] = useState<string>();
    useEffect(() => {
        onValue(ref(db, "transcripts/" + id), (snapshot) => {
            if (snapshot.exists()) {
                const data = snapshot.val() as { transcript: string };
                setTranscript(data.transcript);
            }
        });
    }, [id]);

    return (
        <p>{transcript}</p>
    );
};

Having this component developed, don't forget to add it to the main page.

// app/page.tsx

"use client";
import Transcript from "@/components/Transcript";

export default function Home() {
    const [id, setId] = useState<string>();

    const onDrop = useCallback(async (acceptedFiles: File[]) => {
        if (acceptedFiles.length > 0) {
            setLoading(true);
            const file = acceptedFiles[0];
            const fileName = file.name.replace(/\.[^/.]+$/, "");
            const fileRef = ref(storage, `${fileName}-${Date.now()}${path.extname(file.name)}`);
            const uploadId = uuid();
            await uploadBytes(fileRef, file, {
                customMetadata: {
                    uploadId
                }
            });
            setId(uploadId);
        }
    }, []);

    return (
        <>
            <ImageDropzone onDrop={onDrop} />
            <Transcript id={id} />
        </>
    );
}

Frontend: Extras

The application is finally complete. However, I encourage you to add a few optional features to make the app more functional and appealing to the user. ๐Ÿคฉ

Firstly, I would prefer to show the transcript using smooth animation rather than having it spontaneously appear on the screen. We can add the typing animation provided by the react-type-animation NPM package.

All we need to do is install the library and use the TypeAnimation component developed by the library.

// components/Transcript.tsx

"use client";
import { db } from "@/lib/firebase";
import { onValue } from "firebase/database";
import { Dispatch, SetStateAction, useEffect, useState } from "react";
import { ref } from "firebase/database";
import { TypeAnimation } from "react-type-animation";

type Props = {
    id?: string;
}

const Transcript = ({ id, setLoading }: Props) => {
    const [transcript, setTranscript] = useState<string>();
    useEffect(() => {
        onValue(ref(db, "transcripts/" + id), (snapshot) => {
            if (snapshot.exists()) {
                const data = snapshot.val() as { transcript: string };
                setTranscript(data.transcript);
            }
        });
    }, [id]);

    return (
        <p className="mt-10 place-self-start">
            {transcript && 
                <TypeAnimation
                    sequence={[transcript]}
                    wrapper="span"
                    cursor={false}
                    speed={75}
                    repeat={0}
                />
            }
        </p>
    );
};

Secondly, I would incorporate a spinner that informs the user that the application is currently processing the uploaded video. To do so, we'll create a separate component called Spinner.tsx within the components folder. The specifics of the spinner implementation are open to your choice. You can use your preferred styling approach or even a separate library. I've created a spinner using Tailwind CSS.

// components/Spinner.tsx

const Spinner = () => {
    return (
        <div className="w-10 h-10 border-b-2 border-neutral-600 rounded-full animate-spin" />
    );
};

We will then have a separate state inside page.tsx file that represents whether the application is currently handling any uploaded file. In such a case, the spinner will be displayed.

// app/page.tsx

"use client";
import ImageDropzone from "@/components/ImageDropzone";
import Transcript from "@/components/Transcript";
import Spinner from "@/components/Spinner";
import { useCallback, useState } from "react";

export default function Home() {
    const [loading, setLoading] = useState<boolean>(false);

    // Rest of the code...

    return (
        <>
            {loading ? <Spinner /> : <ImageDropzone onDrop={onDrop} />}
            <Transcript id={id} setLoading={setLoading} />
        </>
    );
}

Moreover, upon the addition of a new transcript to the database, we will switch the loading state to "false" and showcase the generated transcript. To accomplish this, we will modify the Transcript component by passing the setter method for the loading process as a new prop.

// components/Transcript.tsx

"use client";
import { Dispatch, SetStateAction, useEffect, useState } from "react";

type Props = {
    id?: string;
    setLoading: Dispatch<SetStateAction<boolean>>;
}

const Transcript = ({ id, setLoading }: Props) => {
    const [transcript, setTranscript] = useState<string>();
    useEffect(() => {
        onValue(ref(db, "transcripts/" + id), (snapshot) => {
            if (snapshot.exists()) {
                const data = snapshot.val() as { transcript: string };
                setTranscript(data.transcript);
                setLoading(false);
            }
        });
    }, [id, setLoading]);

    return (
        <p className="mt-10 place-self-start">
            {transcript && 
                <TypeAnimation
                    sequence={[transcript]}
                    wrapper="span"
                    cursor={false}
                    speed={75}
                    repeat={0}
                />
            }
        </p>
    );
};

Great! ๐ŸŽ‰ Our application finally works identically to the application on the demo I have shown at the beginning of the article.

Conclusion

This series showed you how to build a comprehensive video transcription service with AI using Next.js, Firebase services, FFmpeg, and OpenAI API!

To be honest, writing this series demanded lots of effort and presented a great challenge to me. Every sentence was a struggle ๐Ÿ˜… However, the result is worthwhile, and I hope you enjoyed it!

As usual, if you encounter any problems along the way, you can refer to the complete source code available on my GitHub repository. If you have any questions or need assistance, don't hesitate to reach out to me.

Remember to follow me on Twitter/X, where I share daily updates. And feel free to connect with me on LinkedIn.

ย