Video Transcription Service: Build AI Tool Yourself

Video Transcription Service: Build AI Tool Yourself

A Step-by-Step Guide to Building Your Video Transcription Service with Next.js, Firebase, and OpenAI's Whisper API

Nov 15, 2023ยท

12 min read

With the emergence of numerous AI startups, I find it captivating to embark on the journey of developing the AI application ourselves. Today, I will guide you through the entire process of building a video transcription service. This application will display the transcript of the uploaded video to the user, similar to the demo provided below. ๐Ÿ‘€๐Ÿ‘‡

Although we'll be creating a prototype, consider this as a foundation upon which you can expand to develop your own AI tool! The project uses Next.js and Firebase which plays a key role in constructing the backend logic of extracting audio from video and converting audio into speech. This application will involve using various Firebase services like Cloud Storage, Realtime Database, Cloud Functions, and more!

Since the project is quite lengthy, I have divided it into two parts. In this part, we will build a basic frontend to allow users to upload videos to the server. Then we will set up the Firebase project, integrating Firebase Cloud Storage. Without further ado, let's start the development!

Prerequisites

It is perfectly fine if you do not get to know Firebase before; I'll walk you through the entire setup process, so there's no need to worry! Given that our frontend page will be basic, advanced knowledge of Next.js isn't necessary, though a basic familiarity with it would be beneficial. If you prefer, you can build a similar application using React - the choice of frontend technology doesn't significantly impact the process here.

If you ever need the source code for this project, feel free to access it on my GitHub repository.

Functionality Overview

Before delving into the development process, I want to highlight how our application will operate. I have created a diagram illustrating the key components of the app's logic. Let's break these elements step by step:

  1. The user initiates the process by uploading an image to the Firebase Cloud Storage bucket. This particular bucket stores only the video files submitted by the user. To prevent unnecessary storage costs, we will remove these files after extracting the audio content.

  2. When the video gets uploaded to the bucket, it triggers a Firebase Cloud Function. This function is responsible for extracting the audio part from the video and shipping it off to another bucket. Then, it deletes the corresponding video file from the storage.

  3. Similarly, the upload to this second bucket triggers another Cloud Function, which uses the newly uploaded audio file to generate a transcript of the speech.

  4. Within the function, we send the request to OpenAI Whisper API, which is responsible for converting the audio to text. Although we could have directly sent the video to the API, it has a limitation of 25 MB maximum upload size. Since the application is about uploading videos, the 25-MB limit may not accommodate longer videos. Thus, the logical approach has been to convert the video into an audio file.

  5. Once the transcription is finished, we store the results in the Realtime Database that stores all generated transcripts.

  6. The creation of a new element in the Realtime Database immediately triggers a change in the user interface. We will set up a special observer that updates the UI state whenever there is any change in the database.

This is the complete pipeline that allows you to build this project. If there are moments where you're scratching your head over other details, there is no need to stress; everything falls into place when we begin creating the app.

Frontend: Uploading Video File

Let's kick things off by setting up our frontend! As mentioned earlier, we're going to build a Next.js application. To get started, run the following command and choose the configuration outlined below:

โœ” What is your project named? โ€ฆ speech-transcriber
โœ” Would you like to use TypeScript? โ€ฆ Yes
โœ” Would you like to use ESLint? โ€ฆ Yes
โœ” Would you like to use Tailwind CSS? โ€ฆ Yes
โœ” Would you like to use `src/` directory? โ€ฆ Yes
โœ” Would you like to use App Router? (recommended) โ€ฆ Yes
โœ” Would you like to customize the default import alias (@/*)? โ€ฆ No

I have cleared out the default code and unnecessary styles from globals.css, leaving only the essential Tailwind directives. Next up, let's navigate to layout.tsx, where we'll create the basic view of our page.

// app/layout.tsx

import type { Metadata } from "next";
import { Poppins } from "next/font/google";
import "./globals.css";
import Link from "next/link";

const poppins = Poppins({
    weight: ["400", "700"],
    style: "normal",
    subsets: ["latin"],
});

export const metadata: Metadata = {
    title: "Speech Transcriber",
    description: "Upload an audio file and get a transcript of the speech."
};

export default function RootLayout({
    children,
}: {
  children: React.ReactNode
}) {
    return (
        <html lang="en">
            <body className={poppins.className}>
                <div className="flex flex-col max-w-[90%] mx-auto items-center justify-center py-7">
                    <h1 className="font-bold text-4xl mb-3">
                        <Link href="/">Speech Transcriber</Link>
                    </h1>
                    <p className="mb-10">
              Upload an audio file and get a transcript of the speech.
                    </p>
                    {children}
                </div>
            </body>
        </html>
    );
}

The initial layout contains only the title and subtitle, but later we will incorporate components that are necessary for uploading video files and displaying the results of speech transcription.

Now, we can focus on providing users with a way to upload the video file they want to transcribe. To ensure a seamless user experience, we can build a drag-and-drop component using React Dropzone library, which allows us to swiftly create the dropzone for our application.

Before we dive in, I want to note that I've shared a thread about this on Twitter/X. There, you can find more details on creating such a component using the React Dropzone library. Check it out below ๐Ÿ‘€๐Ÿ‘‡

To install the library, type the following command in the terminal:

npm install react-dropzone

First of all, in the page.tsx file, we'll define the function responsible for handling the "drop" file event. We will use the useCallback hook, which accepts the array of uploaded files as a parameter.

// app/page.tsx

import { useCallback } from "react";

export default function Home() {
    const onDrop = useCallback(async (acceptedFiles: File[]) => {
        if (acceptedFiles.length > 0) {
            // Do something with the files...
        }
    }, []);

    return null;
}

Next, we'll create a separate component ImageDropzone.tsx within the new components folder. We will use the useDropzone hook offered by the library, which accepts the onDrop function passed to the component as a prop. This hook provides us with two dropzone property getters - getRootProps and getInputProps. These are simply two functions that return objects with properties essential for building the drag 'n' drop zone.

Moreover, we can restrict the quantity of uploaded files and what types of files to accept. In our case, we want the user to upload a single video file.

// components/ImageDropzone.tsx

import { useDropzone } from "react-dropzone";

type Props = {
    onDrop: (files: File[]) => void;
}

const ImageDropzone = ({ onDrop }: Props) => {
    const { getRootProps, getInputProps } = useDropzone({
        onDrop,
        accept: { "video/*": [".mp4", ".ogv", ".mpeg"] },
    });

    return (
        <div {...getRootProps()}>
            <input {...getInputProps()} />
            {isDragActive ? (
                <p>Drop the video here ...</p>
            ) : (
                <p>Drag and drop video here, or click to select files</p>
            )}
        </div>
    );
};

export default ImageDropzone;

Alright, we can make our dropzone more user-friendly by adding some styling! Let's create styles corresponding to different states: when the component is idle, focused, or when an uploaded file is either accepted or rejected. Then we'll manipulate these states using the properties isFocused, isDragAccept, and isDragReject provided by the useDropzone hook.

// components/ImageDropzone.tsx

"use client";
import { useDropzone } from "react-dropzone";
import { CSSProperties } from "react";

const baseStyle: CSSProperties = {
    padding: 30,
    borderWidth: 2,
    borderRadius: 6,
    borderColor: "#eeeeee",
    borderStyle: "dashed",
    backgroundColor: "#fff",
    color: "#bdbdbd",
    outline: "none",
    transition: "border .2s ease-in-out",
    display: "flex",
    flexDirection: "column",
    alignItems: "center",
    cursor: "pointer",
};

const focusedStyle = {
    borderColor: "#2196f3",
};

const acceptStyle = {
    borderColor: "#00e676",
};

const rejectStyle = {
    borderColor: "#ff1744",
};

type Props = {
    onDrop: (files: File[]) => void;
}

const ImageDropzone = ({ onDrop }: Props) => {
    const { getRootProps, getInputProps, isDragActive, isFocused, isDragAccept, isDragReject } = useDropzone({
        onDrop,
        accept: { "video/*": [".mp4", ".ogv", ".mpeg"] },
    });

    const style = {
        ...baseStyle,
        ...(isFocused ? focusedStyle : {}),
        ...(isDragAccept ? acceptStyle : {}),
        ...(isDragReject ? rejectStyle : {}),
    };

    return (
        <div {...getRootProps({ style })}>
            <input {...getInputProps()} />
            {isDragActive ? (
                <p>Drop the video here ...</p>
            ) : (
                <p>Drag and drop video here, or click to select files</p>
            )}
        </div>
    );
};

export default ImageDropzone;

Now we've got our dropzone component ready, make sure not to forget about rendering this component on our home page.

// app/page.tsx

import { useCallback } from "react";

export default function Home() {
    const onDrop = useCallback(async (acceptedFiles: File[]) => {
        if (acceptedFiles.length > 0) {
            // Do something with the files...
        }
    }, []);

    return <ImageDropzone onDrop={onDrop} />;
}

The dropzone should look similar to the one below. Currently, the hook isn't doing anything, but we'll handle that later as we integrate Firebase into our application.

Firebase: Creating New Project

As I previously mentioned, I'll break down the entire process of incorporating Firebase step by step. Therefore, you don't need to worry if you never worked with Firebase before.

Start by visiting the Firebase website and selecting the option "Create a project".

Name your project, choose your preferences for Google Analytics, and hit "Create Project."

Now, we need to add the web app to our Firebase project. Once you are redirected to the console for your new project, click on the web icon. Simply enter your app's name and click "Register app."

You'll then receive a set of instructions on installing and initializing the Firebase SDK in your project ๐Ÿ‘‡

Firstly, we'll install the Firebase SDK using the following command:

npm install firebase

Now, we need to make a folder called lib. Inside that, we create the file named firebase.ts and drop in the Firebase code they gave us, which looks like this:

// lib/firebase.ts

import { initializeApp } from "firebase/app";

const firebaseConfig = {
  apiKey: "YOUR-API-KEY",
  authDomain: "YOUR-AUTH-DOMAIN",
  projectId: "YOUR-PROJECT-ID",
  storageBucket: "YOUR-STORAGE-BUCKET",
  messagingSenderId: "YOUR-SENDER-ID",
  appId: "YOUR-APP-ID"
};

const app = initializeApp(firebaseConfig);

Great!๐Ÿ”ฅ Having the Firebase SDK ready, we can add a few essential Firebase services that are responsible for different features within our application.

If you recall the diagram I showed earlier, you can notice that we use the following services: Cloud Storage, Realtime Database, and Cloud Functions.

Firebase: Setting Up Cloud Storage

Let's tackle this diagram step by step, starting with the Firebase Cloud Storage. Head to the left sidebar in the console and find the "Storage" option. Now, you need to click "Get Started". You can set up the storage in the test mode. Once the setup is done, Firebase takes care of creating a default storage bucket where we can upload any files.

To make it more convenient for us, we will separate the storage into two buckets: one for uploaded video files and another for audio files extracted from those videos.

However, if you click "Add bucket", Firebase will prompt you to upgrade your project to the "Blaze Plan".

Although it involves entering your payment details, there's no need to worry. The pricing details in the table below make it clear that charging starts when you store over 5GB of data and upload more than 1GB per day. You are highly unlikely to reach these milestones unless you develop a project, which is widely used by a large number of users. Furthermore, you can create a budget, specifying the maximum amount you're willing to spend.

Therefore, all you have to do is configure your billing details in your Google account, and then you can freely explore more services within Firebase.

Once the Firebase Blaze Plan is in place, we're good to go for creating an extra bucket to store those extracted audio files. We can hit "Add bucket" to create more buckets for storage.

Next up, we can start uploading files to the newly created buckets inside our application. If any part of the process seems a bit unclear, you can read the Firebase Docs for the Cloud Storage - they are incredibly helpful.

Firstly, we are going to initialize the Cloud Storage and get the reference to this service. To do so, we will head to the lib/firebase.ts and add the following bit of code:

import { FirebaseOptions, initializeApp } from "firebase/app";
import { getStorage } from "firebase/storage";

const firebaseConfig: FirebaseOptions = {
    // Same code...
};

const app = initializeApp(firebaseConfig);
export const storage = getStorage(app);

Moving on, we will head to the app/page.tsx and handle the logic of uploading the file. Inside of the previously defined onDrop function, we will verify if the file has been indeed uploaded. Afterward, we will rename the file by appending a timestamp to its name and proceed to upload it using the uploadBytes method.

// app/page.tsx

import { useCallback, useState } from "react";

export default function Home() {
    const [id, setId] = useState<string>();
    const onDrop = useCallback(async (acceptedFiles: File[]) => {
        if (acceptedFiles.length > 0) {
            const file = acceptedFiles[0];
            const fileName = file.name.replace(/\.[^/.]+$/, "");
            const fileRef = ref(storage, `${fileName}-${Date.now()}${path.extname(file.name)}`);
            const uploadId = uuid();
            await uploadBytes(fileRef, file, {
                customMetadata: {
                    uploadId
                }
            });
            setId(uploadId);
        }
    }, []);

    return <ImageDropzone onDrop={onDrop} />;
}

To complete the setup, there's one more step: install the "uuid" package. We'll use this package to generate a unique ID for each uploaded file, passing it within the metadata. In the future, when the Realtime Database undergoes any changes, it will automatically notify the client and return all the stored data within the database. Since we do not want to display all previously added transcripts but only the transcript for the newly uploaded file, we will use this ID to distinguish the required entry from the others stored in the Realtime Database.

I hope I've not been overly confusing. In any case, once you see this in action, it should all become clear! To install the package, use the following command:

npm install uuid

Give it a test run, and you'll see the uploaded image showing up in the bucket! Fabulous! ๐Ÿ’ซ

Conclusion

We now have our basic setup in place. You've learned how to build a drag-and-drop component, set up the Firebase project, and use one of its services - Firebase Cloud Storage.

However, the next blog post is where the whole fun begins! I will guide you through the process of combining Firebase Cloud Functions, FFMPEG, and OpenAI Whisper API to build a comprehensive speech transcription service! ๐Ÿ—ฃ๏ธ

If you enjoyed this blog post and would like to see more like it, just let me know! ๐Ÿ‘ As usual, if you encounter any problems along the way, you can refer to the complete source code available on my GitHub repository. If you have any questions or need assistance, don't hesitate to reach out to me.

Remember to follow me on Twitter/X, where I share daily updates. And feel free to connect with me on LinkedIn.

ย