Communicate with a real-time digital human

Example of the First Request: Interacting with a Real-time Digital Human

NavTalk offers real-time digital human capabilities based on WebSocket + WebRTC, supporting voice recognition, function calling, and video lip-syncing. Below is the complete access process and key code explanations.

Overall Process

Establish a WebSocket connection (with license + characterName)
Configure session parameters (voice, audio format, context, etc.)
Capture audio from the browser's microphone, convert it to PCM, and send the audio stream to the server.
Receive server responses (text, audio, function call).
Use WebRTC to display the digital human in real-time video.

🔹 Step 1: Establish a WebSocket Real-time Voice Connection

You need to create a WebSocket connection using the license we provide and pass the characterName to select the digital human's appearance.

const license = "YOUR_LICENSE_KEY";
const characterName = "girl2";

const socket = new WebSocket(`wss://api.navtalk.ai/api/realtime-api?license=${encodeURIComponent(license)}&characterName=${characterName}`);
socket.binaryType = 'arraybuffer';

socket.onopen = () => {
  console.log("WebSocket connection established successfully.");
};

socket.onmessage = (event) => {
  if (typeof event.data === 'string') {
    const data = JSON.parse(event.data);
    handleReceivedMessage(data); // Process JSON message
  } else if (event.data instanceof ArrayBuffer) {
    handleReceivedBinaryMessage(event.data); // Process audio stream
  }
};

🔹 Step 2: Configure Session Parameters (Initialization)

After session.created is returned, send session.update to configure the AI's behavior style, language model, audio parameters, transcription method, etc.

const sessionConfig = {
  type: "session.update",
  session: {
    instructions: "chat",
    voice: "alloy",
    temperature: 1,
    max_response_output_tokens: 4096,
    modalities: ["text", "audio"],
    input_audio_format: "pcm16",
    output_audio_format: "pcm16",
    input_audio_transcription: {
      model: "whisper-1"
    }
  }
};

socket.send(JSON.stringify(sessionConfig));

Extensible: The tools field supports custom function calling capabilities.

🔹 Step 3: Capture User Voice and Push

Access the microphone through the browser, record voice in real-time, convert it to PCM16 format, and send it to the server in base64 encoding.

navigator.mediaDevices.getUserMedia({ audio: true }).then(stream => {
  const audioContext = new AudioContext({ sampleRate: 24000 });
  const source = audioContext.createMediaStreamSource(stream);
  const processor = audioContext.createScriptProcessor(8192, 1, 1);

  processor.onaudioprocess = (event) => {
    const input = event.inputBuffer.getChannelData(0);
    const buffer = floatTo16BitPCM(input);
    const base64Audio = base64EncodeAudio(new Uint8Array(buffer));

    // Send audio chunks
    const chunkSize = 4096;
    for (let i = 0; i < base64Audio.length; i += chunkSize) {
      const chunk = base64Audio.slice(i, i + chunkSize);
      socket.send(JSON.stringify({ type: "input_audio_buffer.append", audio: chunk }));
    }
  };

  source.connect(processor);
  processor.connect(audioContext.destination);
});

🔹Step 4: Handle AI Response Events

The platform will return multiple events, mainly including:

Event Type

Explanation

session.created

The conversation was successfully created, and the configuration needs to be sent immediately.

session.updated

You can start sending the audio

response.audio_transcript.delta

Real-time return of voice recognition text

response.audio.delta

Return AI audio data (play)

response.function_call_arguments.done

Trigger a function call

response.audio.done

The response session has concluded

Example：

function handleReceivedMessage(data) {
  switch (data.type) {
    case "session.created":
      sendSessionUpdate();
      break;
    case "session.updated":
      startRecording();
      break;
    case "response.audio_transcript.delta":
      console.log("AI says:", data.delta.text);
      break;
    case "response.audio.delta":
      // Play data.delta audio content
      break;
    case "response.function_call_arguments.done":
      handleFunctionCall(data);
      break;
  }
}

🔹 Step 5: Establish WebRTC Video Stream Connection (Display Digital Human)

WebRTC is the medium for the real-time expressiveness of the digital human (lip movement, facial expressions, gaze, etc.), so ensure that you create the WebRTC video channel while establishing the WebSocket real-time voice connection.

You will need:

An HTML <video> tag bound to the video.
A specific license (i.e., userId).
A target sessionId (which must be associated with the real-time WebSocket session).

1️⃣ Bind the Video Element

Reserve a <video> tag in your HTML to display the digital human's appearance:

<video id="character-avatar-video" autoplay muted playsinline style="width: 320px; height: 400px;"></video>

Ensure that autoplay and playsinline are key attributes for mobile browsers to display video.

Then bind the element in JavaScript:

const remoteVideo = document.getElementById('character-avatar-video');

2️⃣ Establish WebRTC Signaling Connection

Create a WebSocket signaling channel using your license:

const remoteVideo = document.getElementById('character-avatar-video');
const resultSocket = new WebSocket(`wss://api.navtalk.ai/iwebrtc?userId=${license}`);

resultSocket.onopen = () => {
  resultSocket.send(JSON.stringify({ type: 'create', targetSessionId: "123" }));
};

3️⃣ Receive Offer / Answer / ICE Candidates

The server will return in sequence:

Offer (SDP request)
Answer (SDP response)
ICE Candidate (network hole punching address)

Handle these messages as follows:

resultSocket.onmessage = (event) => {
  const message = JSON.parse(event.data);
  switch (message.type) {
    case "offer": handleOffer(message); break;
    case "answer": handleAnswer(message); break;
    case "iceCandidate": handleIceCandidate(message); break;
  }
};

4️⃣ Create RTCPeerConnection and Play Video

const configuration = { iceServers: [{ urls: 'stun:stun.l.google.com:19302' }] };
let peerConnectionA = null;

function handleOffer(message) {
  const offer = new RTCSessionDescription(message.sdp);
  peerConnectionA = new RTCPeerConnection(configuration);

  // Set remote SDP
  peerConnectionA.setRemoteDescription(offer)
    .then(() => peerConnectionA.createAnswer())
    .then(answer => peerConnectionA.setLocalDescription(answer))
    .then(() => {
      resultSocket.send(JSON.stringify({
        type: 'answer',
        targetSessionId: message.targetSessionId,
        sdp: peerConnectionA.localDescription
      }));
    });

  // Listen for remote streams
  peerConnectionA.ontrack = (event) => {
    console.log("Received video stream");
    remoteVideo.srcObject = event.streams[0];
    remoteVideo.play();
  };

  // Collect ICE candidates
  peerConnectionA.onicecandidate = (event) => {
    if (event.candidate) {
      resultSocket.send(JSON.stringify({
        type: 'iceCandidate',
        targetSessionId: message.targetSessionId,
        candidate: event.candidate
      }));
    }
  };
}

5️⃣ Receive ICE Reverse Channel

function handleIceCandidate(message) {
  const candidate = new RTCIceCandidate(message.candidate);
  if (peerConnectionA) {
    peerConnectionA.addIceCandidate(candidate).catch(console.error);
  }
}

Common Issues and Debugging Suggestions

Question

Suggestions

There is no voice returned

Please check if the session.update was sent and whether the audio format is correct

The video does not display

Check whether the WebRTC connection is successful and whether the video DOM has been bound

AI is not responding

Check whether the audio stream was successfully sent and if there are any issues with the audio stream format

ICE failed

Check the network environment and try changing the STUN server

Complete Example Project

We recommend using the official DEMO project to quickly verify if the connection is successful: https://github.com/navtalk/Sample. The example supports recording, character selection, video rendering, function calls, and the entire process.

PreviousGenerate a digital human video NextWeb Use (Non-Real-Time Synthesis)

Last updated 4 months ago

hashtagExample of the First Request: Interacting with a Real-time Digital Human

hashtagOverall Process

hashtag🔹 Step 1: Establish a WebSocket Real-time Voice Connection

hashtag🔹 Step 2: Configure Session Parameters (Initialization)

hashtag🔹 Step 3: Capture User Voice and Push

hashtag🔹Step 4: Handle AI Response Events

hashtag🔹 Step 5: Establish WebRTC Video Stream Connection (Display Digital Human)

hashtagCommon Issues and Debugging Suggestions

hashtagComplete Example Project