Real-time Digital Human Interaction

NavTalk provides real-time digital human capabilities through WebSocket + WebRTC integration, supporting speech recognition, function calling, and synchronized video lip-sync. Below is the complete implementation process with key code examples.

Step 1: Establish a WebSocket Real-time Voice Connection

Create a WebSocket connection using license key and character selection:

const license = "YOUR_LICENSE_KEY";
const characterName = "navtalk.Leo";

const socket = new WebSocket(
  `wss://api.navtalk.ai/api/realtime-api?license=${encodeURIComponent(license)}&characterName=${characterName}`
);
socket.binaryType = 'arraybuffer';

// Connection event handlers
socket.onopen = () => console.log("WebSocket connection established");
socket.onmessage = (event) => {
  if (typeof event.data === 'string') {
    handleJSONMessage(JSON.parse(event.data));
  } else {
    handleAudioStream(event.data);
  }
};

Step 2: Configure Session Parameters

Send configuration after receiving session.created event:

const sessionConfig = {
  type: "session.update",
  session: {
    instructions: "You are a friendly digital assistant",
    voice: "alloy",
    temperature: 0.7,
    max_response_output_tokens: 1024,
    modalities: ["text", "audio"],
    input_audio_format: "pcm16",
    output_audio_format: "pcm16",
    input_audio_transcription: { model: "whisper-1" },
    tools: [...]  // Optional: function calling configuration
  }
};
socket.send(JSON.stringify(sessionConfig));

Step 3: Capture and Send Audio Stream

Real-time audio capture and transmission:

navigator.mediaDevices.getUserMedia({ audio: true }).then(stream => {
  const audioContext = new AudioContext({ sampleRate: 24000 });
  const source = audioContext.createMediaStreamSource(stream);
  const processor = audioContext.createScriptProcessor(8192, 1, 1);

  processor.onaudioprocess = (event) => {
    const input = event.inputBuffer.getChannelData(0);
    const buffer = floatTo16BitPCM(input);
    const base64Audio = base64EncodeAudio(new Uint8Array(buffer));

    // Send audio chunks
    const chunkSize = 4096;
    for (let i = 0; i < base64Audio.length; i += chunkSize) {
      const chunk = base64Audio.slice(i, i + chunkSize);
      socket.send(JSON.stringify({ type: "input_audio_buffer.append", audio: chunk }));
    }
  };

  source.connect(processor);
  processor.connect(audioContext.destination);
});

Step 4: Handle AI Response Events

Process various response types from server:

function handleReceivedMessage(data) {
  switch (data.type) {
    case "session.created":
      sendSessionUpdate();
      break;
    case "session.updated":
      startRecording();
      break;
    case "response.audio_transcript.delta":
      console.log("AI says:", data.delta.text);
      break;
    case "response.audio.delta":
      // Play data.delta audio content
      break;
    case "response.function_call_arguments.done":
      handleFunctionCall(data);
      break;
  }
}
Event Type
Explanation

session.created

The conversation was successfully created, and the configuration needs to be sent immediately.

session.updated

You can start sending the audio

response.audio_transcript.delta

Real-time return of voice recognition text

response.audio.delta

Return AI audio data (play)

response.function_call_arguments.done

Trigger a function call

response.audio.done

The response session has concluded

Step 5: Establish WebRTC Video Connection

Real-time digital human video display:

HTML Setup:

<video id="avatar-video" autoplay muted playsinline 
       style="width: 320px; height: 400px; object-fit: cover;">
</video>

WebRTC Connection:

const peerConnection = new RTCPeerConnection({
  iceServers: [{ urls: 'stun:stun.l.google.com:19302' }]
});

// Handle incoming video stream
peerConnection.ontrack = (event) => {
  document.getElementById('avatar-video').srcObject = event.streams[0];
};

// Signaling channel
const signalingSocket = new WebSocket(`wss://api.navtalk.ai/iwebrtc?userId=${license}`);
signalingSocket.onmessage = async (event) => {
  const message = JSON.parse(event.data);
  switch (message.type) {
    case "offer":
      await peerConnection.setRemoteDescription(message.sdp);
      const answer = await peerConnection.createAnswer();
      await peerConnection.setLocalDescription(answer);
      signalingSocket.send(JSON.stringify({
        type: "answer",
        targetSessionId: message.targetSessionId,
        sdp: answer
      }));
      break;
    case "iceCandidate":
      await peerConnection.addIceCandidate(message.candidate);
      break;
  }
};

Complete Example

We recommend referring to the official sample project for quick validation: https://github.com/navtalk/Sample

The example includes complete functionality:

  • Audio Capture & Processing - Real-time microphone input and audio stream handling

  • Digital Human Configuration - Character appearance, voice, and behavior settings

  • Real-time Video Rendering - WebRTC-based video streaming with lip synchronization

  • Function Call Integration - Custom tool execution and API interactions

  • Conversation History - Session recording and historical dialogue management

Last updated