Real-time Digital Human Interaction

NavTalk provides real-time digital human capabilities through WebSocket + WebRTC integration, supporting speech recognition, function calling, and synchronized video lip-sync. Below is the complete implementation process with key code examples.

Step 1: Establish a WebSocket Real-time Voice Connection

Create a WebSocket connection using license key and character selection:

const license = "YOUR_LICENSE_KEY";
const characterName = "navtalk.Leo";

const socket = new WebSocket(
  `wss://api.navtalk.ai/api/realtime-api?license=${encodeURIComponent(license)}&characterName=${characterName}`
);
socket.binaryType = 'arraybuffer';

// Connection event handlers
socket.onopen = () => console.log("WebSocket connection established");
socket.onmessage = (event) => {
  if (typeof event.data === 'string') {
    handleJSONMessage(JSON.parse(event.data));
  } else {
    handleAudioStream(event.data);
  }
};

Step 2: Configure Session Parameters

Send configuration after receiving session.created event:

Step 3: Capture and Send Audio Stream

Real-time audio capture and transmission:

Step 4: Handle AI Response Events

Process various response types from server:

Event Type
Explanation

session.created

The conversation was successfully created, and the configuration needs to be sent immediately.

session.updated

You can start sending the audio

response.audio_transcript.delta

Real-time return of voice recognition text

response.audio.delta

Return AI audio data (play)

response.function_call_arguments.done

Trigger a function call

response.audio.done

The response session has concluded

Step 5: Establish WebRTC Video Connection

Real-time digital human video display:

HTML Setup:

WebRTC Connection:

Complete Example

We recommend referring to the official sample project for quick validation: https://github.com/navtalk/Samplearrow-up-right

The example includes complete functionality:

  • Audio Capture & Processing - Real-time microphone input and audio stream handling

  • Digital Human Configuration - Character appearance, voice, and behavior settings

  • Real-time Video Rendering - WebRTC-based video streaming with lip synchronization

  • Function Call Integration - Custom tool execution and API interactions

  • Conversation History - Session recording and historical dialogue management

Last updated