Real-time Digital Human Interaction
NavTalk provides real-time digital human capabilities through WebSocket + WebRTC integration, supporting speech recognition, function calling, and synchronized video lip-sync. Below is the complete implementation process with key code examples.
Step 1: Establish a WebSocket Real-time Voice Connection
Create a WebSocket connection using license key and character selection:
const license = "YOUR_LICENSE_KEY";
const characterName = "navtalk.Leo";
const socket = new WebSocket(
`wss://api.navtalk.ai/api/realtime-api?license=${encodeURIComponent(license)}&characterName=${characterName}`
);
socket.binaryType = 'arraybuffer';
// Connection event handlers
socket.onopen = () => console.log("WebSocket connection established");
socket.onmessage = (event) => {
if (typeof event.data === 'string') {
handleJSONMessage(JSON.parse(event.data));
} else {
handleAudioStream(event.data);
}
};Step 2: Configure Session Parameters
Send configuration after receiving session.created event:
const sessionConfig = {
type: "session.update",
session: {
instructions: "You are a friendly digital assistant",
voice: "alloy",
temperature: 0.7,
max_response_output_tokens: 1024,
modalities: ["text", "audio"],
input_audio_format: "pcm16",
output_audio_format: "pcm16",
input_audio_transcription: { model: "whisper-1" },
tools: [...] // Optional: function calling configuration
}
};
socket.send(JSON.stringify(sessionConfig));Step 3: Capture and Send Audio Stream
Real-time audio capture and transmission:
navigator.mediaDevices.getUserMedia({ audio: true }).then(stream => {
const audioContext = new AudioContext({ sampleRate: 24000 });
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(8192, 1, 1);
processor.onaudioprocess = (event) => {
const input = event.inputBuffer.getChannelData(0);
const buffer = floatTo16BitPCM(input);
const base64Audio = base64EncodeAudio(new Uint8Array(buffer));
// Send audio chunks
const chunkSize = 4096;
for (let i = 0; i < base64Audio.length; i += chunkSize) {
const chunk = base64Audio.slice(i, i + chunkSize);
socket.send(JSON.stringify({ type: "input_audio_buffer.append", audio: chunk }));
}
};
source.connect(processor);
processor.connect(audioContext.destination);
});
Step 4: Handle AI Response Events
Process various response types from server:
function handleReceivedMessage(data) {
switch (data.type) {
case "session.created":
sendSessionUpdate();
break;
case "session.updated":
startRecording();
break;
case "response.audio_transcript.delta":
console.log("AI says:", data.delta.text);
break;
case "response.audio.delta":
// Play data.delta audio content
break;
case "response.function_call_arguments.done":
handleFunctionCall(data);
break;
}
}
session.created
The conversation was successfully created, and the configuration needs to be sent immediately.
session.updated
You can start sending the audio
response.audio_transcript.delta
Real-time return of voice recognition text
response.audio.delta
Return AI audio data (play)
response.function_call_arguments.done
Trigger a function call
response.audio.done
The response session has concluded
Step 5: Establish WebRTC Video Connection
Real-time digital human video display:
HTML Setup:
<video id="avatar-video" autoplay muted playsinline
style="width: 320px; height: 400px; object-fit: cover;">
</video>WebRTC Connection:
const peerConnection = new RTCPeerConnection({
iceServers: [{ urls: 'stun:stun.l.google.com:19302' }]
});
// Handle incoming video stream
peerConnection.ontrack = (event) => {
document.getElementById('avatar-video').srcObject = event.streams[0];
};
// Signaling channel
const signalingSocket = new WebSocket(`wss://api.navtalk.ai/iwebrtc?userId=${license}`);
signalingSocket.onmessage = async (event) => {
const message = JSON.parse(event.data);
switch (message.type) {
case "offer":
await peerConnection.setRemoteDescription(message.sdp);
const answer = await peerConnection.createAnswer();
await peerConnection.setLocalDescription(answer);
signalingSocket.send(JSON.stringify({
type: "answer",
targetSessionId: message.targetSessionId,
sdp: answer
}));
break;
case "iceCandidate":
await peerConnection.addIceCandidate(message.candidate);
break;
}
};Complete Example
We recommend referring to the official sample project for quick validation: https://github.com/navtalk/Sample
The example includes complete functionality:
Audio Capture & Processing - Real-time microphone input and audio stream handling
Digital Human Configuration - Character appearance, voice, and behavior settings
Real-time Video Rendering - WebRTC-based video streaming with lip synchronization
Function Call Integration - Custom tool execution and API interactions
Conversation History - Session recording and historical dialogue management
Last updated