Example of the First Request: Interacting with a Real-time Digital Human
NavTalk offers real-time digital human capabilities based on WebSocket + WebRTC, supporting voice recognition, function calling, and video lip-syncing. Below is the complete access process and key code explanations.
Overall Process
Establish a WebSocket connection (with license + characterName)
Configure session parameters (voice, audio format, context, etc.)
Capture audio from the browser's microphone, convert it to PCM, and send the audio stream to the server.
Receive server responses (text, audio, function call).
Use WebRTC to display the digital human in real-time video.
🔹 Step 1: Establish a WebSocket Real-time Voice Connection
You need to create a WebSocket connection using the license we provide and pass the characterName to select the digital human's appearance.
constlicense="YOUR_LICENSE_KEY";constcharacterName="girl2";constsocket=newWebSocket(`wss://api.navtalk.ai/api/realtime-api?license=${encodeURIComponent(license)}&characterName=${characterName}`);socket.binaryType='arraybuffer';socket.onopen=()=>{console.log("WebSocket connection established successfully.");};socket.onmessage=(event)=>{if (typeofevent.data==='string') {constdata=JSON.parse(event.data);handleReceivedMessage(data);// Process JSON message}elseif (event.datainstanceofArrayBuffer) {handleReceivedBinaryMessage(event.data);// Process audio stream}};
After session.created is returned, send session.update to configure the AI's behavior style, language model, audio parameters, transcription method, etc.
Extensible: The tools field supports custom function calling capabilities.
🔹 Step 3: Capture User Voice and Push
Access the microphone through the browser, record voice in real-time, convert it to PCM16 format, and send it to the server in base64 encoding.
🔹Step 4: Handle AI Response Events
The platform will return multiple events, mainly including:
Event Type
Explanation
session.created
The conversation was successfully created, and the configuration needs to be sent immediately.
session.updated
You can start sending the audio
response.audio_transcript.delta
Real-time return of voice recognition text
response.audio.delta
Return AI audio data (play)
response.function_call_arguments.done
Trigger a function call
response.audio.done
The response session has concluded
Example:
🔹 Step 5: Establish WebRTC Video Stream Connection (Display Digital Human)
WebRTC is the medium for the real-time expressiveness of the digital human (lip movement, facial expressions, gaze, etc.), so ensure that you create the WebRTC video channel while establishing the WebSocket real-time voice connection.
You will need:
An HTML <video> tag bound to the video.
A specific license (i.e., userId).
A target sessionId (which must be associated with the real-time WebSocket session).
1️⃣ Bind the Video Element
Reserve a <video> tag in your HTML to display the digital human's appearance:
Ensure that autoplay and playsinline are key attributes for mobile browsers to display video.
Then bind the element in JavaScript:
2️⃣ Establish WebRTC Signaling Connection
Create a WebSocket signaling channel using your license:
3️⃣ Receive Offer / Answer / ICE Candidates
The server will return in sequence:
Offer (SDP request)
Answer (SDP response)
ICE Candidate (network hole punching address)
Handle these messages as follows:
4️⃣ Create RTCPeerConnection and Play Video
5️⃣ Receive ICE Reverse Channel
Common Issues and Debugging Suggestions
Question
Suggestions
There is no voice returned
Please check if the session.update was sent and whether the audio format is correct
The video does not display
Check whether the WebRTC connection is successful and whether the video DOM has been bound
AI is not responding
Check whether the audio stream was successfully sent and if there are any issues with the audio stream format
ICE failed
Check the network environment and try changing the STUN server
Complete Example Project
We recommend using the official DEMO project to quickly verify if the connection is successful: https://github.com/navtalk/Sample. The example supports recording, character selection, video rendering, function calls, and the entire process.
function handleReceivedMessage(data) {
switch (data.type) {
case "session.created":
sendSessionUpdate();
break;
case "session.updated":
startRecording();
break;
case "response.audio_transcript.delta":
console.log("AI says:", data.delta.text);
break;
case "response.audio.delta":
// Play data.delta audio content
break;
case "response.function_call_arguments.done":
handleFunctionCall(data);
break;
}
}
function handleIceCandidate(message) {
const candidate = new RTCIceCandidate(message.candidate);
if (peerConnectionA) {
peerConnectionA.addIceCandidate(candidate).catch(console.error);
}
}