Simli WebRTC API basics

This is a quick and shallow guide on how to use the Simli WebRTC API. This won’t explain how WebRTC works or go into the details of the audio. You can check the following articles for that: WebRTC, Audio. In this article, we’re assuming you have a basic understanding of javascript, python, HTTP requests, and how to use APIs. Full codes will be posted at the end of this document.

What’s the Simli API?

The Simli API is a way to transform any audio, whatever the source, into a talking head video (Humanoid character that speaks the audio) with realistic motions and low latency. The api allows anyone to add more life into their chat bots, virtual assistants, or any formless characters. For example, this article will give a face to any radio station that provides an audio stream over HTTP. However, the exact same principles apply to any audio source.

The main input in the API is the audio stream. The audio must be in PCM Int16 format with a sampling rate of 16000 Hz and a single channel. The audio stream should be preferably sent in chunks of 6000 bytes; however, there’s no limit on minimum audio size, with the maximum being 65,536 bytes. The API initiates a WebRTC connection which is handled by the browser (or your WebRTC library of choice) so you don’t have to worry about playback details.

The HTML and Javascript components of this demo are adapted from aiortc examples

What’s being written for this to work?

  • Basic HTML page with a video and audio elements (to playback the webrtc stream).
  • Javascript file to handle the WebRTC connection and sending the audio stream to the Simli API.
  • Small python server to decode the mp3 audio stream to PCM Int16 and send it to the WebRTC connection. (not mandatory if you have a different way to get a correctly formatted audio stream).
  • An audio stream source (for this example, we’re using a radio station stream).

index.html

The HTML file is pretty simple. It has a video element to display the talking head video and an audio element to play the audio stream. The video element is hidden by default and will only be shown when the WebRTC connection is established.

filename="index.html"
<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>WebRTC demo</title>
  </head>
  <body>
    <button id="start" onclick="start()">Start</button>
    <button id="stop" style="display: none" onclick="stop()">Stop</button>

    <h2>State</h2>
    <p>ICE gathering state: <span id="ice-gathering-state"></span></p>
    <p>ICE connection state: <span id="ice-connection-state"></span></p>
    <p>Signaling state: <span id="signaling-state"></span></p>

    <div id="media" style="display: none">
      <h2>Media</h2>

      <audio id="audio" autoplay="true "></audio>
      <video id="video" autoplay="true" playsinline="true"></video>
    </div>

    <h2>Data channel</h2>
    <pre id="data-channel" style="height: 200px;"></pre>

    <h2>SDP</h2>

    <h3>Offer</h3>
    <pre id="offer-sdp"></pre>

    <h3>Answer</h3>
    <pre id="answer-sdp"></pre>

    <script src="client.js"></script>
  </body>
</html>

The HTML file doesn’t do anything special, it just has button to initiate the connection, some text elements to show the state of the connection, and a video and audio element to display the video and audio stream respectively. It also imports a client.js file which contains the javascript code to handle the WebRTC connection.

client.js

The javascript file handles the WebRTC connection, sends the audio stream to the Simli API, and displays the video and audio stream.

// get DOM elements
var dataChannelLog = document.getElementById("data-channel"),
  iceConnectionLog = document.getElementById("ice-connection-state"),
  iceGatheringLog = document.getElementById("ice-gathering-state"),
  signalingLog = document.getElementById("signaling-state");

This block of code gets the DOM elements that will be used to display the state of the connection and the data channel messages.

// peer connection
var pc = null;

// data channel
var dc = null,
  dcInterval = null;

function createPeerConnection() {
  var config = {
    sdpSemantics: "unified-plan",
  };

  config.iceServers = [{ urls: ["stun:stun.l.google.com:19302"] }];

  pc = new RTCPeerConnection(config);
  // register some listeners to help debugging
  pc.addEventListener(
    "icegatheringstatechange",
    () => {
      iceGatheringLog.textContent += " -> " + pc.iceGatheringState;
    },
    false
  );
  iceGatheringLog.textContent = pc.iceGatheringState;

  pc.addEventListener(
    "iceconnectionstatechange",
    () => {
      iceConnectionLog.textContent += " -> " + pc.iceConnectionState;
    },
    false
  );
  iceConnectionLog.textContent = pc.iceConnectionState;

  pc.addEventListener(
    "signalingstatechange",
    () => {
      signalingLog.textContent += " -> " + pc.signalingState;
    },
    false
  );
  signalingLog.textContent = pc.signalingState;

  // connect audio / video
  pc.addEventListener("track", (evt) => {
    if (evt.track.kind == "video")
      document.getElementById("video").srcObject = evt.streams[0];
    else document.getElementById("audio").srcObject = evt.streams[0];
  });

  pc.onicecandidate = (event) => {
    if (event.candidate === null) {
      console.log(JSON.stringify(pc.localDescription));
    } else {
      console.log(event.candidate);
      //   console.log(JSON.stringify(pc.localDescription));
      candidateCount += 1;
      //   console.log(candidateCount);
    }
  };

  return pc;
}

This block of code defines the function that creates the peer connection and registers some listeners to display the state of the connection. It also connects the audio and video tracks to the video and audio elements respectively.

let candidateCount = 0;
let prevCandidateCount = -1;
function CheckIceCandidates() {
  if (
    pc.iceGatheringState === "complete" ||
    candidateCount === prevCandidateCount
  ) {
    console.log(pc.iceGatheringState, candidateCount);
    connectToRemotePeer();
  } else {
    prevCandidateCount = candidateCount;
    setTimeout(CheckIceCandidates, 250);
  }
}

function negotiate() {
  return pc
    .createOffer()
    .then((offer) => {
      return pc.setLocalDescription(offer);
    })
    .then(() => {
      prevCandidateCount = candidateCount;
      setTimeout(CheckIceCandidates, 250);
    });
}

This block of code defines the function that initiates the negotiation process. It creates an offer, sets the local description, while it doesn’t wait until Gathering ICE candidates is finished; however, it waits until the ICE gathering state is complete or the number of candidates doesn’t change for 250ms.

function connectToRemotePeer() {
  var offer = pc.localDescription;
  var codec;
  document.getElementById("offer-sdp").textContent = offer.sdp;

  return fetch("https://api.simli.ai/StartWebRTCSession", {
    body: JSON.stringify({
      sdp: offer.sdp,
      type: offer.type,
    }),
    headers: {
      "Content-Type": "application/json",
    },
    method: "POST",
  })
    .then((response) => {
      return response.json();
    })
    .then((answer) => {
      document.getElementById("answer-sdp").textContent = answer.sdp;
      return pc.setRemoteDescription(answer);
    })
    .then((out) => {
      return out;
    })
    .catch((e) => {
      alert(e);
    });
}

This block of code defines the function that connects to the remote peer. It sends the local description to the Simli API, gets the remote description, and sets it.

function start() {
  document.getElementById("start").style.display = "none";

  pc = createPeerConnection();

  var time_start = null;

  const current_stamp = () => {
    if (time_start === null) {
      time_start = new Date().getTime();
      return 0;
    } else {
      return new Date().getTime() - time_start;
    }
  };

  var parameters = { ordered: true };
  dc = pc.createDataChannel("datachannel", parameters);

  dc.addEventListener("error", (err) => {
    console.error(err);
  });

  dc.addEventListener("close", () => {
    clearInterval(dcInterval);
    dataChannelLog.textContent += "- close\n";
  });

  dc.addEventListener("open", async () => {
    console.log(dc.id);
    const metadata = {
      faceId: "tmp9i8bbq7c", // Simli face ID
      isJPG: false,
      apiKey: "SIMLI_API_KEY", // Simli API key
      syncAudio: true,
    };

    const response = await fetch(
      "https://api.simli.ai/startAudioToVideoSession",
      {
        method: "POST",
        body: JSON.stringify(metadata),
        headers: {
          "Content-Type": "application/json",
        },
      }
    );
    const resJSON = await response.json();
    dc.send(resJSON.session_token);

    dataChannelLog.textContent += "- open\n";
    dcInterval = setInterval(() => {
      var message = "ping " + current_stamp();
      dataChannelLog.textContent += "> " + message + "\n";
      dc.send(message);
    }, 1000);
    await setTimeout(()=>{}, 100);
    var message = new Uint8Array(16000)  // 0.5 second silence to start the audio (Optional)
    dc.send(message);
    initializeWebsocketDecoder()
  });
  dc.addEventListener("message", (evt) => {
    dataChannelLog.textContent += "< " + evt.data + "\n";

    if (evt.data.substring(0, 4) === "pong") {
      var elapsed_ms = current_stamp() - parseInt(evt.data.substring(5), 10);
      dataChannelLog.textContent += " RTT " + elapsed_ms + " ms\n";
    }
  });

  // Build media constraints.

  const constraints = {
    audio: true,
    video: true,
  };

  // Acquire media and start negotiation.

  document.getElementById("media").style.display = "block";
  navigator.mediaDevices.getUserMedia(constraints).then(
    (stream) => {
      stream.getTracks().forEach((track) => {
        pc.addTrack(track, stream);
      });
      return negotiate();
    },
    (err) => {
      alert("Could not acquire media: " + err);
    }
  );
  document.getElementById("stop").style.display = "inline-block";
}

function stop() {
  document.getElementById("stop").style.display = "none";

  // close data channel
  if (dc) {
    dc.close();
  }

  // close transceivers
  if (pc.getTransceivers) {
    pc.getTransceivers().forEach((transceiver) => {
      if (transceiver.stop) {
        transceiver.stop();
      }
    });
  }

  // close local audio / video
  pc.getSenders().forEach((sender) => {
    sender.track.stop();
  });

  // close peer connection
  setTimeout(() => {
    pc.close();
  }, 500);
}

async function initializeWebsocketDecoder() {
  ws = new WebSocket("ws://localhost:8080/");
  ws.onopen = function (event) {
    console.log("connected");
  };
  ws.onmessage = function (event) {
    dc.send(event.data);
  };
  ws.binaryType = "arraybuffer";
  while (dc.readyState !== "open") {
    await setTimeout(() => {}, 100);
  }
  response = await fetch("https://radio.talksport.com/stream");
  if (!response.ok) {
    throw new Error("Network response was not ok");
  }
  const reader = response.body.getReader();
  function read() {
    return reader.read().then(({ done, value }) => {
      if (done) {
        console.log("Stream complete");
        return;
      }
      ws.send(value);
      return read();
    });
  }

  return read();
}

This block of code defines the functions that start and stop the connection. The start function creates the peer connection, creates a data channel, sends the metadata to the Simli API, and starts the negotiation process. The stop function closes the data channel, transceivers, local audio and video, and the peer connection.


async function initializeWebsocketDecoder() {
  ws = new WebSocket("ws://localhost:8080/");
  ws.onopen = function (event) {
    console.log("connected");
  };
  ws.onmessage = function (event) {
    dc.send(event.data);
  };
  ws.binaryType = "arraybuffer";
  while (dc.readyState !== "open") {
    await setTimeout(() => {}, 100);
  }
  response = await fetch("https://radio.talksport.com/stream");
  if (!response.ok) {
    throw new Error("Network response was not ok");
  }
  const reader = response.body.getReader();
  function read() {
    return reader.read().then(({ done, value }) => {
      if (done) {
        console.log("Stream complete");
        return;
      }
      ws.send(value);
      return read();
    });
  }

  return read();
}

The last block of code defines the function that initializes the websocket decoder. It creates a websocket connection to the python server, sends the audio stream to the data channel, and reads the audio stream from the radio station.

and Here’s everything together

filename="client.js"
// get DOM elements
var dataChannelLog = document.getElementById("data-channel"),
  iceConnectionLog = document.getElementById("ice-connection-state"),
  iceGatheringLog = document.getElementById("ice-gathering-state"),
  signalingLog = document.getElementById("signaling-state");

// peer connection
var pc = null;

// data channel
var dc = null,
  dcInterval = null;

function createPeerConnection() {
  var config = {
    sdpSemantics: "unified-plan",
  };

  config.iceServers = [{ urls: ["stun:stun.l.google.com:19302"] }];

  pc = new RTCPeerConnection(config);
  // register some listeners to help debugging
  pc.addEventListener(
    "icegatheringstatechange",
    () => {
      iceGatheringLog.textContent += " -> " + pc.iceGatheringState;
    },
    false
  );
  iceGatheringLog.textContent = pc.iceGatheringState;

  pc.addEventListener(
    "iceconnectionstatechange",
    () => {
      iceConnectionLog.textContent += " -> " + pc.iceConnectionState;
    },
    false
  );
  iceConnectionLog.textContent = pc.iceConnectionState;

  pc.addEventListener(
    "signalingstatechange",
    () => {
      signalingLog.textContent += " -> " + pc.signalingState;
    },
    false
  );
  signalingLog.textContent = pc.signalingState;

  // connect audio / video
  pc.addEventListener("track", (evt) => {
    if (evt.track.kind == "video")
      document.getElementById("video").srcObject = evt.streams[0];
    else document.getElementById("audio").srcObject = evt.streams[0];
  });

  pc.onicecandidate = (event) => {
    if (event.candidate === null) {
      console.log(JSON.stringify(pc.localDescription));
    } else {
      console.log(event.candidate);
      candidateCount += 1;
    }
  };

  return pc;
}

let candidateCount = 0;
let prevCandidateCount = -1;
function CheckIceCandidates() {
  if (
    pc.iceGatheringState === "complete" ||
    candidateCount === prevCandidateCount
  ) {
    console.log(pc.iceGatheringState, candidateCount);
    connectToRemotePeer();
  } else {
    prevCandidateCount = candidateCount;
    setTimeout(CheckIceCandidates, 250);
  }
}

function negotiate() {
  return pc
    .createOffer()
    .then((offer) => {
      return pc.setLocalDescription(offer);
    })
    .then(() => {
      prevCandidateCount = candidateCount;
      setTimeout(CheckIceCandidates, 250);
    });
}

function connectToRemotePeer() {
  var offer = pc.localDescription;
  var codec;
  document.getElementById("offer-sdp").textContent = offer.sdp;

  return fetch("https://api.simli.ai/StartWebRTCSession", {
    body: JSON.stringify({
      sdp: offer.sdp,
      type: offer.type,
    }),
    headers: {
      "Content-Type": "application/json",
    },
    method: "POST",
  })
    .then((response) => {
      return response.json();
    })
    .then((answer) => {
      document.getElementById("answer-sdp").textContent = answer.sdp;
      return pc.setRemoteDescription(answer);
    })
    .then((out) => {
      return out;
    })
    .catch((e) => {
      alert(e);
    });
}

function start() {
  document.getElementById("start").style.display = "none";

  pc = createPeerConnection();

  var time_start = null;

  const current_stamp = () => {
    if (time_start === null) {
      time_start = new Date().getTime();
      return 0;
    } else {
      return new Date().getTime() - time_start;
    }
  };

  var parameters = { ordered: true };
  dc = pc.createDataChannel("datachannel", parameters);

  dc.addEventListener("error", (err) => {
    console.error(err);
  });

  dc.addEventListener("close", () => {
    clearInterval(dcInterval);
    dataChannelLog.textContent += "- close\n";
  });

  dc.addEventListener("open", async () => {
    console.log(dc.id);
    const metadata = {
      faceId: "tmp9i8bbq7c", // Simli face ID
      isJPG: false,
      apiKey: "SIMLI_API_KEY", // Simli API key
      syncAudio: true,
    };

    const response = await fetch(
      "https://api.simli.ai/startAudioToVideoSession",
      {
        method: "POST",
        body: JSON.stringify(metadata),
        headers: {
          "Content-Type": "application/json",
        },
      }
    );
    const resJSON = await response.json();
    dc.send(resJSON.session_token);

    dataChannelLog.textContent += "- open\n";
    dcInterval = setInterval(() => {
      var message = "ping " + current_stamp();
      dataChannelLog.textContent += "> " + message + "\n";
      dc.send(message);
    }, 1000);
    await setTimeout(()=>{}, 100);
    var message = new Uint8Array(16000)  // 0.5 second silence to start the audio (Optional)
    dc.send(message);
    initializeWebsocketDecoder()
  });
  
  dc.addEventListener("message", (evt) => {
    dataChannelLog.textContent += "< " + evt.data + "\n";

    if (evt.data.substring(0, 4) === "pong") {
      var elapsed_ms = current_stamp() - parseInt(evt.data.substring(5), 10);
      dataChannelLog.textContent += " RTT " + elapsed_ms + " ms\n";
    }
  });

  // Build media constraints.

  const constraints = {
    audio: true,
    video: true,
  };

  // Acquire media and start negotiation.

  document.getElementById("media").style.display = "block";
  navigator.mediaDevices.getUserMedia(constraints).then(
    (stream) => {
      stream.getTracks().forEach((track) => {
        pc.addTrack(track, stream);
      });
      return negotiate();
    },
    (err) => {
      alert("Could not acquire media: " + err);
    }
  );
  document.getElementById("stop").style.display = "inline-block";
}

function stop() {
  document.getElementById("stop").style.display = "none";

  // close data channel
  if (dc) {
    dc.close();
  }

  // close transceivers
  if (pc.getTransceivers) {
    pc.getTransceivers().forEach((transceiver) => {
      if (transceiver.stop) {
        transceiver.stop();
      }
    });
  }

  // close local audio / video
  pc.getSenders().forEach((sender) => {
    sender.track.stop();
  });

  // close peer connection
  setTimeout(() => {
    pc.close();
  }, 500);
}

server.py

This one is also relatively simple and is used to decode the mp3 audio stream to PCM Int16 and send it back.

filename="server.py"
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from starlette.websockets import WebSocketState
import asyncio
import uvicorn

app = FastAPI()


async def GetDecodeOutput(
    websocket: WebSocket, decodeProcess: asyncio.subprocess.Process
):
    while True:
        data = await decodeProcess.stdout.read(6000)
        if not data:
            break
        await websocket.send_bytes(data)


@app.websocket("/")
async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()
    try:
        decodeTask = await asyncio.subprocess.create_subprocess_exec(
            *[
                "ffmpeg",
                "-i",
                "pipe:0",
                "-f",
                "s16le",
                "-ar",
                "16000",
                "-ac",
                "1",
                "-acodec",
                "pcm_s16le",
                "-",
            ],
            stdin=asyncio.subprocess.PIPE,
            stdout=asyncio.subprocess.PIPE,
        )
        sendTask = asyncio.create_task(GetDecodeOutput(websocket, decodeTask))
        while (
            websocket.client_state == WebSocketState.CONNECTED
            and websocket.application_state == WebSocketState.CONNECTED
        ):
            data = await websocket.receive_bytes()
            decodeTask.stdin.write(data)
    except WebSocketDisconnect:
        pass
    finally:
        decodeTask.stdin.close()
        await decodeTask.wait()
        await sendTask
        await websocket.close()


if __name__ == "__main__":
    uvicorn.run(app, port=8080)

To run, just type python server.py

Tada! That’s everything!