An Interview with Vivox's Monty Sharma at OGDC 2007

May 16, 2007 - We met with Monty Sharma in early 2007 at Game Developer’s Conference (check out the GDC 2007 Vivox article, if you like) and discovered some very interesting things about their premier integrated in-game voicechat technology. At the time, Vivox had just become integrated into EVE Online (official site) and we chatted about the advantages of integrating voice. Some surprising statistics have materialized since March, namely that players not using the Vivox technology spend an average of an hour and a half simply organizing their third-party voicechat technology prior to an in-game action.

Voice fonts and the ability to accommodate gender flips were also discussed during our last meeting with the ever-affable Mr. Sharma. His examples of Vivox’s ability to morph voice on the fly had improved since GDC, including the “injected” frequencies necessary to convincingly convert male-to-female voice (female to male is much more straightforward). “It’s like art, it sort of evolves into what you want it to be… it’s a process because the brain is tuned to hear the differences.” Monty explained. He shared a discussion with an biologist who believed that exact tendency to key in on what’s wrong about sensory input (rather than focus on what’s right – a kind of cerebral pessimism) was part and parcel with predatory and survival instincts. This reinforces why voicechat technology, especially when morphing a voice to approximate the sound of a fairy or a giant, has to be spot-on.Helping people to trust a complete voicechat solution for a game, free of quality considerations, mechanistic-sounding artifacts, even poor user-interface integration (providing visual cues as to who’s talking), is Vivox’s mission. Monty put it simply, “We want to get it to a point where it doesn’t bug you.” When that happens, voice breaks into the mainstream.

Monty Sharma

And the mainstream doesn’t know what it’s missing.  Monty illustrated how lacking our present online experience is in terms of true collaboration by comparing a visit to the brick and mortar bookstore to a visit to a popular online book e-tailer. The latter experience is like looking through a catalogue, while shopping in person allows our inquisitive eyes to roam. “We’re nosey by nature,” Monty said with a smile. Hearing the discussion in a category and perhaps letting that meld into pleasing background white-noise when you converse with others looking over the same product all plays into that dynamic. Voice could do for internet shopping what MUDs and early MMOs did for tabletop RPGs. While you might presently call a friend or relative to browse around on the Internet, you could soon discuss a product you’re browsing with a like-minded stranger somewhere around the world.

Security and safety are always of concern when voicechat comes to the forefront, being slightly more invasive and lifelike yet still nearly as anonymous as textual communication. Audio is harder to moderate because it’s far more expensive to store than text-based chatlogs, and what is recorded must be reviewed sequentially. In other words, it’s easier to skim through a book than, say, an audiobook. Yet inroads are being made with voice-transcription technologies, which would make massive-scale moderation and complaint-handling far more feasible. Monty cited the real-time transcription technology being used by US forces in Iraq. “The way that works is voice is transcribed into text, the text is translated, then the translated text is turned into audio,” Monty explained. Outside of military budgets, this kind of transcription technology isn’t accurate or fast enough for gaming, quite yet. With constrained subject sets, the recognition rate (or how often the point gets across) is about 80%. With games, where obscure raid tactics, on-the-fly acronyms, and Chuck Norris jokes are standard fare, the subject sets are anything but constrained and the technology isn’t quite mature. But, rest assured, the money, grants, and desire are in place and speech transcription will be here before you know it (and the journalists rejoice).

One side of the Vivox formula that isn’t just coming- it’s practically here- is text-to-speech.  Useful for those with high levels of ambient noise (like noisy kids or a summertime city environment in the background) or players uncomfortable with the sound of their voice (perhaps due to a heavy accent), text to speech at OGDC was eons better than what we heard even at GDC 2007, even to the point of tonal inflection. Text to speech is a critical component of making voice and audio communication the norm in games instead of text chat, and Monty seems confident that it will be implemented in Quarter 3 of this year.

Positional voicechat, like the proximity-dependent scheme mentioned in the bookstore example above, is one of the newer advances in Vivox technology. With 80+ people in a single EVE Online channel, short of resorting to the requisite half-duplex solutions (e.g. muting out everyone except the leader when he or she speaks), creating not only a sense of positioning but also making voices susceptible to a decay curve would make integrated voicechat not only less chaotic, but more organic as well. Monty described some of the physics behind positional audio and sound decay- k-curves, attenuation distance, and much more- but from what we could pick out, this sort of technology is available for any developers wise and saavy enough to implement it.

Vivox has always been an advocate of voice fonts as a design area; that is, working from a set of data when morphing voices and attempting to re-create the essence of an appropriate character voice. You and I have different voices, so when Vivox makes us sound like a fairy or an orc, it shouldn’t be simply a matter of pushing each “slider” up or down a set amount for you and that same amount for me. Some sliders will have to be more drastically adjusted for me than for you (like the one labeled “adjust his voice so it’s not the frequency of florescent lighting”), and vice versa to find a set of variables that a human brain would associate with not just the look of an avatar, but the size of its chest cavity, throat, palate, and nose. As a parting shot, Monty explained that the crew was busy at work trying to find those vocal qualities which the brain associates with female attractiveness, or “hot-ness” as Monty put it. Assuring us that this was done entirely in the name of research, Vivox has been mapping the vocal characteristics of beautiful women to hone in on what makes a voice “hot.” Setting up a voice font to turn a plain vanilla accent into a cute Southern drawl or some BBC British is out (damn!), since accents primarily altar phonemes, the smallest contrastive element in the sound system of a language, rather than tinkering with the more mathematically-friendly elements of voice. But hotness may still have an algorithmic quality to it, and Monty Sharma and Vivox are definitely on the case… for the betterment of science, of course.

Between voice fonts, positional voicing, text-to-speech and accurate real-time transcription, the future continues to be bright for Monty Sharma, Vivox, and integrated gaming voicechat. Much thanks to Monty and the Vivox crew for sitting down with us at OGDC 2007!

