Startup Shootaround: Making Sense of Voice Technology

Editor’s note: The NextView team periodically holds internal “shootarounds,” where we discuss a startup topic or trend and try to make sense of it for everyone involved: entrepreneurs, investors, and consumers. Below is a lightly edited transcript of our latest shootaround on voice technology.

 

ROB GO: Okay, so as you know, I’m passionate about this space, and I think there are three main things to think about around voice.

In no particular order: The first one is the international opportunity, especially countries where it’s actually more laborious to input things via keyboard due to their language and their characters. And that portion of overall internet users is much bigger now than ever before.

The second is probably most crucial at least in America and that’s these changing social barriers. Apple, et al, are basically training humans to be comfortable speaking to technology. It’s not as weird to say something to a phone or some other tech.

DAVID BEISEL: Do you think it’s also just more socially acceptable?

ROB: Yes. It’s acceptable to just talk to your phone around other people — you don’t feel crazy.

You know, a good example is when we were talking about Snapchat here the other day. I was like, it’s so hard to consume snaps because you need audio on — but the younger demo always has audio on. People are just listening to stuff on speaker phone all the time. I think that’s a change. In an older demo, to listen, you feel like you have to put on headphones or go to quiet room.

That’s changed.

And also the tolerance for failure around voice has gone way up. People are using Alexa and Siri and other technologies, and they’re just more willing to try and tolerate a certain level of failure.

So that’s all under this second idea of voice become more acceptable, socially.

The third factor here is that the technology barrier is coming down. We’re seeing more instances of computers beating humans at various tasks and competitions, and so we’re just getting better around actual AI. Plus, the hardware components are getting better. For instance, with batteries, it’s a lot easier to use always-on features like audio, rather than have to hold down a button to access the voice command. You can do the “OK Google” or “Hey Siri” type commands as one example. Then there are the processing speeds getting faster, and so on.

We’re just closer to a tipping point in terms of the actual tech than we have been previously.

LEE HOWER: My question with voice has been around the social barriers. I don’t think it’s actually, “I’m uncomfortable talking to a machine.” Some might say that, but people will get over that, just as they did typing into a machine. And maybe this is a demo issue and younger generations have a different mentality, but there are times when it’s awkward to use voice regardless of your demo.

For instance, typing can be private. It’s a non0-interruptive interface, whereas speech by definition is a public or interpretive form of interface. If I say, “Give me the number for So-and-So Barbershop, then everyone knows I’m about to get a haircut. It’s not a big deal with that example, but it does place limits on what you’d say out loud rather than type, which would be everything because it’s private.

But just to think about this from all angles, there are definitely other instances, too, where you’ve seen speech interface slowly creep in when people are trying to do something else, like drive.

Will speech work perfectly? I think so, and it’s just a matter of when and how, or to what degree it happens this month versus next month versus one year versus five. We’ll get there. But I still wonder about that interruptive-slash-public nature of speech as an interface.

TIM DEVANE: That’s true, but for me, speech as a command is more efficient. When you order something at a restaurant, you’re not typing into a phone — you’re letting everyone hear your order. So at the most base form, speaking a command is the fastest and actually most natural way to do it.

With regards to speech versus text, I think it breaks away a lot of social and emotional barriers to use spoken messages instead of written. With texting and SMS and chat, for instance, there’s less vested interest in the conversation and how the other person feels. It can invite a lot of bad behavior.

So I don’t know that voice command overtakes text messages or chat, especially given the chat apps that have surfaced in the past 8 years or so, but maybe there’s trend developing where it brings more human interactions back to messaging in the way it was when we could only pick up a phone and call someone. It’s easier to argue over text because you’re saying things you wouldn’t say out loud, as one example.

LEE: Agreed, but to be clear, it’s not just text as an SMS medium but literally typing or touch as an input to make something happen or to get something done.

DAVID: Right, there’s typing a text message which can then become a voice command into text or voice command sent as audio, and that makes sense. But the challenge with voice more broadly is that you almost need this pre-voice context. With texting, you type and touch your phone, and that evolves easily into voice as the substitute. That’s the context.

But for other things, you need to have a an idea of exactly what you’re asking a machine before you say it out loud. For example, being able to touch your finger to a screen and think, “I want to open the Uber app first, then THIS is my current spot, then THIS is my destination” — that’s all a given, and that’s different than saying, “Get me a ride to XYZ destination.” The interface may not know if that means grab you a taxi, an Uber, a friend nearby…

LEE: I think the two things will converge. Humans will get better at interfacing with computers and giving the right commands and vice versa. So it’s like, rather than saying, “I need to get to Fenway Park,” you’ll say, “Get me an Uber to Fenway Park.”

ROB: I think the challenge really is in the input/interface interplay: Does the app get what I’m trying to say and do I know what to say in the first place? But then can it answer back to me, and how? I think that part is super hard. For instance, I can’t ask Alexa very many factual questions because it’s hard to ask a search question to Alexa, but that’s the sweet spot for Siri. It’s not smarter or dumber, but it’s because Siri can answer with five different links and they have blurbs — it’s on your phone, it’s connected to the internet. But Alexa can only speak back to me — there’s no visual interface — so my input has to be more precise and sort of anticipate the interface’s ability to communicate back to me.

DAVID: Isn’t that more about context again?

ROB: Yeah, to a certain degree, yes.

DAVID: Because with Siri, you can recognize the context of a page that gets spit back. You’ve seen a search result or list of links before.

ROB: Right, but Google has trained us NOT to put tons of context in our search, because we can figure it out on our own based on the first few results. So we’re now doing this in reverse. More context is needed in the search through voice command so the computer can spit back something specific.

LEE: The same is true if you think about voice commands in car though. You can try to tell it the address or whatever, and you have to be very specific about it because you’re not interfacing with a display. It’s just voice. Same with Alexa. So I think a lot of the products will wind up as a display AND voice, both in AND out. The power of it is when it really is an interface thing, not just a search query or prompt for an app. It’s the entire way you interact at that point.

And, you know, it would definitely be faster to just say, “Get me an Uber to Fenway,” and not take 20 seconds to click in, set location, set destination, and order the Uber.

Honestly, I think this about the top five to 10 apps you use. There’s a ton you can do with speech. What if you opened Facebook and it dictated to you, “Here are your three friends that have birthdays today. Would you like to write on their wall?”

Email, same thing, you can dictate, and there are already programs that do this.

So in my opinion you have to think about speech as an interface, not just as the super narrow versions like Alex and Siri. In the short term, you still have to physically prompt your apps. Speech is not yet an always-on thing on these devices or for those apps. By definition, that would mean your phone is recording you and listening to you all the time.

TIM: So are there speech patterns or shortcuts to get the right answer through a voice-enabled app? That would aid the adoption until the entire interface knows us better.

ROB: It’s an interesting product question for AI and voice entrepreneurs. I mean, one way to minimize the bugginess early is to pre-program it with shortcuts like you’re saying — these would work for lots of apps or at least specific, popular apps until a system learns about you.

But another thing that’s interesting that this all leads to: Does this take over our use of keyboards? I think there’s a little bit of this voice stuff that ends up as FOUND time — found computing time. My analogy would be that casual games didn’t steal share from hardcore gaming as much as creating a new time block to play games where previously you needed meaningful time to sit down and play on a console or computer.

LEE: I’m more of the opinion that it’s about ubiquitous computing. It won’t steal massive share from keyboard usage — maybe 10-20% of your keyboard time — but there will still be lots of things where you’re using a keyboard and touch.

ROB: It’s easier to multi-task more with voice than anything else though.

LEE: True. But with something like Alexa, it isn’t that you don’t need keyboard, it’s that it provides enough utility where you don’t need keyboards and can use speech only. But with the Echo and Alexa, if you’re sitting in front of a web browser, you can buy from Amazon in the browser, and they don’t care. But Alexa creates that ubiquity because now every time you’re NOT sitting in front of a keyboard or computer, you can still order from Amazon. So it’s about ubiquitous computing and interactions.

DAVID: Right, because I won’t have to say, “Oh, I need more Tide, so let me put it on my list for shopping later.” I can just instantly say through Alexa, “Buy this.” So it’s capturing me in that moment and converting me faster or more.

But I think about the tech versus the consumer. I think it IS a matter of when, not if, voice tech works, but if current companies like Amazon or others experience all kinds of bugginess, consumers are gonna be turned off.

LEE: That’s right. Four or five years ago, we thought haptic feedback would be the next big thing, and all these gesture-based commands and interfaces.

ROB: But a lot of people are using voice happily already. If there are a couple anchor use cases where it actually does work for one product, then you’ll tolerate all the errors.

DAVID: If it actually works, then yes. You’ll learn how to use it just like you learned how to use Google. We aren’t just freeform typing into Google — we’ve learned how to direct it and get the right results, despite some early flaws in it or early issues we had with the right search queries.

LEE: Okay, so let’s talk about starting companies and investment strategy. I think we’re seeing organically more companies coming across our inboxes and taking more meetings. And it’s clear we’re stringing dots together to conclude that something’s happening here.

This is what we see in any trend, by the way. This is how I personally prefer to react to trends, too, rather than say, “Oh, the next big thing is X,” then go boil the ocean to find good companies.

So we’e seen more voice companies as an input and output to applications — and that’s been over the last three to four years, by the way.

ROB: That’s right. I actually started getting interested in this thinking about cars and in-car entertainment and cars as found computing time. So maybe three years ago, we looked at a pitch doing a trivia game for an in-car console.

DAVID: Oh yeah, I remember that. It was a game, but the interactive part was all voice.

At a past firm, I saw a founder who was 10 years too early thinking about the UI in the car, too, so this is not an entirely new discussion. It’s starting to tip though.

ROB: Yep, and I don’t think we know what the tipping point is on a technical basis yet, but there’s a blurry line we’re approaching — I think we’re pretty darn close.

TIM: I think everyone owning and wearing earbuds are NOT insignificant here. Everyone has these all the time. For something like Anchor or Stitcher or Overcast or any podcast, as well as straight voice communication, everyone is walking around with them on or with them in their pocket. Without that, I don’t think there’s any of this trend on the consumer side.

(Laughs) And I guess that takes us to the 10-year anniversary of this shootaround which is going to be — what? — THINK-to-text? Thought-based interfaces?

DAVID: I’m pretty sure it’s Google that has a patent on doing heads-up displays and augmented reality along those lines. It’s coming, don’t worry.

What are your thoughts on voice technology? Leave a comment or tweet the team.