WASHINGTON (THE WASHINGTON POST) – ChatGPT’s chatbot can now talk with users using voice and audio, OpenAI announced, threatening tech giants Google, Apple and Amazon in the battle to create smarter voice assistants.
ChatGPT can respond to queries from users with one of five “personas,” in a tone that sounds generally more conversational compared with popular voice assistants such as Alexa and Siri. OpenAI said adding voice was a key way for it to get more people to interact with and use ChatGPT.
“That’s our challenge here,” Peter Deng, OpenAI’s vice president of consumer products, said in an interview. “One of the hardest jobs is taking that amazing technology and translating it to the simplicity that the next 300-400 million people are looking for.”
OpenAI’s announcement highlights how Amazon, one of the leaders in voice assistants with Alexa, has in recent months fallen behind the curve in launching new AI tools for the general public. On Monday, the same day as OpenAI’s announcement, Amazon said it had signed a deal to invest up to USD4 billion in another AI start-up, Anthropic. The deal is the largest in the AI space since Microsoft signed its landmark investment in OpenAI at the beginning of the year and reflects how tech giants are placing their bets on hot AI start-ups. Microsoft’s investment in OpenAI, which has led to many product partnerships, has helped it rocket ahead in the AI race.
The developments follow several AI launches last week from such companies as Google, Amazon and OpenAI, a frenzied pace that shows the rush to beat the competition. The companies are trying a variety of approaches to getting people to use – and pay for – the bots, and putting them in existing speakers is one of the key avenues they are exploring. Last week, Amazon announced it was adding a chatbot “conversation” feature to its Alexa home speakers, which are set up in millions of homes. Over the summer, Google told staff that it was looking at putting the tech behind its Bard chatbot into its own voice assistants.
Up to now, people could ask ChatGPT questions by speaking them out loud on its mobile app, but the bot would respond with text. OpenAI also said people can now upload images as part of their questions to the bot, such as showing a photo of the ingredients in a fridge and asking ChatGPT to come up with recipe suggestions. Adding voice and image capabilities also puts ChatGPT further along the line toward becoming a true “multimodal” model – a chatbot that can “see” and “hear” the world, and respond with voice and images, in addition to being fed text-only prompts. AI researchers and analysts say multimodal models are the next stage of competition in the industry, and companies are racing to create the most capable one.
Voice assistants have been in cars, smartphones, TVs and home speakers for years, with millions of people using them daily. But for the most part, their use is confined to a small set of rote interactions, such as being asked to turn off the lights or give a weather report. The “large language model” technology behind chatbots opens up the possibility that voice assistants could become much more capable of having longer, natural conversations and answering more complex questions.
Investors and analysts have accused Amazon of reacting sluggishly to the competition for such “generative” AI as chatbots and image generators, and the Anthropic deal will give the company access to the start-up’s researchers and technology. Anthropic was founded by former OpenAI employees and had previously taken investment from Google.
OpenAI set off the chatbot boom in November when it made ChatGPT public. Since then, the tech giants have scrambled to develop their own, with Microsoft partnering with OpenAI to use its tech and Google putting out its Bard chatbot.
AI researchers have warned that people are likely to anthropomorphize chatbots, especially since their answers usually seem humanlike. That could give users a false sense of trust in the bot’s intelligence or capabilities. All chatbots still routinely make up information and pass it off as real, a problem that AI researchers refer to as “hallucinating.”
The new personas for ChatGPT are named Sky, Ember, Breeze, Juniper and Cove. Each of the personas has a different tone and accent. “Sky” sounds somewhat similar to Scarlett Johansson, the actor who voiced the AI that Joaquin Phoenix’s character falls in love with in the movie “Her.” Deng, the OpenAI executive, said the voice personas were not meant to sound like any specific person.
In a demo, he showed how the bot could understand rambling and open-ended voice questions. Rather than having to think about how exactly to phrase a question, the new features make conversations easier and more free-flowing, he said.
“With this feature, you can just talk,” Deng said. “My kids now request bedtime stories by ChatGPT.”
OpenAI tested the voice and image features and added guardrails to ensure the bot responds appropriately to sensitive topics, such as suggesting the user consult a professional if they ask questions related to mental health, Sandhini Agarwal, a policy researcher at OpenAI, said in an interview. There will be more to do, though, she said. “The work isn’t ending tomorrow.”