|
November 27, 2002
Shinjuku-Mitsui Building 31F
2-1-1 Nishi-Shinjuku, Shinjuku-ku, Tokyo, 163-0431
http://www.ntt-at.com/
NTT-AT to launch "LiveChao", a high-quality speech synthesis system, which can read out text accurately and with high quality, approaching the sound of a real human voice.
| NTT Advanced Technology Corporation (hereafter referred to as NTT-AT, headquartered in Shinjuku-ku, Tokyo, Japan; President: Kimio Tazaki) is, starting February, 2003, selling "LiveChao", a high-quality speech synthesis system, which can read out text accurately and with high quality. |
|
"LiveChao" is a speech synthesis system based on CHATR, a speech synthesis technology developed in Advanced Telecommunications Research Institute International (hereafter referred to as ATR, headquartered in Seikacho, Soraku-gun, Kyoto, Japan; President: Nobuyoshi Fugono), with improvements added by NTT-AT. It also uses -Tagger*1, a high-speed, high-accuracy morphological analysis engine, for the analysis of text. As a result, "LiveChao" can synthesize natural-sounding speech at a quality approaching that of a real human voice. "LiveChao" can generate a speech database from a voice recording (minimum of one hour of speaking). Unlike a system based on the editing of speech clips, it is not necessary to re-record voices when the target speech messages are updated. It is therefore suitable for use in high-quality audio response systems in which messages are updated frequently, for audio services, including services for reading aloud a variety of Web texts, and audio portal sites and for voice chat systems.
NTT-AT plans to sell a basic system for one million yen, and targets a sale of 100 systems during the first year.
[Background and history of development]
Speech synthesis technology for reading text aloud has already been developed far enough for it to be used for practical purposes. In fact text readout software is bundled with some PCs. However, conventional synthesis technology uses a rule-based synthesis method of combining clips of phonemes or syllables to read out arbitrary text. The resulting voice is not natural, but sounds mechanical and monotonous. Another speech synthesis method, a method used in audio response and other services, is to combine pre-recorded voice clips of words or phrases. This has the drawback of only being applicable to situations where the types of message are fixed. It is not suitable for reading out arbitrary text such as is found on websites. To solve these problems, ATR has developed CHATR, a generic speech synthesis technology, which selects optimal waveforms from phonemes in a speech synthesis database to synthesize continuous speech while retaining the speech quality of the waveforms. The technology has made it possible to synthesize speech whose quality approaches that of human speech although there are still some issues, such as the difficulty in making the optimal selection of phonemes and the occasional occurrences of hard-to-comprehend parts. Building on CHATR, NTT-AT has produced a continuous waveform system which can add long strings of phonemes, covering a wide range of types of speech from a word to a sentence. Furthermore, the text analysis part uses ¿-Tagger, a high-speed, high-accuracy Japanese morphological analysis engine sold by NTT-AT, enabling text analysis to be based on a large dictionary containing about 350 thousand words. As a result, the accuracy in the choice of pronunciation and stress for each Kanji character has improved dramatically.
"LiveChao" is suitable for information services such as weather forecasts and stock market information services, in which information is updated frequently, for high-quality audio response systems, and for audio services, including services for reading aloud a variety of Web texts, audio portal sites, and voice chat systems. In the future, it will be possible for it to be built into welfare-related systems to enable a patient who is in danger of losing their voice to record his or her voice in advance, so that after eventual loss of voice, he or she can still speak in his or her own voice.
Prior to the sale of "LiveChao" in February 2003, NTT-AT will exhibit it in the NTT Group Exhibition 2002 to be held in Tokyo International Forum on December 6, 2002.
[Main features of the product]
| - |
It can produce synthesized speech using various types of speech characteristic. |
| |
By generating a speech database for speech synthesis from the voice of a TV personality or a person who speaks for voice-overs, his or her voice can be reproduced for new text items. Guidance information can also be provided using the voice of a particular cartoon character. |
| - |
It allows the accurate read aloud of text, and the generation of almost natural-sounding speech. |
| |
The synthesis using high-speed, high-accuracy text analysis and continuous waveform linking methods allows the accurate readout of textual information and the generation of high-quality speech approaching the quality of natural speech. |
| - |
Since it is unnecessary to re-record a voice each time messages are updated, the product is suitable for systems that involve the frequent updating of messages. |
| |
Since a speech database can be generated from a minimum of one hour's recorded speech, it is not necessary to re-record a voice each time messages are updated. |
[Basic specifications]
| |
| Input data |
Japanese |
Text with combinations of Kanji characters and kana or representation in Roman letters with cadence information |
| Others |
Representation in Roman letters with cadence information |
| Language processing |
Basic dictionary |
About 350 thousand words plus various dictionaries for technical terms
User dictionary |
| Speech synthesis |
Method
|
Continuous waveform linking (using phoneme/morpheme boundary as a unit for processing) |
| Speaker |
Male/female (construction of speaker's speech database) |
| Language |
Japanese |
| Output speech |
Monaural, up to 16 kHz |
| HDD space |
Program: 1.4MB; female speech database: 300MB.
(Consult with us for male speech database.) |
| System requirements |
OS |
Microsoft Windows NT4.0/Windows 2000* |
| CPU |
PentiumII 400MHz or higher (recommended) |
| Memory |
64MB or more |
| HDD space |
About 1GB or more (recommended) |
|
| |
- |
To synthesize speech using the voice of a particular person, it is necessary to construct a database by recording the voice of that person reading a text we have prepared. The reading will take about one hour. |
| |
- |
The quality of the synthesized speech depends on the quality of the speech data in the database. |
[Sales price]
| |
Basic system: 1 million yen (provisional price, not including tax). |
[Start of sales]
| |
February 2003 (provisional). |
<Glossary>
| *1: |
-Tagger |
| |
A high-speed, high-accuracy morphological analysis engine, capable of analysis at 15,000 characters per second (when using CPU Pentium4** 2.4GHz) and a word extraction accuracy of 98%. The system dictionary supports morphemes of about 350 thousand words. Various dictionaries for technical terms can be optionally added. |
For inquiries, please contact
| |
Ms. Hiroko Miyakawa
Manager, LiveChao
Media Applications Technology Department
Media Integration Division
NTT Advanced Technology Corporation
TEL: +81-422-36-5051; FAX: +81-422-37-8457
LiveChao URL: http://livechao.ntt-at.co.jp/ (in Jananese WebSite) |
| * |
Microsoft Windows NT4.0/Windows 2000 are registered trademarks of Microsoft Corporation, U.S.A, in the U.S. and other countries. |
| ** |
Pentium4 is a registered trademark of Intel Corporation. |
|