265 lines
22 KiB
Plaintext
265 lines
22 KiB
Plaintext
|
|
Episode: 21
|
||
|
|
Title: HPR0021: The Festival Speech Synthesis System
|
||
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr0021/hpr0021.mp3
|
||
|
|
Transcribed: 2025-10-07 10:22:48
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
|
||
|
|
Hello, this is Hacker Public Radio and my name is Dave and I'll be your host for today.
|
||
|
|
And I am going to talk about festival today.
|
||
|
|
Festival is a multilingual, synthetic speech synthesizer package.
|
||
|
|
It was developed at the Center for Speech Technology Research at the University of Edinburgh
|
||
|
|
and Carnegie Mellon University, amongst other places, there are a couple of other sites
|
||
|
|
that probably had something to do with it as well.
|
||
|
|
And it's licensed under ABSD-style license.
|
||
|
|
There are other pieces of the puzzle, not just festival gets lumped in in conversations
|
||
|
|
like this or topics like this.
|
||
|
|
First box is another suite of tools that make the building of synthetic voices for festival
|
||
|
|
more systematic and better documented.
|
||
|
|
And there are, I guess, voice packages that come with festival or that are designed for
|
||
|
|
festival and those that are additional voices developed by different entities that work
|
||
|
|
with festival.
|
||
|
|
To come to mind, the Inbrella Project and the HTS Project, and these can be used as back
|
||
|
|
ends for festival, although they both of these will require the use of a separate engine
|
||
|
|
to work.
|
||
|
|
And then, I guess, walking in with the MIPS, we have some front ends for festival.
|
||
|
|
Flight, F-L-I-T-E, I think it's how that's spelled, is a small footprint speech synthesis
|
||
|
|
based on festival, developed at Carnegie Mellon University, and it's built with Festbox.
|
||
|
|
And then there's some GUI front ends, TK Festival, of course, built with Tickle and Carnival,
|
||
|
|
which is one I have not used, but the screenshots are really nice.
|
||
|
|
It is a really nice looking graphical front end for festival.
|
||
|
|
So that's sort of the family of applications I'm going to be talking about.
|
||
|
|
Now, most people, when they think of computerized speech synthesis, or think of one of two things,
|
||
|
|
or one of a couple of our three things, they will think of accessibility in sight, less
|
||
|
|
computer users, people that need to have screen readers and need to have text read to them
|
||
|
|
because they can't see it, or they will think of the 1968 Masterpiece 2001 and Hal 9000,
|
||
|
|
or they will think of, probably some of the things that they thought of that, 1961, IBM
|
||
|
|
717 series, I don't know what it was, 740 series, computer-saving or daisy bail bicycle
|
||
|
|
built for two, which was what Hal 9000 sang.
|
||
|
|
I am not a sightless computer user, but I am one that enjoys having his computer talk
|
||
|
|
to him. The very first computer I ever bought was a Windows computer, and I bought and
|
||
|
|
paid for, and my dad bought me a T.S. or 80 when I was an early adolescent, but the early
|
||
|
|
90s, I bought a PC and it came with Windows 3.0 or 3.1 or something, and that was an application
|
||
|
|
that would read text, and I just remember thinking that was really funny to him, a computer
|
||
|
|
-wise voice, say the word banana, that's the end of there, but I find it hard to believe
|
||
|
|
that I am so unlike most people that I would be alone in enjoying hearing my computer speak
|
||
|
|
to me. If you are anything like me, you probably have a personal relationship with your computers
|
||
|
|
anyway. I have been known to have pictures of my computers hanging up on my office at
|
||
|
|
work, snapshot, so to speak. Anyway, it's something that geeks do, I think, is talk to their
|
||
|
|
computers, or especially have their computers talk back to them. It's just a, another neat
|
||
|
|
thing we can do with our computers and a very useful thing too. The reason I started
|
||
|
|
using festival on a routine basis was because of the podcast I do from my car. I am recording
|
||
|
|
this from my car. I am recording audio for stuff like this from my car is most convenient
|
||
|
|
for me, was not convenient about that, is I don't have a computer in front of me, and
|
||
|
|
so invariably I will forget something. I will have a minimal amount of notes in front
|
||
|
|
of me and I will leave something out or I will get something wrong. Instead of re-recording
|
||
|
|
it all or having to record something extra when I get home, I can top something up and
|
||
|
|
then have a festival translate the text to a way file that I can import an audacity
|
||
|
|
and include in the podcast. In addition to corrections and stuff, I can pre-prepare for
|
||
|
|
the podcast. I can pre-prepare the show notes, I can pre-prepare the opening and closing,
|
||
|
|
that kind of thing, or I can correct something like I said, or making a dendom to the podcast
|
||
|
|
after I have recorded it. It has been a real time saver for me using a festival. Before
|
||
|
|
I started doing any kind of podcast, which wasn't that long ago anyway, when I first started
|
||
|
|
using Linux, I remember there was a program called Say Date, and there is one, say time is
|
||
|
|
the one I use, say RTI, you made it, you would pop in say time and it would tell you what
|
||
|
|
time it was, you know, audibly. I remember thinking that was pretty neat. But before
|
||
|
|
getting into what you can do with festival, other than some of the things I already alluded
|
||
|
|
to, I guess a question a lot of people would have is how do you install it. What is, it's
|
||
|
|
really easy to install. It comes with your distribution anyway more than likely. Ubuntu
|
||
|
|
comes with version 1.4.3. I think there is a beta version, not a beta when you repose
|
||
|
|
it. I know all for Debian or Ubuntu, but it's version 1.95 or, say, a release counter beta
|
||
|
|
version of what the festival people are calling version 2, and that's what is available
|
||
|
|
from source, but as a tall ball. But as far as Debian and Ubuntu go, version 1.4.3 is
|
||
|
|
what's currently in the repose. And I know for a fact this end, you know, the fedora
|
||
|
|
repose, and the susur repose, and the mandraic mandraiva, gen2, picture distribution is
|
||
|
|
there. If it's not, like I said, you can install from source. It's not that big a deal.
|
||
|
|
But the packages I have installed on Ubuntu laptop are festival. FestLex-CMU. This is
|
||
|
|
the dictionary file that developed the Carnegie Mellon University. FestLex-PosLex, and that
|
||
|
|
is a festival lexicon file showing, for part of speech, POS-LEX, part of speech lexicon.
|
||
|
|
FestBox-KDLP16K is one of two American mail voice packages that you can install. There
|
||
|
|
are, well, it's one of two different types, and there's different rates, or do you want
|
||
|
|
it eight kilohertz and one to 16 kilohertz, and for both versions of this. But there's
|
||
|
|
I think there's a KDLP16K, as well as the KALPC16K, I think. Anyway, those are just the
|
||
|
|
standard voices, and of course there's voices for lots of languages. There's a multi-lingual
|
||
|
|
program, so more likely you can find your language. I'd be enough out of the most of the
|
||
|
|
repos, you're only going to get American mail voices, American, or any English or UK mail
|
||
|
|
voices. I know there's an Italian female voice, but there's not a lot of female voices in
|
||
|
|
the Debian and Ubuntu repos at least. Anyway, so what can you do with festival? Well,
|
||
|
|
in a Ubuntu system, and probably a Debian system, both not much unless you add a line
|
||
|
|
or two to your dot festival RC file and your home directory. This will be in the show notes
|
||
|
|
unhesitant to read it, but it's two parent-edical statements that start with parameter dots
|
||
|
|
set followed by space. The first one is in parentheses parameter dot set space, single
|
||
|
|
quote audio underscore command space, then followed by end of quotes, the command that you
|
||
|
|
won't audacity, and excuse me, festival to use, to play the audio. So if you want to
|
||
|
|
use also, which I recommend doing, if you want to be able to have a music player open
|
||
|
|
a new special at the same time, you'd put in something like a play, I guess it's also
|
||
|
|
play, followed by and the appropriate parameters. You can go to the Gentoo Wiki and do a search
|
||
|
|
for speech D, how to, and you will find the parameters to put in. I will put them in
|
||
|
|
the show notes so you will know how to get this to work. It's not good radio to read it.
|
||
|
|
In the second parent-edical statement, it would just be parameter dot set space, single
|
||
|
|
quote audio method. That's the parameter you've set in the previous parentheses space, single
|
||
|
|
quote audio underscore command in parentheses, I guess I said. Not the best radio, but as
|
||
|
|
a very lesser if you want to be able to use festival with also. Now that you got it
|
||
|
|
working, what can you do with it? You can, for instance, make festival read instant messages
|
||
|
|
to you from within game. I am not sure if there is a pigeon plugin for this, I'm sure there
|
||
|
|
is, but it's not in the Ubuntu repos, I know that. There is a game plugin for festival,
|
||
|
|
it's probably called something like game dash festival, and it is literally a five minute
|
||
|
|
setup once you install it. You just go into game and you enable the plugin. More or less
|
||
|
|
it under KDE, there is the K text to speech manager, which includes a little parrot that
|
||
|
|
sits down in your status bar, system bar, and anything you copy to the clipboard, it
|
||
|
|
will worry back to you. There is also an app called K say it, and one called K mouth
|
||
|
|
that I have never played with. Then of course there is the command line, festival. You
|
||
|
|
type festival, you get presented with a festival prompt where you can set the default voice
|
||
|
|
or set the speech at the volume and then have it echo back commands or say things. That's
|
||
|
|
one way to use it, I don't often use it, I'll have it used it that way in years, so I'm
|
||
|
|
not going to talk about that. But just from the command line, you can pop the output of
|
||
|
|
a command to festival. The command on switch we'll use for festival is festival space dash
|
||
|
|
dash TTS, where TTS stands for text to speech. You could echo and then double quote some
|
||
|
|
clever text, pop festival dash, festival space dash, dash TTS, whatever you put, whatever
|
||
|
|
clever text you put in the prompt saying that double quotes will get spoken to you or
|
||
|
|
you heard spoken by festival over your speakers. You can tap a file that way, I think you
|
||
|
|
need the capital A switch, cat, space dash capital A, file.txt, pop, again festival space dash
|
||
|
|
dash TTS will read to you the text in the file.txt file. A really useful thing to do,
|
||
|
|
no most of you have read a man page or two, but if you would like the man page to be read to you,
|
||
|
|
while you do something else, while you multitask, while you actually listen to the man page,
|
||
|
|
tell you what to do. You have your fingers at the weight and the indeterminal ready to top what
|
||
|
|
you hear. For instance, you could top man, space, crime, for instance, pop festival,
|
||
|
|
space dash dash TTS again. So you can more as pop the, you know, spain command to this,
|
||
|
|
it's going to input stuff to standard out. You could pop the date command with the appropriate
|
||
|
|
parameters to have it tell you to date, tell you the date. Not much need to do that when you have
|
||
|
|
programs like say date or say time. I think the say date program will even tell you your uptime
|
||
|
|
in addition to the date. So that's pretty handy. What, my hand is pretty neat. What else can you
|
||
|
|
do? You could create a shortcut if you use a desktop manager that allows you to create desktop
|
||
|
|
shortcuts. I don't, but if you do, you could create an icon on your desktop and that points to
|
||
|
|
the festival dash, space dash TTS command and you could drag a text file to it,
|
||
|
|
conceivably drag an email to it with some tweaking, maybe some an HTML page. I don't know,
|
||
|
|
some of the stuff I've not tried, but that's the kind of stuff you can do with it. One thing I
|
||
|
|
have done with festival, other than, there's a time saving device for my podcast, is I took a book,
|
||
|
|
a book that was in the public domain, a book that was considered to be one of its public domain.
|
||
|
|
That's literally freeware. The book is called Underground, written by Juliet Drafis. And we
|
||
|
|
researched by Juliet Sand. I think he works at Harvard. I think, anyway, it's a book about some
|
||
|
|
hackers in Australia in the, I guess, late 80s, early 90s. And it's a very good book. It's a
|
||
|
|
true story. It's a documentary style book. And I enjoyed reading it where they made the text of
|
||
|
|
the book available online in text files and I divided it up into chapters. And I used text to
|
||
|
|
waive, which is another portion of festival, to create an audio book. I offer every chapter. I
|
||
|
|
did it in MP3 and I, it was very time consuming. I actually did it on vacation from Dominican
|
||
|
|
Republic year before last, I think. I would take the text of a chapter and I would use text to waive
|
||
|
|
and the command off the top of my head for text to waive. I'm thinking it's something like
|
||
|
|
text to waive. I've forgotten. You can, you can do this with, you know, you could, I can text
|
||
|
|
to waive dash dash help will tell you, but it is something along the lines of text to waive
|
||
|
|
in the name of the text file and a dash over output in the name of the waive file that you're
|
||
|
|
going to create. And then you follow this with a dash e-vail and then inside quotes, double quotes,
|
||
|
|
inside parentheses, you put the name of the voice you want to use from my, from my podcast and
|
||
|
|
further reading of the book I use one of the HTS voices, CMU underscore US underscore SLT underscore
|
||
|
|
Arctic underscore HTS is the voice I used it. Hi, this is voice CMU US SLT Arctic HTS just in case
|
||
|
|
Dave was an error trying to quote the command that created this audio. Here is the command once again
|
||
|
|
text to waive my file dot t x t dash o my file dot waive dash e-vail open quote open parentheses
|
||
|
|
voice underscore CMU underscore US underscore SLT underscore Arctic underscore HTS close parentheses
|
||
|
|
close double quotes. This is my podcast, but I know it's one, I've named that voice one,
|
||
|
|
but that's just me, that's not the official name of the voice, but that command I used to
|
||
|
|
create an audio book from a bunch of text files. This is something that you want to be sort of
|
||
|
|
careful with that you decide to create an audio book using festival. It's best if you have a book
|
||
|
|
broken up in the chapters because that's what's going to load the entire text file in the memory
|
||
|
|
and it's going to create a waive file using that and it's it's not it's going to use that
|
||
|
|
volume memory if b file is going to be too big. I was lucky to underground is a book that was
|
||
|
|
right in the chapters. It had been the whole book and I tried to load you know six or 700 pages
|
||
|
|
worth of text. It would have probably taken up all two gig of RAM on my system. I'm just guessing,
|
||
|
|
but between just a simple text to speech command on switch and the text away program that comes
|
||
|
|
with festivals, the sky's the limit is to what you can what you can see doing with this,
|
||
|
|
especially the text to speech part. I mean you could, there was demo pages on the way up where
|
||
|
|
you know people have what is that the festival site and some of these other sites where you
|
||
|
|
can demo some of the voices. There's a dialogue box you type text into you pick up a voice
|
||
|
|
in the drop down list and you can hear that text spoken in a different voice. That's that's this
|
||
|
|
PHP. There's I just stumbled upon a site the other day the regular way not using a stumble upon
|
||
|
|
that that would read books to you. But you could it was I think it would take a book and you could
|
||
|
|
turn into an audio file that you can think is subscribed to the RSS feed. It's pretty neat
|
||
|
|
side. I forgot the name of it. But then you could like listen to Project Gutenberg books
|
||
|
|
translated from text to speech and you could subscribe to that audio file is it with RSS feed
|
||
|
|
readers so like a podcast that was pretty. But I'm sure they use something like festival,
|
||
|
|
it's not festival it was a commercial text to speech package. So there's a lot that you can do
|
||
|
|
with it. One of the questions I get asked a lot is how do you change voices festival comes with
|
||
|
|
a handful of voices and like I said most of the American voices are male and most people
|
||
|
|
are probably like me you know they find it somewhat intriguing to have their computer talk
|
||
|
|
in the first place but if you're married like me there's something
|
||
|
|
what's the word there's there's there's something really nice about having your computer talk
|
||
|
|
doing a female voice knowing that you can tell her exactly what to do and she'll do it. She's
|
||
|
|
a computer but she's telling a female voice it's something that made me no need to do very often
|
||
|
|
anyway uh for you on how you change voices with it there uh another thing you can do it is I've
|
||
|
|
never tried this but I imagine it would be pretty easy is uh like the the command line
|
||
|
|
text-based browser links L-Y-N-X have a dump feature D-U-M-P so you could you could open up a web
|
||
|
|
page with links using the the dump argument or links space-dump the name of the HTML file and then
|
||
|
|
redirect that with another the greater than sign to a file so it put us supposed to do is take that
|
||
|
|
HTML file and translate it to ASCII text which I'm pretty sure festival can handle pretty good
|
||
|
|
except for maybe some of our non-standard characters so that's the way maybe to convert nine you know
|
||
|
|
a web page that is in HTML if there's something festival could read to you. I like text-based
|
||
|
|
weather forecast because they open up better in links and console window and that's something
|
||
|
|
it could easily be parked and I could have festival read me the weather forecast. There's there's
|
||
|
|
lots of things you can do with this and pretty sure asterisk has a has an option to use festival
|
||
|
|
instead of a recorded voice. Speaking of asterisk there is a commercial text-to-speech package for
|
||
|
|
for this available for Linux that I think asterisk uses and I've forgotten the name of it against
|
||
|
|
with a C I think and Alison Smith the woman that does the voices for asterisk she she isn't in her
|
||
|
|
voice too. I think it's called a diaphone I mean she said she they've set the size for voice so
|
||
|
|
as her voice synthesizes it is a computer like voice it sounds like her that you can use with this
|
||
|
|
commercial package and then I've forgotten the name of it. Sorry for the interruption what
|
||
|
|
Dave is struggling to remember is the commercial text-to-speech system called Kepp Stryl CEP S-T-R-A-L.
|
||
|
|
But there's two or three commercial alternatives too festival and not I'm a really expensive there's
|
||
|
|
I guess there's T.T.Santh which used to be IBM text-to-speech she used to be via voice
|
||
|
|
by around $40 there's there's a couple more than I don't think that I want to remember but
|
||
|
|
they're in between the $29 dollar price range as well but that was tangent through me off oh yeah
|
||
|
|
the before I get that change of voices there's other things you can envision doing with sad and
|
||
|
|
alt friends and she you know you could remove non-standard characters using sad or you could
|
||
|
|
take the output of a log file this is this you can't use an alt in this print you know like
|
||
|
|
one column of it you know if you want to know who was online you could you could use alt to figure
|
||
|
|
that out have it really to you so that's pretty neat but like I said one of the main questions I
|
||
|
|
get frequently is how do you get the voice of what this is my podcast called in which is this the
|
||
|
|
CMU US SLT Arctic HTS file for voice and let's CMU underscore US underscore SLT underscore Arctic
|
||
|
|
underscore HTS and this Arctic clock is in Tundra this voice file is can be found at HTS.SP.NITEC.AC.JP
|
||
|
|
one sure at that web page you want to look for the release archive because what's on their main
|
||
|
|
page are voice files for the latest version of our assets I keep saying on that if the latest
|
||
|
|
version is vessel which is version 2.0 beta or 1.95 what comes with most distributions I think is
|
||
|
|
1.43 so you've got to release archive at that web page and you'll find the CMU US SLT Arctic HTS
|
||
|
|
file and the SLT file is the US female voice there's there's another one but that's that's the one
|
||
|
|
I think sounds the best and it's about a two-meg download and it will include the the HTS
|
||
|
|
engine that you'll need to run this voice as well as the voice files themselves and
|
||
|
|
and installing this it isn't that easy it's not it's not hard it's not hard right which
|
||
|
|
of you download the tarball and if you're unzip it or untaught and look at it you'll see
|
||
|
|
that there's a I forget the directory structure up top of my head but with down a couple of levels
|
||
|
|
and you got like a it may be ball slash we have slash CMU underscore US underscore SLT
|
||
|
|
underscore Arctic underscore HTS so it won't be the top level directory but the first two levels
|
||
|
|
will be empty but like three levels down or so at least two levels now you'll see that directory
|
||
|
|
then the CMU blah blah blah that one is the one that you will want to copy if you have
|
||
|
|
festival installed into I think it's user user share festival voices English so that's user
|
||
|
|
slash user slash share slash special slash voices slash English that's where you'll want to put
|
||
|
|
that directory once you do that you'll be able to using text to wave select that voice with the
|
||
|
|
the command line switch that's evalving in parentheses the name of that voice in parentheses
|
||
|
|
and that may have to be in quotes I mean of course forgive me but that that's how you do that
|
||
|
|
I think what maybe makes that voice sound better is the way that it's built it's built with
|
||
|
|
it's built with a different engine in it's built with what is called HMM which stands for
|
||
|
|
hidden Markov model in its a statistical method that's I've read is like the simplest
|
||
|
|
form of a dynamic Bayesian system but it's a statistical system where I won't try to explain
|
||
|
|
it because I'm not completely up on it but it's it's often used in creating voice files but it's
|
||
|
|
particularly good at the calls that the hidden part of it you know there's there's inputs or
|
||
|
|
outputs for you but they're called that can change to get you not aware of which one I can't
|
||
|
|
explain it like I said I'm in the car you need to spend a while since I face statistics and my
|
||
|
|
wife is gone hey uh I have to give the name of this road it's road to chase live on in the
|
||
|
|
premises yeah no okay okay okay okay
|
||
|
|
invably happens when you record audio in a car on your way home from work is your wife calls you
|
||
|
|
and I sort of forgotten where I was but yeah that the hidden Markov model it's I don't know
|
||
|
|
if it's superior or not but I know this SLT voice this uh this sounds really good and again you can
|
||
|
|
find that it HTS.S.P. and I take that AC.JP I think and I take is like the Nagio
|
||
|
|
Nagio Institute of Technology is a college in Japan evidently anyway uh I guess something I
|
||
|
|
got left to say is using that voice using that uh CMU US SLT Arctic HTS voice using text the
|
||
|
|
wave I get and you know I get the result I won and I get an audio file that's been translated
|
||
|
|
front text but I also get a sub file in a cordon every time I don't know why I'm not investigated
|
||
|
|
uh it's that the cordon is clean up accuracy after down I had that I'm not accumulating a bunch of
|
||
|
|
no don'ts and wasting this space or anything and it's not doing any any damage at all it's
|
||
|
|
still creating my file for me and everything works so I guess your mileage may vary anyway I am
|
||
|
|
I've rambled on long enough and that's going to wrap it up for this episode of HPR much I have a good day
|
||
|
|
hello this is voice CMU US BBL Arctic HTS hello this is voice US one I'm roll they forgot
|
||
|
|
dimension but I'm roll engine binary and voice files can be found at the follow in url tcts.fbms.c.v
|
||
|
|
dot v slash synthesis slash mvrl dot html
|
||
|
|
hello I am the default festival voice thanks for downloading hacker public radio have a nice day
|
||
|
|
thank you for listening to hacker public radio
|
||
|
|
hpr is mastered by caro.net so head on over to clr.o.nc for all of those of you
|