Files
hpr-knowledge-base/hpr_transcripts/hpr1599.txt
Lee Hanken 7c8efd2228 Initial commit: HPR Knowledge Base MCP Server
- MCP server with stdio transport for local use
- Search episodes, transcripts, hosts, and series
- 4,511 episodes with metadata and transcripts
- Data loader with in-memory JSON storage

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 10:54:13 +00:00

805 lines
75 KiB
Plaintext

Episode: 1599
Title: HPR1599: Interview with Ingmar Steiner from the MaryTTS project
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1599/hpr1599.mp3
Transcribed: 2025-10-18 05:42:48
---
It's Thursday 18th on September 2014.
This is an HBR episode 1599 entitled, Interview with Endmasteryner from the MerityDS project.
It is hosted by Ken Fallon and is about 86 minutes long.
Feedback can be sent to Ken at Fallon.i or by leaving a comment on this episode.
The summary is, Ken Interview with Endmasteryner from the MerityDS to Speech project.
This episode of HBR is brought to you by an honesthost.com.
At 15% discount on all shared hosting with the offer code HBR15, that's HBR15.
Better web hosting that's honest and fair at An Honesthost.com.
Hi everybody, my name is Ken Fallon and today I'm talking to Ingmar Steyer.
Is that correct?
Hi, and you're from the MerityDS project.
Could you tell us a little bit about what MerityDS is?
It is actually a Texas speech platform written in Java that is open source and has been around for nearly 15 years now.
It started out as a project at the Institute of phonetics at Sarland University in Sarbrook in Germany
and was developed mostly at DFKI.
That's the German Research Center for Artificial Intelligence, which has a branch here on campus.
I'm in Sarbrook and as well.
There's a language technology lab there and that's where the main architect, designer programmer and basically everyone.
The god of the project, Merck's rotor, spent a lot of his time.
It was conceived as an in-house Texas speech synthesis tool for various projects.
DFKI has a number of public and industrial projects and they needed some sort of solution and they decided to just roll their own.
It just so happens that Mark, Mark Shlorta, was just working on his PhD at the time in phonetics and he was very interested in emotional speech and especially in synthesizing emotional speech.
So that was his main research interest and he basically designed Merri from the ground up to be modular so that basically you can plug different things together to achieve different ends.
But also always with the application side of things in clear view because it was always meant to be a component that would actually be used as middleware but other projects like spoken dialogue systems, that kind of thing.
And also the personal research interest of Mark was, as I said, emotional synthesis.
So that was always something that was kind of built in or one of the capabilities that was supposed to be developed and was to a certain extent also achieved and that's kind of how it started back in 2000 or so.
Could you tell us what emotional synthesis is?
Yeah, so essentially when I use regular off the shelf TTS systems, they sound pretty boring.
They're essentially just very monotonous or maybe even if they have a little bit of lively speech and built in, then it's typically also kind of artificial and flat.
And emotional synthesis is a way to convey human emotion or expressivity through parameters such as pitch or duration or voice quality.
And basically I can sound happy and then I have a different pitch range or angry and then I sound with, I don't know, my voice is more harsh or whatever, you know.
And if I'm a little depressed or something, then I guess I sound quieter and I speak more slowly, things like that.
So those were kind of the things that Mark was interested in, how to model these things and to model them basically using synthesis as a tool to discover what it is exactly that makes things or that makes people and by extension artificial agents convey emotion or express emotions in this way.
So that was kind of always an implicit deliverable, but not something that is typically used for, although it can be.
What are the main applications for?
There is actually quite a number of different applications that it's been used for in the past.
So I mentioned spoken dialogue systems or what we like to call intelligent virtual agents.
So some 3D avatar that has actual spoken audio.
So you see a virtual character on the screen and when he talks, it's actually synthetic speech.
Or perhaps a spoken dialogue system that actually talks, well, we have a couple of dummy toy projects in house actually that like a speaking elevator that greets you when you walk in and you can tell it which floor to go to.
Or whose office you're looking for and it will bring you to the right floor, that kind of thing.
But or speech speech translation systems.
There was actually a project that we used married for a couple of years ago.
That was in Dublin actually where we wired up a little toy project using a speech recognition, then automatic translation and then spoken output.
And we used a webcam to detect the emotion from the speaker.
So if he's smiling, we would assume he's happy.
If he knits his brow, then he must be angry, that kind of thing.
And we met that onto different expressive output styles in the synthetic speech.
So if you look angry and say good and talk, then maybe it would say, or actually was English determined only.
So you look angry at the camera and you say, good day to you.
And the system would translate it as good and talk as opposed to a good and talk if you're smiling while you're saying.
So I mean, it's.
Those are coming to come a couple of the things that that Mary has been used for.
I believe there's a school for blind children in Tibet that actually uses a very early version of Mary where Tibetan support was developed in a student software project that was back in 2005 or so.
And I think they're, I'm not sure, but they might still be using it today.
So that's actually something that's very useful for a lot of people.
And of course, there's an increasing demand for for screen readers, although I have to say that there are a couple of issues with using Mary ATTS for screen readers, where it just falls behind other solutions.
Those would be well, there's been a couple of problems with the long startup times and the latency.
So before it actually starts speaking, and those are some things that we can address, but unfortunately, we don't really have a project at the moment that that focuses on these things.
So we don't have anyone to actually address those those things directly.
And that'd be nice. Yeah, I just want to split in here just for our own community to tell our own listeners.
I think the, we've started doing using eSpeak as the introductory text to speech to give us synopsis of what the particular episodes are going to be.
I think for this episode, I'm going to use Mary TTS to do that.
Some people will get an idea of, I do both, I'll do the eSpeak and I'll do the Mary text to speak so people will get an idea of the differences of what we were talking about.
My main reason for having you on an interview in you is because many of our listeners will know that Jonathan Nadu, who is one of the blind listener himself as a project from the Accessible Computing Foundation.
And they use Mary text to speech as a, as their tool for integrating with speech to Spectre, which interested me greatly.
So I will absolutely get Jonathan on to you to see how he, if there's anything that he can do to assist you out.
Okay, sorry, that was a slightly direction. Yeah, yeah, no, we're, so jumping over a couple years of history, I need to mention that Mary TTS today is very much an open open source and communities working project.
We've been on GitHub for a couple of years now, and that makes it much, much easier to accept contributions from, from, from other users before it was, it's written in Java.
Yeah, and it's under the GPL version.
It's actually under the LGPL, so that makes it very friendly to businesses as well.
So there are a couple of voice portals and other companies that are, that are using it for commercial purposes, and that's fine.
So yeah, that makes it, that makes it more accessible in an enterprise setting as well.
Now, I will admit that I downloaded and tried to use it myself.
My main use case I have to tell you is, my daughter is dyslexia, and I played, she benefits from having people read stuff to her, because she can understand it quite easily.
Yes.
And when I had the computer read text to speech in the Dutch voice using eSpeak, she just came out of a room and tears, she thought that the computer was angry with her and stuff.
So my goal is to have my wife create a voice based on my wife's soft spoken voice, so that as she's reading her text,
it's the sound of her mother at least a synthesized sound of her mother coming back.
And this seemed to be something that Mary TTS could do, so what could you tell me about the components, because I found it.
Well, you mentioned about the latency and startup, and I would also throw in resource hockey in there as well.
Yes, yes, it depends on what you use as an engine.
So could you just give us a bit of a technical background as the architecture, what you absolutely need and what you don't need for that sort of scenario?
Okay, so to start out, what Mary TTS does is it ties in a couple of different modules.
And some of those, the basic building blocks are these so-called modules that are Mary internal.
So we'll call them Mary modules or just modules for short for now, and they pass around data.
And each module has an input type and an output type, and these different output types are variants of a certain kind of data.
It itself is actually an XML format, we call it Mary XML, which is a container for information, and then at the end also the actual audio.
So what happens is that you have a certain input type, let's say text, and you want a certain output type, let's say audio.
And the system itself, the so-called module registry, determines what the optimal path is through all the modules that are needed to process it from text to audio and all the intermediate stages.
And every one of those stages will be created by a different module that takes the input, process it and enriches it, and passes on the enriched output to the next module in the pipeline.
And that's pretty much how that works.
Why is that so complex when something like eSpeak, you know, you say this and that says it?
It's not as complex as you think, because the modules themselves are, well, okay, let me back up.
There are a couple of different things that need to be done to text before it can be actually read out by some sort of synthesis engine.
Now eSpeak uses, I believe, the performance synthesizer, which means that it generates low-level acoustic parameters and then just renders it as acoustics, as audio, as wave samples or whatever.
Yeah.
And there are a couple of other engines out there as well.
There's the unit selection engine, which is the resource hog you spoke of, because that actually loads into memory or not.
You can't read it from the disk, but it needs a lot of memory to organize a huge database of spoken units.
So you record, I don't know, hours of audio.
It needs to be processed and annotated offline.
And then once you have all that, you can load it into memory and use it as a TTS voice.
And what it does is it looks for units in that database and just concatenates them together in the best possible way.
And then the output is hopefully something that sounds almost as natural as human speech.
And it might have a few glitches depending on whether or not it found the right units or whether it needed to back off to subult them.
And that's the thing that's still kind of used in server side applications where naturalness is paramount.
But the trend is strongly heading towards something that is more similar to Foreman synthesis, but uses a slightly higher level, more high level machine learning based approach.
So HMMs are trained on for each acoustic unit, if you want.
And then uses a vocoding technique to regenerate the audio using something that well, OK, the approach results in something that is a little bit buzzy and a little bit too smooth.
But much more much nicer than just raw Foreman synthesis and you can actually recognize the identity of the speaker on which these models were trained.
And so those are some of the some of them are popular techniques right now.
There's also something that is similar to unit selection that was used in the 90s, if you will, where basically you only have one instance of your every unit.
That's called the iPhone synthesizer. Emberola is one that was quite popular and is actually still used because it gives you very, very precise control over the way in which things are pronounced if you need that.
But other than that, it doesn't sound very natural. And there are a couple of experimental things that people use.
There is a harmonics plus noise system that basically tries to parameterize the periodic aspects of the acoustic signal and to store the stochastic parts.
So basically it stores the noise, but tries to parameterize the harmonics of your voice, if you will, and tries to manipulate those things in such a way that you can actually change the way in which something is spoken without it sounding unnatural, but that's more an experimental system.
And there's something else entirely, which is more like sound simulation of speech simulation called, sorry, articulatory synthesis.
And there we have an essentially a vocal track model, a three dimensional vocal track model.
And we simulate how sound propagates from the blotus or from the from the larynx through the oral cavity and through the nasal cavity to be rendered as, well, as acoustic audio.
Sorry, my screen's in there. Let me just disable it.
There we go. Okay. And so that that's actually something that is very experimental and where a student of mine just very recently integrated it with Mary. So that is another technique that we can use, but I wouldn't call it production ready.
So what happens in production or just the user side of things is you enter some text or you, I don't know, whatever you provide some text input.
This text input needs to be pronounced somehow. And the intermediate stages are to look up the pronunciation of all those words. And of course, it's not as simple as just looking them upward by word in some sort of pronunciation dictionary, because you're not going to find all of these words in any given dictionary.
Particularly names or some other different word forms depending on which language it is and there there might be English is actually quite simple because there isn't that much morphology in the English language.
The words don't really change that much, but other languages would have much more complexity in their in their linguistic morphology.
So you need some sophisticated processing there of those words to find out how to look them up in the first place.
But more to the point at some point, you're going to have to back off and use some sort of either pronunciation rules or some sort of statistical model to predict the pronunciation of words that you don't know how they're pronounced.
And that produces some sort of phone sequence or a sequence of phonetic units, we'll call them phones. And those can then be passed on to the synthesis back in, but there is more. There is also the tone of voice, the actual duration of each phone that needs to be predicted, because if you, if you say every sound in exactly the same what with with exactly the same duration, it's going to sound like nothing that you've ever considered speech.
So the segmental durations, the intonation, all those things have to be predicted in such a way that what comes out is actually understandable and hopefully even more or less natural.
And beyond that, if you have certain things that need to be stressed or accented in a certain way to focus on certain things, so maybe you're trying to articulate some sort of contrast, you have a contrast of accent you're talking about.
It wasn't John, it was Mary who went to the pub yesterday or something like that. You want the word Mary to pronounce differently than if you just talk about, I don't know, just out of the blue, Mary went to the pub yesterday. So you have a completely different intonation and things like that are, well, they depend on in a certain way on knowledge that is not actually in the text.
So it's very much an interpretation of the text when we read it out and the computer doesn't really know how to do that because he has no knowledge of any of these entities that we're referring to.
So it kind of goes in the direction of natural language understanding before you can even tackle some of these more high level problems in predicting the pronunciation of something.
But anyway, once you have the acoustic parameters that you want, then the next challenge is to actually render them into acoustics.
And that's what you do with one of these engines, so Foreman synthesis, HMM, Unuslection, whatever.
So it's actually a Tuesday's process at the very rough level. You would have the text preprocessing up to the point where you know what the acoustics should be like.
And then you want to actually generate them based on your predicted parameters.
And I haven't looked inside, sorry, go ahead.
Wow, that's complicated.
Yeah, sorry, it's a bit of a mouthful, but anyway, stop me if I'm if I'm rambling.
No, no, no, I'm following along.
Okay, think this is fascinating. Keep going.
Okay, yeah, I have to admit I haven't looked inside of eSpeak, but I would be very surprised if it doesn't at some level follow the same pipeline.
Except maybe it's not not breaking it up into as many different sub steps and modules that actually processes things explicitly, but kind of just doing all one thing in one class and then passing it on to the to the acoustics or whatever.
I don't know.
But at the end of the day, the synthesis or text speech synthesis is always about those things that I was that I mentioned.
So you have to predict the pronunciation and predict acoustic parameters from text in in some way and then render it in some other way.
Okay, so wow, that's that's not I suppose the first part is the predicting the waveforms is the difference between having a good reader and
somebody who can really a good narrator of a story versus a bad narration of a story.
When I read a storybook to the kids, it's just boring and flat on the moment.
Life reads as it's really exciting and thrilling and stuff.
I don't even know how you would do that on the computer.
Actually, it's terribly simple.
And some people have tried to create expressive voices simply by just reading out a really boring text in an exciting way.
And whatever they get out, I mean, you can read the phone book in an exciting way and that's what it sounds like.
So with some of these some of these voice building techniques, like unicelection or or or HMM basis, if you put in one kind of data, the whole thing is just flavored in a happy way or in a more engaged way.
And then it implicitly sounds like that for everything it says.
So that's that's one way to get around that.
So when people try to do some sort of happy, sad, angry, expressive voices, what they want one rather simple approach is to actually partition your data into those different kinds of recordings.
And to just say, OK, well, this is the happy part of the voice. I'm going to record lots of stuff that sounds happy, build a voice out of it.
I'm going to record an angry voice with the same data, but make it sound angry.
And then you can switch these voices at runtime, depending on whether you need a happy or angry voice.
So that works.
Yeah.
And OK, yeah, if you like, I can I can kind of back up a little and go back to kind of the history of Mary TTS because that some of those things were projects in his history.
Yeah, no far ahead.
Yeah.
We've all the time in the world here in the interview.
Oh, good.
OK, so I mentioned that Mark Schroeder was the mastermind behind Mary for many years.
And it's for the community.
It's rather sad that he left because he went to work for Google a couple of years ago.
And that so you kind of had to remove himself from the project.
And it's been a bit slow since then.
But before he left, so he left in 2012, I believe.
So the dozen or so years before he left, there were some ups and downs as well.
So mostly it was him working on it by himself.
It was kind of his pet project, if you will.
And at the FKI, he also had other things to do to work on.
So he didn't always have time to to develop Mary further, although he had very clear ideas on some of the things that would be cool to do.
And many of those he actually did manage to get done by applying for funding for projects that specifically did these things.
So the first rather interesting, larger funding that he received was a project called Pavoke, which was about the parametricization of voice quality and precedent for synthesis.
And that was a project funded for three years by the German Research Council.
I think it was from actually was split up into two years and then another year tacked on a bit after a bit of a hiatus.
And that was interesting because for the first time he was able to get people actually to actually hire researchers to work with him on Mary, where previously it was basically just him.
And occasionally the odd student project or student assistant who would be able to implement a little bit of something or other, but no continuity in the development really.
And that's kind of reflected also in the code style, unfortunately, because different people over the years worked on different parts and they all had their their little.
I mean, some of these people were so.
Hardly anyone who worked on Mary was actually a thoroughbred programmer.
So a lot of them were just basically students and computational linguistics who were learning Java as they went.
Hello in itself. Yes, hello. Can you hear me?
You seem to drop out after learning Java when it went.
Oh, sorry.
OK, forgot what exactly I said, but anyway, so the students.
Yeah, OK, so there were some students over the years who were working on it as either a software project that they needed as one of their courses or maybe they.
There was a little bit of money left over for student assistants who would be able to to work on one aspect or one feature for a couple months or something.
So there was no real continuity besides Mark working on it.
But that changed with the with this DFG project public because that Mark was able to hire a postdoctoral researcher to work with him for three years on this on the system.
And that was fantastic.
Unfortunately, what happened after one year was that the postdoc decided to take a job in the industry as well.
Prefer to leave for a better paid job elsewhere.
And so he had to hire another postdoc.
And this other postdoc of course came with a completely different background and different expectations.
So they were working on different things but within the scope of that project.
But again, after one year, the funding was, well, there was a bit of a hiccup in the funding.
And so that person didn't want to hold his breath and also left for another industry job.
And by the time that the third year of this project got funded, there was another postdoc who had to come in and kind of step up and do something else.
And that happened to be me. So this was in 2010.
So I had just finished my PhD at the time.
And I had been working with archaic laboratory synthesis.
I have much job back around myself.
But I really, really learned a lot during that year.
And that was, I mean, that wasn't just an amazing time.
It was really, really great to be working with Mark.
And it wasn't just the two of us.
There was also a European grant.
A project called Simein.
That was even bigger than the public project.
This one had, I think half a dozen partners in different European countries working on components for a,
for an emotional multi-modal dialogue system.
So it's actually pretty neat.
They had, they had a virtual character on screen that would react to your own mood or the way that we were talking.
So it also integrated the webcam.
It integrated various feature detection algorithms for facial features, voice features, all kinds of things.
And it would try and gauge the emotional state of the user.
And it would try and manipulate that state as well.
So if you, there were a number of different virtual characters that you could talk to.
One of them was permanently moody, permanently happy or permanently angry, whatever.
And they would try and kind of prime you emotionally.
And that's where, where these voices that you might have heard the, the British voices come from.
So there's, there's a spike character who's always angry and aggressive.
There's these other character called Obadaya, when who's always kind of depressed and gloomy.
And there are two female characters.
Poppy was always very cheerful and obnoxiously happy.
And, and prudence, which is a weird name.
But anyway, so she's kind of the matter of the fact, matter of fact, neutral persona.
And these four characters, they actually had different.
I mean, they were virtually like CG characters, just standing there on screen with, and with, well, just talking to you really.
And, and the output, the speech output was generated by Mary.
And that was, so that, that project was coordinated by Mark.
And it allowed him to do all kinds of more interesting things with the emotional speech.
And he had a PhD student working with him for a number of years in that project.
Satush, Satush Pami.
And also, a number of other people were working on it.
Most importantly, Marsella Charfuela, who was at DFKL for, for many years.
But very recently, left for a job in academia.
And so this was a small team, but a small steady team of, of, of about three to four people working on Mary over a couple of years.
And so when I joined in 2010, it was, it was really exciting to be, to be working so closely with these, with these excellent colleagues.
And, yeah, so that really, really gave me a great boost in both doing something worthwhile and, and working on a, on a, on a really good application.
And trying making it better and implementing new features, such as the manipulation of prosody, working on these, on these voice building tools.
And that is something that I, that I, yeah, didn't, didn't want to let go again.
So I kept, I kept in touch with, with Mary development over the years since then.
And a couple of years later, in late 2012, I returned to Sir Broken in a different capacity.
But as Mark had left, it seemed natural to kind of take over maintaining the project.
And at the same time, unfortunately, I also had other responsibilities. So I'm currently working on completely unrelated things in my, well, if you want, in my main project.
Yeah.
But the great news is that we have a number of projects running and one of them is starting actually in next month, where I've also had the opportunity to hire more people that are going to be working basically full time on Mary TTS.
Oh, super. Good news.
Yeah. So we have, we have solid funding for the next four years to improve and continue working on Mary TTS.
And we have very, very exciting ideas on how to make it better and easier and more efficient and all those things while adding functionality that will hopefully actually have a research impact as well, especially regarding trying to predict or to infer the way that things should be read from text.
Basically, so using, using measures or, well, using using techniques from information theory to, to model how it maps on to proceed and then how to use that to improve the quality of the synthesis output.
Well, that is that that's actually very good news to hear.
And it also means now that if we get all our book reports and you'll get to them first.
Yeah, there's a there's a huge backlog, unfortunately, but there are a lot of a lot of really interesting things in the pipeline.
I can go on if you want.
Yeah, far ahead, please.
Okay.
So the voice building itself is something that's basically anyone can do, but up to now they've had to have a certain, you know, well, let's say a tolerance for pain or frustration.
So what happens when you do this voice building thing?
So there are some, there's like a voice SDK, if you will, included with Mary that allows you to record your own, your own voice.
In fact, you can, you can even back off, you can, you can do your own language, if you want.
So if you happen to be a knowledgeable or at least have a cursory familiarity with some language that's not yet supported by Mary, you can in fact create the, or a bootstrap, the, the bare minimum modules required to.
To predict the acoustic parameters for that new language and that's basically done in a separate step.
We currently have a project doing that for Luxembourgish interestingly Luxembourgish does not have a TTS system apparently.
And so that's that's something we're working on at the moment.
You're actually heading into tertiary that I'm quite interested in now because I'd like to.
As I said before, have a Dutch language version.
Yes, so what do I need to do?
Well, the vanilla way to create really kind of baseline support for a new language is to take a large text corpus.
Let's say the Wikipedia in that language and to harvest that text corpus for words to create a new dictionary, a new pronunciation dictionary.
And also to create the inventory or the phone set that or the inventory of sounds of separate sound in that language.
And that will allow you to go from text to pronunciation.
Unfortunately, the process there is even more painful than the, than the voice building process because I mean it's fantastic for what they did back in 2005 to 2006.
But it hasn't really been improved since and there are a couple of things that were never really that solid.
And it was a research tool at the time and it still is the same basically.
So one of the things that we're that we're dealing with is just reducing the technical overhead in these things.
So for instance, processing the Wikipedia in a given language takes a huge amount of memory.
Now, unfortunately, it was it was implemented using.
I don't know why, but they decided to implement it as an SQL database.
So you have a certain technical overhead in creating or even storing and processing the raw data because you need to install whatever SQL need to run all these queries.
So you can't just do it on any random windows box.
But that's that's something that we're trying to make simpler simply by by moving on to slightly more modern mass processing techniques.
But what you end up with is a dictionary that will predict or that will give you the pronunciation for all the words that you know how to describe.
And that will use that knowledge to predict the pronunciation of unknown words.
So just stop you there.
So you would need to supply it.
Here's the text that's been read and here's a wild file, I guess, of that text being read by a human that comes in that comes a bit later or basically the next step.
So the first step is here is a Wikipedia page and then and what do I need to do to that Wikipedia page?
So the Wikipedia serves for for two things.
First, it gives you a reasonably good coverage of what the sounds of that language are.
Yeah, and what the and what the most frequent words are.
And if you transcribe those most frequent words and then train models to predict the pronunciation of the rest of the words, then you're good to go for the pronunciation stage.
Yes, can you just sorry, didn't quite get that.
Okay, so what you need is some sort of pronunciation model.
If you if you throw text at the system, it's supposed to give you the phonetic units that you need to well sequence to to read it out.
So if I mean in some languages, it's super straightforward, but in other languages, it's completely different.
So English is kind of difficult because there are a lot of words that are spelled completely differently than they're spoken.
Yeah, Irish would be a nice example of how horrible it can be to have really, really bad correspondence between the written in the spoken form.
Whereas Spanish, for instance, would be straightforward.
You could you could do a Spanish synthesizer based on the handful of rules because it's so direct the correspondence between the spoken and written form.
Okay, so with the Wikipedia article, I presume you're using Wikipedia because of the license.
Yeah, that you can use it.
So that's a simple text file.
Do would you then need to manipulate the text file in somewhere or is it just enough to, you know, say that is Wikipedia text?
Well, it would be a large text file.
So yeah, yeah.
You're basically you want very, very large coverage.
So you would be using, I don't know, hundreds or thousands of pages in kind of an XML dump or a markup dump or whatever.
Okay.
And processing all that.
And what it does, what it does beside setting up the pronunciation of words is to give you a list of sentences.
And there is a greedy algorithm that tries to maximize the phonetic coverage of your of your phone set for this large speech corpus to give you, I don't know, a couple hundred sentences that if you read those out and record them, that will give you a good database for voice.
Yes, I understand that that's the next step.
Okay.
But if so, you dump it all this Wikipedia stuff, but then what are for the likes of something that will be geared towards talking to a child that might give you if you summarize Wikipedia, the essence of Wikipedia is kind of very formal and stuffy language, wouldn't you say?
Would it not be better to feed it a whole go of children's books or something of that level?
Or would you end up them with a language that would be Dutch, you know, kindergarten or Dutch middle school, that sort of stuff.
Well, they're, I mean, they're not completely separate, but there are two different aspects, if you will.
So one is basically the text domain.
And the text domain is not as critical because some of these, I mean, sure, if you find a whole word, if you want to say something like, I don't know, colonoscopy, and you have that word in your database, then of course it's easy to make it sound good because you're just going to take that you're going to lift all the units in one block and have it right out.
But that word is probably not going to have, or not going to.
So that's basically just for the text domain, but the way that you read it is, of course, then going to affect the way that the voice sounds, the GTS voice sounds at the end.
So you can kind of split these things up and you can say, okay, I'm going to take the Wikipedia or, I don't know, whatever, the.
Fairytale one and yeah, you did nice and light and then read the Wikipedia in a nice somber voice.
You could do yeah, of course, the problem is getting creative commons or freely licensed material from people.
Yeah, well, I mean, another way to do it would be to use, I don't know, public domain works in.
Yeah, we were thinking about that and we had to look at some of the books, but even then they, everything that's gone out of, gone out of copyright is so old that the language has moved on that, you know, it's very, very, very old, you know,
I don't know, old Dutch wouldn't be a right word, but more your grandmother's type of Dutch.
Yeah, yeah, but incidentally, that might be ideal for fairy tales, but that's a certain text type that you might not want for, I don't know, reading your text messages for you or reading out your email.
But you could obviously, you could feed us like Wikipedia and these books as well and then you would get a that corpus that sort of distill distillation process will give you a list of of texts that it once read is that correct.
Pretty much yes, you're not tied into synthesizing only that, but if you have a domain specific corpus and then build a voice from it, then things that are within domain are probably going to sound better because it's going to find more
or larger chunks of desired output in that, in that corpus, then if you, if you try and synthesize something else.
So an extreme, an extreme example would be there's a, I'm sure you've come across the festival synthesis system.
I have another text to speech engine that is incredibly difficult to work with, just like my retts, I have to be honest.
Yes, yes, although, anyway, moving on to the later.
Yeah, yeah.
So the extreme example is there is a, there's a website called festivox.org that deals or that provides resources among other things for creating voices for festival.
And one of the demo voices or one of the demo recording databases is a really small data set written sorry read by by Ellen Black, one of the original programmers on, on festival, which is nothing but numbers to create a speaking clock.
Actually, I think there is even a, whatever, a package for Linux distributions out there that would just have that speaking clock as a, as an application.
But the point is if you use that to synthesize the time, it works wonderfully, even though it has a very strong Scottish accent, or maybe because of it, I don't know.
But it will not, it will fail miserably if it, if it's supposed to read out, I don't know, a poem or something like that.
I guess not. I guess it. Yeah. Yeah. This is very clear. Cool. Cool.
Carry on. So, okay.
And the, so the voice building process itself takes the list of sentences that you've grabbed from this using this greedy algorithm.
So, or you can just use whatever other prompt or prompt list that you want.
I mean, it should have good phonetic coverage and it should give you early good coverage.
If you're interrupted after or if you, if you have to, if you only have 50 sentences or something, they should be the right 50 sentences to give you good coverage, but it would be better to have several hours of data if possible.
Now, if you do that, you will have, so you, you then sit down, you know, and you record, hopefully under good acoustic conditions, maybe not while the neighbor is mowing his lawn outside.
You can record this and then you have the text that you read and you have the corresponding way files or whatever audio format that you choose.
And then you start.
Would you not need to keep them in sync somewhere? Well, they would be in the next step, basically.
You would have, so let's just say you have 500 utterances of random Wikipedia text and, and you have the corresponding recordings that you that you just spoke right out.
And then the next step is to segment those into phonetic units and that can be done using, well, you can do it by hand.
Typically gives you slightly better quality, but of course it's hugely, hugely times consuming and you have to know what you're doing.
So the obvious solution is to use automatic speech recognition, specifically in a kind of a guided way, which is called forced alignment.
So you know what's being said, you know what the units are, you just have to place the boundaries correctly. And so there are a couple of different ways to do that.
And the one that Mary was tending to use for a long time is something actually lifted straight out of the FestVox tools, which are also open source called EHM.
And that worked reasonably well.
That worked reasonably well, but more often than not, it would also fail just completely on some sentences, which is to say, OK, well, maybe the first 500 frames are all the audio and then we have 30 seconds of nothing where it doesn't match up with the audio at all.
And just weird glitches like that would happen more often than not.
There are a couple of different ways to do it. There's also something that's widely used called HTK, which is a hidden mark of model toolkit developed at Cambridge University, which is, well, it comes with a license burden.
So it's open source, but you may not redistribute it.
So that makes a little problematic to you. So you have to go to them down register and then download the source, then you can pilot yourself and then you can use it for forced alignment.
And, yeah, I mean, it works, but more easily there.
No, carry on. Maybe you're going to cover. No, go ahead.
I saw I come across a website and I cannot find it now where you had where you would feed it a line.
It would say a line of text and then you were press play and sort of like a karaoke machine, the line will go across and you were supposed to speak as the line was going across the piece of text.
And you could slow it up and you could slow speed up or slow down the speed of the line going across so that it would know where it was on the text.
Any idea of that?
It was linked on the Mary TTS size during my googling of it.
And I have to admit, I'm not exactly sure.
There is a recording tool in Mary TTS.
And I'm not sure if maybe that's part of what you're referring to. I don't know.
But yeah, I mean the not to worry carry on.
Yeah, it comes down to how you're going to instruct your speaker or how are you going to actually.
How are you going to stage the whole recording session?
I mean, one approach would be just to sit them down and say here read those sentences and that's it.
Or you could actually go and try and direct your voice telling in a certain way to make sure that he does speak or produce those utterances in a way that that's more conducive to what the overall domain goal is.
So we had the case of the expressive synthesis.
You're telling him, okay, well, maybe you could read these sentences in a happy voice.
And then he'll go and do that if he's good or not.
Or if you want, you can you can try different like speech rates.
You can try different.
Well, I guess.
Registers like.
It's something.
Or different accents if you want.
So things like that can all basically be done to give a certain flavor to the overall results.
But the TTS system is not really aware of that.
That's kind of a side effect of your recording.
So it's kind of a sub channel.
Yeah.
And it will be something you pick.
You would end up with a different voice.
The happy voice.
The sad voice, the thingy voice.
And you would select that at.
He said something angrily and you'd use the angry voice and then.
Yes.
Yeah, okay.
So that's later.
So in my particular very specific use case, it's related directly to.
I want to.
I want to read a nice soft cam voice the whole time.
So that's what we will be going for.
So I interrupted you.
Keep going.
No, no.
Okay.
So then after you.
You process it with one of these first alignment ASR engines.
You then have.
Timeline.
So you know where in which.
In which file or basically you have.
You have the audio and you have a set of timestamps that correspond to the boundaries.
Between the genetic units.
And those are basically the basic building blocks of the unit section voice, for instance.
If you're if you're going to go for the traditional unit selection synthesis.
That sounds the most natural in domain.
Then that's how you do it.
And.
You then go and add some features.
You.
Based on these on on this phonetic annotation.
Those units are then enriched with some features that are predicted by.
By the.
Essentially the text processing.
Components.
And at the end of the day, you have a.
Well, I guess you have an enriched form of your recorded database that is suitable for.
So then you can you can basically.
Go and predict the pronunciation for a given text.
And based on the pronunciation and some other features.
Look for.
The appropriate or the best possible.
Sorry units matching those features.
In your database using a cost function really.
Yeah.
And so.
And then you you take them.
You can catenate them.
And then you get your.
Your new veterans, your synthetic veterans.
For the given input text using the voice that you recorded or using the voice data.
And depending on how you recorded it, it will sound like that person in a certain way.
And that's basically the process.
Okay.
Perfectly clear.
Very, very complex.
A lot more complex than I thought it would be.
Well, I've been I've been going a lot into a lot of detail.
It's not that terribly complex once you've done it a few times.
I've actually I was I was teaching a seminar earlier this year where the where the participants were going into our recording studio.
And recording one and a half hours of a standard prompt list.
And then just using Mary to create a new TTS voice from those recordings.
So they basically had their own recordings and their own.
Their own TTS voice that was usable with Mary.
They did that was both for the unit selection and for the HMF synthesis.
And that worked.
I would say about a week or so.
So basically they went from zero to working voices in a couple days.
Okay.
Do you happen to have how to's on that that are available to me?
Again, the state of the documentation is sadly a bit out of date.
So that's what I meant when I said you have to have a certain tolerance for frustration.
It can be done. It certainly can be done.
And it may work out of the box or you may run into into one or another problem and maybe a little clunkier.
But I'm happy to fix the documentation as I go.
That would be quite simple if you're if you're willing to help me along.
Yeah.
So I mentioned that that Mary to TTS is on GitHub.
And so basically there's there's an issue tracker crammed full of open bugs.
And there is a wiki that is sadly out of date.
But essentially that's where everything everything happens.
And if you find something on one of the wiki pages that for instance does not match up with what is in the current code,
then you're more than welcome to edit that.
All you need for that is a good type of account.
Okay. Super.
The process utilization installation and something as simple as, you know,
piping, you know, echo hello world to Mary TTS.
Why is that so complicated?
Not necessarily complicated.
I'll go back into a little bit of history.
So up to version four, Mary was designed as a pure client server system.
So the idea was that you would run it as an HTTP server.
And you could connect to it using a client either on the same machine or a different machine.
And all you needed to do was to send a request to the server.
And some of the parameters of the request was the input text and then whatever voice you wanted or something like that.
And the server would respond with essentially the output of whatever you wanted.
So if you wanted audio, it would send you back away file.
And that persevered into, well, basically into the current code.
However, in version five, Mark designed something that is slightly more lightweight, which he calls the Mary interface.
And that can be something that's basically just within an application.
Now, the most important, the most important change from Mary five, which is the current version.
Is, or though the era, if you will, or family.
Is the fact that it now uses maven for dependency management and deployment and all those things.
It's modular in a different way.
So it now uses maven modules for some of the things.
And it gets a little confusing because some of the Mary modules are in different maven modules and so forth.
So there's a slight mismatch in how the modularization works within the Mary TTS components and within the maven project.
What's maven?
Oh, sorry. Maven is a Java centric build automation system.
Which takes you from, but uses a set of conventions that are actually very straightforward to follow.
Where you place your source code, where you place your unit tests, where you place your resources.
You run basically a simple command and it automatically puts everything together.
Compiles it, tests it and deploys it into a jar or actually it assembles it into a jar.
And then you can, you can upload that jar to an artifact repository where other people can find it.
And one of the really cool things about maven is that it has a concept of remote repositories for commonly used artifacts.
So if you have, let's say you want to use some common third third party library in your code, all you have to do is write a certain instruction into your project model.
It's really just an XML file that tells maven to use that particular library.
And it goes and looks for that library at that version in a public central repository automatically downloads it into your local machine, puts it in a repository, puts it on the class path.
And if you want, also bundles it with your application.
So that makes it super, super straightforward to say that maybe you want to use, I don't know, Mary TTS in your, in your project.
So all you say is I have a dependency on Mary TTS version 5.1.
You just use it and then suddenly, automatically, you have all of the stuff on your class path and you can even have content assist and eclipse or something.
So you can just type new Mary interface and then that's it and then you can synthesize stuff.
So using maven, really all you need is a little block that that that declares a dependency on Mary TTS and then maybe three or four lines of Java and then you're done.
So I get, I get a sort of like C pen in the Pearl world.
Yeah, exactly, exactly.
And so that brings me to the newer concepts that I was mentioning earlier about voice building had to simplify it and make it more, more performative.
So maven is fantastic and is pretty much, I mean, I would say in the Java world is probably by now in industry standard.
For a long time, it was aunt, but aunt is, it gives you a lot of power, but you also have to write a lot of XML before anything happens.
And maven just kind of throws all that out and says, it's just, it's just going to work.
If you follow the convention, if you want to customize, go ahead, but then you do have to write XML.
Okay, I'm familiar with that. So good.
Okay.
Yeah. Okay.
Clear.
And there is a newer build system that is even more flexible.
So that the nice thing about about maven is the concept of a build lifecycle.
So implicitly, you already have a standard model where you compile or you initialize and make sure that certain things are valid.
For instance, you can, you can assert that you're using the correct version of JDK or something.
Then it compiles your classes.
It comp moves all the resources into place, friends, your unit tests, assembles everything and then installs it or uploads it.
But if you want to deviate from that or you have some other layout, then you're in trouble because then you do have to write a lot of customization XML.
And there's something that has learned and taken the best out of maven and aunt and various other things that is called Gradle.
And Gradle is a build automation system that is still well originally focused on Java, but has become a very strong build tool for C++ and other tools as well.
And it's actually the now as of a few months ago.
It is now the standard build tool for Android.
So it's become very mainstream at this point.
And I've been kind of watching and using it for the last two, three years or so.
And I really, really like what I saw. So I'm at this point, I'm deeply convinced that I've bet on the right horse here.
The thing about Gradle is that it allows you to apply conventions if you want, but you don't have to and you can very easily redefine your workflow.
And at this point, we're using Gradle for everything in my group.
So we're using it for compiling and publishing papers, doing some other stuff, like tying in native binaries into a certain workflow.
Ultimately, what you do in a Gradle build is you formalize a set of tasks and the dependency between those tasks.
So you end up with a task dependency graph, which tells you this thing is a task that does this, and it depends on some other task.
And it gives you all this stuff like caching and dependency management and testing and deployments and all that for free.
You can get parallelization for free. So if you have, I don't know, several unrelated tasks that could run in parallel, that can run in parallel, you don't have to write a single line of code to do that.
And it just allows you the freedom of defining your own workflows.
And so I'll back up a little bit. The voice building process as it's implemented in Mary or as it was designed almost eight years ago, is implicitly following a model like that.
Now this was before Maven was mainstream before Gradle existed.
There's essentially a number of steps that you have to accomplish and some of these steps depend on previous steps.
And at the end, you have this big assembly task that kind of creates the package that you then install with Mary to get that voice running.
And there's a lot of there's a lot of checking going on. There's some external tools that need to be called and all that stuff.
So that's all implemented in a rather unfortunately, a bit clunky way.
And it's also tied up with some with some GUI code, which makes the whole thing rather messy and a bit hard to manage.
But at the end of the day, you create a you have data, you process the data through a number of steps.
And at the end, you have a product and this product is essentially a zip file containing your voice in the Mary format.
And you can send that to, I don't know, you can put it on a server and you can point your Mary TTS component installer, which is the little little GUI that allows you to install more voices of the ones that are already out there.
You know, point it there, download and install this voice.
And then suddenly you have that voice in your Mary installation. If you run the Mary server, then it will pick up that voice and have it available.
And so these concepts are all very, very, very, very overlapping with the concept of remote repositories, artifacts, resolving these artifacts and remote repositories, installing them locally or putting them in a cache and installing them from the cache.
And those are all things that are done in Maven and also in Gradle.
So it seemed obvious to kind of migrate this kind of self or home baked task or workflow of voice building to to something more modern and and maintained well maintained like Gradle.
So that's actually what we're doing.
So we have a voice building plugin that I'm currently working on that allows you to to go from from your raw data all the way through the final voice, basically, just a few steps.
You get a new e for free Gradle comes with the GUI. You get parallel processing for free. It's much easier to log stuff to to test things to ensure that the output of one task is actually valid before the next test tries to consume it.
So you get you get external thread handling. So there are a number of different external tools that we need to use, but it's very very straightforward to just execute them with Gradle as opposed to the current state of the Mary code where you have to write on a 10 lines of Java code to manage these external threads.
It's up to you whether or not you you capture their output value and what happens with with their output if they fail and so forth. So it's it's much easier to do these things with Gradle and so I'm very happy to to see that this is actually working.
And so at the end of the day, you'll have open source voice building. So there are a couple of databases out there that are open source that you can tie into a project using the new voice building.
And all you have to do is is get cloned it and run Gradle and well, it'll probably take a couple minutes, but then you'll have your your TTS voice and you can deploy it to remote repository.
You can find it there and and then use it in some other computer just simply by resolving it and installing it locally.
Yeah, good. Yeah, I'm just thinking from a Linux point of view, you know, you put it into the distribution repositories then and you do an upget install something or other.
You could do. I mean, right now we're the binaries are sorry, the actual voices themselves, some of which are actually rather large packages.
Yeah, they are not as easy to host. So we have kind of a half asked cloud hosting solution where we have a Google drive, we have a drop box and those things have some of the voices shared.
So if for instance, the main Mary web server, which is also hosting these voices is somehow down the installer world, try to fall back to one of those other cloud storage services.
But it's not again, it's it's a little bit home baked and doesn't doesn't work as well as advertised.
Unfortunately, by offloading it to Gradle, it makes it very, very straightforward to just define a number of different places to look for these things, different patterns to find them to resolve them in the path.
And they're also more advanced or more flexible hosting solutions now specifically for binaries, such as Bintray.
And I don't know if you've heard of Bintray, but it's no, it's a service offered by J Frog, the company behind Artifactory, which is one of the most widely used, maybe repository manager softwares.
And you could think of it as a little bit like like GitHub, but not for code but for binaries.
So if you sign up for an account on Bintray, you get a number of different repositories, for instance, one for for maven and Gradle and Apache IV and things like that can use it as well.
So you put your binaries there and then people can just resolve them by pointing their their maven at that particular repository or if you're using Gradle, you just there's a little function called J Center that just applies that automatically and then you can resolve these artifacts from from Bintray.
Okay, perfect.
The accessible computer from foundation, Jonathan, they do who spoke about before has got a Linux distribution called sonar and they are using the text text to speech engine.
They've replaced the speech dispatcher to use Mary TTS.
They're running that on a Raspberry Pi.
When I run it here on my quad core, blah, blah, blah machine with so many giga quads of RAM, it's run like a hug.
How are they able to run it so fast and on such a small device?
Maybe it's a question for Jonathan, but I guess most of the people here would just like they pick a nice voice that they want and they want to be able to
cut in some text and hear the output.
Something as simple as that with a non voice that they like, what do I need to do?
It's actually very simple.
So the unit selection voices use a lot of memory and those are hogs.
If you're constrained for processing power or storage or both, then you'll want to use one of the HMM voices.
They're built from the same data, but using a completely different technique and what comes out of it is something that doesn't actually have the actual data in it.
So it's much smaller, but something that's trained on the data and just gives the parameters that you need to make it sound like that.
And those are the voices that are much smaller in footprint and also more efficient to run.
So I'm assuming that they're using one of those voices.
And HMM stands for human?
Hidden Markov model.
Okay.
It's actually a hidden semi-markov model, but that's nitpicking at this point.
But yes, if you use one of those voices, you can run it on an Android phone on an Raspberry Pi.
Okay.
Do I still need to run the client and server for it to do that?
I doubt it.
So the client server stuff is, it has a bit of overhead and I mean, there are also some problems where I suspect that, for instance, a Windows firewall is getting in the way, things like that.
If you use the Mary 5 code, you can actually wrap it directly into an application and call all the other synthesis routines internally.
So you don't have to worry about all that.
But do you have something, do you have something yourself that is, that's just a very simple wrapper on that like Mary TTS-C?
We don't actually have that.
It would be reasonably easy to build though.
There were a couple of hold on.
I think there is actually an example project somewhere in the code that has something like that.
So there's an example project for the remote interface where you would have the interface running on a remote server that actually goes back to using this networking protocol.
And there's a second example project that runs the internal or the so-called local Mary interface, which does it within the app with no network involved.
I think you can find those on GitHub.
Let me just double check.
Because I can definitely tell you that is, that is something that has been a major barrier to me just, just getting into this and even trying the voices.
Ideally, with eSpeaker, with some of those, there's a command eSpeak.
And you can specify a text file.
It can specify a wild file.
But if you don't, it goes standard in, standard out.
And then you have the option of dash V for the voice.
Even something like that where you can run it locally and test out your voices.
It would be just so nice.
Well, I think the biggest constraint here is that it has to be Java.
No, no, Java will be fine.
We can wrap it in a basket or something.
OK, yeah, in that case, I see no problem.
So there's, yeah, I just checked.
You heard it here, folks.
Well, you see no problems.
Four hours later, I'm throwing the keyboard out the window.
No, seriously, there's a, if you look on the GitHub source page there, there's something called, there's a maven module called user examples, which has a submodule called example embedded by consists of two files.
One is the, the palm, the project object model that, that maven uses to, to build this project.
And the important things are the dependencies.
So you have a dependency on, on the voice, really.
You only need one, you need one dependency because the others are actually transitive.
So the voice, in this case, in this example, there is a voice called CMU SLT HSMM, which is the HMM version of a, of an Arctic database provided by CMU.
For speaker SLT.
So there is this, which is actually the example voice in the Mary code base.
Now this, this voice is the dependency for that project.
It has transit, so that voice in turn depends on the English language component, which in turn depends on the Mary to just core runtime life.
OK, all on all over.
So you have transitive dependencies from the voice to all the runtime stuff that you need in Mary.
So all you need is a single dependency called, well, voice CMU SLT HSMM, which is available in various ways.
So you can either get it from Bintray or you can get it by locally installing Mary from source if you want.
But if you have that one dependency, then you can go to the other file, which is called,
MaryTTS embedded.java, and consists of about, well, just under 30 lines.
And the main class is only like, I don't know, 10 lines, which just instantiates a local Mary interface, which is already on the class path because of your dependency.
And it actually says, OK, so you can load a voice or just use the default voice.
And there's a little method that says, generate audio, and you pass it some text.
And it creates an audio input stream, which in turn, you can then either play or save as a way file or do what you want with.
So the actual logic or the actual code that you need to write to use this is, I don't know, three or four lines.
Now, if you want to wrap it into something that runs on the console, then you can do that too.
I'm sure there are people who follow that.
I unfortunately wasn't one of them. However, that said, I think it does highlight an issue with the, wow, with all these projects, all your projects is that I think you guys are so far into us.
You're, you're forgetting the humans here behind.
Yeah, sorry. I think I think really think there should be an example project.
You download that and you type in some text here and you get some voice out would definitely help.
I do know, I do know some Java programmers are listening to this who hopefully will understand what you've just done and be able to produce that for us.
That would be nice.
Yeah, yeah, I apologize for being a little too.
No, no, no, I also, I agree, I agree. It was a bit, it was a bit weird, but we can document this.
We will document this in a better way to make it very clear that it actually is very simple to use this.
And if we have the small project that basically just has a say utility or something that you can just type stuff or pipe stuff into, then that would be probably even easier.
Cool. Let's do this offline, actually.
And if there are people listening to this who can assist, there's a few people jump to mind who I know are Java programmers as well.
So it would be nice to have this done.
Just that when people hit the main page, you know, new to Mary TTS, try this.
Excellent.
Okay, cool.
Is there speaking of help that you need with the project? Is there anything that you think, you know, how people contribute what sort of helpers are you looking for?
So let's see, I think one of the main problems is that we currently don't have a focus on integrating with OS at the OS level.
I mean, Windows, macOS, Linux or various distributions, they all have their own ways of hooking into the speech APIs.
And that's something that I know not too much about and that has been on the to do list on Mary for a long, long time, but it hasn't been done.
So for instance, there's a Java speech API that might or might not be easy to integrate to create something that would simply just run as a, as a TTS voice in the existing frameworks that there are out there for macOS and Windows, I don't know, would be cool to have.
But it's unfortunately, it's not a priority because we don't have a project that depends on it right now.
And likewise, for Linux, I know that people have tried to integrate Mary and there's been a bit of, well, criticism regarding certain aspects of how Mary works that would be nicer to work directly with people who are using it to know how to develop for this.
But again, unfortunately, it was limited due to manpower at our end.
So yeah, I think those, those are the things where, where we would really benefit from, from user contributions.
Contributions in the, in the sense of somebody who knows how to, how to code for these frameworks.
And I know that some people have already done this.
But unfortunately, there's no, there's no direct dialogue. I don't think they've really gotten involved on the GitHub level, which is the way to, to, to do this more, most efficiently, I think.
But you're definitely open to contributions coming back for that sort of thing.
Absolutely, absolutely. That's what it's for.
And then we have also mentioned about people helping out with the wiki and things.
And in our own way, when I try and sort get all this stuff clear in my head, maybe we can just get a better overview of,
overview myself on the documentation aspects. And we'll work through updating the documents for those that recording a new voice, trying to make it as easy as possible.
If we do have some voice talent, as you say, that is willing to give up their time.
Super. I'm, I need to let this settle down. Is there any chance that maybe later on I could interview again, what a follow up, perhaps with somebody else who knows what's what they're talking about?
Absolutely. Yeah, I think that would be wonderful because right now we're kind of at, at the dawn of a new, a new stage or a new phase in, in very development.
So there were several different phases in the past. One was where it was not open source yet. Then there was one where it was open source.
And we had increasing number of projects working on it and developing on it. And then a bit of a hiatus, like I said, after those projects ended, it was just mark working on it.
Even he was too busy to focus on it. And then he basically just pushed it onto GitHub in preparation for leaving for Google. And so it was basically handed off to the community.
And now that we have funding coming in for, for people who are actually going to work on this full time, this is the beginning of a new phase of, I don't know, three, four years, where we can really focus on stuff and really get a big push towards towards more interesting features, improving the existing features.
Improving documentation, all that stuff. So it would be wonderful to, at the onset of this, of this particular new phase to have this interview. So that's, that's great. Did that.
But having said that, it would be great to, it may be in a year or two to sit down again and talk about some of the, some of the things that were on the road map.
And that may or may not have been accomplished. And perhaps other things that are not on the road map that will have been accomplished by then.
So it would be great to have a kind of a follow-up and see in one or two years how it's developed since today.
Have you considered going to foster presenting at foster at all or other other developer conferences like that?
I haven't, haven't actually. Yeah, okay. It's certainly food for thought.
I, well, I see Mary as, well, so far, Mary has been mainly a research tool that happens to be usable for certain applications that make it useful to, I don't know, enterprise developers or enthusiasts and users.
But it's not really as mainstream as some other open source tools. And I guess there are a handful of people who have contributed more or less steadily over the years and the community.
But because it was originally an in-house tool, it's kind of still in that, in that little nest, in that little basket waiting to be released or to be adopted by the wide world.
And so I don't know how far people in the open source community will have become aware of Mary.
Well, there's one way to definitely get developers attention is to, in the Linux world at least, and the freeBSD world is to pop over to Foster, the largest developer conference is held in Brussels, which shouldn't be too far away from you.
And they're currently looking for papers. So if you wanted to give a talk, I guarantee you that that would be a fun fill weekend.
I, of course, will be there as well. And I can then give you a beer if you're into beers. If not, a nice cup of coffee.
Oh, Belgian beer is good.
It is indeed, sir.
Let's see now the 15th. I don't think I don't think we can do it this time around.
31st of January and the 1st of February, 2015.
So the call, go on.
Sorry, they're assumed there is a deadline, though, for proposing talks.
Yeah, that would be the 1st of October.
Okay, yeah, it's a little bit, that would be a little bit exciting because the 1st of October is also the date on which our project actually begins.
And I'm actually going to, to various conferences before then this month.
So you can file them off an email. I mean, how difficult is this?
Okay.
Yeah, it's certainly food for thought. And it's something where I would like to keep it on the radar.
Even if we don't manage to attend next year's, then the year after that should certainly be doable.
Yeah, okay, very good.
You should attend, anyway, to get a flavor of the thing and talk to some people.
Is there anything else that I haven't covered that you think you would like to share with us?
Not sure. Yeah, maybe not.
I can't think of anything at this point, but it doesn't matter if you do, you can always record to show yourself or just send it to me and we can do a follow-up with plenty, plenty, plenty of slots here.
Okay, I would like to thank you very much for taking the time and explaining this to me and do me a very good understanding of why I can't just simply echo text files.
And tune in tomorrow, folks, for another exciting episode of Hacker Public Radio.
Thanks a lot, Ken.
No problem.
Hello there. I'm Prudence, and I am very matter of fact HSMM Voice. It's Thursday 18th of September 2014.
This is HPR episode 1,599 entitled Interview Within Marstiner from the Merry TTS Project.
It is hosted by Ken Fallon and is about 86 minutes long.
Feedback can be sent to Ken Fallon, I or by leaving the comment on this episode.
The summary is, Ken Interviews in Marstiner from the Merry Text to Speech Project.
Hello there. I'm Prudence, and I am very matter of fact unit selection voice.
It's the day 18th of September 2014. This is HPR episode 1,599 entitled Interview Within Marstiner from the Merry TTS Project.
It is hosted by Ken Fallon and is about 86 minutes long.
Feedback can be sent to Ken Fallon, I or by leaving a comment on this episode.
The summary is, Ken Interviews in Marstiner from the Merry Text to Speech Project.
I'm Spike HSMM Voice. It's Thursday 18th of September 2014.
This is HPR episode 1,599 entitled Interview Within Marstiner from the Merry TTS Project.
It is hosted by Ken Fallon and is about 86 minutes long.
Feedback can be sent to Ken Fallon, I or by leaving a comment on this episode.
The summary is, Ken Interviews in Marstiner from the Merry Text to Speech Project.
Hi, I'm Spike in its election. It's Thursday 18th of September 2014.
This is HPR episode 1,599 entitled Interview Within Marstiner from the Merry TTS Project.
It is hosted by Ken Fallon and is about 86 minutes long. Feedback can be sent to Ken Fallon, I or by leaving a comment on this episode.
The summary is, Ken Interviews in Marstiner from the Merry Text to Speech Project.
Hi, I'm Poppy HSMM Voice. It's Thursday 18th of September 2014.
This is HPR episode 1,599 entitled Interview Within Marstiner from the Merry TTS Project.
It is hosted by Ken Fallon and is about 86 minutes long.
Feedback can be sent to Ken Fallon, I or by leaving a comment on this episode.
The summary is, Ken Interviews in Marstiner from the Merry Text to Speech Project.
Hi, I'm Poppy in its election voice. It's Thursday 18th of September 2014.
This is HPR episode 1,599 entitled Interview Within Marstiner from the Merry TTS Project.
It is hosted by Ken Fallon and is about 86 minutes long. Feedback can be sent to Ken Fallon, I or by leaving a comment on this episode.
The summary is, Ken Interviews in Marstiner from the Merry Text to Speech Project.
Hello, my name is Obediah HSMM Voice. It's Thursday 18th of September 2014.
This is HPR episode 1,599 entitled Interview Within Marstiner from the Merry TTS Project.
It is hosted by Ken Fallon and is about 86 minutes long. Feedback can be sent to Ken Fallon, I or by leaving a comment on this episode.
The summary is, Ken Interviews in Marstiner from the Merry Text to Speech Project.
Hello, my name is Obediah, the unit selection voice. It's Thursday 18th of September 2014.
This is HPR episode 1,599 entitled Interview Within Marstiner from the Merry TTS Project.
It is hosted by Ken Fallon and is about 86 minutes long. Feedback can be sent to Ken Fallon, I or by leaving a comment on this episode.
The summary is, Ken Interviews in Marstiner from the Merry Text to Speech Project.
This is CMUSL THSMM Voice. It's Thursday 18th of September 2014. This is HPR episode 1,599 entitled Interview Within Marstiner from the Merry TTS Project.
It is hosted by Ken Fallon and is about 86 minutes long. Feedback can be sent to Ken Fallon, I or by leaving a comment on this episode.
The summary is, Ken Interviews in Marstiner from the Merry Text to Speech Project.
Hello, I am the CMUSL THSMM Voice. It's Thursday 18th of September 2014. This is HPR episode 1,599 entitled Interview Within Marstiner from the Merry TTS Project.
It is hosted by Ken Fallon and is about 86 minutes long. Feedback can be sent to Ken Fallon, I or by leaving a comment on this episode.
The summary is, Ken Interviews in Marstiner from the Merry Text to Speech Project.
Hello, I am the CMUSL THSMM Voice. It's Thursday 18th of September 2014.
This is HPR episode 1,599 entitled Interview Within Marstiner from the Merry TTS Project. It is hosted by Ken Fallon and is about 86 minutes long. Feedback can be sent to Ken Fallon, I or by leaving a comment on this episode.
The summary is, Ken Interviews in Marstiner from the Merry Text to Speech Project.
Hello, I am the default speak voice and you're stuck with me until one of you can make a easy command line application from the Merry TTS voices. Ha ha ha. This is me laughing.
You've been listening to HECKAPOBLICGradio.org. We are a community podcast network that releases shows every weekday, Monday through Friday.
Today's show, like all our shows, was contributed by an HPR listener like yourself. If you ever thought of recording a podcast, then click on our contributing to find out how easy it really is.
HECKAPOBLICGradio was founded by the digital dog pound and the infonomicom computer club and is part of the binary revolution at binrev.com.
If you have comments on today's show, please email the host directly, leave a comment on the website or record a follow-up episode yourself.
Unless otherwise stated, today's show is released on the creative comments, attribution, share a light, 3.0 license.