Initial commit: HPR Knowledge Base MCP Server

- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 10:54:13 +00:00
commit 7c8efd2228
4494 changed files with 1705541 additions and 0 deletions
--- a/hpr_transcripts/hpr3446.txt
+++ b/hpr_transcripts/hpr3446.txt
@@ -0,0 +1,225 @@
+Episode: 3446
+Title: HPR3446: Speech To Text
+Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3446/hpr3446.mp3
+Transcribed: 2025-10-24 23:35:03
+
+---
+
+This is Hacker Public Radio Episode 3446 for Mundy, the 18th of October 2021.
+Tid's show is entitled, Speech to Text. It is hosted by operator and is about 23 minutes long
+and carries a clean flag. The summary is, I talk about converting HPR, audio to text and tagging.
+This episode of HPR is brought to you by archive.org. Support universal access to all knowledge
+by heading over to archive.org forward slash donate.
+Hello, everyone. Welcome to another episode of Hacker Public Radio with your host operator.
+Today I'm going to be talking about analyzing audio and extracting text out of the audio
+and creating sort of a keyword database. The idea behind this is to have transcribed
+audio from HPR episodes, for example, hearing impaired, whatever, and then also for metadata
+and or tagging. The idea here is to eventually use natural language processing,
+which is this sort of AI-driven artificial intelligence or whatever you want to call it,
+way to analyze audio and pull key terms out of it. So I'll go over some examples in my approach.
+This probably spent maybe an hour, an hour, and 30 minutes piecing this all together
+and a quick little hacked up batch job or whatever. The first step is obviously
+downloading the audio file itself. The second step is pulling that audio in to VOSC, VOSK,
+and from what I understand these are all open source tools. I don't know what their licenses are
+or whatever, but this is a proof of concept. We could kind of use anything. There's probably
+better things for other stuff, and I'll go into that a little bit more. But I'm just merely
+scratching the surface of all this. So the second step is, you know, after you've downloaded the
+file, I'm using YouTube BL, converting that to WAVE and letting embedded YouTube BL FFN
+pick, I think that's embedded inside of it or some magic to convert that to a WAVE file.
+Now after that WAVE file is, the MB3 is downloaded and converted to a WAVE file. It passes through
+this Python script called VOSK VOSK. And I'm just using the defaults now.
+From what I can tell, it's built on like an input word list. So if the word is not in the word list,
+it's not going to pull it out as a word. So for example, we got fight covert viruses instead of
+fight COVID viruses. So what I'd like to see is it's actually recognizing words based on
+not a word list. So we can't possibly put in every single word that's ever going to speak
+been spoken in the English language. We need to pull out words that are made up, for example,
+Carolina Con or Poned. And there should be a minimal effort around training the speech,
+or speech recognition to pick up new words or get really close to those words.
+So I don't know what that approach is going to look like. And I'll take any suggestions around
+that space. But for this example, it looks like the way that VOSK works is that it's using like
+an input word list. So if it's not in the word list, it gets something whatever closest to the
+word. So the closest word to COVID is covert. Now, with the sample test sample python script,
+you could probably just have it open ended and not import, you know, the word list and have it
+kind of make up its own words. So phonetically typing words out. I'm not sure exactly how text
+to speech or speech to text works in the in the event that the word doesn't exist. You know,
+it's a made up word or it's a word that's not a common dictionary word, right? So that's the first
+struggle or point I'd like to make that, you know, I need a way to extract the the words out of
+the audio. And it's those words might be made up in some cases. So for example, maybe I don't know
+what the match threshold is. You could say, okay, if the threshold is doesn't match to 60% or
+greater, then use the word that you think phonetically is being said. So COVID might be
+spelled incorrectly, but it might be at least close enough to where you could run that through
+a spell check program. And then maybe it would spell auto correct it to to to COVID. So it doesn't
+have necessarily have to be perfect. But from this, you know, from the standpoint of words that
+don't exist, if you ran that through like a, for example, Google word lists or spell check algorithm
+that was cloud based, that would have every single word in. Then we could take that word that
+looks like COVID phonetically and it would auto correct to COVID. So that's how we could get around
+the words that aren't words situation. So that next step is normalizing the output and removing
+common beginning phrases, right? So in the beginning of every episode, you have the, you know,
+the prologue and the whatever, whatever. So you have the impart in the beginning part.
+We want to trim that out. We don't need the, you know, this episode and tonight's shows entitled
+or brought to you by. We want to filter those out because we don't want those keywords popping up
+when we do the analysts later. So we kind of don't realize this this output. It's in like a JSON or
+XML. Once that output is normalized, it's then passed through the YAK, YAK, it's like yet another
+something engine, whatever. So that part, you can set the number of, excuse me, kind of terms
+to reply back to. And I set it to 100. It defaults to, I want to say 10. So I'll give you some example
+words here. Let's see if I can find. Now, an episode about the latest episode as of today or whatever,
+normal layer modes, erase merge and split. Look at layers, models, and gimp. So this is by Huga, I think
+who am I looking at? Bobbie, Bob, Bob, yeah, Huga is going over some gimp stuff. And this is
+episode 3, 4, 2, 0. So we can go in here, find our 3, 4, 2, 0, 3, 4, 2, 0, YAK out.
+And for our words, we get 82 words or 82 phrases or whatever. And some of these phrases are
+toy image layer, top layer transparent, remaining layers, modes, undocumented layer modes, layer
+mode set, normal layer mask, open font license, layer mask effect, layer mask worked, layer group
+move, layer mask situations, layer group put, layer completely transparent, layer windows maker.
+So you're sort of building a multi word keywords list in here. So it's not exact transcribe.
+It's a pulling out what they'll call keywords is it a keyword extraction, which may include one
+word, it may include multiple words. And I'm sure how to break that up either. So there's the
+detecting the actual words part that needs to be ironed out. And then there's the keyword
+extraction, which I'm thinking I need to move away from keyword extraction using this method
+and move something to text classification, which means I'm thinking it's just
+categorizing text into key thematic things, the idea being that would be the tags for the show.
+So we would get the text and we would run it through maybe a spell checker and it would
+you know create a nice clean output or transcription of the of the episode. Then that gets run
+through this AI thing and it spits out tags or one letter one word tags or maybe multiple word tags,
+whatever we decide on. But the key there is getting this keyword extraction right or what I'm
+thinking we actually need text classification. So given a group of words or sentences or text,
+we want to pull out the classification. What is the theme? What is what is what are we talking
+about? What is the classification? So I'm thinking text classification might be where we want to go.
+But like again, just barely scratching the surface of the stuff, there's a couple of pieces of
+software that are open-source that I've been looking at. And they're fairly complicated. You know,
+when you start getting to machine learning and all this AI, whatever, and training, I don't want
+to have to train it. I want to use existing modules or whatever. But at the end of the day,
+I'd like to have a group of words that will basically make the episode have its own tags.
+So for this episode, you know, it doesn't have the word gimp in it. But it does have, you know,
+layer group out, layer completely transparent. You've got stuff like layer masks. You've got words like,
+let's see, picked full saturation, full saturation spectrum, normal layer, image layer,
+layer transparent, layer opaque, layer completely. So there's keywords in here that would aim you
+towards the theme of the episode. But this doesn't tell me that this episode is about gimp and
+whatever. Now that's what the title does, right? So we have at a high level what the title is.
+We have a high level of what the notes are because usually people put something in the show notes
+section that indicates what's going on, but usually it's links or whatever. For me, at least, I don't
+put a whole lot in there. To be honest, my episodes don't stay down their own because they're
+generally time sensitive. So, you know, in 15 years episode about whatever program is not really
+going to matter all that much as far as, you know, the usefulness of that information. So
+the idea there is, I feel like if we can get it to a point where it's giving us, you know,
+a set of 10 words or key phrases that will kind of automate the show notes part of it. So,
+show notes don't exist or maybe we add kind of a metadata field and attach that to each episode.
+We could have more of a searching index where, okay, you've got the output. You can search just the
+straight up transcribe output of the episode. And again, that's can be like for the hearing impaired
+or whatever. You can download the text and yeah, it's not going to be 100%. But it'll get you,
+you know, 99% the way there. And then the second piece is, you know, if the show notes are lacking or if,
+you know, you want tags, we can use some AI and some stuff like that to bring it into, you know,
+maybe a sort of a hodgepodge of show notes that would just have a bunch of phrases of words.
+And that would help you identify, okay, well, yes, this is about Gimp, but it's about, you know,
+compiling your own Gimp and you'll see the word compile and you'll see the word download and you'll
+see the word Linux and you'll see the word, you know, cross compile and you'll see the word make,
+right? And you'll be able to look at those words and understand that it's not about using Gimp. It's
+about programming in Gimp or compiling Gimp or writing Gimp plugins, right? So, the idea there is
+that, you know, you have just transcribe pull transcribe. Once we go from transcribe, we're going to
+marinate that and basically have a list of 10 words. What I want is what I imagine future state
+is a list of 10 words that describe the podcast by itself that will give you an idea of what
+that podcast is about. So, that's like the second layer and then the third layer is create
+automating that creation of tags. So, once we get kind of those key words or key phrases
+into the mix, we can take those key words key phrases and maybe compare them to maybe Google
+searches or compare them to something else and then the results based on that search will give us
+potentially a key single keyword words to go through. So, for example, if we get feed it a list of
+10 key words like layer mask worked, layer group move, layer mask situation, you know,
+saturation, layer group put, layers window make and we put that all of kind of in quotes and we've
+sent it to Google and maybe we pulled the top 10 results from that. Then we use keyword extraction
+based on that to say, okay, these words are only found in, you know, out of all these search results,
+the key words that come out are word gimp. So, this must be about gimp and maybe some,
+maybe we can use some other search pattern recognition, you know, here's a list of things.
+What am I talking about? And that's what we want to, we want to answer programmatically and say,
+okay, here's a list of phrases or a list of words. What am I actually talking about without,
+without having to read the whole transcription, the reader or the listener should be able to
+look at the kind of keywords or tags and tell what it's about. So, again, oh, I'm going off
+here. So, once we've got the keywords in those tags, then we can kind of where people haven't
+provided tags, we can provide our own tags in our own keywords or key phrases. So, at the very
+least, I should be able to convert all HPR episodes and to transcribe them pretty successfully,
+probably to, you know, a 95% or 98% accuracy. So, I can do that right now with something else.
+I don't know if this is not the right, this vask or vask or whatever, it's not the right tool
+for that or how to configure it differently. But at the very least, I can transcribe, get us
+transcriptions. What I want to do is, again, add, you know, keyword phrases and then eventually add
+tags and have that all automated. So, if someone does provide tags or they would like, you know,
+the option to have automated tags and then they click next and then maybe they get an email that says,
+please review your tags. You know, what do you, you know, if you think, you know, if this is great,
+click yes, if they know or something. Or you can just have it, you know, done off line somehow
+and I can provide a single binary. I don't know, there's ways to do it. But the idea there is,
+you can use, well, a lot of these, especially these deep learning apparatuses and these
+artificial intelligence things, you can use GPU. So that's also an option. The reason I say that is
+an hour of audio takes about 20, 30 minutes to be analyzed. Two minutes of audio takes about 10
+minutes or sorry, 10 minutes takes about two minutes to kind of analyze. So there is a time,
+there is a time factor there. So each episode is going to be about 10 minutes or about two minutes
+to analyze. And then you have your longer episodes that are, you know, like an hour or even, you know,
+these giant six hour ones or two hour ones. There's a time element involved, but we can use GPU
+for that if we need to use to do that later. But I don't think that's going to be a problem because
+they trickle in, right? We only have one every day and if it takes, you know, it takes two hours to scan
+one, one file who really cares. So the aspect from the speed perspective, I'm not too concerned.
+But if we need speed, we can utilize GPU. We can utilize multithreaded downloads. We can
+utilize the server itself, right? For running or maybe remotely mounting the audio files so that
+we don't have to download them in text, you know, archive that org and HPR site itself, whatever.
+So that's one thing. I already talked about the word input word stuff. So, you know, it said the word
+COVID was covert and that's something, you know, I can work through. But I just wanted to get you all
+thoughts. What you think has anybody approached this? I know I've talked about it before and I brought
+it up before and I don't know if I followed the thread or if anybody answered, but really, for me,
+at the very end of the, at the very least, I would like, I would like to help provide
+transcriptions for episodes. And that's the biggest thing for me. The second thing would be to
+help with tagging keywords with automation, right? So here's an example. I'll provide links in
+the show notes to episode text and examples. And again, this is this is proof of concept stuff.
+This is just an example and all of this can be tuned. We can use different software.
+If you, if we find out that you guys like it, feel free to let me know. It's free load,
+if I read L-O-A-D-101 at Yahoo.com. And you can also reach me at 404-647-425-0 if you want to hit me up.
+Again, this is a very quick, quick and dirty, you know, proof of concept for what I'm doing.
+What I'd like is to again, you know, have, have these episodes, have them so tagging automated
+as much as possible and then have like keyword stuff. So when you can, you can search the full text
+or you can search kind of keywords key, key phrases, which would be like a not as deep search
+or you can search just based on, you know, automated tags. So there's kind of three levels of
+of search there. Now at the end of the day, maybe we don't need any of that. Maybe we just need
+the transcription. And that might be good enough for the listeners. It might be even good enough for
+people trying to write show notes and stuff like that. We might even be able to use the output,
+right, of some of these scripts to help generate show notes for, generate show notes and tags. So,
+you know, instead of having to listen to the entire episode, we use machine learning to
+pull phrases out, keyword out, we review those keywords and say, okay, well, obviously this
+episode is about Gimp, but I don't want the word, you know, banana in there because the whole
+episode is about photoshopping or excuse me, gimping a banana, right, using Gimp on a picture of a banana.
+So if the speaker uses the, you know, keyword a bunch of times or key phrase and it's not
+what the episode's about, right, we can filter that out. So that might be kind of some way we could
+use, you know, machine learning way, whatever to help with show notes and tagging. Maybe make it,
+you know, halfway there. Maybe, right, this is some other features thinking just off the top of my
+head. Maybe the transcribe comes in, the transcribes gets analyzed and it creates key words and tags.
+And then those go to what do we call janitors to approve or disprove, right? So instead of, you know,
+the janitors haven't listened to every single episode. I mean, as much as I want to hear other
+people's voices in my own voice, it can be, it can be kind of hard to go through some of these episodes
+that are hours long or 30 minutes long that has, it's on a topic that you're not really passionate about.
+So having to mentally listen, right, even out of 2x, haven't to mentally listen and retain that
+knowledge. Yeah, you can probably listen to the first 5 minutes and get an idea about what the
+episode's about. And maybe we use machine learning for that piece too. Maybe we analyze the first 5 minutes,
+pull out the key phrases, those get pushed into a queue for janitors to approve kind of automated
+show notes and or automated tagging. And they can say, okay, well, based on the transcription,
+yeah, I'll approve these four keywords or these four tags. And these other three are just stupid
+and they need to be filtered out. And, you know, we can kind of have a white list and a black list
+of keywords and phrases so that, you know, instead of managing and having to like uncheck free
+software every time, then you can kind of highlight that and filter it in or out based on, you know,
+based on things like that. So just basic filtering, basic, basic stuff like that.
+Other than that, I can't really think of anything else off the top of my head.
+If you all have any comments, suggestions, ideas, you know, there's, there's the, you know,
+10,000 ways to skin this cat. But at the end of the day, I can probably provide, give it enough
+time and maybe a little bit of help with some people. I could probably automate tagging.
+Which is kind of scary with machine learning. So just keep that in mind. If, and if we were,
+you know, we're against the robots and we think that the human touch is important, whatever. But
+I'm a big scripter. I'm a bit of automator and I definitely don't want you guys getting burned out.
+And, you know, if I can help with manual tagging, I probably should. Instead of doing episodes about,
+you know, automating something that doesn't need to be automated. But anyways, hope that helps you
+guys out. Feel free to pass off any other ideas or anything or provide comments. I've, I've really
+thought about this for several years and finally broke down and said, you know what? I, you know,
+I'm, I'm, I've hear so many, so many talking about, you know, tagging and having to do this and
+having to do that. Well, if we can, at the very least, transcribe, right? These episodes, then
+that makes that can cut down on a lot of the work having to do, having to be done. Anyways,
+you all have a good one and peace out.
+You've been listening to Hacker Public Radio at HackerPublicRadio.org. We are a community podcast
+network that releases shows every weekday, Monday through Friday. Today's show, like all our shows,
+was contributed by an HBR listener like yourself. If you ever thought of recording a podcast,
+then click on our contributing to find out how easy it really is. Hacker Public Radio was found
+by the digital dog pound and the infonomican computer club and is part of the binary revolution
+at binrev.com. If you have comments on today's show, please email the host directly, leave a comment
+on the website or record a follow-up episode yourself. Unless otherwise stated, today's show is
+released on the creative comments, attribution, share a live 3.0 license.