Initial commit: HPR Knowledge Base MCP Server

- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 10:54:13 +00:00
commit 7c8efd2228
4494 changed files with 1705541 additions and 0 deletions
--- a/hpr_transcripts/hpr4026.txt
+++ b/hpr_transcripts/hpr4026.txt
@@ -0,0 +1,199 @@
+Episode: 4026
+Title: HPR4026: Using NLP to get better answer options for language learning
+Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr4026/hpr4026.mp3
+Transcribed: 2025-10-25 18:49:12
+
+---
+
+This is Hacker Public Radio Episode 4000 and 26 from Monday 8th of January 2024.
+Today's show is entitled, using an LP to get better answer options for language learning.
+It is the first show by new host Tom P. S. G. J. and is about 17 minutes long.
+It carries a clean flag.
+The summary is, Levinstein Distance may help language learning apps improve answer options
+for better learning.
+Hello, my name is Greg Thompson.
+This is my first Hacker Public Radio broadcast.
+Many people enjoy learning language in their free time.
+People don't have much time though when they want to have fun and efficient practice
+when they're learning language.
+And so I'd like to talk a little bit about how natural language processing can help make
+more effective language learning practice.
+There are lots of apps to choose from to learn a language.
+They usually let you choose from many target languages, the languages that you'd like to
+learn.
+Usually words are grouped into themes like animals or workplace or hobbies and they have
+different activities.
+For example, you might hear the audio of a target word and select that word from a list
+of options or you might see a word that you're learning and select the translation for that
+word into your native language.
+You might fill in a sentence that requires that word.
+You might type it or select the right option again from a list or you might spell the word.
+So say I'm learning Korean.
+The app might show goyangi and the options it displays could be dog, cat, mouse and
+fish.
+And I pick cat because that's the right translation.
+Goyangi is cat in English.
+As you progress, new themes usually come up in these apps and it's common even when
+you're learning new words that review words would come back up and they would mix with
+the vocabulary from the current theme that you're working on.
+Sometimes these answer options aren't very useful though.
+So let's say we moved on from the animal topic earlier with goyangi and cat and now we're
+working on weather.
+The target word might be chua, which means cold and the options that the app displays
+is cold, hot, humid and cat.
+Well cat is obviously wrong.
+It's just completely out of the topic and when we see that in there, we'll probably
+know that it's just not one of the current vocabulary words we're learning.
+And so it's pretty easy to discard that.
+So the problem is the answer choices aren't very challenging and the review words themselves
+when we're reviewing them might get answer options that aren't very challenging.
+We might just can remember the general topic and not really remember the actual meaning
+of the word.
+This problem would get worse as the review words are brought up and the jumble of a relevant
+topic's increases.
+So a quick fix for that might just be to draw answer options from words related to the
+topic that the word's in.
+That's great, but there are other options that we can do.
+And some of these could relate to natural language processing.
+So instead of drawing answer options from topics, we could use the characteristics of the
+words themselves to come up with answer options.
+Imagine getting a set of answer options that look similar like yo-yang is hometown,
+goyang is cat, yo-yog is education and go on means finer or beautiful.
+They still have very different topics, but it is harder to distinguish because they
+look similar.
+So we're forced to consider the spelling and remember more of the word than just the
+general topic.
+So finding words that are similar may enhance vocabulary practice.
+And that's where natural language processing comes in because there's a lot of techniques
+to use the characteristics of the words to come up with better answer options.
+And one technique that we just have mentioned is the similarity of words.
+There are a lot of ways to measure the similarity or difference of words.
+One is the leavenstein distance, which calculates difference based on substitutions, insertions,
+and deletions.
+So substitution means that we replace maybe one character of a word.
+Insertion means we add a new character between two existing characters.
+And deletion means that we take out a character.
+And the leavenstein distance, we're trying to transform one word to the other and calculate
+those number of steps to make that happen.
+Number one is the hamming distance.
+It's very similar to the leavenstein distance and that it calculates the difference between
+two strings, but it's only based on substitutions.
+And so it's mainly useful for strings of the same length.
+There are other techniques that we could use, like the jocard or cosine or ingrams, and
+each has particular use cases where they might be really useful.
+And this podcast will look at leavenstein and get some experience using that and playing
+around with it.
+So in Python, there's a package that you can use to calculate leavenstein distance.
+To get started, you would need to download the leavenstein package with pip install leavenstein.
+And leavenstein is spelled levenstein, s-h-t-e-i-n.
+Then we would import leavenstein, but when we spell leavenstein this time, we need to use
+a capital L. So capital L, e-ve-n, s-h-t-e-i-n.
+So we've downloaded the leavenstein package and imported it.
+And now we can use the simple function provided by the leavenstein package to calculate
+the distance between two words.
+So you can type in leavenstein with a capital L dot distance and then we'll have open
+and close brackets and inside we need to provide two strings.
+In my case, I'm going to use the words cat and dog.
+Make sure that your two strings are separated by a comma.
+After you press enter, the leavenstein dot distance with cat and dog or the words that
+you've chosen, you would see the result of three.
+This is showing that the words are fairly different because each of these strings is
+three characters and it takes three edits to change cat into dog.
+We would need to change c to d, a to o and t to g.
+If we ran this again using cat and can c-a-n, then we would get a result of one.
+The leavenstein distance calculated one because only one character needs to be changed to
+change cat into cat.
+We change the t to an n.
+The leavenstein distance has some mathematics involved and it can be useful to review those
+mathematics and see the visual matrix of how this works.
+And Ethan Nomm has a great blog post about it that breaks down the mathematics and shows
+a step-by-step example of how to calculate the leavenstein distance for a couple longer
+words.
+So I'm going to put a link to that along with this podcast, so you check that out if
+you'd like to know more about how to calculate leavenstein distance.
+For our purposes of language learning applications, though, we could take in a list of words
+a student knows, find similar words, and come up with more difficult answer options
+to enhance practice.
+You need to write custom functions to apply the leavenstein distance calculation toward
+your apps and needs, though.
+There is another package in Python that does provide these functions using the leavenstein
+distance and even the leavenstein package that we've been playing with.
+That package is called the Fuzz.
+It uses leavenstein distance but has different functions to apply to different situations
+on how you might want to use that.
+For example, it has simple ratio, which is just the edit distance that we've been calculating
+so far.
+It's not much different.
+But as the name implies, it's going to provide a ratio to us.
+It's not going to return the number of edits.
+Instead, it's going to return the percent of similarity between the two words.
+Harshal ratio is another tool that it has, and it will find out if some words are in
+a common order, still applying the leavenstein distance, but now looking at longer things
+and applying them using order.
+The token sort ratio tool will find similar strings, whether they're longer short, without
+order as a factor, so it's particularly useful for longer strings.
+There are other tools that are exposed, but for now, we'll just go ahead and get started.
+You can install the fuzz package by entering pip install the fuzz.
+The fuzz is one word, t-h-e-f-u-z-e.
+To use the fuzz in an application, you would type from the fuzz import fuzz.
+Again, the fuzz is one word.
+So from the fuzz import fuzz.
+You can run the fuzz package using the same examples that we did before.
+To use it, type in fuzz dot ratio, and then enter in the two words, cat comma dog.
+Make sure that cat and dog are both in quotation marks because they are strings.
+When you run fuzz dot ratio on cat and dog, it will return zero.
+This might seem surprising because the last package returned three, but remember that
+those three edits indicated that all three letters needed to change in order to transform
+cat and dog.
+And so the zero here is implying that there is zero similarity.
+If we run fuzz dot ratio on cat and can, then the fuzz package returns 67.
+This indicates that two thirds of the string are similar between the two strings, which
+makes sense because only the T needed to change to an end, so about 33% of these two
+words is different, one third.
+The fuzz package also has a useful function called process.
+It takes in a list of options, tells a string to match against, and the max number of matches,
+and once you set up all of that, you will get a list of tuples with a string match and
+a similarity score.
+So type in options, we're going to make a variable called options, O, P, T, I, O, N, S,
+equals, open bracket dog comma can, and then close bracket, we're using the same words
+as before, make sure dog and can are in quotation marks, so that they're strings.
+Then we're going to import something else from the fuzz, we're going to import process.
+So from the fuzz import process, make sure the fuzz is one word.
+Then we'll use process.extract, and we're going to put three parameters into process.extract.
+So inside the open and close parentheses, type in cat in quotation marks, then comma,
+then options, comma, limit equals to cat is the word that we're going to check for, options
+is the list of words that we want to get the similarity score against, and limit to means
+that we're going to limit the results to the highest two scores.
+So when we do this, we're going to get dog and can again, and we'll have the similar scores.
+They'll be in a list of two pulls, and we only have two items in our options, so of course
+we're going to get both back.
+If we change the limit parameter to one, then we would only get can, because dog is not
+the highest score, process extract has another parameter that we can set.
+We can set the score parameter.
+So we could say score equals fuzz dot ratio, or one of the other score tools that we talked
+about before.
+That will let you run the Levenstein distance calculation for the type of string that
+you're working on for our purposes fuzz ratio was fine.
+But if you wanted to, you could change that.
+The fuzz package shows how you might wrap up the Levenstein distance calculation into
+a set of useful functions tailored toward your preferences.
+Maybe you prefer the ratio, or maybe you like to preserve the edit distance in your results.
+However, our language learning example, these tools would be useful for finding similar
+answer options from vocabulary a student was exposed to.
+So you know how to calculate word similarities using two different Python packages.
+With complicated LLMs and machine learning models, it's easy to forget that they're
+relatively simple techniques that you can use to experiment with natural language processing
+that can still achieve some powerful results and give you some interesting capabilities
+for your apps.
+In the new year, you might find some value or joy in learning a new language.
+But if you come across some answer options that just aren't challenging you, you may
+find there are better ways to draw out those answer options to make your practice more effective.
+Thank you for listening to this podcast and have a happy new year.
+You have been listening to Hacker Public Radio at Hacker Public Radio does work.
+Today's show was contributed by a HPR listener like yourself.
+If you ever thought of recording podcasts, click on our contribute link to find out how
+easy it really is.
+Using for HPR has been kindly provided by an honesthost.com, the internet archive and
+our syncs.net.
+On the Sadois status, today's show is released under Creative Commons Attribution 4.0 International
+License.