- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
200 lines
12 KiB
Plaintext
200 lines
12 KiB
Plaintext
Episode: 4026
|
|
Title: HPR4026: Using NLP to get better answer options for language learning
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr4026/hpr4026.mp3
|
|
Transcribed: 2025-10-25 18:49:12
|
|
|
|
---
|
|
|
|
This is Hacker Public Radio Episode 4000 and 26 from Monday 8th of January 2024.
|
|
Today's show is entitled, using an LP to get better answer options for language learning.
|
|
It is the first show by new host Tom P. S. G. J. and is about 17 minutes long.
|
|
It carries a clean flag.
|
|
The summary is, Levinstein Distance may help language learning apps improve answer options
|
|
for better learning.
|
|
Hello, my name is Greg Thompson.
|
|
This is my first Hacker Public Radio broadcast.
|
|
Many people enjoy learning language in their free time.
|
|
People don't have much time though when they want to have fun and efficient practice
|
|
when they're learning language.
|
|
And so I'd like to talk a little bit about how natural language processing can help make
|
|
more effective language learning practice.
|
|
There are lots of apps to choose from to learn a language.
|
|
They usually let you choose from many target languages, the languages that you'd like to
|
|
learn.
|
|
Usually words are grouped into themes like animals or workplace or hobbies and they have
|
|
different activities.
|
|
For example, you might hear the audio of a target word and select that word from a list
|
|
of options or you might see a word that you're learning and select the translation for that
|
|
word into your native language.
|
|
You might fill in a sentence that requires that word.
|
|
You might type it or select the right option again from a list or you might spell the word.
|
|
So say I'm learning Korean.
|
|
The app might show goyangi and the options it displays could be dog, cat, mouse and
|
|
fish.
|
|
And I pick cat because that's the right translation.
|
|
Goyangi is cat in English.
|
|
As you progress, new themes usually come up in these apps and it's common even when
|
|
you're learning new words that review words would come back up and they would mix with
|
|
the vocabulary from the current theme that you're working on.
|
|
Sometimes these answer options aren't very useful though.
|
|
So let's say we moved on from the animal topic earlier with goyangi and cat and now we're
|
|
working on weather.
|
|
The target word might be chua, which means cold and the options that the app displays
|
|
is cold, hot, humid and cat.
|
|
Well cat is obviously wrong.
|
|
It's just completely out of the topic and when we see that in there, we'll probably
|
|
know that it's just not one of the current vocabulary words we're learning.
|
|
And so it's pretty easy to discard that.
|
|
So the problem is the answer choices aren't very challenging and the review words themselves
|
|
when we're reviewing them might get answer options that aren't very challenging.
|
|
We might just can remember the general topic and not really remember the actual meaning
|
|
of the word.
|
|
This problem would get worse as the review words are brought up and the jumble of a relevant
|
|
topic's increases.
|
|
So a quick fix for that might just be to draw answer options from words related to the
|
|
topic that the word's in.
|
|
That's great, but there are other options that we can do.
|
|
And some of these could relate to natural language processing.
|
|
So instead of drawing answer options from topics, we could use the characteristics of the
|
|
words themselves to come up with answer options.
|
|
Imagine getting a set of answer options that look similar like yo-yang is hometown,
|
|
goyang is cat, yo-yog is education and go on means finer or beautiful.
|
|
They still have very different topics, but it is harder to distinguish because they
|
|
look similar.
|
|
So we're forced to consider the spelling and remember more of the word than just the
|
|
general topic.
|
|
So finding words that are similar may enhance vocabulary practice.
|
|
And that's where natural language processing comes in because there's a lot of techniques
|
|
to use the characteristics of the words to come up with better answer options.
|
|
And one technique that we just have mentioned is the similarity of words.
|
|
There are a lot of ways to measure the similarity or difference of words.
|
|
One is the leavenstein distance, which calculates difference based on substitutions, insertions,
|
|
and deletions.
|
|
So substitution means that we replace maybe one character of a word.
|
|
Insertion means we add a new character between two existing characters.
|
|
And deletion means that we take out a character.
|
|
And the leavenstein distance, we're trying to transform one word to the other and calculate
|
|
those number of steps to make that happen.
|
|
Number one is the hamming distance.
|
|
It's very similar to the leavenstein distance and that it calculates the difference between
|
|
two strings, but it's only based on substitutions.
|
|
And so it's mainly useful for strings of the same length.
|
|
There are other techniques that we could use, like the jocard or cosine or ingrams, and
|
|
each has particular use cases where they might be really useful.
|
|
And this podcast will look at leavenstein and get some experience using that and playing
|
|
around with it.
|
|
So in Python, there's a package that you can use to calculate leavenstein distance.
|
|
To get started, you would need to download the leavenstein package with pip install leavenstein.
|
|
And leavenstein is spelled levenstein, s-h-t-e-i-n.
|
|
Then we would import leavenstein, but when we spell leavenstein this time, we need to use
|
|
a capital L. So capital L, e-ve-n, s-h-t-e-i-n.
|
|
So we've downloaded the leavenstein package and imported it.
|
|
And now we can use the simple function provided by the leavenstein package to calculate
|
|
the distance between two words.
|
|
So you can type in leavenstein with a capital L dot distance and then we'll have open
|
|
and close brackets and inside we need to provide two strings.
|
|
In my case, I'm going to use the words cat and dog.
|
|
Make sure that your two strings are separated by a comma.
|
|
After you press enter, the leavenstein dot distance with cat and dog or the words that
|
|
you've chosen, you would see the result of three.
|
|
This is showing that the words are fairly different because each of these strings is
|
|
three characters and it takes three edits to change cat into dog.
|
|
We would need to change c to d, a to o and t to g.
|
|
If we ran this again using cat and can c-a-n, then we would get a result of one.
|
|
The leavenstein distance calculated one because only one character needs to be changed to
|
|
change cat into cat.
|
|
We change the t to an n.
|
|
The leavenstein distance has some mathematics involved and it can be useful to review those
|
|
mathematics and see the visual matrix of how this works.
|
|
And Ethan Nomm has a great blog post about it that breaks down the mathematics and shows
|
|
a step-by-step example of how to calculate the leavenstein distance for a couple longer
|
|
words.
|
|
So I'm going to put a link to that along with this podcast, so you check that out if
|
|
you'd like to know more about how to calculate leavenstein distance.
|
|
For our purposes of language learning applications, though, we could take in a list of words
|
|
a student knows, find similar words, and come up with more difficult answer options
|
|
to enhance practice.
|
|
You need to write custom functions to apply the leavenstein distance calculation toward
|
|
your apps and needs, though.
|
|
There is another package in Python that does provide these functions using the leavenstein
|
|
distance and even the leavenstein package that we've been playing with.
|
|
That package is called the Fuzz.
|
|
It uses leavenstein distance but has different functions to apply to different situations
|
|
on how you might want to use that.
|
|
For example, it has simple ratio, which is just the edit distance that we've been calculating
|
|
so far.
|
|
It's not much different.
|
|
But as the name implies, it's going to provide a ratio to us.
|
|
It's not going to return the number of edits.
|
|
Instead, it's going to return the percent of similarity between the two words.
|
|
Harshal ratio is another tool that it has, and it will find out if some words are in
|
|
a common order, still applying the leavenstein distance, but now looking at longer things
|
|
and applying them using order.
|
|
The token sort ratio tool will find similar strings, whether they're longer short, without
|
|
order as a factor, so it's particularly useful for longer strings.
|
|
There are other tools that are exposed, but for now, we'll just go ahead and get started.
|
|
You can install the fuzz package by entering pip install the fuzz.
|
|
The fuzz is one word, t-h-e-f-u-z-e.
|
|
To use the fuzz in an application, you would type from the fuzz import fuzz.
|
|
Again, the fuzz is one word.
|
|
So from the fuzz import fuzz.
|
|
You can run the fuzz package using the same examples that we did before.
|
|
To use it, type in fuzz dot ratio, and then enter in the two words, cat comma dog.
|
|
Make sure that cat and dog are both in quotation marks because they are strings.
|
|
When you run fuzz dot ratio on cat and dog, it will return zero.
|
|
This might seem surprising because the last package returned three, but remember that
|
|
those three edits indicated that all three letters needed to change in order to transform
|
|
cat and dog.
|
|
And so the zero here is implying that there is zero similarity.
|
|
If we run fuzz dot ratio on cat and can, then the fuzz package returns 67.
|
|
This indicates that two thirds of the string are similar between the two strings, which
|
|
makes sense because only the T needed to change to an end, so about 33% of these two
|
|
words is different, one third.
|
|
The fuzz package also has a useful function called process.
|
|
It takes in a list of options, tells a string to match against, and the max number of matches,
|
|
and once you set up all of that, you will get a list of tuples with a string match and
|
|
a similarity score.
|
|
So type in options, we're going to make a variable called options, O, P, T, I, O, N, S,
|
|
equals, open bracket dog comma can, and then close bracket, we're using the same words
|
|
as before, make sure dog and can are in quotation marks, so that they're strings.
|
|
Then we're going to import something else from the fuzz, we're going to import process.
|
|
So from the fuzz import process, make sure the fuzz is one word.
|
|
Then we'll use process.extract, and we're going to put three parameters into process.extract.
|
|
So inside the open and close parentheses, type in cat in quotation marks, then comma,
|
|
then options, comma, limit equals to cat is the word that we're going to check for, options
|
|
is the list of words that we want to get the similarity score against, and limit to means
|
|
that we're going to limit the results to the highest two scores.
|
|
So when we do this, we're going to get dog and can again, and we'll have the similar scores.
|
|
They'll be in a list of two pulls, and we only have two items in our options, so of course
|
|
we're going to get both back.
|
|
If we change the limit parameter to one, then we would only get can, because dog is not
|
|
the highest score, process extract has another parameter that we can set.
|
|
We can set the score parameter.
|
|
So we could say score equals fuzz dot ratio, or one of the other score tools that we talked
|
|
about before.
|
|
That will let you run the Levenstein distance calculation for the type of string that
|
|
you're working on for our purposes fuzz ratio was fine.
|
|
But if you wanted to, you could change that.
|
|
The fuzz package shows how you might wrap up the Levenstein distance calculation into
|
|
a set of useful functions tailored toward your preferences.
|
|
Maybe you prefer the ratio, or maybe you like to preserve the edit distance in your results.
|
|
However, our language learning example, these tools would be useful for finding similar
|
|
answer options from vocabulary a student was exposed to.
|
|
So you know how to calculate word similarities using two different Python packages.
|
|
With complicated LLMs and machine learning models, it's easy to forget that they're
|
|
relatively simple techniques that you can use to experiment with natural language processing
|
|
that can still achieve some powerful results and give you some interesting capabilities
|
|
for your apps.
|
|
In the new year, you might find some value or joy in learning a new language.
|
|
But if you come across some answer options that just aren't challenging you, you may
|
|
find there are better ways to draw out those answer options to make your practice more effective.
|
|
Thank you for listening to this podcast and have a happy new year.
|
|
You have been listening to Hacker Public Radio at Hacker Public Radio does work.
|
|
Today's show was contributed by a HPR listener like yourself.
|
|
If you ever thought of recording podcasts, click on our contribute link to find out how
|
|
easy it really is.
|
|
Using for HPR has been kindly provided by an honesthost.com, the internet archive and
|
|
our syncs.net.
|
|
On the Sadois status, today's show is released under Creative Commons Attribution 4.0 International
|
|
License.
|