hpr-knowledge-base/hpr_transcripts/hpr4026.txt

Episode: 4026
Title: HPR4026: Using NLP to get better answer options for language learning
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr4026/hpr4026.mp3
Transcribed: 2025-10-25 18:49:12

---

This is Hacker Public Radio Episode 4000 and 26 from Monday 8th of January 2024.
Today's show is entitled, using an LP to get better answer options for language learning.
It is the first show by new host Tom P. S. G. J. and is about 17 minutes long.
It carries a clean flag.
The summary is, Levinstein Distance may help language learning apps improve answer options
for better learning.
Hello, my name is Greg Thompson.
This is my first Hacker Public Radio broadcast.
Many people enjoy learning language in their free time.
People don't have much time though when they want to have fun and efficient practice
when they're learning language.
And so I'd like to talk a little bit about how natural language processing can help make
more effective language learning practice.
There are lots of apps to choose from to learn a language.
They usually let you choose from many target languages, the languages that you'd like to
learn.
Usually words are grouped into themes like animals or workplace or hobbies and they have
different activities.
For example, you might hear the audio of a target word and select that word from a list
of options or you might see a word that you're learning and select the translation for that
word into your native language.
You might fill in a sentence that requires that word.
You might type it or select the right option again from a list or you might spell the word.
So say I'm learning Korean.
The app might show goyangi and the options it displays could be dog, cat, mouse and
fish.
And I pick cat because that's the right translation.
Goyangi is cat in English.
As you progress, new themes usually come up in these apps and it's common even when
you're learning new words that review words would come back up and they would mix with
the vocabulary from the current theme that you're working on.
Sometimes these answer options aren't very useful though.
So let's say we moved on from the animal topic earlier with goyangi and cat and now we're
working on weather.
The target word might be chua, which means cold and the options that the app displays
is cold, hot, humid and cat.
Well cat is obviously wrong.
It's just completely out of the topic and when we see that in there, we'll probably
know that it's just not one of the current vocabulary words we're learning.
And so it's pretty easy to discard that.
So the problem is the answer choices aren't very challenging and the review words themselves
when we're reviewing them might get answer options that aren't very challenging.
We might just can remember the general topic and not really remember the actual meaning
of the word.
This problem would get worse as the review words are brought up and the jumble of a relevant
topic's increases.
So a quick fix for that might just be to draw answer options from words related to the
topic that the word's in.
That's great, but there are other options that we can do.
And some of these could relate to natural language processing.
So instead of drawing answer options from topics, we could use the characteristics of the
words themselves to come up with answer options.
Imagine getting a set of answer options that look similar like yo-yang is hometown,
goyang is cat, yo-yog is education and go on means finer or beautiful.
They still have very different topics, but it is harder to distinguish because they
look similar.
So we're forced to consider the spelling and remember more of the word than just the
general topic.
So finding words that are similar may enhance vocabulary practice.
And that's where natural language processing comes in because there's a lot of techniques
to use the characteristics of the words to come up with better answer options.
And one technique that we just have mentioned is the similarity of words.
There are a lot of ways to measure the similarity or difference of words.
One is the leavenstein distance, which calculates difference based on substitutions, insertions,
and deletions.
So substitution means that we replace maybe one character of a word.
Insertion means we add a new character between two existing characters.
And deletion means that we take out a character.
And the leavenstein distance, we're trying to transform one word to the other and calculate
those number of steps to make that happen.
Number one is the hamming distance.
It's very similar to the leavenstein distance and that it calculates the difference between
two strings, but it's only based on substitutions.
And so it's mainly useful for strings of the same length.
There are other techniques that we could use, like the jocard or cosine or ingrams, and
each has particular use cases where they might be really useful.
And this podcast will look at leavenstein and get some experience using that and playing
around with it.
So in Python, there's a package that you can use to calculate leavenstein distance.
To get started, you would need to download the leavenstein package with pip install leavenstein.
And leavenstein is spelled levenstein, s-h-t-e-i-n.
Then we would import leavenstein, but when we spell leavenstein this time, we need to use
a capital L. So capital L, e-ve-n, s-h-t-e-i-n.
So we've downloaded the leavenstein package and imported it.
And now we can use the simple function provided by the leavenstein package to calculate
the distance between two words.
So you can type in leavenstein with a capital L dot distance and then we'll have open
and close brackets and inside we need to provide two strings.
In my case, I'm going to use the words cat and dog.
Make sure that your two strings are separated by a comma.
After you press enter, the leavenstein dot distance with cat and dog or the words that
you've chosen, you would see the result of three.
This is showing that the words are fairly different because each of these strings is
three characters and it takes three edits to change cat into dog.
We would need to change c to d, a to o and t to g.
If we ran this again using cat and can c-a-n, then we would get a result of one.
The leavenstein distance calculated one because only one character needs to be changed to
change cat into cat.
We change the t to an n.
The leavenstein distance has some mathematics involved and it can be useful to review those
mathematics and see the visual matrix of how this works.
And Ethan Nomm has a great blog post about it that breaks down the mathematics and shows
a step-by-step example of how to calculate the leavenstein distance for a couple longer
words.
So I'm going to put a link to that along with this podcast, so you check that out if
you'd like to know more about how to calculate leavenstein distance.
For our purposes of language learning applications, though, we could take in a list of words
a student knows, find similar words, and come up with more difficult answer options
to enhance practice.
You need to write custom functions to apply the leavenstein distance calculation toward
your apps and needs, though.
There is another package in Python that does provide these functions using the leavenstein
distance and even the leavenstein package that we've been playing with.
That package is called the Fuzz.
It uses leavenstein distance but has different functions to apply to different situations
on how you might want to use that.
For example, it has simple ratio, which is just the edit distance that we've been calculating
so far.
It's not much different.
But as the name implies, it's going to provide a ratio to us.
It's not going to return the number of edits.
Instead, it's going to return the percent of similarity between the two words.
Harshal ratio is another tool that it has, and it will find out if some words are in
a common order, still applying the leavenstein distance, but now looking at longer things
and applying them using order.
The token sort ratio tool will find similar strings, whether they're longer short, without
order as a factor, so it's particularly useful for longer strings.
There are other tools that are exposed, but for now, we'll just go ahead and get started.
You can install the fuzz package by entering pip install the fuzz.
The fuzz is one word, t-h-e-f-u-z-e.
To use the fuzz in an application, you would type from the fuzz import fuzz.
Again, the fuzz is one word.
So from the fuzz import fuzz.
You can run the fuzz package using the same examples that we did before.
To use it, type in fuzz dot ratio, and then enter in the two words, cat comma dog.
Make sure that cat and dog are both in quotation marks because they are strings.
When you run fuzz dot ratio on cat and dog, it will return zero.
This might seem surprising because the last package returned three, but remember that
those three edits indicated that all three letters needed to change in order to transform
cat and dog.
And so the zero here is implying that there is zero similarity.
If we run fuzz dot ratio on cat and can, then the fuzz package returns 67.
This indicates that two thirds of the string are similar between the two strings, which
makes sense because only the T needed to change to an end, so about 33% of these two
words is different, one third.
The fuzz package also has a useful function called process.
It takes in a list of options, tells a string to match against, and the max number of matches,
and once you set up all of that, you will get a list of tuples with a string match and
a similarity score.
So type in options, we're going to make a variable called options, O, P, T, I, O, N, S,
equals, open bracket dog comma can, and then close bracket, we're using the same words
as before, make sure dog and can are in quotation marks, so that they're strings.
Then we're going to import something else from the fuzz, we're going to import process.
So from the fuzz import process, make sure the fuzz is one word.
Then we'll use process.extract, and we're going to put three parameters into process.extract.
So inside the open and close parentheses, type in cat in quotation marks, then comma,
then options, comma, limit equals to cat is the word that we're going to check for, options
is the list of words that we want to get the similarity score against, and limit to means
that we're going to limit the results to the highest two scores.
So when we do this, we're going to get dog and can again, and we'll have the similar scores.
They'll be in a list of two pulls, and we only have two items in our options, so of course
we're going to get both back.
If we change the limit parameter to one, then we would only get can, because dog is not
the highest score, process extract has another parameter that we can set.
We can set the score parameter.
So we could say score equals fuzz dot ratio, or one of the other score tools that we talked
about before.
That will let you run the Levenstein distance calculation for the type of string that
you're working on for our purposes fuzz ratio was fine.
But if you wanted to, you could change that.
The fuzz package shows how you might wrap up the Levenstein distance calculation into
a set of useful functions tailored toward your preferences.
Maybe you prefer the ratio, or maybe you like to preserve the edit distance in your results.
However, our language learning example, these tools would be useful for finding similar
answer options from vocabulary a student was exposed to.
So you know how to calculate word similarities using two different Python packages.
With complicated LLMs and machine learning models, it's easy to forget that they're
relatively simple techniques that you can use to experiment with natural language processing
that can still achieve some powerful results and give you some interesting capabilities
for your apps.
In the new year, you might find some value or joy in learning a new language.
But if you come across some answer options that just aren't challenging you, you may
find there are better ways to draw out those answer options to make your practice more effective.
Thank you for listening to this podcast and have a happy new year.
You have been listening to Hacker Public Radio at Hacker Public Radio does work.
Today's show was contributed by a HPR listener like yourself.
If you ever thought of recording podcasts, click on our contribute link to find out how
easy it really is.
Using for HPR has been kindly provided by an honesthost.com, the internet archive and
our syncs.net.
On the Sadois status, today's show is released under Creative Commons Attribution 4.0 International
License.