48 lines
3.4 KiB
Plaintext
48 lines
3.4 KiB
Plaintext
|
|
Episode: 22
|
||
|
|
Title: HPR0022: Chunk Parsing
|
||
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr0022/hpr0022.mp3
|
||
|
|
Transcribed: 2025-10-07 10:23:11
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
...
|
||
|
|
Hello everyone, welcome to another episode of Higher Public Radio. I am your host
|
||
|
|
of Flexi and I'll be talking about parsing, specifically about chunk parsing. I will
|
||
|
|
describe the classical chunk parsing, then I will tell you about modification suggested
|
||
|
|
to make it more efficient. Chunk parsing is also called partial parsing. It's a robust
|
||
|
|
parsing strategy proposed for natural language processing. An example of a chunk sentence
|
||
|
|
is, I like foreign languages, where I like is taken as a chunk and foreign languages
|
||
|
|
as the second chunk. The chunking corresponds to the phonology of the sentence. A chunk
|
||
|
|
edge is created when there is a pause and there is one major stress-burst chunk. The stress
|
||
|
|
pattern and pause pattern of spoken language is called prosody.
|
||
|
|
Fantastically, a chunk contains one head or content word, like a verb or noun, and function
|
||
|
|
words surrounding it. The form of chunks follows given fixed templates. Their relationship
|
||
|
|
between chunks is not templated, but are more governed by lexical interactions. Chunks
|
||
|
|
can move around each other as a whole, but items within a chunk cannot move within the
|
||
|
|
chunk. The chunking process is made of two main tasks. The first is segmentation, and
|
||
|
|
segmentation tokens are identified, and chunks are created based on the criteria described
|
||
|
|
earlier for the chunks. Then there's the second main task, which is labeling. Labeling
|
||
|
|
identifies the types of the words. Then the types of the chunks, such as noun phrase,
|
||
|
|
etc. A chunk parser groups related tokens into chunks. Then combines the chunks with
|
||
|
|
the dominant tokens, forming a two-level tree that covers the whole sentence. This tree
|
||
|
|
is called chunk structure. Chunking rules are applied in turn until all, until all the
|
||
|
|
rules are done with. The resulting structure is returned. When a chunking rule is applied
|
||
|
|
to a hypothesis, it only creates new chunks that don't overlap with any previous ones.
|
||
|
|
So if we apply two non-identical rules in reverse order, we get two different results. There
|
||
|
|
are then rules for chunking, rules for unchunking, rules for merging, for splitting, etc. So as
|
||
|
|
I said earlier, chunk parsing is also called partial parsing. What's the difference between
|
||
|
|
chunk parsing and full parsing? Well, each one has its benefits and its downsides. Full
|
||
|
|
parsing is a polynomial of degree three, whereas chunk parsing is a linear algorithm. Chunk
|
||
|
|
parsing has a hierarchy of limited depth, whereas full parsing doesn't. But chunk parsing
|
||
|
|
is not as awesome as it sounds, because it can have less than perfect results. Two researchers
|
||
|
|
from Tokyo suggested an alteration to chunk parsing to make it more efficient. They suggested
|
||
|
|
using a classical sliding window technique instead of tagging to consider all subsequences
|
||
|
|
rather than avoid overlapping completely. They also suggested using a machine learning
|
||
|
|
algorithm to filter sequences that are in a context free grammar. They suggested using
|
||
|
|
a maximum entropy classifier for filtering. For more detail, look at that paper. It's
|
||
|
|
one of the links in the show notes. That's all for tonight. Thank you for listening. This
|
||
|
|
was Plexi with Hackebubli Gradio.
|
||
|
|
Thank you for listening to Hackebubli Gradio. HPR is sponsored by tarot.net. So head on
|
||
|
|
over to C-A-R-O dot-E-C for all of the team.
|
||
|
|
Thank you very much.
|