Initial commit: HPR Knowledge Base MCP Server
- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
47
hpr_transcripts/hpr0022.txt
Normal file
47
hpr_transcripts/hpr0022.txt
Normal file
@@ -0,0 +1,47 @@
|
||||
Episode: 22
|
||||
Title: HPR0022: Chunk Parsing
|
||||
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr0022/hpr0022.mp3
|
||||
Transcribed: 2025-10-07 10:23:11
|
||||
|
||||
---
|
||||
|
||||
...
|
||||
Hello everyone, welcome to another episode of Higher Public Radio. I am your host
|
||||
of Flexi and I'll be talking about parsing, specifically about chunk parsing. I will
|
||||
describe the classical chunk parsing, then I will tell you about modification suggested
|
||||
to make it more efficient. Chunk parsing is also called partial parsing. It's a robust
|
||||
parsing strategy proposed for natural language processing. An example of a chunk sentence
|
||||
is, I like foreign languages, where I like is taken as a chunk and foreign languages
|
||||
as the second chunk. The chunking corresponds to the phonology of the sentence. A chunk
|
||||
edge is created when there is a pause and there is one major stress-burst chunk. The stress
|
||||
pattern and pause pattern of spoken language is called prosody.
|
||||
Fantastically, a chunk contains one head or content word, like a verb or noun, and function
|
||||
words surrounding it. The form of chunks follows given fixed templates. Their relationship
|
||||
between chunks is not templated, but are more governed by lexical interactions. Chunks
|
||||
can move around each other as a whole, but items within a chunk cannot move within the
|
||||
chunk. The chunking process is made of two main tasks. The first is segmentation, and
|
||||
segmentation tokens are identified, and chunks are created based on the criteria described
|
||||
earlier for the chunks. Then there's the second main task, which is labeling. Labeling
|
||||
identifies the types of the words. Then the types of the chunks, such as noun phrase,
|
||||
etc. A chunk parser groups related tokens into chunks. Then combines the chunks with
|
||||
the dominant tokens, forming a two-level tree that covers the whole sentence. This tree
|
||||
is called chunk structure. Chunking rules are applied in turn until all, until all the
|
||||
rules are done with. The resulting structure is returned. When a chunking rule is applied
|
||||
to a hypothesis, it only creates new chunks that don't overlap with any previous ones.
|
||||
So if we apply two non-identical rules in reverse order, we get two different results. There
|
||||
are then rules for chunking, rules for unchunking, rules for merging, for splitting, etc. So as
|
||||
I said earlier, chunk parsing is also called partial parsing. What's the difference between
|
||||
chunk parsing and full parsing? Well, each one has its benefits and its downsides. Full
|
||||
parsing is a polynomial of degree three, whereas chunk parsing is a linear algorithm. Chunk
|
||||
parsing has a hierarchy of limited depth, whereas full parsing doesn't. But chunk parsing
|
||||
is not as awesome as it sounds, because it can have less than perfect results. Two researchers
|
||||
from Tokyo suggested an alteration to chunk parsing to make it more efficient. They suggested
|
||||
using a classical sliding window technique instead of tagging to consider all subsequences
|
||||
rather than avoid overlapping completely. They also suggested using a machine learning
|
||||
algorithm to filter sequences that are in a context free grammar. They suggested using
|
||||
a maximum entropy classifier for filtering. For more detail, look at that paper. It's
|
||||
one of the links in the show notes. That's all for tonight. Thank you for listening. This
|
||||
was Plexi with Hackebubli Gradio.
|
||||
Thank you for listening to Hackebubli Gradio. HPR is sponsored by tarot.net. So head on
|
||||
over to C-A-R-O dot-E-C for all of the team.
|
||||
Thank you very much.
|
||||
Reference in New Issue
Block a user