Initial commit: HPR Knowledge Base MCP Server

- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 10:54:13 +00:00
commit 7c8efd2228
4494 changed files with 1705541 additions and 0 deletions
--- a/hpr_transcripts/hpr1372.txt
+++ b/hpr_transcripts/hpr1372.txt
@@ -0,0 +1,101 @@
+Episode: 1372
+Title: HPR1372: Rootstrikers.org and federal election commission data processing
+Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1372/hpr1372.mp3
+Transcribed: 2025-10-18 00:22:52
+
+---
+
+🎵
+Welcome to my second Hacker Public Radio show.
+Today I'm going to make this real short.
+There was a call for new shows to be published, so my attempt I will tell you about the Roots
+Strikers Project from Lawrence Lessick Roots Strikers is a project to reform the campaign
+contributions in federal elections.
+Lawrence Lessick started the Creative Commons Project, which is very successful, and he
+realized that there was no way to implement the changes needed in our society to allow
+for creative commons with the current political situation, so he decided to strike at the
+root of the problem, which is the campaign financing, and started the group Roots Strikers.
+Roots Strikers is basically campaigning for campaign finance reform, but it is also working
+on different software projects, and data projects do so.
+I originally started helping out as a volunteer on one website, and then I was recruited to help
+on working on Wikipedia projects, so basically cleaning up and editing different Wikipedia pages
+and putting on the campaign finance data onto the Wikipedia pages of the various politicians.
+This proved to be very difficult where people were just deleting it or calling me a
+vandal, etc., so then I started to look into where this campaign finance data comes from,
+which is from the Federal Election Commission, and I discovered that it is a huge mess of
+data formats, basically people submit data to the Federal Election Commission using forms,
+and there is also an electronic record for them to submit.
+This is basically a CSV, a comma-separated file, and this file has now 8 versions, they
+started in 2000, so there has been basically a version per year of the file format being changed,
+and then there are, I guess, 20 different software vendors that are producing these files,
+all that are a little bit different, so there has been no software, the best software for reading
+it that I found was FECHFech from the New York Times, but it only supports the last three or four
+versions, so I started a project to re-implement Fech, which is written in Ruby in Python, and you'll
+find my code on GitHub, so then I'm working now on converting all of the CSV files into
+YAML files, which should be easier to process. Unfortunately, the YAML files that I'm producing are a bit
+bloated, they still include the columns, the column numbers, and they include also the original data,
+so I'm putting these files onto GitHub and Bitbucket and publishing them in Git format, and they still
+need to be cleaned up with YAML files, but it's a start, so I'm hoping that other people will be
+interested in helping me process these files, I've started on a C++ program to read them,
+basically it's a huge amount of data using Git is also a bit bloated because it copies the files
+as well, using YAML is also bloated that it has copied a couple times, so these files get expanded,
+we're talking about for one year, we're talking about gigabytes of data,
+you know, 10, one year's worth of data could be between 10 and 20 gigabytes, you know,
+here I'm just looking at the file system, 2002, checked out is 18 gigabytes of data,
+so it gets pretty heavy, and I'm hoping to work on optimizing the data formats. The problem really
+with the data formats is we need to get them cleaned up, we need to get the records cleaned up,
+and then we need to build a better unified data model that is unified across all the different
+eight versions of the software, and describe all the fields and how they interrelate with each other,
+so it's going to be a big project, and I'm expecting you to take years to do it, now the reason why I
+started on this is because, you know, you say well why isn't there this data out there, well,
+there are different repositories of data out there that provide aggregates like map light,
+and originally I was using the map light data, but they don't publish their raw data, and they don't
+share it, so basically you just have to trust their numbers, and I'm someone who wants to
+actually have open access to the software, and also map light is, you know, they don't publish
+their software either, or make their data available, so for me, I want to make this data open access,
+and usable using open-source software, and that's the long-term goal that
+eventually we'll have small repositories of data split up across topics,
+for example, you'll have like not only 2010, I have one repository for 2010, one repository for
+2011, but we'll split that across state, so you could say I want to check out Kansas for 2010,
+and then have the data in there, you know, even per politician or per political group, we'll split
+them up and make the data easily accessible and usable. Another thing we need to do is actually
+document all of the different election committees, and there's thousands of them, and we'll need to
+document those as individual entities. There's someone with Wikipedia, there's different websites
+that document them, but I would like to get them into a YAML format. Now this is where I can segue
+into the work that I was doing on the Open Congress, I'll see my repository in GitHub,
+Congress legislators, so basically I helped clean up the Congress legislators project from
+the United States slush, Congress minus legislators, and there they have all the current legislators
+in YAML format with various fields, including the FEC committee IDs, including the
+their addresses and contact details, so I would like to include to create a similar format for the
+different election committees describing them. Basically you're treating the data as
+in YAML as kind of a wiki page, and people edit that directly or automatically, and instead of
+committing it to a wiki, you post a commit to GitHub, and those emerged. So it's kind of elite,
+I mean it's not for the average person, and I have some hosting here, I have a hosting
+on someone donated a Dreamhost host, and it's only 10 megabytes of memory, so I can't even check
+out the GitHub repository as long as you're there. I have to actually get the zip files and check
+them out, so I'm not going to publish that host yet. So eventually we will create a website for
+allowing people to actually interact with it with the data directly and post changes, which will
+then be committed to Git as well. So I hope to use Git as a repository for the data, we just have to
+make the great repose, find small enough that they can actually be managed easily, and then we need
+to, I mean there's a lot of work to do, just the process of all this data is a lot of work. So
+I'm basically posting this video to hack a public radio if anyone's interested, they can contact me,
+see my data on the show notes, and if you want to help with the code, any help is welcome, if you
+want to help host the data, that would be great. I mean ideally we would find a big machine
+that could actually host the processing of this data because it's just so intensive. Now I have
+to send music Python, which is inefficient, and using YAML, which is inefficient, so
+I mean sure we could turn the data into a Postgrease database, for example, which would be more
+efficient to process the data using C++ that I've started on that. But again, if we create a
+Postgrease database, then we need to host it because I mean the idea of using YAML and Git is that
+people could just check out little bits and process them, and if you really want the whole database,
+then it's going to be huge, and we'll need to have a host for that. So there's different ways to
+look at it, and I'm hoping it gains some interest here and see if I can find people to help out. So
+yeah, well thanks for listening, and bye.
+You have been listening to Hacker Public Radio, where Hacker Public Radio does our.
+We are a community podcast network that releases shows every weekday and Monday through Friday.
+Today's show, like all our shows, was contributed by a HPR listener like yourself.
+If you ever consider recording a podcast, then visit our website to find out how easy it really is.
+Hacker Public Radio was founded by the Digital.Pound and the International Computer Club.
+HPR is funded by the Binary Revolution at binref.com. All binref projects are proudly sponsored
+by Liner Pages. From shared hosting to custom private clouds, go to LinerPages.com for all your
+hosting needs. Unless otherwise stasis, today's show is released under a creative comments,
+attribution, share in life, details on our lives please.