Initial commit: HPR Knowledge Base MCP Server
- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
101
hpr_transcripts/hpr1372.txt
Normal file
101
hpr_transcripts/hpr1372.txt
Normal file
@@ -0,0 +1,101 @@
|
||||
Episode: 1372
|
||||
Title: HPR1372: Rootstrikers.org and federal election commission data processing
|
||||
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1372/hpr1372.mp3
|
||||
Transcribed: 2025-10-18 00:22:52
|
||||
|
||||
---
|
||||
|
||||
🎵
|
||||
Welcome to my second Hacker Public Radio show.
|
||||
Today I'm going to make this real short.
|
||||
There was a call for new shows to be published, so my attempt I will tell you about the Roots
|
||||
Strikers Project from Lawrence Lessick Roots Strikers is a project to reform the campaign
|
||||
contributions in federal elections.
|
||||
Lawrence Lessick started the Creative Commons Project, which is very successful, and he
|
||||
realized that there was no way to implement the changes needed in our society to allow
|
||||
for creative commons with the current political situation, so he decided to strike at the
|
||||
root of the problem, which is the campaign financing, and started the group Roots Strikers.
|
||||
Roots Strikers is basically campaigning for campaign finance reform, but it is also working
|
||||
on different software projects, and data projects do so.
|
||||
I originally started helping out as a volunteer on one website, and then I was recruited to help
|
||||
on working on Wikipedia projects, so basically cleaning up and editing different Wikipedia pages
|
||||
and putting on the campaign finance data onto the Wikipedia pages of the various politicians.
|
||||
This proved to be very difficult where people were just deleting it or calling me a
|
||||
vandal, etc., so then I started to look into where this campaign finance data comes from,
|
||||
which is from the Federal Election Commission, and I discovered that it is a huge mess of
|
||||
data formats, basically people submit data to the Federal Election Commission using forms,
|
||||
and there is also an electronic record for them to submit.
|
||||
This is basically a CSV, a comma-separated file, and this file has now 8 versions, they
|
||||
started in 2000, so there has been basically a version per year of the file format being changed,
|
||||
and then there are, I guess, 20 different software vendors that are producing these files,
|
||||
all that are a little bit different, so there has been no software, the best software for reading
|
||||
it that I found was FECHFech from the New York Times, but it only supports the last three or four
|
||||
versions, so I started a project to re-implement Fech, which is written in Ruby in Python, and you'll
|
||||
find my code on GitHub, so then I'm working now on converting all of the CSV files into
|
||||
YAML files, which should be easier to process. Unfortunately, the YAML files that I'm producing are a bit
|
||||
bloated, they still include the columns, the column numbers, and they include also the original data,
|
||||
so I'm putting these files onto GitHub and Bitbucket and publishing them in Git format, and they still
|
||||
need to be cleaned up with YAML files, but it's a start, so I'm hoping that other people will be
|
||||
interested in helping me process these files, I've started on a C++ program to read them,
|
||||
basically it's a huge amount of data using Git is also a bit bloated because it copies the files
|
||||
as well, using YAML is also bloated that it has copied a couple times, so these files get expanded,
|
||||
we're talking about for one year, we're talking about gigabytes of data,
|
||||
you know, 10, one year's worth of data could be between 10 and 20 gigabytes, you know,
|
||||
here I'm just looking at the file system, 2002, checked out is 18 gigabytes of data,
|
||||
so it gets pretty heavy, and I'm hoping to work on optimizing the data formats. The problem really
|
||||
with the data formats is we need to get them cleaned up, we need to get the records cleaned up,
|
||||
and then we need to build a better unified data model that is unified across all the different
|
||||
eight versions of the software, and describe all the fields and how they interrelate with each other,
|
||||
so it's going to be a big project, and I'm expecting you to take years to do it, now the reason why I
|
||||
started on this is because, you know, you say well why isn't there this data out there, well,
|
||||
there are different repositories of data out there that provide aggregates like map light,
|
||||
and originally I was using the map light data, but they don't publish their raw data, and they don't
|
||||
share it, so basically you just have to trust their numbers, and I'm someone who wants to
|
||||
actually have open access to the software, and also map light is, you know, they don't publish
|
||||
their software either, or make their data available, so for me, I want to make this data open access,
|
||||
and usable using open-source software, and that's the long-term goal that
|
||||
eventually we'll have small repositories of data split up across topics,
|
||||
for example, you'll have like not only 2010, I have one repository for 2010, one repository for
|
||||
2011, but we'll split that across state, so you could say I want to check out Kansas for 2010,
|
||||
and then have the data in there, you know, even per politician or per political group, we'll split
|
||||
them up and make the data easily accessible and usable. Another thing we need to do is actually
|
||||
document all of the different election committees, and there's thousands of them, and we'll need to
|
||||
document those as individual entities. There's someone with Wikipedia, there's different websites
|
||||
that document them, but I would like to get them into a YAML format. Now this is where I can segue
|
||||
into the work that I was doing on the Open Congress, I'll see my repository in GitHub,
|
||||
Congress legislators, so basically I helped clean up the Congress legislators project from
|
||||
the United States slush, Congress minus legislators, and there they have all the current legislators
|
||||
in YAML format with various fields, including the FEC committee IDs, including the
|
||||
their addresses and contact details, so I would like to include to create a similar format for the
|
||||
different election committees describing them. Basically you're treating the data as
|
||||
in YAML as kind of a wiki page, and people edit that directly or automatically, and instead of
|
||||
committing it to a wiki, you post a commit to GitHub, and those emerged. So it's kind of elite,
|
||||
I mean it's not for the average person, and I have some hosting here, I have a hosting
|
||||
on someone donated a Dreamhost host, and it's only 10 megabytes of memory, so I can't even check
|
||||
out the GitHub repository as long as you're there. I have to actually get the zip files and check
|
||||
them out, so I'm not going to publish that host yet. So eventually we will create a website for
|
||||
allowing people to actually interact with it with the data directly and post changes, which will
|
||||
then be committed to Git as well. So I hope to use Git as a repository for the data, we just have to
|
||||
make the great repose, find small enough that they can actually be managed easily, and then we need
|
||||
to, I mean there's a lot of work to do, just the process of all this data is a lot of work. So
|
||||
I'm basically posting this video to hack a public radio if anyone's interested, they can contact me,
|
||||
see my data on the show notes, and if you want to help with the code, any help is welcome, if you
|
||||
want to help host the data, that would be great. I mean ideally we would find a big machine
|
||||
that could actually host the processing of this data because it's just so intensive. Now I have
|
||||
to send music Python, which is inefficient, and using YAML, which is inefficient, so
|
||||
I mean sure we could turn the data into a Postgrease database, for example, which would be more
|
||||
efficient to process the data using C++ that I've started on that. But again, if we create a
|
||||
Postgrease database, then we need to host it because I mean the idea of using YAML and Git is that
|
||||
people could just check out little bits and process them, and if you really want the whole database,
|
||||
then it's going to be huge, and we'll need to have a host for that. So there's different ways to
|
||||
look at it, and I'm hoping it gains some interest here and see if I can find people to help out. So
|
||||
yeah, well thanks for listening, and bye.
|
||||
You have been listening to Hacker Public Radio, where Hacker Public Radio does our.
|
||||
We are a community podcast network that releases shows every weekday and Monday through Friday.
|
||||
Today's show, like all our shows, was contributed by a HPR listener like yourself.
|
||||
If you ever consider recording a podcast, then visit our website to find out how easy it really is.
|
||||
Hacker Public Radio was founded by the Digital.Pound and the International Computer Club.
|
||||
HPR is funded by the Binary Revolution at binref.com. All binref projects are proudly sponsored
|
||||
by Liner Pages. From shared hosting to custom private clouds, go to LinerPages.com for all your
|
||||
hosting needs. Unless otherwise stasis, today's show is released under a creative comments,
|
||||
attribution, share in life, details on our lives please.
|
||||
Reference in New Issue
Block a user