Files

102 lines
8.4 KiB
Plaintext
Raw Permalink Normal View History

Episode: 1372
Title: HPR1372: Rootstrikers.org and federal election commission data processing
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1372/hpr1372.mp3
Transcribed: 2025-10-18 00:22:52
---
🎵
Welcome to my second Hacker Public Radio show.
Today I'm going to make this real short.
There was a call for new shows to be published, so my attempt I will tell you about the Roots
Strikers Project from Lawrence Lessick Roots Strikers is a project to reform the campaign
contributions in federal elections.
Lawrence Lessick started the Creative Commons Project, which is very successful, and he
realized that there was no way to implement the changes needed in our society to allow
for creative commons with the current political situation, so he decided to strike at the
root of the problem, which is the campaign financing, and started the group Roots Strikers.
Roots Strikers is basically campaigning for campaign finance reform, but it is also working
on different software projects, and data projects do so.
I originally started helping out as a volunteer on one website, and then I was recruited to help
on working on Wikipedia projects, so basically cleaning up and editing different Wikipedia pages
and putting on the campaign finance data onto the Wikipedia pages of the various politicians.
This proved to be very difficult where people were just deleting it or calling me a
vandal, etc., so then I started to look into where this campaign finance data comes from,
which is from the Federal Election Commission, and I discovered that it is a huge mess of
data formats, basically people submit data to the Federal Election Commission using forms,
and there is also an electronic record for them to submit.
This is basically a CSV, a comma-separated file, and this file has now 8 versions, they
started in 2000, so there has been basically a version per year of the file format being changed,
and then there are, I guess, 20 different software vendors that are producing these files,
all that are a little bit different, so there has been no software, the best software for reading
it that I found was FECHFech from the New York Times, but it only supports the last three or four
versions, so I started a project to re-implement Fech, which is written in Ruby in Python, and you'll
find my code on GitHub, so then I'm working now on converting all of the CSV files into
YAML files, which should be easier to process. Unfortunately, the YAML files that I'm producing are a bit
bloated, they still include the columns, the column numbers, and they include also the original data,
so I'm putting these files onto GitHub and Bitbucket and publishing them in Git format, and they still
need to be cleaned up with YAML files, but it's a start, so I'm hoping that other people will be
interested in helping me process these files, I've started on a C++ program to read them,
basically it's a huge amount of data using Git is also a bit bloated because it copies the files
as well, using YAML is also bloated that it has copied a couple times, so these files get expanded,
we're talking about for one year, we're talking about gigabytes of data,
you know, 10, one year's worth of data could be between 10 and 20 gigabytes, you know,
here I'm just looking at the file system, 2002, checked out is 18 gigabytes of data,
so it gets pretty heavy, and I'm hoping to work on optimizing the data formats. The problem really
with the data formats is we need to get them cleaned up, we need to get the records cleaned up,
and then we need to build a better unified data model that is unified across all the different
eight versions of the software, and describe all the fields and how they interrelate with each other,
so it's going to be a big project, and I'm expecting you to take years to do it, now the reason why I
started on this is because, you know, you say well why isn't there this data out there, well,
there are different repositories of data out there that provide aggregates like map light,
and originally I was using the map light data, but they don't publish their raw data, and they don't
share it, so basically you just have to trust their numbers, and I'm someone who wants to
actually have open access to the software, and also map light is, you know, they don't publish
their software either, or make their data available, so for me, I want to make this data open access,
and usable using open-source software, and that's the long-term goal that
eventually we'll have small repositories of data split up across topics,
for example, you'll have like not only 2010, I have one repository for 2010, one repository for
2011, but we'll split that across state, so you could say I want to check out Kansas for 2010,
and then have the data in there, you know, even per politician or per political group, we'll split
them up and make the data easily accessible and usable. Another thing we need to do is actually
document all of the different election committees, and there's thousands of them, and we'll need to
document those as individual entities. There's someone with Wikipedia, there's different websites
that document them, but I would like to get them into a YAML format. Now this is where I can segue
into the work that I was doing on the Open Congress, I'll see my repository in GitHub,
Congress legislators, so basically I helped clean up the Congress legislators project from
the United States slush, Congress minus legislators, and there they have all the current legislators
in YAML format with various fields, including the FEC committee IDs, including the
their addresses and contact details, so I would like to include to create a similar format for the
different election committees describing them. Basically you're treating the data as
in YAML as kind of a wiki page, and people edit that directly or automatically, and instead of
committing it to a wiki, you post a commit to GitHub, and those emerged. So it's kind of elite,
I mean it's not for the average person, and I have some hosting here, I have a hosting
on someone donated a Dreamhost host, and it's only 10 megabytes of memory, so I can't even check
out the GitHub repository as long as you're there. I have to actually get the zip files and check
them out, so I'm not going to publish that host yet. So eventually we will create a website for
allowing people to actually interact with it with the data directly and post changes, which will
then be committed to Git as well. So I hope to use Git as a repository for the data, we just have to
make the great repose, find small enough that they can actually be managed easily, and then we need
to, I mean there's a lot of work to do, just the process of all this data is a lot of work. So
I'm basically posting this video to hack a public radio if anyone's interested, they can contact me,
see my data on the show notes, and if you want to help with the code, any help is welcome, if you
want to help host the data, that would be great. I mean ideally we would find a big machine
that could actually host the processing of this data because it's just so intensive. Now I have
to send music Python, which is inefficient, and using YAML, which is inefficient, so
I mean sure we could turn the data into a Postgrease database, for example, which would be more
efficient to process the data using C++ that I've started on that. But again, if we create a
Postgrease database, then we need to host it because I mean the idea of using YAML and Git is that
people could just check out little bits and process them, and if you really want the whole database,
then it's going to be huge, and we'll need to have a host for that. So there's different ways to
look at it, and I'm hoping it gains some interest here and see if I can find people to help out. So
yeah, well thanks for listening, and bye.
You have been listening to Hacker Public Radio, where Hacker Public Radio does our.
We are a community podcast network that releases shows every weekday and Monday through Friday.
Today's show, like all our shows, was contributed by a HPR listener like yourself.
If you ever consider recording a podcast, then visit our website to find out how easy it really is.
Hacker Public Radio was founded by the Digital.Pound and the International Computer Club.
HPR is funded by the Binary Revolution at binref.com. All binref projects are proudly sponsored
by Liner Pages. From shared hosting to custom private clouds, go to LinerPages.com for all your
hosting needs. Unless otherwise stasis, today's show is released under a creative comments,
attribution, share in life, details on our lives please.