102 lines
8.4 KiB
Plaintext
102 lines
8.4 KiB
Plaintext
|
|
Episode: 1372
|
||
|
|
Title: HPR1372: Rootstrikers.org and federal election commission data processing
|
||
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1372/hpr1372.mp3
|
||
|
|
Transcribed: 2025-10-18 00:22:52
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
🎵
|
||
|
|
Welcome to my second Hacker Public Radio show.
|
||
|
|
Today I'm going to make this real short.
|
||
|
|
There was a call for new shows to be published, so my attempt I will tell you about the Roots
|
||
|
|
Strikers Project from Lawrence Lessick Roots Strikers is a project to reform the campaign
|
||
|
|
contributions in federal elections.
|
||
|
|
Lawrence Lessick started the Creative Commons Project, which is very successful, and he
|
||
|
|
realized that there was no way to implement the changes needed in our society to allow
|
||
|
|
for creative commons with the current political situation, so he decided to strike at the
|
||
|
|
root of the problem, which is the campaign financing, and started the group Roots Strikers.
|
||
|
|
Roots Strikers is basically campaigning for campaign finance reform, but it is also working
|
||
|
|
on different software projects, and data projects do so.
|
||
|
|
I originally started helping out as a volunteer on one website, and then I was recruited to help
|
||
|
|
on working on Wikipedia projects, so basically cleaning up and editing different Wikipedia pages
|
||
|
|
and putting on the campaign finance data onto the Wikipedia pages of the various politicians.
|
||
|
|
This proved to be very difficult where people were just deleting it or calling me a
|
||
|
|
vandal, etc., so then I started to look into where this campaign finance data comes from,
|
||
|
|
which is from the Federal Election Commission, and I discovered that it is a huge mess of
|
||
|
|
data formats, basically people submit data to the Federal Election Commission using forms,
|
||
|
|
and there is also an electronic record for them to submit.
|
||
|
|
This is basically a CSV, a comma-separated file, and this file has now 8 versions, they
|
||
|
|
started in 2000, so there has been basically a version per year of the file format being changed,
|
||
|
|
and then there are, I guess, 20 different software vendors that are producing these files,
|
||
|
|
all that are a little bit different, so there has been no software, the best software for reading
|
||
|
|
it that I found was FECHFech from the New York Times, but it only supports the last three or four
|
||
|
|
versions, so I started a project to re-implement Fech, which is written in Ruby in Python, and you'll
|
||
|
|
find my code on GitHub, so then I'm working now on converting all of the CSV files into
|
||
|
|
YAML files, which should be easier to process. Unfortunately, the YAML files that I'm producing are a bit
|
||
|
|
bloated, they still include the columns, the column numbers, and they include also the original data,
|
||
|
|
so I'm putting these files onto GitHub and Bitbucket and publishing them in Git format, and they still
|
||
|
|
need to be cleaned up with YAML files, but it's a start, so I'm hoping that other people will be
|
||
|
|
interested in helping me process these files, I've started on a C++ program to read them,
|
||
|
|
basically it's a huge amount of data using Git is also a bit bloated because it copies the files
|
||
|
|
as well, using YAML is also bloated that it has copied a couple times, so these files get expanded,
|
||
|
|
we're talking about for one year, we're talking about gigabytes of data,
|
||
|
|
you know, 10, one year's worth of data could be between 10 and 20 gigabytes, you know,
|
||
|
|
here I'm just looking at the file system, 2002, checked out is 18 gigabytes of data,
|
||
|
|
so it gets pretty heavy, and I'm hoping to work on optimizing the data formats. The problem really
|
||
|
|
with the data formats is we need to get them cleaned up, we need to get the records cleaned up,
|
||
|
|
and then we need to build a better unified data model that is unified across all the different
|
||
|
|
eight versions of the software, and describe all the fields and how they interrelate with each other,
|
||
|
|
so it's going to be a big project, and I'm expecting you to take years to do it, now the reason why I
|
||
|
|
started on this is because, you know, you say well why isn't there this data out there, well,
|
||
|
|
there are different repositories of data out there that provide aggregates like map light,
|
||
|
|
and originally I was using the map light data, but they don't publish their raw data, and they don't
|
||
|
|
share it, so basically you just have to trust their numbers, and I'm someone who wants to
|
||
|
|
actually have open access to the software, and also map light is, you know, they don't publish
|
||
|
|
their software either, or make their data available, so for me, I want to make this data open access,
|
||
|
|
and usable using open-source software, and that's the long-term goal that
|
||
|
|
eventually we'll have small repositories of data split up across topics,
|
||
|
|
for example, you'll have like not only 2010, I have one repository for 2010, one repository for
|
||
|
|
2011, but we'll split that across state, so you could say I want to check out Kansas for 2010,
|
||
|
|
and then have the data in there, you know, even per politician or per political group, we'll split
|
||
|
|
them up and make the data easily accessible and usable. Another thing we need to do is actually
|
||
|
|
document all of the different election committees, and there's thousands of them, and we'll need to
|
||
|
|
document those as individual entities. There's someone with Wikipedia, there's different websites
|
||
|
|
that document them, but I would like to get them into a YAML format. Now this is where I can segue
|
||
|
|
into the work that I was doing on the Open Congress, I'll see my repository in GitHub,
|
||
|
|
Congress legislators, so basically I helped clean up the Congress legislators project from
|
||
|
|
the United States slush, Congress minus legislators, and there they have all the current legislators
|
||
|
|
in YAML format with various fields, including the FEC committee IDs, including the
|
||
|
|
their addresses and contact details, so I would like to include to create a similar format for the
|
||
|
|
different election committees describing them. Basically you're treating the data as
|
||
|
|
in YAML as kind of a wiki page, and people edit that directly or automatically, and instead of
|
||
|
|
committing it to a wiki, you post a commit to GitHub, and those emerged. So it's kind of elite,
|
||
|
|
I mean it's not for the average person, and I have some hosting here, I have a hosting
|
||
|
|
on someone donated a Dreamhost host, and it's only 10 megabytes of memory, so I can't even check
|
||
|
|
out the GitHub repository as long as you're there. I have to actually get the zip files and check
|
||
|
|
them out, so I'm not going to publish that host yet. So eventually we will create a website for
|
||
|
|
allowing people to actually interact with it with the data directly and post changes, which will
|
||
|
|
then be committed to Git as well. So I hope to use Git as a repository for the data, we just have to
|
||
|
|
make the great repose, find small enough that they can actually be managed easily, and then we need
|
||
|
|
to, I mean there's a lot of work to do, just the process of all this data is a lot of work. So
|
||
|
|
I'm basically posting this video to hack a public radio if anyone's interested, they can contact me,
|
||
|
|
see my data on the show notes, and if you want to help with the code, any help is welcome, if you
|
||
|
|
want to help host the data, that would be great. I mean ideally we would find a big machine
|
||
|
|
that could actually host the processing of this data because it's just so intensive. Now I have
|
||
|
|
to send music Python, which is inefficient, and using YAML, which is inefficient, so
|
||
|
|
I mean sure we could turn the data into a Postgrease database, for example, which would be more
|
||
|
|
efficient to process the data using C++ that I've started on that. But again, if we create a
|
||
|
|
Postgrease database, then we need to host it because I mean the idea of using YAML and Git is that
|
||
|
|
people could just check out little bits and process them, and if you really want the whole database,
|
||
|
|
then it's going to be huge, and we'll need to have a host for that. So there's different ways to
|
||
|
|
look at it, and I'm hoping it gains some interest here and see if I can find people to help out. So
|
||
|
|
yeah, well thanks for listening, and bye.
|
||
|
|
You have been listening to Hacker Public Radio, where Hacker Public Radio does our.
|
||
|
|
We are a community podcast network that releases shows every weekday and Monday through Friday.
|
||
|
|
Today's show, like all our shows, was contributed by a HPR listener like yourself.
|
||
|
|
If you ever consider recording a podcast, then visit our website to find out how easy it really is.
|
||
|
|
Hacker Public Radio was founded by the Digital.Pound and the International Computer Club.
|
||
|
|
HPR is funded by the Binary Revolution at binref.com. All binref projects are proudly sponsored
|
||
|
|
by Liner Pages. From shared hosting to custom private clouds, go to LinerPages.com for all your
|
||
|
|
hosting needs. Unless otherwise stasis, today's show is released under a creative comments,
|
||
|
|
attribution, share in life, details on our lives please.
|