Episode: 1372 Title: HPR1372: Rootstrikers.org and federal election commission data processing Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1372/hpr1372.mp3 Transcribed: 2025-10-18 00:22:52 --- 🎵 Welcome to my second Hacker Public Radio show. Today I'm going to make this real short. There was a call for new shows to be published, so my attempt I will tell you about the Roots Strikers Project from Lawrence Lessick Roots Strikers is a project to reform the campaign contributions in federal elections. Lawrence Lessick started the Creative Commons Project, which is very successful, and he realized that there was no way to implement the changes needed in our society to allow for creative commons with the current political situation, so he decided to strike at the root of the problem, which is the campaign financing, and started the group Roots Strikers. Roots Strikers is basically campaigning for campaign finance reform, but it is also working on different software projects, and data projects do so. I originally started helping out as a volunteer on one website, and then I was recruited to help on working on Wikipedia projects, so basically cleaning up and editing different Wikipedia pages and putting on the campaign finance data onto the Wikipedia pages of the various politicians. This proved to be very difficult where people were just deleting it or calling me a vandal, etc., so then I started to look into where this campaign finance data comes from, which is from the Federal Election Commission, and I discovered that it is a huge mess of data formats, basically people submit data to the Federal Election Commission using forms, and there is also an electronic record for them to submit. This is basically a CSV, a comma-separated file, and this file has now 8 versions, they started in 2000, so there has been basically a version per year of the file format being changed, and then there are, I guess, 20 different software vendors that are producing these files, all that are a little bit different, so there has been no software, the best software for reading it that I found was FECHFech from the New York Times, but it only supports the last three or four versions, so I started a project to re-implement Fech, which is written in Ruby in Python, and you'll find my code on GitHub, so then I'm working now on converting all of the CSV files into YAML files, which should be easier to process. Unfortunately, the YAML files that I'm producing are a bit bloated, they still include the columns, the column numbers, and they include also the original data, so I'm putting these files onto GitHub and Bitbucket and publishing them in Git format, and they still need to be cleaned up with YAML files, but it's a start, so I'm hoping that other people will be interested in helping me process these files, I've started on a C++ program to read them, basically it's a huge amount of data using Git is also a bit bloated because it copies the files as well, using YAML is also bloated that it has copied a couple times, so these files get expanded, we're talking about for one year, we're talking about gigabytes of data, you know, 10, one year's worth of data could be between 10 and 20 gigabytes, you know, here I'm just looking at the file system, 2002, checked out is 18 gigabytes of data, so it gets pretty heavy, and I'm hoping to work on optimizing the data formats. The problem really with the data formats is we need to get them cleaned up, we need to get the records cleaned up, and then we need to build a better unified data model that is unified across all the different eight versions of the software, and describe all the fields and how they interrelate with each other, so it's going to be a big project, and I'm expecting you to take years to do it, now the reason why I started on this is because, you know, you say well why isn't there this data out there, well, there are different repositories of data out there that provide aggregates like map light, and originally I was using the map light data, but they don't publish their raw data, and they don't share it, so basically you just have to trust their numbers, and I'm someone who wants to actually have open access to the software, and also map light is, you know, they don't publish their software either, or make their data available, so for me, I want to make this data open access, and usable using open-source software, and that's the long-term goal that eventually we'll have small repositories of data split up across topics, for example, you'll have like not only 2010, I have one repository for 2010, one repository for 2011, but we'll split that across state, so you could say I want to check out Kansas for 2010, and then have the data in there, you know, even per politician or per political group, we'll split them up and make the data easily accessible and usable. Another thing we need to do is actually document all of the different election committees, and there's thousands of them, and we'll need to document those as individual entities. There's someone with Wikipedia, there's different websites that document them, but I would like to get them into a YAML format. Now this is where I can segue into the work that I was doing on the Open Congress, I'll see my repository in GitHub, Congress legislators, so basically I helped clean up the Congress legislators project from the United States slush, Congress minus legislators, and there they have all the current legislators in YAML format with various fields, including the FEC committee IDs, including the their addresses and contact details, so I would like to include to create a similar format for the different election committees describing them. Basically you're treating the data as in YAML as kind of a wiki page, and people edit that directly or automatically, and instead of committing it to a wiki, you post a commit to GitHub, and those emerged. So it's kind of elite, I mean it's not for the average person, and I have some hosting here, I have a hosting on someone donated a Dreamhost host, and it's only 10 megabytes of memory, so I can't even check out the GitHub repository as long as you're there. I have to actually get the zip files and check them out, so I'm not going to publish that host yet. So eventually we will create a website for allowing people to actually interact with it with the data directly and post changes, which will then be committed to Git as well. So I hope to use Git as a repository for the data, we just have to make the great repose, find small enough that they can actually be managed easily, and then we need to, I mean there's a lot of work to do, just the process of all this data is a lot of work. So I'm basically posting this video to hack a public radio if anyone's interested, they can contact me, see my data on the show notes, and if you want to help with the code, any help is welcome, if you want to help host the data, that would be great. I mean ideally we would find a big machine that could actually host the processing of this data because it's just so intensive. Now I have to send music Python, which is inefficient, and using YAML, which is inefficient, so I mean sure we could turn the data into a Postgrease database, for example, which would be more efficient to process the data using C++ that I've started on that. But again, if we create a Postgrease database, then we need to host it because I mean the idea of using YAML and Git is that people could just check out little bits and process them, and if you really want the whole database, then it's going to be huge, and we'll need to have a host for that. So there's different ways to look at it, and I'm hoping it gains some interest here and see if I can find people to help out. So yeah, well thanks for listening, and bye. You have been listening to Hacker Public Radio, where Hacker Public Radio does our. We are a community podcast network that releases shows every weekday and Monday through Friday. Today's show, like all our shows, was contributed by a HPR listener like yourself. If you ever consider recording a podcast, then visit our website to find out how easy it really is. Hacker Public Radio was founded by the Digital.Pound and the International Computer Club. HPR is funded by the Binary Revolution at binref.com. All binref projects are proudly sponsored by Liner Pages. From shared hosting to custom private clouds, go to LinerPages.com for all your hosting needs. Unless otherwise stasis, today's show is released under a creative comments, attribution, share in life, details on our lives please.