Files

120 lines
9.1 KiB
Plaintext
Raw Permalink Normal View History

Episode: 2016
Title: HPR2016: Echoprint
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2016/hpr2016.mp3
Transcribed: 2025-10-18 13:20:59
---
This is HPR episode 2016 entitled Echo Print.
It is hosted by India and is about 13 minutes long.
The summary is, I share what I learned about the Echo Print Music Identification System.
This episode of HPR is brought to you by an Honesthost.com.
Get 15% discount on all shared hosting with the offer code HPR15.
That's HPR15.
Better web hosting that's Honest and Fair at An Honesthost.com.
This is Leandere for Hacker Public Radio, recording Friday, April 25th, 2016.
I'm going to be talking today about the Echo Print Music Fingerprinting System.
Earlier this year, a message went out to the mailing list asking about possible ways of
identifying whether recordings uploaded to Hacker Public Radio had had the intro prepended
and outro appended or not, and if it was possible to automate this.
This got me thinking about how audio is identified and specifically music, and I'd
remembered reading some while back about the Echo Nest and Echo Print projects.
So I'd throw that out there as a suggestion of something to look into, and wouldn't you
know it?
You suggest doing something and pretty soon you end up doing it yourself.
Now I never was quite able to get good detection of the HPR intro and outro, but I did learn
a lot about the system and found it pretty interesting, so I thought I would record an
episode and share in case someone else is interested in it.
So first of all, the Echo Print project has a website, it's echoprint.me, and there's
a pretty high level overview there of how it works under this, how it works link, slash
how is the URL.
Now when I first started, I assumed this was going to be like most other fingerprinting
systems, just sort of a hash that I could just do direct comparisons to.
So I downloaded a few audio files, including the Hacker Public Radio, Intro and outro,
Black versions, then I re-encoded those to AUG with the idea that I could run the fingerprinting
over both and compare the two, and hopefully we'd get a match.
So I downloaded the code generating part of Echo Print, and they have the source code
for it on GitHub, it's at github.com slash econist slash echo print dash code gen.
The code is MIT licensed, and I was able to compile it with no problems on my Raspberry
Pi, and so I ran the fingerprinting on both audio, it uses FFMPEG to do the audio decoding,
so there were no problem handling both formats.
I learned pretty quickly that it's not just a matter of using the code generator to get
hashes and then comparing them.
First of all, the output of the code generator was JSON, and that included what looked like
a base64 chunk, which was the hash itself, and they didn't look the same at all, and
they weren't even the same length, so I needed to dig a little deeper.
So the first thing I did was try decoding the base64, which turned out it was actually
a URL safe version of base64, where the plus and slash are replaced by a hyphen and underscore,
and then I just got this big binary blob, and at that point I pretty much knew I was in over my
head, but I decided to keep going. So I did a little more searching around on their website,
found a mention of a white paper, and through a little googling was able to find that,
and that gave a little more information about how the algorithm worked, and
why I wasn't able to just direct compare these things. So I'll give a link to the white paper
in the show notes. It's at www.e.columbia.edu. Slash till the dpwe slash pubs slash lswp11-ecoprint.pdf.
And it's co-authored by Daniel PW Ellis, who is with the laboratory for recognition and
organization of speech and audio at Columbia University, Brian Whitman from Econest,
and Aleister Porter, who is from the Center for Interdisciplinary Research in Music Media
and Technology at McGill University. So I'll try to summarize for you what I learned about the
fingerprinting algorithm from the paper. You start with the audio, they do a little bit of
pre-processing to get everything in the same format, and then they chop the signal up into eight
frequency bands from zero hertz up to about five and a half kilohertz. I assume that something
just sort of like a Fourier transform just to get the signal into those eight buckets.
Next, they in each bucket go through and they detect what they call onset events.
From what I could figure out, that's basically just places where the volume increases,
trying to find the beginnings of notes or beats.
They take those onset events in groups of four, and then for each pair from that group,
they generate a hash. So all possible pairs for a group before,
keeping the pairs in order gives six pairs, so six hashes.
And then those hashes are stored along with the time they occur.
So at this point, I at least knew what the fingerprint represented, so now I just needed to deal
with the format. What was very helpful for that was actually looking at the server code for the
echo print project, and that's at github.com slash echo nest slash echo print dash server,
and that is licensed under the Apache 2 license.
So looking through that code, I found that the
JSON payload that gets sent to the server was base 64 encoded, and the binary blob that I was
getting after decoding that was actually gzip compressed, but with no gzip header.
So I took my samples that I generated, pre-pended a gzip header, and decompressed it.
And what I found was essentially just a long string of hex digits.
And the early ones, it was pretty obvious that values were getting repeated, and they were
five digit hexadecimal numbers. And each five digit number was repeated about six times
throughout the first part of the file. And then in the second half of the file, there
didn't seem to be any repetition at all. The numbers seemed much more random.
And the reason I found that out is, or the reason for that, is that all of the lengths
are listed first, and then all of the hashes are listed at the end of the file. And the advantage
that that gives is you get better compression, because each length is repeated six times,
once for each of the six pairs for that group. And since those are all together,
gzip is able to find those repetitions and compress them very well.
So I needed to take the file, split it into half, and split each half into a five digit length
for the first half, and a five digit hash for the second half, and then pair those up.
The next thing that the server does to compare two fingerprints is it compares all of the hashes
from sample A, and matches them up against hashes in sample B wherever they're equal.
Then it calculates the minimum time offset between each pair of matching hashes,
and does a histogram. So it counts how many times a distance between two hashes occurs.
And then it takes the most common times, and uses that as a score. The count of the
number of occurrences of that time distance. And that essentially allows finding two matching
pieces of music that are just shifted by a time offset. So rather than looking for occurrences
at exact times, they just check to make sure that the time difference between
two hashes is pretty constant throughout the sample. So I wrote some conversion to go from the JSON
through the base 64, pre-pen the gzip header, decompress, split the text up into pairs of five
digit hex numbers, and then wrote an ox script to try to score the difference between the two files,
essentially trying to duplicate what the server was doing. I never did quite get it worked out well
enough to be able to consistently identify the HPR intro and outro. And I think in the process,
I learned that a major reason for that is with this kind of fuzzy matching,
it works pretty well for audio identification. So let's say you have a large
database of pre-computed fingerprints. Like the Ekonass project does, they use I think it's the
million songs database. It's pretty easy to say which of these songs does this new sample match
most closely, which gives very good results, but it's harder to say given any two samples,
are they the same song? It's pretty easy to say no, false negatives are pretty rare,
but it's rather hard to say yes, you know, how close a match is close enough.
So I didn't really get what I was after, but did learn quite a bit, and hopefully it was in
interest to you as well. Tune in tomorrow for another exciting episode of Hacker Public Radio.
You've been listening to Hacker Public Radio at HackerPublicRadio.org. We are a community podcast
network that releases shows every weekday Monday through Friday. Today's show, like all our shows,
was contributed by an HPR listener like yourself. If you ever thought of recording a podcast,
then click on our contribute link to find out how easy it really is. Hacker Public Radio was
founded by the digital dog pound and the infonomican computer club, and it's part of the binary
revolution at binrev.com. If you have comments on today's show, please email the host directly,
leave a comment on the website or record a follow-up episode yourself. Unless otherwise status,
today's show is released on the creative comments, attribution, share a life, 3.0 license.
You've been listening to Hacker Public Radio at HackerPublic Radio at HackerPublicRadio.org.