Files
hpr-knowledge-base/hpr_transcripts/hpr3328.txt

116 lines
9.8 KiB
Plaintext
Raw Normal View History

Episode: 3328
Title: HPR3328: Pandas Part 2
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3328/hpr3328.mp3
Transcribed: 2025-10-24 20:52:28
---
This is Hacker Public Radio Episode 3328 for Wednesday, the 5th of May 2021.
To its show is entitled, Pandas Part 2 and is part of the series a little bit
of Python it is hosted by Enigma and is about 12 minutes long and carries a clean flag.
The summary is Enigma continues his discussion about his favorite Python module, Pandas.
This episode of HBR is brought to you by an honesthost.com.
Get 15% discount on all shared hosting with the offer code HBR15.
That's HBR15.
Better web hosting that's honest and fair at an honesthost.com.
It's Wednesday and you know what that means.
I am your host Enigma and this is another episode of Hacker Public Radio.
Today I'm going to be talking about my Pandas Part 2 and this is going to be part of a series
that I'm kind of renaming on the fly here. It used to be for the love of Python.
I'm going to say it's for the love of data. I'm planning on doing more data sciencey,
data analytic type things in this particular series and I want it to be all-encompassing
and not tied to a particular language because we might do some SQL, we might do some
other things as part of the series. So I did a intro to Pandas back in January of this year 2021
and this is a follow-up. For those that didn't listen to the first episode it's 3253 I believe.
I'll leave a link to it in the show notes but Pandas is a Python module that basically allows you
to create two-dimensional data structures in memory and it allows you to do some manipulation
and any type of data cleaning, data wrangling that you want to do and you can write to an
alpha file, you can write to a database, you can do a lot of cool things with Pandas and I use it
every day. So I wanted to talk more about a couple of topics. We're going to talk about
another way to apply a conditional field. We're going to create a data frame from a dictionary,
we're going to append a data frame with another data frame so just basically concatenating two
data frames together with the same column names so we can get into more advanced topics at another,
I'm just giving you kind of a high-level basic. We're going to talk about joining data frames with
merges and joins, they're different and I use one more than another and we'll talk about that
and then we're going to write an alpha file using CSV and this is a one-liner, we'll briefly cover
that at the end of the show. So I wanted to talk about first the ways to apply a condition to a field
based on other values in the data frame and I talked in my last show about using numpy select
for this and you can go back and review that and review the code. I'll also have a working example
in this show note so you can compare the differences. This is defining a function and then applying
that function to the data frame. So if we were going to hypothetically create a data frame that had
a integer value that was one through like let's say 20 and we wanted to create a basically a good
bad flag or a true false flag in the data set based on the values that were in that column. We could
do that using a function and basically what you would do is you would define your function name
and then pass in the data frame and then you would basically do an if statement to say let's say
if the the DF score was greater than 10 return good else if it was you know less than 10 or you
could just even do an else return bad and then you could basically do outside of the function. You
could say DF let's say status and you would put that in brackets and in parentheses equal and then
DF dot apply in parentheses your function comma axis one and then you would end your parentheses
and this would basically create a status field that would be a good bad based on the data in the
other column. I like this approach a little bit more than the numpy select only because it looks
cleaner if someone has a plus minus on if they're using both and they found a pro con approach to
this I'd love to hear from you shoot me an email leave me a comment or get with me on Twitter
I'll have all that in the show notes. So the next thing I wanted to talk about would be basically
creating a data frame from a dictionary and this is pretty easy as long as you keep the the
dictionary labels and the data frame labels the same it's basically a one line statement so you're
going to create your your data frame so let's say DF 2 in this case equal to pd.data frame remember
to capitalize the d and the f and tripped up on that many times and then you're going to put that
in parentheses your dictionary name so my dictionary and then and your parentheses obviously and
this is going to create a data frame based on the dictionary pretty easy. The next thing is
talking about merges and joins so there are two approaches to joining two data frames together
and this would be basically like a sequel join for those who are more familiar with SQL
so if you use dot join so basically df equal df dot join and then the other in in parentheses
your other data frame you're going to be joining those two objects based on their indexes
and this is assuming they have similar indexes so the dot join is I believe the first
function that they they introduced and then the merge was basically a replacement for that
I don't know that for true but I use merge way more than I use join and if someone has is has a
good use case for that reach out to me I'd love to know because I use merge way more than
I'll ever use join so merge gives you the ability to control how you merge the two data frames
together so you can do an inner join a left join a right join so and and what that means is
basically if you're doing a left join whatever data frame you define first so whatever's in front
of the so it would be df dot merge and then let's say we were doing df2 as the merge item
the df would be the left portion of the join so you're you're essentially keeping everything in
the first data frame irrespective of the second data frame so if you're creating like a df3 for
example you would get everything in the first data frame and then join to any matching elements
in the second data frame so if you do an inner you just basically get a cross section of both
so they have to exist in both based on the columns you define if you do a right obviously it
would be whatever the second element is on the on the right join so this has the power of giving
cross sections of data frames so I use it a lot when I'm when I'm trying to compare to
datasets or I'm trying to append a dataset based on another dataset so I use it a lot in a real
life example so I work for a heavy equipment manufacturer and we were appending serial numbers
based on a equipment number an internal equipment number or prices based on an equipment number
something like that where your datasets might not be completely aligned so I do the left
join to see the differences or I'll do an inner to see the the differences so you can also have
different column names so by default if you're just doing it and you get past no elements it assumes
that the column names are the same if you do a scenario and I'll leave a note in the show notes
for this if you do a left underscore on equals you can define the column name and you put that in
brackets just like you define any other column in pandas you can define what column you're joining
on and then same way with right underscore on there's two elements that you can you can define
there so that's a little bit more complicated and I'll leave a detailed explanation in the show
notes for that so a pen is basically another one liner pretty much as long as your data frames line
up from a field by field perspective so in other words if you're reading in two files that have
the exact same columns it's pretty easy and pretty straightforward so in this instance it would be
a df equal df dot append and then you pass the second data frame so df2 pretty straightforward
the last thing I'm going to talk about is basically writing to an output file and there's
there's multiple ways you can write to output in pandas but I'm just going to cover a simple one
this one is the dot two two CSV so two underscore CSV and this requires no parameters this is
pretty much how I do when I'm when I'm just exploring data and I want to look at it in Excel or I want
to look at the output file I pretty much just do an underscore dot two underscore CSV and just to
get me an output pretty simple and I'm bad at just naming my output output dot CSV and then
if I have it open it'll error you know if whatever so long and short this was a pretty short
episode wanted to do another follow up with pandas so there'll be a third at least one more
in this pandas series or pandas sub series of my for the love of data we'll be talking about
group buys and group buys gives you powerful Excel like pivot capabilities within pandas so stay
tuned for that and as always I'll leave a detailed explanation in the show notes for the purposes
of you following along to me merambling about pandas and all my contact information will be
in the show notes as well otherwise have a great day guys and take care of yourselves
you've been listening to hecka public radio at hecka public radio dot org
we are a community podcast network that releases shows every weekday Monday through Friday
today's show like all our shows was contributed by an hbr listener like yourself if you ever
thought of recording a podcast then click on our contributing to find out how easy it really is
hecka public radio was founded by the digital dog pound and the infonomicon computer club
and it's part of the binary revolution at binrev.com if you have comments on today's show
please email the host directly leave a comment on the website or record a follow up episode yourself
on this otherwise stated today's show is released on the creative comments attribution share
alike 3.0 license