hpr-knowledge-base/hpr_transcripts/hpr3253.txt

Episode: 3253
Title: HPR3253: Pandas Intro
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3253/hpr3253.mp3
Transcribed: 2025-10-24 19:45:01

---

This is Haka Public Radio Episode 3253 for Wednesday 20th of January 2021, today's show is entitled Pandan Intro.
It is hosted by Anigma and is about 21 minutes long and carries a clean flag.
The summer is, Anigma introduces one of his favorite iPhone modules Pandan.
This episode of HPR is brought to you by archive.org.
Support universal access to all knowledge by heading over to archive.org forward slash donate.
Welcome to another episode of HPR. I'm your host Nigma and today we're going to be talking about one of my favorite Python modules Pandas, as well as NumPy and maybe PioDBC to some extent because they have some interrelation to the module that I'm using.
This will be my first episode in a series that I'm naming for the love of Python and what we're going to be talking about in this series is some data analytics, data science, as well as just some good old fashioned Python that I enjoy.
The goal is to give you some basic tutorials as well as to give you some real-world examples and some code that I use on a day-to-day basis, as well as get some community feedback
because I will have a GitHub repository with all of this code available that if you want to suggest check out the branch or create a branch off of my code and give your suggestions, I wouldn't mind it.
So when we're talking about Pandas, the first thing we have to do is install the module. So we're going to do a PIP-3 install Pandas, this will also install NumPy by the way if you don't have it already installed.
You can also do this through your code editor of choice, whether that be PyCharm or whatever you're using. Typically they have a graphical user interface that you can install packages with.
So Pandas is all about data frames and first we're going to define what a data frame is. A data frame is a two-dimensional data structure, it's aligned basically data that is aligned in tabular fashion in rows and columns.
So think of it as a spreadsheet type object in memory and you can do a lot of fancy things with this so you can do filters, you can do group eyes, basically anything you can do in SQL.
You can do in Pandas in memory. So this is great for spreadsheet manipulation or any smaller task that you have a few hundred rows or a few thousand rows that you're trying to churn over.
As you get bigger data sets, obviously this is going to take a little more memory. There's some things you can do for larger data sets and we'll talk about those in future episodes.
So today we're going to talk about importing data, accessing that data, setting column types and some basic filtering and manipulation of that data.
So the first thing we'll talk about is importing that data and you can do it from a lot of different sources.
CSV, Excel, SQL, JSON, there's some other more advanced topics and we're going to talk about the first three, the CSV, Excel and SQL, since that's typically what folks use the most.
JSON is a little bit more complex since it's, you know, you could have embedded objects. You can also do data frames from lists and also from dictionaries.
Not going to talk about those today but if you're interested, let me know in the comments and I might do a topic about those more advanced topics.
So the syntax for this is you basically create a variable, my variable for all these examples is going to be df for data frame.
And then you're going to do for all of them, you're going to do pd. and then whatever type you're going to be reading.
So whether that's read CSV, read Excel or read SQL and there's an underscore between the read part and then the object part, so CSV, underscore CSV, underscore Excel, underscore SQL.
For the two types that are file based, you're your next, you're going to put it in parentheses, you're going to put the file name in quotations, you're going to then specify for the CSV, you're going to set specified the limiter if you don't, this is an optional component.
If you don't, it's going to assume that it's a comma delimited file.
And then you also have an optional where you can skip rows. This comes in handy if you have some rows ahead.
If it's, if you're reading in a file that has additional rows at the top of the file or it has a header that you really don't care about and you're going to define your own header, you can do that.
For the read Excel component, it's the same thing in parentheses, it's the, the Excel file name and then comma, the sheet name, so you can actually, there's, there's a trick where if you don't specify sheet name and you put sheet name equal none.
It will read in all of the sheets in succession and put them in a dictionary and then you can loop through that essentially dictionary of data frames to get to a, the one you want or you can combine them all and that's a little bit more advanced topic.
If anyone's interested, leave me a comment and I'll talk about that in more detail.
The next option would be the read SQL and this is a little bit different syntax, so you have in the parentheses, you have your query, comma, your connection and I'm going to leave an example of this in the show notes.
Basically, the query is just a string variable that has your select statement in it and then the connection varies by odbc connection.
I'm going to leave a SQL server example in the show notes so you can take a look at that and then if anyone has any trouble connecting to another source, leave me a comment or catch me on IRC and we can, we can work that out.
So the next thing I want to do, once you've gotten your file red in or your SQL connection executed, then how do you access that data?
So there's a couple of things you can do from a quick visual look at your data.
So you can use the head and tail functions, so it'll be your data frame name dot either head or tail and this will either print out the first five rows or first and throws because within the parentheses of the function, you can specify the number of rows you're printing out.
So for my example, it's five, you're printing out all columns, but either the first five rows at the top or the last five rows at the bottom of the file.
The next thing, the next function is df shape or dot shape and this will give you the number of rows and columns in your data frame.
You can also set this to a variable. So if you wanted to, you would know the number of rows and columns if you want to, if you had a need to know that.
So there's a use case that I do where I basically spit out a file and then I read the file back in with a different Excel based Python package to do some conditional formatting.
And I may leave, probably do another show about that because that's a lot more complicated and uses a couple of different modules to do that.
So shape is a handy one. If you want to visually look at what number of rows and columns you have, typically I'm going to use shape and the next one, which is dot columns to kind of get a visual feel for what my data frame looks like.
So dot columns, just print your column names and there's, I've had some trouble with column names in the past, I've had to redefine some column names because of trailing or leading spaces.
And you can do that with a strip and probably not going to talk about that a whole lot this episode.
So if you want to do a singular column, you can either refer to that column by name or by position. So I've got two examples of that and basically how you do that is you do the data frame and then brackets either positionally or within the brackets put the column name.
You can redefine your your data frame or set a new data frame by doing the data frame equal and then either the same data frame or another data frame and you can essentially chop columns off of the data frame.
So if I wanted to look at the first 26 columns, if I had 100 columns, take the first 26, I could do that by putting it in brackets with a semicolon in between so 0 through 26, this would give me 26 columns.
I can also do it by column name and the syntax is a little wonky on this. So I'll leave this in the show notes, but it's the data frame and then you've got to do a sub select of the data frame. So it's, you know, the data frame in brackets, the data frame again and brackets, then the columns associated that you want to keep.
A little confusing, I'll leave that one in the show notes. So the next thing I wanted to talk about was setting the column type. So the three main types of columns I'm using are int strings and basically date time. There's also a float component if you need decimal points.
But the syntax is the same for int string and float. So it would be the field name. So df bracket the field name in quotes dot as type and then the type you're setting. So int string float.
For the date time, there's actually a function for that. So it's PD dot to underscore date time and then within, within parentheses, the data, the data frame, and then you've got in that your brackets, the column name.
This would set a date time function and this is important when you're doing filtering to know what type of column it is. So you can do different operations based on the column type.
So and it will give you an error if it's the wrong type. So if it's an int and you're trying to do a string comparison, obviously, that's not going to work. So you need to know what type of column it is.
So some basic filtering. So for string columns, you have basically anything that Python does by nature. So your title case, your upper or your lower, you can also do a split by string.
So that syntax will be in the show notes, but it's that's str dot split and then in parentheses, your, your character, you're splitting on or characters. And then there's some optional things that you can do where we're basically you set the number of instances that you want to split on.
So if you have a big string that contains multiple dots and you split by the dot, you could just take the first one or take the first three or so and and so forth.
So in my example, I'm splitting email address by the at symbol, something that I do a lot with my contacts when I'm inspecting the user name versus the main name and so and so forth.
So I've got an example of that where I'm where I'm taking and splitting by the at and then setting to new columns and the data frame to the user and the domain.
So the thing that I do a lot is I'll take like the first three characters in a in a particular field and set it to a new column and then drop the the bigger column.
You can do that for things like if you want to inspect only the first three characters of the last three characters, that's handy.
And that's basically, you know, setting either the new column or the existing column and then using the dot str and then in brackets, colon three would give you the first three positions in the field that you're referencing.
That is again, if it's a string column type, so if you're doing something like an inspecting an integer column type, then you can do your less than greater than equal and not equal.
And I've got an example of a data frame where I'm only keeping rows that the field name doesn't contain a zero.
You can also fill a particular column if it's got a bunch of null values. Now what pandas will do is it'll convert the null values to NAND values.
So if you see, if you print out a bunch of columns and you see NAN in your display, that's basically just saying it's a null value.
The next thing I wanted to talk about was a list example where we're basically I'm taking one column, I'm making sure that column has values in it.
So I'm filling the nulls with just basically pound pound symbols, hashtags, whatever you want to call it.
And then I'm pushing that to a list and then I'm using that list filter another another data frame.
So in a real world example, what I use that for is let's say I have a bunch of email addresses that I don't want to include in a list.
I'm going to consume that in, so if I have it basically as an Excel document, I'll basically consume that as a data frame, push it to a list, and then read that list in and filter my other data frame that contains the values I want to ship out in a marketing campaign or whatever.
And I'm either going to include those or exclude those and I've got examples for both.
Moving on to loops, so I've got an example in the show notes where I'm looping through a data frame and I'm doing some type of filter.
So this would be a greater than, so it would be a new column and I'm essentially saying a particular column is greater than my filter and that's just a variable that I set above.
So you can do that where if you have like a year component and you're you're saying it must be greater than let's say 2009 or 2010, whatever, you could use that as that as a component if you wanted to do it as a loop.
Now when you do that and you're doing a data append, which I've got in the the show notes, you're basically creating a Python or a panda series and then you got to convert that back as a data frame at the end of your loop once you've appended all the rows.
So there's more elegant ways to do this, but I wanted to show an example or real world example where you could essentially do a for loop and then have an if statement in that loop append the rows and then bring it back as a as a data frame.
And this is more of a Python way to do it. There's easier ways to do it in pandas.
Next, I wanted to talk about conditional ways to set a particular column. So this uses numpy. So we'll talk about numpy for just a sec. So essentially what I'm doing in this example is I've got conditions that are in a list.
So and this is showing essentially the column a particular column is equal to a condition and then the list is obviously separated by commas and you could you could print that list out.
And I show essentially you can use one panda's column or multiple panda's column using an ensemble and essentially here I'm building the conditions and then I build another list of choices and then I'm doing a numpy select dot select to say
if I have this condition set this choice otherwise set a default value. So if I don't need any of those conditions, I would basically set my default.
Now this comes in handy when you're when you're inspecting multiple columns in pandas. It's easy to do a filter on one column, but I haven't found a cleaner way to do multi columns.
And if there's a better way, feel free to leave me a comment or shoot me an email or come talk to me on IRC, but there's this is a clean way that I've found to do multi column conditions and kind of combined multiple things.
So so if you think of it in a SQL context, it would be multiple basically and statement within a SQL select statement or where statement.
So at this point, I think I've talked about all of the high level things I wanted to talk about in this episode.
So as I said, we're going to have multiple shows related to this pandas module as well as I'm showing a real world example of a little context scrubber I built.
And I think that's going to be our next episode. So until then, this has been an episode of HPR.
Appreciate everybody's time. Feel free to reach out to me on freeno.net, IRC dot freeno.net, either in the hashtag hacker exchange channel or hashtag augcast planet channel.
My email is ETH zero enigma at gmail.com. Otherwise, I'll talk to you guys in the next episode.
You've been listening to Hacker Public Radio at Hacker Public Radio dot org.
We are a community podcast network that releases shows every weekday Monday through Friday.
Today's show, like all our shows, was contributed by an HPR listener like yourself.
If you ever thought of recording a podcast, then click on our contributing to find out how easy it really is.
Hacker Public Radio was founded by the digital dog pound and the infonomicon computer club.
And it's part of the binary revolution at binwreff.com.
If you have comments on today's show, please email the host directly, leave a comment on the website or record a follow-up episode yourself.
Unless otherwise status, today's show is released on the creative comments, attribution, share a life, 3.0 license.
www.hacker.com