95 lines
15 KiB
Plaintext
95 lines
15 KiB
Plaintext
|
|
Episode: 2906
|
||
|
|
Title: HPR2906: Feature Engineering for Data-Driven Decision Making
|
||
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2906/hpr2906.mp3
|
||
|
|
Transcribed: 2025-10-24 13:02:15
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
This is HBR episode 2966 entitled Feature Engineering for Data Driven Decision Making.
|
||
|
|
It is posted by EEZ and in about 17 minutes long and carrying a clean flag for summary.
|
||
|
|
In this episode, I explain feature engineering and how it can be used to make decisions.
|
||
|
|
This episode of HBR is brought to you by an honesthost.com.
|
||
|
|
Get 15% discount on all shared hosting with the offer code HBR15, that's HBR15.
|
||
|
|
Better web hosting that's honest and fair at an honesthost.com.
|
||
|
|
Hello Hacker Public Radio fans, this is Bee Easy once again with another episode this time talking about
|
||
|
|
a topic that's very important to me and the work that I do and it's something that I've been promising to talk about on HBR for almost the entire year.
|
||
|
|
And that has to do with a talk that I've been giving at a couple of different conferences regarding data driven process development
|
||
|
|
and becoming a data driven company or a data driven person requires creating processes that use data as the backbone of your thought process.
|
||
|
|
And I was trying to figure out what part of the talk would be of interest to the Hacker Public Radio audience and I think I found it.
|
||
|
|
And that has to do with a topic called feature engineering and feature engineering and data science or machine learning is as Wikipedia defines it is the process of using domain knowledge to create features that make machine learning algorithms work.
|
||
|
|
And in particular I think of it as a way of hacking data to turn it into information.
|
||
|
|
Hacking in the sense that we are all used to which is taking something that has an original use and repurposing it to do something else.
|
||
|
|
And if you've ever had to do any type of data analysis I'm not going to go into machine learning nor did I go into machine learning in the talks that I've been giving because it wasn't appropriate for the audience at the time.
|
||
|
|
They would have probably fell asleep or just simply walked out if I started going on and on about the ins and outs of support vector machines or something like that.
|
||
|
|
So instead I focus on things that I think are more universal and things that are applicable to things outside of just the context of, you know, oncology research and medical diagnostics.
|
||
|
|
And so a feature if you think about it is a piece of information that you use in a machine learning context to use that information as an input to be able to help you predict an output.
|
||
|
|
And more business intelligence or business or data analysis sense it is information that you're using that can help you make a decision.
|
||
|
|
And important in my mind for for for a data driven process development and data driven decision making is that you actually one you have use measurable outcomes as your input and two you actually have in your processes.
|
||
|
|
You know thresholds lines in the sand that if you cross them you make a decision one way or another.
|
||
|
|
And when you have that it doesn't matter which person is making a decision you're making a consistent decision over time.
|
||
|
|
And if you want to monitor that threshold and say well let's look at the last 85 times we made the decision how many times did we find out that it was the wrong decision.
|
||
|
|
You'll have data to back up that's any any modifications to that threshold.
|
||
|
|
So that's something that's really important that I want to make sure that the audience here understands and that I made sure that that was really apparent in the talk.
|
||
|
|
And in the talk I gave I did a live coding demo with with a couple different examples but the one I'm going to focus on here has to do with something that I think most people can understand.
|
||
|
|
And that's the idea of say you are a doctor and you have patients obviously and you order a certain test for your patients when they have a certain disease or when you think that they might have a certain disease.
|
||
|
|
And you choose to use I don't know my laboratory to send your send your patients to to get the test done.
|
||
|
|
Well at my laboratory I have really poor laboratory information information systems and the only data that I capture because I'm not being compliant by regulations in this hypothetical.
|
||
|
|
But the only data I'm gathering is the name of the ordering client or maybe more more likely as data analysts that don't want to give me more data that I should have because it might not be compliant with privacy and and and hippocompliance matters.
|
||
|
|
The only data that I have is a line for every time a client ordered any any any test and all I have is the client's name and the date that they ordered.
|
||
|
|
Now obviously I can do some basic information I can do some basic things like group by the client and the date or group by the client in the in the week and see how many orders each client has sent every week.
|
||
|
|
Alright so that's first order you know data aggregation and that can tell you one who are your most active clients you know the people who send you the most volume and it will tell you how that volume changes over time you know you can put in a graph you can put each client in a different color and you can watch them go up and down as time goes on every week or every month or however else you want to monitor it.
|
||
|
|
But in this case let's just talk about every week.
|
||
|
|
Alright so one thing that happens sometimes is that for whatever reason a client or several clients may no longer be happy with your services and they don't call you up and say hey I hate what your test does or I hate you it's not effective I'm going somewhere else what they do instead is they just leave.
|
||
|
|
They just stops any you patients for for tests.
|
||
|
|
Now if you have hundreds of clients how do you manage that how do you find how do you know when someone is just silently left and you can think about this in another context you know you know any any type of business scenario where you're having orders and all of a sudden you're you have continuous orders and all of a sudden.
|
||
|
|
Your volume decreases.
|
||
|
|
Well as an organization or as an individual you have limited resources so you can't go chasing down every single client maybe you're small enough you can but but if you're not you have to focus on the ones that mean the most of your business.
|
||
|
|
And in this case let's say the ones that mean the most are business are the ones that have the highest volume that's usually the case but not always sometimes there could be other strategic reasons why certain clients might be important.
|
||
|
|
But in this case let's think about it in the context of in the context of well who's telling me the most who's giving me the most money.
|
||
|
|
Well just using those two data points we've already done volume and so we can say well let's focus on our top I don't know hundred or twenty clients and those are the people that we're going to go after because we have five sales people or five hours in my day that I can go reach out.
|
||
|
|
There's a lot of people and I can spend an hour a couple of minutes on each one and those are the people are going to focus on well how do I know which of those are an issue.
|
||
|
|
Well something that a couple of data points that I would want to see you know this is the domain knowledge part is I would want to know when they first ordered any sample.
|
||
|
|
The last time they ordered and how many days has it been since the last time they ordered along with that volume and having all those other pieces of information I can I can learn a lot more.
|
||
|
|
So if the first time that they ordered was you know three or four years ago I know that this is a client that has been with the organization for a long time and you know there's probably something that happened that made them lose confidence.
|
||
|
|
If it's someone that's new maybe you know they sent a bunch of stuff right at the beginning because they had a bunch of patients with that illness maybe there's a breakout in their area or whatever and then they stopped and so looking at these data different pieces of data we can put different thresholds down and we can make a little if then statement but for simplicity sake I'm going to make a if then statement that just says if it has been more than ten days since the last time they ordered.
|
||
|
|
Then and their and their volume is more than 50 samples a week.
|
||
|
|
Then we will put them on the list of people to reach out to.
|
||
|
|
And you do that you know and that's the decision that we're going to make and so just from having those first two items I'm using a concept called you know just just doing a date difference between.
|
||
|
|
You know putting all of them in order and then doing a date difference between any any two lines that come from the same client.
|
||
|
|
And that's it we're just by doing that we have all this more information and we can do it like a we can do a minimum date to find out the first order you can do a maximum date to find the last ordered and that's you know really something you can do that in any program in the language even in the most spreadsheet software.
|
||
|
|
And now we've made that list of thousands of clients down to a manageable number and if we don't like that number over time we can set a threshold at a different place or we can say we will use the volume and a day since last ordered.
|
||
|
|
And also the first ordered if it's been more than you know six months since the first order you know having this data means that you can put new thresholds in place.
|
||
|
|
Well sometimes you might not want to wait until the people have completely left you might want to do something before then.
|
||
|
|
A lot of times what happens in in my industry and that happens I think a lot of times is that people don't just leave you cold turkey.
|
||
|
|
They will maybe start to send some of your patients to a competitor and some to you as I can wait and start slowly trickle over to that new that new service.
|
||
|
|
And there's many reasons why they do that one is because they don't want to jeopardize all their patients all at once so if that new person is really not good they're only you know they they do not jeopardizing more than a their most.
|
||
|
|
The lowest risk patients at once but as a person who wants to maintain their business I want to be able to see if the volume is starting to decrease.
|
||
|
|
And a lot of times you get into a situation i've been in a lot of work environments where already talk about is well decreasing well what is decreasing mean.
|
||
|
|
You know decreasing to the sales person means this to the CEO means that to the guys working in doing the work every day means something else.
|
||
|
|
And so having a having a concrete definition of what decreasing means is the first step and so in this case I chose decreasing meaning if to consecutive weeks the volume has gone down I consider that decreasing.
|
||
|
|
And so what did I say two or do I say three one two three I said three decreasing links a volume going down we consider that decreasing.
|
||
|
|
So now we have a concrete number a concrete and we have a process developed around that number that says three weeks in a row means we do something and using those same two data elements that we had at the beginning we can do.
|
||
|
|
We can do a first order lag which is a lag is the difference between the most recent number date and the one before and then you can do a second order lag which is you know going back the next week and then you can do a third order lag which is going back three weeks you can look at the volume of three weeks the difference in the volume of three weeks two weeks one week and now.
|
||
|
|
And you can make it if then statement says if that is less than that and less than that and less than that then flag this client you know you can understand that logic in simple language you can see how it could easily be written into a program language or even in Excel.
|
||
|
|
Or what I like to easily grow office sheet and sometimes Google sheets.
|
||
|
|
But so that's that's the concept is well one if you can help it design your information systems with the quality and performance metrics in mind so that you don't have to go through a lot of these hoops.
|
||
|
|
And if you can do that then that's important second part is design your information systems with data transformation in mind so that it makes it easier for a person like me to do these feature engineering task in the future so you know having discrete data for the names of clients not just free text where people can type in a client so that you end up having.
|
||
|
|
Joe's barbershop spoke with an apostrophe s with that with no apostrophe just s sometimes is Joe's BB shop.
|
||
|
|
Having a drop down list in that system where they can only choose from you know a list of qualified names makes it so that my job at the end of the day is easier and it's you know I do have ways to hack around that you know that being a data hacker is about but.
|
||
|
|
You know that time is money and the time I'm doing that you're paying for me to do that and not to give you the business insights that you're that you're looking for.
|
||
|
|
And then the other part is you know setting your thresholds and limits.
|
||
|
|
Having a plan once those limits are crossed and recording the decisions that you make when and the outcomes of those decisions.
|
||
|
|
So in the business sense those are really great properties that have proven very valuable and a lot of the customers that I've worked with a lot of the organizations I've worked with but also in my personal life taking a lot of those a lot of the same ideas whether it be you know regarding my health and fitness or you know my education or my kids education.
|
||
|
|
Being a or you know my personal relationships sometimes being able to you know sometimes it gets a little crazy with personal relationships you don't want to keep all types of metrics but you know the idea of having written down somewhere lines that if people cross them you make a decision about them.
|
||
|
|
It's a kind of a out there topic but that that way you could go back to and said hey I told myself if this person.
|
||
|
|
If this person did this then I would propose to them and here I am am I too checking to do it or you know whatever whatever the whatever the situation might be.
|
||
|
|
But that's my that's my talk in a nutshell and that's the portion that I hope that you can take away and gain some value from.
|
||
|
|
So that's it from be easy and as I always say keep packing.
|
||
|
|
You've been listening to Hacker Public Radio at HackerPublicRadio.org.
|
||
|
|
We are a community podcast network that releases shows every weekday Monday through Friday.
|
||
|
|
Today's show like all our shows was contributed by an HPR listener like yourself.
|
||
|
|
If you ever thought of recording a podcast then click on our contributing to find out how easy it really is.
|
||
|
|
Hacker Public Radio was founded by the digital dog pound and the infonomicom computer club and is part of the binary revolution at binrev.com.
|
||
|
|
If you have comments on today's show please email the host directly leave a comment on the website or record a follow-up episode yourself.
|
||
|
|
Unless otherwise stated today's show is released on the creative comments, attribution, share a light 3.0 license.
|