Kaggle – Part 3/N, Cleaning Data – Learning Something New Every Day

This will be a relatively short post, but I wanted to make sure I put something up after a long weekend of hiking at Acadia national park with friends (& pup) and training at work.

When I last left off here I was stuck with exactly how to clean data where values were encoded as text; in particular, rather than getting a series of numbers and a dictionary to translate them to their true values (e.g. 0 = black, 1 = brown, etc.) we actually got the textual data and needed to encode it numerically. This was the first time I’d tackled a problem like this, so I was at a bit of a loss initially but I think I found an approach that works (though I’m sure can be vastly improved upon).

I’ll summarize the approach as a guide to the code attached at the end. The Breed data is in a column ‘Breed’, and different breeds present in the same animal are separated by forward slashes ‘/’. In addition, some are designated mixes, so an average entry might read ‘Australian Shepherd/Labrador Mix’.

The first thing I did was create a new column named ‘Mix’, and iterated through each animal (entry) checking to see if the last 3 characters of the string were ‘Mix’. If so, I would change the entry in the ‘Mix’ column to 1 (for positive) and reset the value equal to the first n-4 values (removing ‘ Mix’, including the space), and if not change it to 0.

Next I created an empty list ‘Breeds_List’, and iterated through once again splitting the string value on forward slash using the .split() method. I then checked each element of the resulting string to see if it already existed in ‘Breeds_List’, and if not I appended it.

Finally I created a new column ‘Breeds’ of empty lists, and for each animal (entry) I systematically checked each breed against the list of breeds (that was earlier split on ‘/’). If the breed existed in the animal, I appended ‘1’ to the Breeds list for that animal, and if not I appended ‘0’. In this way the ‘Breeds’ column represents a matrix of breed presence in each animal, where rows are different animals and columns are different unique breeds (corresponding directly to ‘Breeds_List’).

Next up was to tackle Colors, as previously described here; however now that Breeds was finished I employed a stripped-down version of basically the same approach, minus the ‘mix’ considerations. I’ve reprinted the code so far below, so you can see the progress.

Features to consider adding in the future:

Some breeds are more specific versions of others; e.g. ‘Bulldog’ vs. ‘English Bulldog’. Another column could be added to encode for specific sub-breeds.
Separate out colors from shading; e.g. one field for color, like black/blue/brown, and one optional field for presence of shading like brindle/smoke/point/tiger
- This can be tackled with a similar approach to ‘Breeds’ where ‘Mix’ was separated out, except needs to be a combined technique since there are multiple possible values to encode for (rather than just ‘Mix’, need to find ‘Brindle’ or ‘Smoke’ or etc.)

Note, both of these are similar types of features; what would really be useful would be some kind of rough approximation of the number of sub-types for each to do a gut-check on whether there would be a statistically significant number of each (as a general rule I use 20-25 on the low end, and prefer 35+).

Final note: I’ve been running the code in stages to confirm it works, with much of it commented out, so I’m not entirely sure how long it would take to run on the full training data set at this point (it’s been roughly 3-10 min for 3-4 sections each). Any recommendations on improving performance would be hugely appreciated.

Photo:

Found here via google images

Code:

Animal Shelter Outcomes v1-4

Learning Something New Every Day

Learn at least one new thing every day about programming, statistics, and data science

Kaggle – Part 3/N, Cleaning Data

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply