Data Science – Hojoong Chung

While doing project assigned from the Udacity Nanodegree program I’m currently attending, I had to collect the number of null values in each row and display it in the histogram. However, the Pandas dataset contained 891221 rows, which I had to wait quite a long time to iterate through the rows using the following code:

df.apply(lambda row: sum_of_nulls_in_row(row), axis=1)

Although it was suggested in this post that using apply() is much faster than using iterrow(), it was still too slow to finish the project efficiently. After several search, I found this discussion. In Icyblade‘s answer, he mentioned this:

When using pandas, try to avoid performing operations in a loop, including apply, map, applymapetc. That’s slow!

Icyblade’s suggestion was to use following code:

df.isnull().sum(axis=1)

I’ve applied it into my code, and Boom! It worked like a charm. Long waiting was eliminated and the result was there in a blink. A good lesson learned.

Tag: Data Science

Efficient way of collecting sum of missing values per row in Pandas