While doing project assigned from the Udacity Nanodegree program I’m currently attending, I had to collect the number of null values in each row and display it in the histogram. However, the Pandas dataset contained 891221 rows, which I had to wait quite a long time to iterate through the rows using the following code:
df.apply(lambda row: sum_of_nulls_in_row(row), axis=1)
Although it was suggested in this post that using apply() is much faster than using iterrow(), it was still too slow to finish the project efficiently. After several search, I found this discussion. In Icyblade‘s answer, he mentioned this:
When using pandas, try to avoid performing operations in a loop, including
apply
,map
,applymap
etc. That’s slow!
Icyblade’s suggestion was to use following code:
df.isnull().sum(axis=1)
I’ve applied it into my code, and Boom! It worked like a charm. Long waiting was eliminated and the result was there in a blink. A good lesson learned.