Udacity Data Wrangling Project Done!

I am currently enrolled in the Udacity Data Analyst Nanodegree program. Yesterday, I’ve successfully submitted and passed the review of my third project, Data Wrangling. This project was about download some dataset from the web via HTTP, audit it if they have irrelevant values, clean it using the Python script, import it to the database (I chose MongoDB), and get some insights using the database query. So in summary, I’ve learned:

  • Data wrangling technique
  • Data auditing process using Python
  • MongoDB

Although I already knew how to work with Python and MongoDB pretty well, it was good to learn the overall process how to deal with data from the Web. Have a look on my exercise if you are interested in – you can find the report and the source codes here! I will keep posting my progress about Udacity Data Analyst Nanodegree program in this blog.

Pandas – Lambda function

After finishing the first assignment of Udacity Data Analyst Nanodegree, I decided to summarize and record most commonly used Pandas commands here. The first topic I would like to post is the lambda function.

Lambda function in Pandas can be used via apply() command something like below:

df = pd.DataFrame({'A':[1,2],'B':[3,4]})
df.apply(lambda x : x + 1)

Above code adds 1 to all the data resides in the DataFrame named ‘df’. The result looks like below:

__| A B
0 | 2 4
1 | 3 5

Lambda function can be applied to a single column using below code:

df['A'].apply(lambda x : x + 1)

But the result only shows the index and values without the column name like below:

0 | 2
1 | 3

To include the column name, following code can be used:

df.apply({'A': lambda x : x + 1})

And the result will look like this:

__| A
0 | 2
1 | 3

This lambda function can be used in the combination with groupby() function as well.

df = pd.DataFrame({'Year':[2016,2016,2017,2017],'Student':['Paul','Jack', 'Paul', 'Jack'], 'Score':[90,80,100,70]})
df.groupby('Student').apply(lambda x : x[x.Score == x.Score.max()])

Above code applies the lambda function after the data is grouped by ‘Student’ column values, in this case it aggregates the data based on ‘Paul’ and ‘Jack’. Lambda function collects only the row that are matching with the maximum value in that group. The result looks like below:

__________| Score Student Year
Student
Jack    1 | 80    Jack    2016
Paul    2 | 100   Paul    2017

Also, the same can be achieved by using agg() function after the groupby():

df.groupby('Student').agg({lambda x : x.max()})

This code provides much cleaner result than the previous code:

          | Score   Year
__________|(lambda) (lambda)
Student
Jack      | 80      2016
Paul      | 100     2017