- Basic terminology
- What is tidy data?
- Five most common problems with messy data-sets:
- Useful Links
Data cleaning is a very important part of any data science project as data scientist spend 80% of their time is this step of the project. But not very much attentions is given to the cleaning process and not much research efforts are put to create any sort of framework recently I came across an amazing paper titled as Tidy data by Hadley Wickham in Journal of Statistical Software in which he talks about common problems one might encounter in data cleaning and what a Tidy data looks like I couldn’t agree more to him, he has also created a R package reshape and reshape2 for data cleaning, but the problem was the paper had very little to no code I also found the code version of the paper but it was in _R_, while most of my data cleaning work is done in pandas, I had to translate all those R solutions to pandas equivalent, so in this post the I will summarize all the main idea of the paper that the author suggests in the paper and also how we can solve it in pandas.