Day 6 – Data Wrangling & Cleaning with Python & R


Introduction

In 2025, 80% of a data scientist’s time is spent on data wrangling and cleaning. Raw data is messy: missing values, duplicates, inconsistent formats, and errors are everywhere. Without proper cleaning, even the most advanced ML models fail.

At curiositytech.in(Nagpur, 1st Floor, Plot No 81, Wardha Rd, Gajanan Nagar), we emphasize hands-on practice in both Python and R, because mastering data wrangling sets the foundation for insights, machine learning, and business impact.

This blog is a comprehensive guide to cleaning and wrangling data, comparing Python vs R approaches, and providing actionable workflows with examples.


Section 1 – Understanding Dirty Data

Common Data Issues:

  1. Missing Values – Null or NaN entries
  2. Duplicates – Repeated rows that distort analysis
  3. Incorrect Data Types – Strings instead of numeric, dates in wrong formats
  4. Outliers – Extreme values that skew results
  5. Inconsistent Labels – Typos, different naming conventions

Real-World Example:
An e-commerce dataset may have “USA”, “U.S.A.”, and “United States” in the country column. If not corrected, sales aggregation will be inaccurate.

Infographic Description:
A flowchart showing data issues: Raw Data → Missing Values → Duplicates → Inconsistent Labels → Outliers → Cleaned Data. Each node shows an example of the error.


Section 2 – Data Wrangling with Python

Key Libraries:

  • Pandas
  • NumPy

Step-by-Step Python Workflow

  1. Import Libraries & Data:

import pandas as pd

import numpy as np

df = pd.read_csv(‘sales_data.csv’)

  1. Inspect Data:

df.head()

df.info()

df.describe()

  1. Handle Missing Values:

df[‘Revenue’].fillna(df[‘Revenue’].mean(), inplace=True)

df.dropna(subset=[‘Customer_ID’], inplace=True)

  1. Remove Duplicates:

df.drop_duplicates(inplace=True)

  1. Correct Data Types:

df[‘Order_Date’] = pd.to_datetime(df[‘Order_Date’])

  1. Detect & Handle Outliers:

Q1 = df[‘Revenue’].quantile(0.25)

Q3 = df[‘Revenue’].quantile(0.75)

IQR = Q3 – Q1

df = df[(df[‘Revenue’] >= Q1 – 1.5*IQR) & (df[‘Revenue’] <= Q3 + 1.5*IQR)]

  1. Standardize Labels:

df[‘Country’] = df[‘Country’].replace({‘U.S.A.’:’USA’, ‘United States’:’USA’})

Outcome: Cleaned dataset ready for analysis or ML modeling.


Section 3 – Data Wrangling with R

Key Libraries:

  • dplyr
  • tidyr

Step-by-Step R Workflow

  1. Import Libraries & Data:

library(dplyr)

library(tidyr)

df <- read.csv(‘sales_data.csv’)

  1. Inspect Data:

head(df)

summary(df)

str(df)

  1. Handle Missing Values:

df$Revenue[is.na(df$Revenue)] <- mean(df$Revenue, na.rm = TRUE)

df <- drop_na(df, Customer_ID)

  1. Remove Duplicates:

df <- distinct(df)

  1. Correct Data Types:

df$Order_Date <- as.Date(df$Order_Date, format=”%Y-%m-%d”)

  1. Detect & Handle Outliers:

Q1 <- quantile(df$Revenue, 0.25)

Q3 <- quantile(df$Revenue, 0.75)

IQR <- Q3 – Q1

df <- df[df$Revenue >= Q1 – 1.5*IQR & df$Revenue <= Q3 + 1.5*IQR, ]

  1. Standardize Labels:

df$Country <- recode(df$Country, ‘U.S.A.’=’USA’, ‘United States’=’USA’)

Outcome: A clean, structured dataset ready for EDA and modeling.


Section 4 – Python vs R Comparison Table

FeaturePython (Pandas + NumPy)R (dplyr + tidyr)
Ease of LearningModerateMedium
Handling Large DatasetsExcellentGood
Syntax SimplicityIntuitive, PythonicVerb-based, readable
Visualization IntegrationMatplotlib/Seabornggplot2
Community & PackagesExtensiveStrong in statistical analysis
Best ForGeneral data science & MLStatistics-heavy projects, research

Section 5 – Real-World Case Study

Scenario: A retail company wants to analyze monthly sales trends.

  • Python Approach:
    • Clean missing values in Revenue
    • Standardize Product_Category
    • Remove duplicate transactions
    • Outcome: Dataset ready for predictive analysis
  • R Approach:
    • Clean Revenue using dplyr::mutate()
    • Group by Region and summarize revenue
    • Visualize trends using ggplot2
    • Outcome: Report-ready dataset with charts

Impact: Cleaned data improved forecast accuracy by 25% and enabled automated dashboards for managers.


Section 6 – Best Practices for Data Wrangling

  1. Always inspect raw data first (head, summary, info)
  2. Document every cleaning step – reproducibility is key
  3. Use vectorized operations for speed
  4. Keep backups of raw data before cleaning
  5. Automate repetitive cleaning tasks with scripts

CuriosityTech Insight: Our mentors train learners on real-world messy datasets, preparing them for industry-level challenges in Python and R.


Section 7 – How to Become Expert

  • Work on multiple datasets from different domains (finance, retail, healthcare)
  • Practice Python + R side by side to compare efficiency
  • Build a portfolio showing before vs after cleaning
  • Learn advanced techniques like regular expressions, merging datasets, pivot tables, and feature engineering

Contact curiositytech.in

  • Phone: +91-9860555369
  • Email: contact@curiositytech.in
  • Social: Instagram: CuriosityTech Park, LinkedIn: Curiosity Tech

Hands-on guidance accelerates mastery and prepares learners for data science interviews and real-world projects.


Conclusion

Data wrangling and cleaning is the foundation of all data science work in 2025. Mastering Python and R workflows ensures your data is reliable, structured, and ready for analysis or machine learning.

At curiositytech.inNagpur, learners gain practical experience, mentorship, and portfolio-ready projects, making them job-ready data scientists capable of handling any dataset.


Leave a Comment

Your email address will not be published. Required fields are marked *