Dataset Cleaning Tips

Cleaning Large Datasets with Mixed Missing Data

This prompt helps data science teams develop strategies for cleaning large datasets with a mix of missing data patterns. It focuses on identifying the nature of missingness, handling gaps appropriately, and ensuring data integrity for downstream tasks.

Responsible:

Data Science

Accountable, Informed or Consulted:

Data Science, Engineering

THE PREP

Creating effective prompts involves tailoring them with detailed, relevant information and uploading documents that provide the best context. Prompts act as a framework to guide the response, but specificity and customization ensure the most accurate and helpful results. Use these prep tips to get the most out of this prompt:

Analyze the dataset to identify the extent and distribution of missing data.
Determine the importance of features with missing values to the analysis or model.
Review the dataset size and computing resources available for processing.

THE PROMPT

Help create a cleaning plan for handling mixed missing data patterns in a large dataset from [specific domain, e.g., healthcare records]. Focus on:

Classifying Missingness: Recommending steps, such as, ‘Identify whether missing data is MCAR (Missing Completely at Random), MAR (Missing at Random), or MNAR (Missing Not at Random) and tailor handling strategies accordingly.’
Handling Missing Numerical Data: Suggesting imputation methods, like, ‘Use mean, median, or regression imputation for MAR data, or advanced techniques like KNN or multiple imputation for more complex patterns.’
Handling Missing Categorical Data: Including replacement strategies, such as, ‘Replace missing values with the mode, a new category (e.g., "Unknown"), or predictions from classification models.’
Dropping vs. Imputing: Proposing decision criteria, like, ‘Drop rows or columns with excessive missingness thresholds (e.g., >50%) unless they are critical to the analysis.’
Validation: Recommending post-cleaning checks, such as, ‘Assess the impact of imputation on statistical distributions or model performance to ensure cleaning decisions are effective.’

Provide actionable tips for cleaning large datasets with mixed missing data to ensure consistency and reliability for downstream analysis. If additional details about the dataset or specific missing data patterns are needed, ask clarifying questions to refine the plan.

Bonus Add-On Prompts

Propose strategies for visualizing missing data patterns to inform cleaning decisions.

Suggest methods for evaluating imputation accuracy using cross-validation or sensitivity analysis.

Highlight techniques for scaling imputation workflows for large datasets using distributed frameworks like PySpark.

Use AI responsibly by verifying its outputs, as it may occasionally generate inaccurate or incomplete information. Treat AI as a tool to support your decision-making, ensuring human oversight and professional judgment for critical or sensitive use cases.

SUGGESTIONS TO IMPROVE

Focus on datasets with domain-specific missing data patterns, like patient health records or survey results.
Include tips for handling missing time-series data alongside static attributes.
Propose ways to automate missing data detection and cleaning using Python libraries like Missingno or Datawig.
Highlight tools like KNIME or Alteryx for visually managing data cleaning workflows.
Add suggestions for documenting cleaning decisions for reproducibility and team collaboration.

WHEN TO USE

During the preprocessing of large datasets with mixed data types and missing values.
To standardize cleaning practices across projects and ensure reliable imputation strategies.
When preparing datasets for machine learning or statistical modeling.

WHEN NOT TO USE

For small datasets where manual inspection and cleaning are feasible.
If missing data patterns are minimal or have already been addressed.