Dataset Cleaning Tips
Cleaning Imbalanced Datasets for Better Analysis
This prompt helps data science teams address imbalances in datasets, ensuring that underrepresented classes or categories are handled appropriately to improve analytical and modeling outcomes.
Responsible:
Data Science
Accountable, Informed or Consulted:
Data Science, Engineering
THE PREP
Creating effective prompts involves tailoring them with detailed, relevant information and uploading documents that provide the best context. Prompts act as a framework to guide the response, but specificity and customization ensure the most accurate and helpful results. Use these prep tips to get the most out of this prompt:
Review the dataset to identify class distributions and underrepresented categories.
Define the analysis or modeling goals, such as classification accuracy or recall improvement.
Identify the tools and libraries available for handling class imbalances, such as imbalanced-learn or PyTorch.
THE PROMPT
Help create a cleaning plan for addressing imbalances in [specific dataset, e.g., fraud detection transactions]. Focus on:
Class Imbalance Identification: Recommending analysis steps, such as, ‘Visualize class distributions to detect underrepresented categories using bar plots or histograms.’
Resampling Techniques: Suggesting strategies, like, ‘Apply oversampling methods such as SMOTE or undersampling techniques to balance class distributions.’
Synthetic Data Generation: Proposing augmentation methods, such as, ‘Generate synthetic data points for minority classes using GANs or domain-specific simulations.’
Weight Adjustments: Including model preparation, such as, ‘Adjust class weights during model training to penalize underrepresented classes appropriately.’
Validation Strategy: Recommending evaluation checks, such as, ‘Ensure resampling techniques are not applied to test sets and validate with stratified sampling to maintain representativeness.’
Provide actionable guidance for cleaning and preparing imbalanced datasets to ensure fair and accurate analysis or model performance. If additional details about the dataset or its application are needed, ask clarifying questions to refine the cleaning plan.
Bonus Add-On Prompts
Propose strategies for combining domain knowledge with synthetic data generation to address rare cases.
Suggest methods for monitoring the impact of resampling techniques on model performance.
Highlight techniques for handling imbalances in multi-class datasets with hierarchical categories.
Use AI responsibly by verifying its outputs, as it may occasionally generate inaccurate or incomplete information. Treat AI as a tool to support your decision-making, ensuring human oversight and professional judgment for critical or sensitive use cases.
SUGGESTIONS TO IMPROVE
Focus on handling imbalances for specific tasks, like medical diagnosis or credit scoring.
Include tips for documenting changes to class distributions during the cleaning process.
Propose ways to combine manual validation with automated balancing techniques.
Highlight tools like Alteryx or RapidMiner for balancing datasets without coding.
Add suggestions for visualizing changes in class distributions post-cleaning.
WHEN TO USE
During the preprocessing phase for classification tasks with imbalanced data.
To improve model performance when predicting rare or minority classes.
When preparing datasets for industries with high-stakes decisions, like healthcare or finance.
WHEN NOT TO USE
For datasets with naturally balanced class distributions.
If the imbalance does not significantly affect analysis or modeling outcomes.