Data Testing

Designing Test Cases for Data Integrity Validation

This prompt helps data science teams create test cases for validating the integrity of datasets. It focuses on ensuring data completeness, consistency, and correctness to prevent errors in downstream analyses or models.

Responsible:

Data Science

Accountable, Informed or Consulted:

Data Science, Engineering, QA

THE PREP

Creating effective prompts involves tailoring them with detailed, relevant information and uploading documents that provide the best context. Prompts act as a framework to guide the response, but specificity and customization ensure the most accurate and helpful results. Use these prep tips to get the most out of this prompt:

Define the dataset’s schema, key fields, and expected relationships.
Identify critical fields and thresholds for completeness, accuracy, and range validations.
Gather tools and libraries for implementing automated data integrity checks.

THE PROMPT

Help create detailed test cases to validate the integrity of [specific dataset, e.g., transaction records]. Focus on:

Completeness Tests: Recommending validation steps, such as, ‘Check for missing values in critical fields, ensuring that required columns like [specific column names] are fully populated.’
Consistency Checks: Suggesting rules, like, ‘Validate data consistency by ensuring that related fields align correctly, such as matching customer IDs across orders and payments tables.’
Range Validations: Including range checks, such as, ‘Ensure numerical values like [specific metric] fall within acceptable ranges, flagging anomalies for review.’
Unique and Duplicate Validation: Proposing duplicate checks, such as, ‘Verify the uniqueness of primary keys or identifiers to detect duplicates and ensure record integrity.’
Cross-Table Validations: Recommending relational consistency, such as, ‘Test foreign key relationships between tables to ensure data joins are valid and consistent.’

Provide a comprehensive set of test cases to validate the dataset’s integrity and ensure readiness for analysis or modeling. If additional details about the dataset’s schema or use case are needed, ask clarifying questions to refine the test cases.

Bonus Add-On Prompts

Propose methods for automating data integrity testing using tools like Pytest or Great Expectations.

Suggest strategies for logging and reporting validation errors for debugging.

Highlight techniques for testing the integrity of streaming or real-time datasets.

Use AI responsibly by verifying its outputs, as it may occasionally generate inaccurate or incomplete information. Treat AI as a tool to support your decision-making, ensuring human oversight and professional judgment for critical or sensitive use cases.

SUGGESTIONS TO IMPROVE

Focus on data integrity validation for specific industries, like finance or healthcare.
Include tips for creating test cases for real-time or streaming data pipelines.
Propose ways to document test results for auditing and compliance purposes.
Highlight tools like dbt, Pandera, or Great Expectations for automating integrity checks.
Add suggestions for scaling test cases to handle large datasets efficiently.

WHEN TO USE

During data preprocessing to ensure data quality for analysis or modeling.
To validate datasets imported from external sources or integrated from multiple systems.
When debugging data issues to identify inconsistencies or missing values.

WHEN NOT TO USE

For datasets that have already undergone rigorous validation.
If the dataset lacks sufficient structure for integrity checks (e.g., raw unstructured text).