Getting Your Data Ready for AI: The Pre-Deployment Checklist You Didn't Know You Needed
Posted on by We Are Monad AI blog bot
Start here: why your data is the real MVP
You likely want AI to do the heavy lifting. That is natural, but it only lifts what you give it. Bad data does not just slow projects down; it activates creates roadblocks. Research shows organisations continue to burn money on AI because the underlying data is not trusted or fit-for-purpose. AI amplifies that mess rather than fixing it [The Manufacturer - Closing the data confidence gap][Business Insider - Enterprise AI investment falls short without intelligent data].
Use this 30-minute inventory to get a quick read on whether your data is ready, and to spot the red flags that stall projects.
Prep (5 minutes)
Open one representative dataset (such as a CSV or DB table) and one downstream use-case, like a customer churn model or support triage. Decide on the success metric for the use-case, whether that is accuracy, precision, or time saved. Quick context keeps the audit focused on real outcomes.
Source and access check (5 minutes)
Ask who owns this data and where it comes from. Is it from an app, a form, or a third party? Can you get a fresh export or SQL query in under ten minutes? A major red flag here is unknown or siloed owners, or having no programmatic export. If you cannot access data quickly, automations and iterative model building become impossible [The Manufacturer - Closing the data confidence gap].
Sample-size and representativeness (5 minutes)
Check how many rows you have. Do a few random filters by date or user cohort to confirm coverage across periods and groups. If there is a tiny sample for the target cohort or major time gaps, take note. AI cannot generalise from too little data or biased samples [Business Insider - Enterprise AI investment falls short without intelligent data].
Schema and timestamp sanity (5 minutes)
Confirm expected columns exist and types make sense. Dates should be dates, and IDs should not be stored as free text. Spot-check timestamps for timezone consistency. Missing or ambiguous timestamps break sequence-based models and realtime features.
Quick data-quality scan (7 minutes)
Run three quick checks: missing-value counts, duplicate keys, and obvious outliers. In a spreadsheet or simple SQL, this looks like checking counts versus non-null columns. If more than 5–10% of crucial fields are missing, or you find many duplicates, these are common causes of poor model performance [The Manufacturer - Closing the data confidence gap].
Labels and truth-check (3 minutes)
If your use-case needs labels, such as churn or intent, check a small random sample. Do the labels match reality? Are they auto-generated or human-verified? Watch out for noisy, inconsistent, or concept-drifted labels.
Major red flags
Spotting one of these signals a need to pause before spending more time:
- Fragmented ownership: No single source of truth causes duplicated effort and conflicting metrics [The Manufacturer - Closing the data confidence gap].
- Poor data quality at scale: Missing values and duplicates are problems AI will only amplify. Many enterprise AI projects fail for this reason [Business Insider - Enterprise AI investment falls short without intelligent data].
- Biased samples: Models learn and reinforce bias if training data does not reflect the population you care about [Skift - Why travel keeps falling short on its data ambitions].
- No provenance: If you cannot trace a value back to a source, audits and troubleshooting become nightmares.
If you find problems, do not panic. Small wins, like standardising one timestamp or fixing a label rule, move you forward faster than big rewrites. To build a repeatable foundation, read our guide on building a simple yet strong data foundation for AI and reporting and our practical AI adoption roadmap for SMEs.
Clean it fast: practical data-quality fixes you can do this week
These are quick wins you can actually finish before next Monday. Pick a few, run them on a copy of your data, and put better hygiene on autopilot.
1. Deduplicate like a pro
For exact duplicates, the fastest win is removing exact row copies. In Pandas, df = df.drop_duplicates(subset=['email','phone']) is fast and reversible if you export a copy first [pandas documentation - drop_duplicates].
For fuzzy duplicates, such as typos or formatting differences, OpenRefine’s "Clustering" feature allows you to group likely matches and merge them with a few clicks. This is excellent for ad-hoc fixes [OpenRefine documentation - User Manual]. For clearer scriptable fixes, Python libraries like RapidFuzz handle pairwise similarity efficiently [RapidFuzz Project - Documentation], while dedupe handles larger, probabilistic record-linkage jobs [GitHub - dedupe-io/dedupe].
2. Normalise formats
Standardising phone numbers, emails, and dates clears up significant noise.
- Phone numbers: A quick SQL fix in Postgres can strip non-digits using
regexp_replace(phone, '\D','','g')[PostgreSQL Documentation - String Functions]. - Emails: Lowercase and trim them, then check for domain typos using a small mapping table.
- Dates: Parse liberally using tools like
pandas.to_datetimewitherrors='coerce', then flag unparsed rows for manual repair [pandas documentation - to_datetime].
For non-code fixes, Google Sheets functions like TRIM() and LOWER() work well in a pinch [Google Docs Editors Help - TRIM function], as do Power Query transformations [Microsoft Learn - Power Query documentation].
3. Triage missing values
If a column is more than 50% missing and not critical to the business, consider archiving it. If less than 5% is missing, consider simple imputation. For numeric data, filling with the median is often sufficient. For categorical data, filling with "Unknown" or the most frequent value usually works. Scikit-learn’s SimpleImputer is a reliable tool here [scikit-learn - Imputation of missing values].
A tiny roadmap to keep this tidy
Start by snapshotting your data and running quick stats to remove exact duplicates. In the first week, normalise your key contact fields and run a fuzzy deduplication pass. By week two, you should aim to add validation rules at ingestion to catch errors early.
For governance and practical strategy, start by building a small trusted data layer before attempting complex automations. See our guide on building a simple yet strong data foundation and why fixing the database matters.
Labels that won’t break the bank (and strategies that scale)
Outsource when you need volume fast and can afford the spend; keep labelling in-house when data is sensitive or requires deep domain knowledge. Expert human labellers command high rates, which adds up quickly [CNBC - 34-year-old entrepreneur earns $200 an hour training AI models]. If you care about scale with managed workflows, providers like Scale AI exist to handle that operational burden [TS2 - Scale AI Valuation in 2025].
Active learning
Get the most signal per label by training a small model and letting it pick the examples it is most uncertain about. You often get large performance gains with far fewer labels than random sampling. For a deep dive, the active learning literature survey by Burr Settles is an excellent resource [Burr Settles - Active Learning Literature Survey].
Weak supervision
Scale without paying per row by writing quick labelling functions—heuristics, regexes, or simple models. Combine these with a label model to reconcile noisy sources. Snorkel is the canonical toolkit for this pattern [Snorkel AI - Snorkel Flow].
Synthetic data
Use synthetic data to fill gaps like class imbalances or rare edge cases. It is a fast way to create examples you might not see at scale. However, scraped or synthetic data can introduce risks, including poisoning, so always validate model behaviour on real holdout sets [Hackaday - It only takes a handful of samples to poison any size LLM].
Operational tips
Measure label ROI by tracking model performance gain per 1,000 labels. Give clear guidelines to human labellers, as good docs cut error rates. If you are building an AI capability for an SME, start small. A lightweight data foundation and these label patterns often beat throwing money at full-scale labelling from day one. You can read more in our AI adoption roadmap for SMEs and our data foundation guide.
Keep it legal and keep your customers: privacy, consent, and governance basics
Here are plain-English, practical steps so your AI projects do not turn into privacy headaches.
Find the PII before it finds you
Start with a data map. Catalogue where customer data lives, encompassing databases, logs, and third-party apps. This is your most useful compliance artefact [ICO - How do we document our processing activities?]. Automated discovery tools like AWS Macie or Google Cloud DLP can scan buckets and databases to tag sensitive fields like PII or credential data [AWS - Amazon Macie][Google Cloud - De-identifying sensitive data]. Do not forget that model outputs or embeddings can leak identity; the NIST AI Risk Management Framework recommends mapping these flows as part of your assessment [NIST - AI Risk Management Framework].
Mask, tokenise, or remove
When you need live access but not real values, use dynamic masking so apps can work without exposing PII [Microsoft Learn - Dynamic Data Masking]. For analytics, prefer strong de-identification and validate re-identification risks. Achieving provable anonymity is difficult, so document your approach carefully [Bloomberg Law - Anonymization at Crossroads].
Capture and log consent
Capture consent with context: who consented, what they consented to, and when. The ICO expects records tied to the capture event [ICO - How should we obtain, record and manage consent?]. Make consent revocable and easy, building APIs that let users withdraw it and triggering automatic workflows to stop processing.
Governance and access controls
Maintain a Record of Processing Activities (ROPA) that ties datasets and legal bases together [ICO - How do we document our processing activities?]. Apply least privilege and ephemeral credentials; NIST highlights identity as the new perimeter for AI systems [NIST - Digital Identity Guidelines]. Finally, monitor and test. Implement continuous logging and red-team model tests. Defence is a mix of prevention, telemetry, and fast response [CSO Online - Demystifying risk in AI].
Bias, docs, and a small-business playbook: low-cost wins and next steps
How to spot bias
Break metrics down by group. Accuracy averaged across everyone hides problems, so slice by age, region, or cohort and look for gaps [CSO Online - Demystifying risk in AI]. Visualise distributions and missingness to see if differences point to collection bias [Great Expectations - Documentation]. Also, monitor for data drift; if model inputs change over time, fairness can degrade fast [Evidently AI - Open-Source Machine Learning Monitoring].
Low-cost bias mitigation
Rebalancing or reweighting samples before training is cheap and effective. If a group is missing, small targeted data collection beats huge synthetic experiments. Use open fairness libraries like IBM’s AIF360 or Microsoft’s Fairlearn to run audits before attempting complex fixes [IBM - AI Fairness 360 Legacy Documentation][Fairlearn - Documentation].
Documentation essentials
Write down your "Dataset Datasheet" detailing purpose, composition, and known gaps [arXiv - Datasheets for Datasets], and a "Model Card" covering intended use and limitations [arXiv - Model Cards for Model Reporting]. A simple data dictionary with field definitions and a lineage track of where data originates helps maintain clarity.
Practical playbook
- 0–30 days: Inventory and datasheet the highest-risk dataset. Run quick subgroup metrics and add basic validation checks.
- 30–60 days: Apply one lightweight mitigation, such as reweighting. Add human review for high-impact cases.
- 60–90 days: Automate validation in CI, add drift monitoring, and move the catalogue to a shared tool like Airtable or DataHub [DataHub - The Metadata Platform for the Modern Data Stack].
For building the right foundation, see our guide on building a simple yet strong data foundation, or check our staged adoption plan for small teams.
Sources
- [AWS - Amazon Macie]
- [Bloomberg Law - Anonymization at Crossroads]
- [Burr Settles - Active Learning Literature Survey]
- [Business Insider - Enterprise AI investment falls short without intelligent data]
- [CNBC - 34-year-old entrepreneur earns $200 an hour training AI models]
- [CSO Online - Demystifying risk in AI]
- [DataHub - The Metadata Platform for the Modern Data Stack]
- [Evidently AI - Open-Source Machine Learning Monitoring]
- [Fairlearn - Documentation]
- [GitHub - dedupe-io/dedupe]
- [Google Cloud - De-identifying sensitive data]
- [Google Docs Editors Help - TRIM function]
- [Great Expectations - Documentation]
- [Hackaday - It only takes a handful of samples to poison any size LLM]
- [IBM - AI Fairness 360 Legacy Documentation]
- [ICO - How do we document our processing activities?]
- [ICO - How should we obtain, record and manage consent?]
- [Microsoft Learn - Dynamic Data Masking]
- [Microsoft Learn - Power Query documentation]
- [NIST - AI Risk Management Framework]
- [NIST - Digital Identity Guidelines]
- [OpenRefine documentation - User Manual]
- [pandas documentation - drop_duplicates]
- [pandas documentation - to_datetime]
- [PostgreSQL Documentation - String Functions]
- [RapidFuzz Project - Documentation]
- [scikit-learn - Imputation of missing values]
- [Skift - Why travel keeps falling short on its data ambitions]
- [Snorkel AI - Snorkel Flow]
- [TS2 - Scale AI Valuation in 2025]
- [The Manufacturer - Closing the data confidence gap]
- [arXiv - Datasheets for Datasets]
- [arXiv - Model Cards for Model Reporting]
We Are Monad is a purpose-led digital agency and community that turns complexity into clarity and helps teams build with intention. We design and deliver modern, scalable software and thoughtful automations across web, mobile, and AI so your product moves faster and your operations feel lighter. Ready to build with less noise and more momentum? Contact us to start the conversation, ask for a project quote if you’ve got a scope, or book aand we’ll map your next step together. Your first call is on us.