How to use — Quick Guide
- Describe your source, record boundary, and sensitivity.
- Design your schema (field, type, required, regex/enums, examples).
- Pick optional presets (Instruction, Rules, Output, Closure).
- Set transforms, validation, and missing-data behavior.
- Click Generate Prompts → use Copy / Copy+Open / Download / Print buttons.
v2 note: meters now start at 0% when the form is empty (no false baselines).
A. Source & Context
B. Schema Designer
Types: string, number, integer, boolean, date, datetime, currency, enum, regex, array, object.
| Field Name | Type | Required | Regex (validation) | Allowed Values (enum) | Example | Notes / Transform hint | Del |
|---|
C. Presets (Instruction, Rules, Output, Closure)
Presets combine with your schema, validation, and transforms for a complete, robust prompt.
D. Transforms & Validation
Based on source type, boundary strategy, and delimiter pattern.
Score: 0%
Counts fields, types, required flags, and validations (regex/enums/examples).
Score: 0%
Overall Robustness (avg): 0%
Preview
ChatGPT Prompt
Gemini Prompt
Heads-up: For currency/unit conversions, models estimate — validate with your systems.
FAQ — Complex Data Extraction & Transformation
▶Who needs this generator?
- Operations & business analysts: Wrangle unstructured reports, emails, and logs into tidy tables for weekly KPIs, root-cause reviews, and vendor scorecards.
- Finance & procurement teams: Standardize invoices/POs across suppliers, convert currencies, normalize tax fields, and reconcile duplicates before import to ERP.
- E-commerce & product ops: Extract product specs from PDFs/HTML, map messy attributes to a canonical taxonomy, and generate clean CSV/JSON for catalog updates.
- Sales & marketing ops: Parse leads from emails/webhooks, validate UTM parameters, de-dupe contact lists, and enrich fields for CRM hygiene.
- Legal & compliance: Pull clauses, parties, dates, and obligations from contracts; flag missing required fields; mask PII; output audit-friendly tables.
- HR & talent teams: Normalize resumes or job posts into consistent schemas (skills, years, tools, certifications), ready for ATS ranking.
- Security & IT: Extract vulnerability IDs, severities, and patch status from advisories/tickets; output structured JSON for SIEM/CMDB.
- Research & data teams: Lift quantitative results from PDFs, lab notes, and scraped pages; convert units; validate types; export analysis-ready data.
- Healthcare & public sector (no PHI): Standardize codes, dates, and measurements from forms/reports while redacting identifiers and documenting uncertainty.
▶What can you do with this prompt?
- Extract from messy sources
- PDFs, HTML pages, emails, logs, CSV, JSON/NDJSON, mixed text.
- Record boundary strategies (per line, per paragraph, table row, or custom delimiter/regex).
- Enforce a custom schema
- Define fields, types (string/number/boolean/date/currency/enum/regex/array/object), and required flags.
- Add regex and allowed values (enums) to stop junk before it enters your systems.
- Include examples so the model understands format expectations.
- Transform & normalize
- Convert dates (e.g., ISO 8601 → YYYY-MM-DD), currencies (to USD/EUR/GBP/INR/JPY), and units (kg/cm/L).
- Compute derived fields (e.g., warranty months, subtotal + tax = total).
- Map synonyms/aliases to your canonical taxonomy (e.g., "Grey", "Gray" → "Gray").
- Validate, deduplicate, and reconcile
- Drop or mark records that fail validation.
- Deduplicate by a primary key (invoice_number, SKU, ticket_id).
- Keep the last/most complete instance and note conflicts.
- Control sensitivity & compliance
- Redact or mask PII (emails, phones) unless explicitly required.
- Add an
extraction_notesfield so the model logs uncertainty or edge cases.
- Choose your output and precision
- JSON array, CSV/Markdown table, or YAML—pretty-printed if you like.
- Strict types mode to prevent coercion (e.g., "N/A" won’t sneak into a number field).
- Build repeatable, auditable flows
- Use presets (Instruction/Rules/Output/Closure) to standardize prompts across teams.
- Keep an “Input Document” for each run: source description, schema, rules, and guardrails.
▶Practical examples (copy-ready ideas)
- Invoices (Finance): Extract
invoice_number,invoice_date,supplier_name, line items,subtotal,tax_amount,total, and normalizecurrencyto USD with 2-dp rounding. - Product catalogs (E-commerce): From spec sheets/HTML, pull
sku,brand,model,dimensions_cm,weight_kg,materials[], andwarranty_months. - Leads (Marketing): From inbound emails, capture
full_name,company,email(masked or retained),interest_area(enum),utm_campaign, andcountry(ISO-2). - Contracts (Legal): Extract
party_a,party_b,effective_date,term_months,termination_clause_present(bool),governing_law(enum), and clause references. - Vulnerabilities (Security): From advisories, pull
cve_id,severity(P1–P3),affected_versions[],fix_available(bool), andrelease_date(YYYY-MM-DD).
▶Why this matters
- Consistency at scale: Every run uses the same schema and rules—no more ad-hoc copy/paste.
- Lower cleanup cost: Regex, enums, and required fields stop bad data early.
- Faster time-to-insight: Clean JSON/CSV/YAML drops directly into BI tools, CRMs, ERPs, or data pipelines.
- Safer handling: Built-in PII controls and optional confidence notes help with audits and compliance.
▶How to get the best results (quick tips)
- Start narrow: Define the few fields you truly need; expand later.
- Be explicit: Add regex/enums and examples for tricky fields.
- Set boundaries: Choose the right record strategy (per line/row/table/delimiter).
- Decide missing-data behavior:
nullvs drop vs add anerror_note. - Keep types strict for imports into databases/ERPs.
- Test on a small sample (use the Self-Test), then run on the full set.
- Log uncertainty: Turn on
extraction_notesfor edge cases and QA.
