How it works Capabilities Instructions Outputs Structure Preferences
Darcy — AI-native data analyst · for your data · built for Claude Code

Meet Darcy.
Your AI-native data analyst.

Meet Darcy — your AI-native data analyst. Drop in messy source files, tell Darcy what you need, and get clean, production-ready CSV and XLSX outputs back on your machine. For complex operations, Darcy writes and runs Python scripts automatically — saving them to scripts/ for later reuse. No pipelines. No dashboards. Just clean data, fast.

15+ categories
Capability coverage
100% local
Nothing pushed without ask
File file
CSV & XLSX outputs
Claude Code
Extendable to Cursor & Copilot

Data flows one way. You stay in control.

Source files are never touched. Work happens in a staging workspace. Outputs are only confirmed after your review.

📁
Step 1
data/
Source files live here. Never modified. Organized by project subfolder.
⚙️
Step 2
workspace/
Files are copied here for active processing. Staging area for intermediate work.
👁️
Step 3
Review
You inspect the workspace output before anything is elevated.
Step 4
outputs/
Confirmed, reviewed files only. Every output tagged with source version and date.
Drop in sources

Add files to data/

Put raw CSVs, exports, or code lists under data/{project}/. They stay untouched as the permanent source of truth.

Give a plain request

Describe the prep goal

Say what you need — clean this, dedupe that, build a master list, map these codes. No forms, no config files. For complex operations, Darcy checks scripts/ for an existing script first — tweaking or forking it if relevant, writing from scratch only if nothing fits.

Inspect workspace

Review before confirming

All work lands in workspace/ first. A sanity check report is saved automatically so you know exactly what changed.

Approve outputs

Elevate to outputs/

Once you're happy, outputs are saved with a run timestamp so anyone can trace exactly when and from what data they were produced.

Everything you need for data prep

The full capability list lives in capabilities/capabilities.csv. Here's what's covered.

🧹
Source Cleaning
Fix encodings, normalize case, remove duplicates, clean diacritics, handle missing values, and standardize inconsistent labels across files.
cleaningfile_ops
🔗
Schema Alignment
Rename, reorder, split, and combine columns. Align multiple related sources to a common structure for clean downstream joins.
columnscombine
🏷️
Entity Resolution
Fuzzy matching, canonicalization, alias mapping, record linkage, deduplication across sources, and unique ID assignment with provenance tracking.
entity_resolution
📋
Master Lists
Build canonical customer, supplier, product, language, or location master lists from multiple related sources with all source IDs preserved.
combineoutput
🗺️
Code & Alias Mapping
Build mapping tables, alias lists, and link tables for codes, identifiers, and entity names. State all match rules explicitly.
entity_resolutiontext
🔍
Data Quality Review
Null checks, uniqueness counts, range validation, referential integrity, consistency checks, and outlier detection — saved as QA output files.
validationfiltering
📅
Date & Time Prep
Parse mixed date formats, standardize to ISO, align timezones, extract date parts, fill missing intervals, and sort by time.
datetime
📊
Viz-ready Output
Reshape wide-to-long, enforce one row per element, align granularity, standardize labels, pre-aggregate for charts, and compute percentages and rankings.
viz_preptransform
🔤
Text Processing
Tokenization, fuzzy string matching, regex pattern extraction, special character removal, and encoding artifact repair.
textcleaning
Need something not listed?
Darcy's capabilities are driven by capabilities/capabilities.csv. Add a new row and Darcy will pick it up on the next job — no code required.
See folder structure →

Every project can bring its own rules.

Generic capabilities apply to any dataset. But each project can define its own prep instructions — naming conventions, code standards, match rules, source notes — stored in context/instructions/.

⚙️
Generic capabilities
Cleaning, alignment, entity resolution, validation, and all other prep operations live in capabilities/capabilities.csv. These apply to every project and never change per-project.
capabilities/
📌
Project instructions
Each project adds its own prep rules as a Markdown file under context/instructions/. These define domain-specific logic — which codes are canonical, how entities should be matched, what standards apply.
context/instructions/
📁
Your project
Add a Markdown file under context/instructions/ for any project. Define which codes are canonical, how entities match, and what standards apply. The agent reads it before every job in that domain.
context/instructions/{project}.md
capabilities/capabilities.csv
standards.md
+
context/instructions/{project}.md
One file per project. Add as many projects as needed.
=
The agent reads both layers before every job — generic capabilities tell it what it can do, project instructions tell it how to do it for this domain.

Files, not documents.

Every job produces a usable file saved to a typed output folder. A sanity check report is always saved alongside it.

Request Output Saved to
Clean one source Cleaned CSV or XLSX outputs/cleaned/
Build a master list Canonical linked file outputs/master-lists/
Create code or alias mappings Mapping table CSV outputs/mappings/
Review unresolved records QA file or review sheet outputs/qa/
Any job Dropped records file outputs/discarded/
Any job Sanity check report outputs/reports/

Always know which data produced which output.

When a source file changes, the old copy is archived automatically. Every output is stamped with the exact run timestamp so the lineage is traceable at a glance.

1

Change detected automatically

When a source file arrives, Darcy compares it against the existing copy using a file hash. Unchanged files are copied as-is — no version bump. Changed files trigger the archive step.

2

Prior copy archived with timestamp

The existing copy is renamed with a version number and the import timestamp: customers_v1_2026-04-15_143022.csv. The new file takes the clean name: customers.csv — always the current version.

3

Version log updated

A _versions.md file in the same folder records each archived version with its timestamp and a change note.

4

Outputs stamped with run timestamp

Every output file carries the exact timestamp of the run that produced it: customers-master_2026-04-15_143022.csv. Each run is distinct and traceable.

Reuse before you rebuild.

Before writing any new script, Darcy inspects scripts/ and picks the right action based on fit.

↻ Tweak

Same domain, small change

Different source column, new filter, adjusted output field — Darcy edits the existing script in place and states exactly what changed and why.

⎇ Fork

Same domain, different job

Different grain, matching logic, or output shape — Darcy copies the script under a new descriptive name and adapts it. The original stays untouched.

✦ New

Nothing relevant exists

No existing script fits the job. Darcy writes one from scratch, saves it to scripts/, and it becomes available for reuse on future jobs.

Simple and intentional.

Generic capabilities live at the root. Project-specific context and instructions are scoped inside context/.

your-repo/
├── capabilities/           ← generic, reusable across any project
│   └── capabilities.csv
├── context/                ← project-specific only
│   └── instructions/       ← prep rules for this project
├── data/                   ← source files, never modified
│   └── {project}/
│       └── source_file.csv
├── workspace/              ← Darcy copies data/ files here automatically
│   ├── working/            ← intermediate outputs, inspectable before confirmation
│   └── reference/          ← source lookups; prior versions archived here with timestamp
│       ├── source_file.csv
│       ├── source_file_v1_2026-04-15_143022.csv
│       └── _versions.md
├── outputs/                ← confirmed, reviewed files only
│   ├── cleaned/            ← created when a cleaning job runs
│   ├── master-lists/       ← created when a master list job runs
│   ├── mappings/           ← created when a mapping job runs
│   ├── qa/                 ← created when a QA job runs
│   ├── discarded/          ← dropped records, every job, never silent
│   └── reports/            ← sanity check report, every job
├── scripts/                ← scripts reused, tweaked, or forked before new ones are written
├── standards.md            ← naming and code standards
├── CLAUDE.md               ← agent instructions
└── preferences.json        ← behaviour toggles
        
Folder What goes in it
data/ Original source files, organised by project subfolder. Never modified directly.
workspace/working/ Intermediate outputs written during processing — inspectable before anything is elevated to outputs/.
workspace/reference/ Source files copied from data/ and used as lookups to inform the job — not being cleaned themselves.
scripts/ Processing scripts written by Darcy. Before writing a new one, Darcy inspects this folder and tweaks, forks, or writes from scratch depending on fit. Controlled by commitScripts in preferences.
context/instructions/ One Markdown file per project defining prep rules, match logic, and standards for that domain.
capabilities/ The master list of supported prep operations. Generic and reusable across all projects.
outputs/ Confirmed, reviewed files only. Subfolders are created by the agent when a job runs.
outputs/discarded/ Records dropped during every job — no identifier, failed validation, or unresolvable conflict. Never silently lost.
standards.md Cross-project defaults for naming, codes, entity matching, and date formats.
preferences.json Behavioural toggles — controls confirmations, commits, sanity checks, and language.

All behaviour is configurable.

Every toggle lives in preferences.json. Darcy reads it at the start of every session. Change a setting and the behaviour changes — no code required.

Setting Default What it does
commitOutputs false Allow files in outputs/ to be committed to git.
commitWorkspace false Allow files in workspace/ to be committed to git.
commitContext false Allow files in context/ to be committed to git.
commitScripts false Allow files in scripts/ to be committed to git.
pushAfterCommit false Push to remote automatically after every commit.
confirmBeforeSave true Ask before writing any output file.
confirmBeforeCommit true Ask before committing.
confirmBeforeGenerate true Ask for confirmation when the request is ambiguous.
runEvidenceCheck true Inspect data/, workspace/, and context files before starting any job.
includeSanityCheck true Write a sanity check report to outputs/reports/ after every job.
updateSourceRegistry true Offer to update standards.md when new source rules appear.
language "en-US" Writing language for outputs. Supports "en-US" and "en-GB".
When a commit* setting is changed to true, Darcy updates .gitignore automatically to unignore that folder. data/ is always ignored regardless of any setting.

Built for Claude Code.

The instruction file is CLAUDE.md — Claude Code picks it up automatically. Can be extended to Cursor or GitHub Copilot by copying the contents into the relevant instruction file for that tool.

Claude Code

CLAUDE.md

🖱️

Cursor

.cursor/rules/agent-da.mdc

🐙

GitHub Copilot

.github/copilot-instructions.md