OpenRefine
Clean, transform, and reconcile messy data with reversible operations.
What should journalists know about OpenRefine?
OpenRefine is the duct tape of data journalism. Messy CSV from a FOIA request full of inconsistent names, duplicate entries, and broken formatting? OpenRefine fixes it in minutes, not hours. Every operation is logged and reversible — your data cleaning is reproducible and auditable, which matters when an editor or lawyer asks how you got from raw data to published numbers. Built as Freebase Gridworks by Metaweb in 2010, acquired by Google that same year and renamed Google Refine, then released to the community as OpenRefine in 2012. Current version is 3.10.0, which added geospatial functions, new compression format support (XZ, LZMA, 7zip, ZStandard), and better error handling for Excel imports. The 3.9 series averaged 20,000 downloads per month. The killer feature is clustering: it identifies 'John Smith', 'JOHN SMITH', and 'Smith, John' as the same entity without you writing a single regex. Reconciliation against Wikidata and OpenCorporates lets you link messy local data to canonical identifiers. Compared to Excel, OpenRefine keeps a full operation history (Excel doesn't), handles faceting and clustering natively, and won't silently corrupt your data types. Compared to Python/pandas, it requires zero code and has a gentler learning curve, but can't match Python for automation or datasets above ~500K rows. ProPublica used it for their Pulitzer-winning Dollars for Docs investigation. Runs entirely locally — your data never leaves your machine unless you explicitly query reconciliation services.
Cleaning dirty datasets from FOIA responses, government databases, or scraped data. Standardizing names, addresses, and categorical data. Reconciling records against Wikidata, OpenCorporates, or custom SPARQL endpoints. Deduplicating entries across large spreadsheets. Auditable data transformations where you need to show your work.
Datasets above ~500K rows (performance degrades significantly). Statistical analysis or modeling (use R or Python). Visualization (use Datawrapper or Flourish). Fully automated pipelines (Python/pandas is better for repeatable batch processing).
Security & Privacy
Data is scrambled while being sent to their servers
Data is scrambled when stored on their servers
Where servers are located — affects which governments can request your data
Privacy policy summary
No data collection. No telemetry. No network requests unless you explicitly invoke reconciliation services (Wikidata, OpenCorporates, custom endpoints) or database imports. Project data, history, and preferences are stored locally. OpenRefine developers cannot access your data.
How to protect yourself:
Runs entirely on your machine — no cloud exposure. Be aware that reconciliation queries send entity names to external services (Wikidata, OpenCorporates), so don't reconcile columns containing source names or sensitive identifiers. Export your operation history JSON for reproducibility and audit trails. OpenRefine binds to localhost by default but has no built-in authentication — if you change the bind address to make it network-accessible, anyone on that network can access your instance. Keep OpenRefine updated: versions before 3.8.3 had serious vulnerabilities including remote code execution.
Runs entirely locally with no cloud dependency. Open-source with transparent operation logging. Data never leaves your machine unless you use external reconciliation services. Historical CVEs are serious but all patched in 3.8.3+. The lack of authentication is a non-issue for default localhost usage but becomes a real risk if you change the bind address. Keep it updated.
Who Owns This
Known issues
Serious CVE history, all patched in recent versions. CVE-2024-47881: SQLite integration allowed remote code execution via malicious extension loading (fixed in 3.8.3). CVE-2024-23833: JDBC vulnerability let attackers read host filesystem files (fixed in 3.7.9). Pre-3.7.5 versions had unauthenticated remote code execution. Pre-3.8.3 versions lacked CSRF protection on expression preview. A Log4j vulnerability (CVE-2025-68161) was reported in 2025 with a patch request pending. No built-in authentication — if exposed beyond localhost, anyone with network access can control the instance. The CZI EOSS grant that funded most development ended December 2025. The project's 2025 fundraising campaign raised under $1,500 total. Long-term sustainability is an open question.
Pricing
Free
This is an editorial assessment based on publicly available information as of 2026-04-02, using our published methodology. Independent security review is pending. Security posture can change at any time. This is not a guarantee of safety.
Something wrong or outdated? Report it.