Data Laundering | How To and Tools
Data Laundering vs Data Debt
Data you are given, data you produce, data that evolves – it will either already be in debt, or it will slide into debt.
Paying Back Data Debt
Data debt is like software debt – sooner or later, you’ll pay. Here is a small and simple data laundering backbone that isn’t just a tool, it is also a method. Data processing isn’t new, but the way it is done here is!
This data laundering suire comes with a dedicated library of cleanly separated data processors from which you pick and mix, and a small engine that pumps your data through each pipeline stage, laundering and injecting quality, sustainability and intelligence. It has a pay as you go feel, you learn a little, use a little, learn some more, and use some more. Refactoring data is here – not just for data professionals, but for anyone who needs to get data things done.
At its heart – data editing, binding and transformation is a cost-effective way of paying back data debt. You can clean, categorize and add sparkle to your data with a hassle free chain of editors that you network together to form editing routes.
This page is about data processing, enterprise integration, data cleansing, data editing, and data filtration.
The Data Editing Pipeline
You want and need
- toedit (transform) data using a number of filters
- a reusable library of common data cleansing filters
- pre-prepared editing routes that work straight up with little or no config
- automatic binding to common data types like names, phone numbers, e-mail addresses, dates and times, Boolean yes/no true/false data, addresses, country codes, geo locators like lattitude and longitude, names, gender, zip codes, postcodes passport number, NI numbers, driving license numbers, car registrations, web urls, local file urls, config data like database urls, twitter handles, facebook accounts and so on.
- route the data being processed to the next stage
- make local routing decisions depending on the data at hand
- separate editing concerns from routing concerns
Pipes and Filters Pattern
If you want the above then you want the enterprise integration pattern called pipes and filters.
The Hello World for Pipes and Filters
You don’t need to write routing and queue management software. You just need to drop a jar (or add a maven dependency) and then get on with writing domain specific software.
The “Hello World” to the framework requires you to do two things.
- add a DataParcel implements clause to the data bundle classes
- create a class extending the Editor, and override the editParcel() method
The default router will do for now – it just links together your editors and passes the parcel down the line. That’s it!
Who Decides the Route
Patterns like MVC (Model View Controller) give power to a controller that examines requests and decides on routing. Here there are two types of class that can making routing decisions.
Who Decides the Route – You or Your Car’s GPS
When you use GPS to travel – who decides the actual route taken by the car – you or your car’s GPS system. The answer is both.
The GPS decides the “big picture” route – what motorway to take – which type of roads to use. But in an area you know very well – you override the GPS. You know this road is quicker so you take it.
Also if you discover the journey should be aborted half-way through you, turn the car round and tell the GPS to take you back.
This data pipeline processing package has the concept of a
- global route – decided by an EditingRoute
- local routes – decided by the Editor
Editing Detours for Pre and Post Processing
If an editor looks at the parcel and decides that it would be better for another editor to clean it first – it can temporarily place a detour and send the data to the “pre-processing” editor.
By placing a “Return to Sender” address the other editor knows to return the processed parcel back to where it came from.
Local post processing allows an editor to send a parcel to one (and even two or more) editors before forwarding on to
the “globally planned” next editor in line.
What Goes Up – Must Come Down
This type of editing route is like a “there and back” journey. When editors work on data they may want to examine that data again on its way back. So a pipeline of editors may opt for the global there and back route pattern.
Loopback Routing – Return to Sender
Suppose you want to trim data – then split the data parcel into a collection of parcels – now the collection of parcels needs to be trimmed again so the route loop back route allows the new chain of data parcels to each get trimmed again.
Then the data parcel is sent back down to the one before. This is a global routing pattern. The data parcel goes there and back so that each editor stop gets a chance to change things before and after the changes made by all the latter (down the line) editors.
Loopback Routing – Pass Go – Collect £200
When the data parcel has completed its editing and transformation journey a decision must be made – should we stop here or send it for another round of editing – that is where the Monopoly – Pass Go and Collect £200 pattern steps in. This critical decision is encompassed by the Go Or No Go editor.
Preventing Software Complexity
It is obvious what the data processing pipeline does – but the benefits may not be immediately apparent. The key benefit is the ability to scale the system – adding editors and routes without your software becoming complex and riddled with hacks, duplication, nested conditionals and worst of all – not reusable.
Without some help from this lightweight routing and editing structure your software package will contain many concerns that should be separated. These are
- the routing concern
- the data editing and transformation concern
- the anomalies routing hacks
- the decision of when to stop
- concurrency management interventions
- specific domain concerns
- exception handling interventions
All this binds your domain, the routing aspects, the data editing and transformation aspects and other headaches in one. You will not be able to reuse the routing code, or the editing code in another context because it will all be bound together.
Pipes and Filters Summary
When you need a transformational chain – that can act say on delimited text files from different sources – without getting bogged down in routing and pipeline code – the pipes and filters pattern and this implementation should be considered.