How to set up simple and flexible ETL based anonymization

8 April 2021

Reading time 7 min

How to set up simple and flexible ETL based anonymization - part 1

Jan Eerdekens

<span id="hs_cos_wrapper_name" class="hs_cos_wrapper hs_cos_wrapper_meta_field hs_cos_wrapper_type_text" style="" data-hs-cos-general-type="meta_field" data-hs-cos-type="text" >How to set up simple and flexible ETL based anonymization - part 1</span>

Share this via:

How to set up simple and flexible ETL based anonymization - part 1

9:02

In this technical blog post, I want to talk about how to set up simple and flexible ETL based anonymization. Why? Well, I recently had the opportunity to do a small proof of concept for a customer. The customer wanted to know the options that were available that would enable them to take internal data, remove or anonymize any personally identifiable information (PII) and make it available in a simple way and form for external parties. After further requirements gathering, the context for this proof of concept was defined as:

Whatever solution, it needs to be able to extract data from an on premise Oracle database.
The end result should be a set of CSV files in an Amazon S3 bucket.
In between ingesting the Oracle data and dumping it in CSV form on S3, there should be something that removes/anonymizes PII data.
If possible, the chosen solution should be cloud native.

In this 3 part blog series I’ll explain how to set up simple and flexible ETL based anonymization with the following subjects:

The research into products that might be used to solve the problem. Also, check how suitable they are for what the proof of concept needs to achieve.
How the chosen product can be used to create an ETL pipeline that fits the requirements. Additionally, how to setup a local Oracle database in Docker that can be used as a data source for the data ingestion part of the proof of concept (just because this was such a PITA to do).
And whether this can be done in a cloud native way.

Research

The research part of the proof of concept consists of 2 parts:

How to extract data from an Oracle database, anonymize it somehow and store it as a bunch of CSV files in an S3 bucket aka the ETL part.
Figuring out the best way to accomplish the anonymization.

Extracting, transforming and storing the data

Straight off the bat, the customer’s problem sounded remarkably as something that you might solve with an ETL product: Extract Transform Load. So the research part for this part of the proof of concept would be concentrated on this type of product. I also got some input from someone in my team to have a look at singer.io as that was something they had used successfully in the past for this kind of problem.

singer logo

When looking at the Singer homepage, there are a number things that immediately catch your eye:

Singer powers data extraction and consolidation for all of the tools of your organization.
The open-source standard for writing scripts that move data.
Unix-inspired: Singer taps and targets are simple applications composed with pipes.
JSON-based: Singer applications communicate with JSON, making them easy to work with and implement in any programming language.

So when getting down to the basics, Singer is a just a specification, albeit not an official one. It’s a simple JSON based data format and you can either produce something in this format (a tap in Singer terminology) or consume the format (a target). You’re able to chain these taps and targets together to extract data from one location and store it in another. Out of the box Singer already comes with a bunch of taps (100+) and targets (10). These taps and targets are written in Python. Because the central point of the system is just a data format, it’s pretty easy to write one yourself or adapt an existing one.

When checking out the taps, the default Oracle tap should cover the Extract part of our proof of concept. The same however doesn’t seem to be the case for the Load part when looking at the default targets. There is a CSV target, but it stores its results locally, not in an S3 bucket. There is the option of just using this target and do the S3 upload ourself after the ETL pipeline has finished. Another option would be to adapt the existing CSV target and change the file storage to S3. Some quick Googling turns up a community made S3 CSV Singer target. According to its documentation, this target should do exactly what we want.

Whoops, Singer doesn't transform

With the Extract and Load parts covered, this leaves us with just the Transform part of the ETL pipeline to figure out… and this is where it gets a bit weird. Even tough Singer is classified as an ETL tool, it doesn’t seem to have support for the transformation part?

wait what meme

Looking further into this I came by this ominously titled post: Why our ETL tool doesn’t do transformations. Reading this, it seems they consider their JSON specification/data format as the transformation part. So they support transformation to raw data and storing it, but don’t support other kinds of transformations. That part is up to yourself after it has been stored somewhere by a Singer target. So it turns out that Singer is more like the EL part of an ELT product than an “old school” ETL product.

At this point, Singer should at least be sufficient to extract the data from an Oracle database and to put it in an S3 bucket in CSV format. And because Singer is pretty simple, open and extendable, I’m going to leave it at that for now. Let’s continue by looking into the anonymization options that might fit in this Singer context.

Data anonymization

Similarly to the ETL part, I also received some input for this part, pointing me to Microsoft Presidio.

Microsoft Presidio

When looking at the homepage we can read the following:

It provides fast identification and anonymization modules for private entities in text and images such as credit card numbers, names and more.
It facilitates both fully automated and semi-automated PII de-identification flows on multiple platforms.
Customizability in PII identification and anonymization.

So there’s a lot of promising stuff in there that could help me solve my anonymization needs. Upon further investigation it looks like I’m evaluating this product during a major transformation (get it? 😉) from V1 to V2. V1 incorporated some ETL-like stuff like retrieving data from sources (even though Oracle support in the roadmap never seems to have materialized) and storing anonymized results in a number of forms/locations. However, V2 has completely dropped this approach to concentrate purely on the detection and replacement of PII data.

At its core, Presidio V2 is a Python based system built on top of an AI model. This enables it to automatically discover PII data in text and images and to replace it according to the rules you define. I did some testing using their online testing tool and it kind of works, but for our specific context it definitely needs tweaking. Also, when looking at the provided test data, it seems that it is mostly simple and short data but no large text blobs or images. This then begs the question: even if we’re able to configure Presidio to do what we want it to do, might we be hitting small nails with a big hammer?

i can fix this meme

Is Presidio too much?

So let’s rethink this. If we can easily know and define which simple columns in which tables need to be anonymized and when just nulling or hashing the column values is sufficient, we don’t need the auto detection part of Presidio. We also wouldn’t need the Presidio full text or image support and we also wouldn’t need fancy substitution support. Presidio could be a powerful library to create an automatic anonymization transformation step for our Singer based pipeline. It also helps that Presidio is Python based. However, my gut feeling says I maybe should first try to find a slightly simpler solution.

I started searching for something that’s can do a simple PII replace and that works in a Singer tap/target context. I found this Github repository: pipelinewise-transform-field. The documentation reads “Transformation component between Singer taps and targets”. Sounds suspiciously like the “T” part that Singer as an ETL was missing! Further down in the configuration section we even read:

“You need to define which columns have to be transformed by which method and in which condition the transformation needs to be applied.”

and the possible transformation types are:

SET-NULL: Transforms any input to NULL
HASH: Transforms string input to hash
HASH-SKIP-FIRST-n: Transforms string input to hash skipping first n characters, e.g. HASH-SKIP-FIRST-2
MASK-DATE: Replaces the months and day parts of date columns to be always 1st of Jan
MASK-NUMBER: Transforms any numeric value to zero
MASK-HIDDEN: Transforms any string to ‘hidden'

This seems to cover our simple anonymization requirements completely! We can even see how we need to use it in the context of Singer:

some-singer-tap | transform-field --config [config.json] | some-singer-target

Conclusion

We now have all the pieces of the puzzle on how to set up simple and flexible ETL based anonymization. In the next blog post we’ll show how they fit together and whether they produce the results the customer is looking for.

Jan Eerdekens

Solution Engineer, ACA Group

Read their blogs

Superfast POCs with Vantiq and Event Storming

24 MAR 2021

Reading time 5 min

Peter Hardeel

Superfast POCs with Vantiq and Event Storming

How to set up simple and flexible ETL based anonymization - part 2

30 APR 2021

Reading time 17 min

Jan Eerdekens

How to set up simple and flexible ETL based anonymization - part 2

Security & Privacy

ACA is ISO 27001 certified: what this means for secure, compliant software delivery

Reading time 3 min

Simon Vercruysse

19 DEC 2025

ACA Group is officially ISO 27001 compliant. For our customers, this certification is more than a formal milestone: it is clear, independent proof that information security is embedded in how we design, build and deliver software. Information Security Manager Simon Vercruysse explains what ISO 27001 entails and what the benefits are for your (future) projects.

Software Development

The peril of experience assumption

Reading time 7 min

Dorien Jorissen

8 MAY 2025

In software development, assumptions can have a serious impact and we should always be on the look-out. In this blog post, we talk about how to deal with assumptions when developing software. Imagine…you’ve been driving to a certain place A place you have been driving to every day for the last 5 years, taking the same route, passing the same abandoned street, where you’ve never seen another car. Gradually you start feeling familiar with this route and you assume that as always you will be the only car on this road. But then at a given moment in time, a car pops up right in front of you… there had been a side street all this time, but you had never noticed it, or maybe forgot all about it. You hit the brakes and fortunately come to a stop just in time. Assumption nearly killed you. Fortunately in our job, the assumptions we make are never as hazardous to our lives as the assumptions we make in traffic. Nevertheless, assumptions can have a serious impact and we should always be on the look-out. Imagine… you create websites Your latest client is looking for a new site for his retirement home because his current site is outdated and not that fancy. So you build a Fancy new website based on the assumption that Fancy means : modern design, social features, dynamic content. The site is not the success he had anticipated … strange … you have build exactly what your client wants. But did you build what the visitors of the site want? The average user is between 50 – 65 years old, looking for a new home for their mom and dad. They are not digital natives and may not feel at home surfing on a fancy, dynamic website filled with twitter feeds and social buttons. All they want is to have a good impression of the retirement home and to get reassurance of the fact that they will take good care of their parents. The more experienced you’ll get, the harder you will have to watch out not to make assumptions and to double-check with your client AND the target audience . Another well known peril of experience is “ the curse of knowledge “. Although it sounds like the next Pirates of the Caribbean sequel, the curse of knowledge is a cognitive bias that overpowers almost everyone with expert knowledge in a specific sector. It means better-informed parties find it extremely difficult to think about problems from the perspective of lesser-informed parties. You might wonder why economists don’t always succeed in making the correct stock-exchange predictions. Everyone with some cash to spare can buy shares. You don’t need to be an expert or even understand about economics. And that’s the major reason why economists are often wrong. Because they have expert knowledge, they can’t see past this expertise and have trouble imagining how lesser informed people will react to changes in the market. The same goes for IT. That’s why we always have to keep an eye out, we don’t stop putting ourselves in the shoes of our clients. Gaining insight in their experience and point of view is key in creating the perfect solution for the end user. So how do we tackle assumptions …? I would like to say “Simple” and give you a wonderful oneliner … but as usual … simple is never the correct answer. To manage the urge to switch to auto-pilot and let the Curse of Knowledge kick in, we’ve developed a methodology based on several Agile principles which forces us to involve our end user in every phase of the project, starting when our clients are thinking about a project, but haven’t defined the solution yet. And ending … well actually never. The end user will gain new insights, working with your solution, which may lead to new improvements. In the waterfall methodology at the start of a project an analysis is made upfront by a business analist. Sometimes the user is involved of this upfront analysis, but this is not always the case. Then a conclave of developers create something in solitude and after the white smoke … user acceptance testing (UAT) starts. It must be painful for them to realise after these tests that the product they carefully crafted isn’t the solution the users expected it to be. It’s too late to make vigorous changes without needing much more time and budget. An Agile project methodology will take you a long way. By releasing testable versions every 2 to 3 weeks, users can gradually test functionality and give their feedback during development of the project. This approach will incorporate the user’s insights, gained throughout the project and will guarantee a better match between the needs of the user and the solution you create for their needs. Agile practitioners are advocating ‘continuous deployment’; a practice where newly developed features will be deployed immediately to a production environment instead of in batches every 2 to 3 weeks. This enables us to validate the system (and in essence its assumptions) in the wild, gain valuable feedback from real users, and run targeted experiments to validate which approach works best. Combining our methodology with constant user involvement will make sure you eliminate the worst assumption in IT: we know how the employees do their job and what they need … the peril of experience! Do we always eliminate assumptions? Let me make it a little more complicated: Again… imagine: you’ve been going to the same supermarket for the last 10 years, it’s pretty safe to assume that the cereal is still in the same aisle, even on the same shelf as yesterday. If you would stop assuming where the cereal is … this means you would lose a huge amount of time, browsing through the whole store. Not just once, but over and over again. The same goes for our job. If we would do our job without relying on our experience, we would not be able to make estimations about budget and time. Every estimation is based upon assumptions. The more experienced you are, the more accurate these assumptions will become. But do they lead to good and reliable estimations? Not necessarily… Back to my driving metaphor … We take the same road to work every day. Based upon experience I can estimate it will take me 30 minutes to drive to work. But what if they’ve announced traffic jams on the radio and I haven’t heard the announcement… my estimation will not have been correct. At ACA Group, we use a set of key practices while estimating. First of all, it is a team sport. We never make estimations on our own, and although estimating is serious business, we do it while playing a game: Planning poker. Let me enlighten you; planning poker is based upon the principle that we are better at estimating in group. So we read the story (chunk of functionality) out loud, everybody takes a card (which represent an indication of complexity) and puts them face down on the table. When everybody has chosen a card, they are all flipped at once. If there are different number shown, a discussion starts on the why and how. Assumptions, that form the basis for one’s estimate surface and are discussed and validated. Another estimation round follows, and the process continues till consensus is reached. The end result; a better estimate and a thorough understanding of the assumptions surrounding the estimate. These explicit assumptions are there to be validated by our stakeholders; a great first tool to validate our understanding of the scope.So do we always eliminate assumptions? Well, that would be almost impossible, but making assumptions explicit eliminates a lot of waste. Want to know more about this Agile Estimation? Check out this book by Mike Cohn . Hey! This is a contradiction… So what about these assumptions? Should we try to avoid them? Or should we rely on them? If you assume you know everything … you will never again experience astonishment. As Aristotle already said : “It was their wonder, astonishment, that first led men to philosophize”. Well, a process that validates the assumptions made through well conducted experiments and rapid feedback has proven to yield great results. So in essence, managing your assumptions well, will produce wonderful things. Be aware though that the Curse of Knowledge is lurking around the corner waiting for an unguarded moment to take over. Interested in joining our team? Interested in meeting one of our team members? Interested in joining our team? We are always looking for new motivated professionals to join the ACA team! {% module_block module "widget_3ad3ade5-e860-4db4-8d00-d7df4f7343a4" %}{% module_attribute "buttons" is_json="true" %}{% raw %}[{"appearance":{"link_color":"light","primary_color":"primary","secondary_color":"primary","tertiary_color":"light","tertiary_icon_accent_color":"dark","tertiary_text_color":"dark","variant":"primary"},"content":{"arrow":"right","icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"tertiary_icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"text":"View career opportunities"},"target":{"link":{"no_follow":false,"open_in_new_tab":false,"rel":"","sponsored":false,"url":{"content_id":229022099665,"href":"https://25145356.hs-sites-eu1.com/en/jobs","href_with_scheme":null,"type":"CONTENT"},"user_generated_content":false}},"type":"normal"}]{% endraw %}{% end_module_attribute %}{% module_attribute "child_css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "definition_id" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "field_types" is_json="true" %}{% raw %}{"buttons":"group","styles":"group"}{% endraw %}{% end_module_attribute %}{% module_attribute "isJsModule" is_json="true" %}{% raw %}true{% endraw %}{% end_module_attribute %}{% module_attribute "label" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "module_id" is_json="true" %}{% raw %}201493994716{% endraw %}{% end_module_attribute %}{% module_attribute "path" is_json="true" %}{% raw %}"@projects/aca-group-project/aca-group-app/components/modules/ButtonGroup"{% endraw %}{% end_module_attribute %}{% module_attribute "schema_version" is_json="true" %}{% raw %}2{% endraw %}{% end_module_attribute %}{% module_attribute "smart_objects" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "smart_type" is_json="true" %}{% raw %}"NOT_SMART"{% endraw %}{% end_module_attribute %}{% module_attribute "tag" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "type" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "wrap_field_tag" is_json="true" %}{% raw %}"div"{% endraw %}{% end_module_attribute %}{% end_module_block %}

Software Development

Mob programming: a team experiment

Reading time 4 min

Dorien Jorissen

8 MAY 2025

ACA does a lot of projects. In the last quarter of 2017, we did a rather small project for a customer in the financial industry. The deadline for the project was at the end of November and our customer was getting anxious near the end of September. We were confident we could pull off the job on time though and decided to try out an experiment. We got the team together in one room and started mob programming . Mob what? We had read an article that explains the concept of mob programming. In short, mob programming means that the entire team sits together in one room and works on one user story at a time. One person is the ‘driver’ and does the coding for a set amount of time. When that time has passed, the keyboard switches to another team member. We tried the experiment with the following set-up: Our team was relatively small and only had 4 team members. Since the project we were working on was relatively small, we could only assing 4 people. The user stories handled were only a part of the project. Because this was en experiment, we did not want the project - as small as it was - to be mobbed completely. Hence, we chose one specific epic and implemented those user stories in the mob. We did not work on the same computer. We each had a separate laptop and checked in our code to a central versioning system instead of switching the keyboard. This wasn't really a choice we made, just something that happened. We switched every 20 minutes. The article we referred to talks about 12, but we thought that would be too short and decided to go with 20 minutes instead. Ready, set, go! We spent more than a week inside a meeting room where we could, in turn, connect our laptops to one big screen. The first day of the experiment, we designed. We stood at the whiteboard for hours deciding on the architecture of the component we were going to build. On the same day, our mob started implementing the first story. We really took off! We flew through the user story, calling out to our customer proxy when some requirements were not clear. Near the end of the day, we were exhausted. Our experiment had only just started and it was already so intense. The next days, we continued implementing the user stories. In less than a week, we had working software that we could show to our customer. While it wasn’t perfect yet and didn’t cover all requirements, our software was able to conduct a full, happy path flow after merely 3 days. Two days later, we implemented enhancements and exception cases discussed through other user stories. Only one week had passed since our customer started getting anxious and we had implemented so much we could show him already. Finishing touches Near the end of the project, we only needed to take care of some technicalities. One of those was making our newly-built software environment agnostic. If we would have finished this user story with pair programming, one pair would know all the technical details of the software. With mob programming, we did not need to showcase it to the rest of the team. The team already knew. Because we switched laptops instead of keyboards, everyone had done the setup on their own machine. Everyone knew the commands and the configuration. It was knowledge sharing at its best! Other technicalities included configuring our software correctly. This proved to be a boring task for most of the navigators. At this point, we decided the mob experiment had gone far enough. We felt that we were not supposed to do tasks like these with 4 people at the same time. At least, that’s our opinion. Right before the mob disbanded, we planned an evaluation meeting. We were excited and wanted to do this again, maybe even at a bigger scale. Our experience with mob programming The outcome of our experiment was very positive. We experienced knowledge sharing at different levels. Everyone involved knew the complete functionality of the application and we all knew the details of the implementation. We were able to quickly integrate a new team member when necessary, while still working at a steady velocity. We already mentioned that we were very excited before, during and after the experiment. This had a positive impact on our team spirit. We were all more engaged to fulfill the project. The downside was that we experienced mob programming as more exhausting. We felt worn out after a day of being together, albeit in a good way! Next steps Other colleagues noticed us in our meeting room programming on one big screen. Conversations about the experiment started. Our excitement was contagious: people were immediately interested. We started talking about doing more experiments. Maybe we could do mob programming in different teams on different projects. And so it begins… Have you ever tried mob programming? Or are you eager to try? Let’s exchange tips or tricks! We’ll be happy to hear from you!

Want to dive deeper into this topic?

Get in touch with our experts today. They are happy to help!

Want to dive deeper into this topic?

Get in touch with our experts today. They are happy to help!

Want to dive deeper into this topic?

Get in touch with our experts today. They are happy to help!

Want to dive deeper into this topic?

Get in touch with our experts today. They are happy to help!

Research

Extracting, transforming and storing the data

Whoops, Singer doesn't transform

Data anonymization

Is Presidio too much?

Conclusion

What others have also read

Want to dive deeper into this topic?

Want to dive deeper into this topic?

Want to dive deeper into this topic?

Want to dive deeper into this topic?