

Data transformation and generating data from other data are common tasks in software development. Different programming languages have different ways to achieve this, each with their strengths and weaknesses. Depending on the problem, some may be more preferable than others. In this blog, you find simple but powerful methods for generating and transforming data in Python.
Before we discuss a more complex case, let’s start with a basic example. Imagine that we own a few stores and each store has its own database with items added by employees. Some fields are optional, which means employees do not always fill out everything. As we grow, it might become difficult to get a clear view of all the items in our stores. Therefore, we develop a Python script that takes the different items from our stores’ databases and collects them in a single unified database.
from stores import store_1, store_2, store_3
# Typehints are used throughout the code.
items_1: Generator[Item, None, None] = store_1.get_items()
items_2: Generator[Item, None, None] = store_2.get_items()
items_3: Generator[Item, None, None] = store_3.get_items()
Generators
store_1.get_items() returns a generator of items. Generators will have a key role in this blogpost. With them, we can set up a complex chain of transformations over massive amounts of data without running out of memory, while keeping our code concise and clean. If you are not familiar with Python yet:
def a_generator():
for something in some_iterable:
# do logic
yield something
Two things are important here. First, calling a generator will not return any data; it will return an iterator. Second, values are produced on demand. A more in-depth explanation can be found here.
Syntax
There are two ways to create generators. The first looks like a normal Python function, but has a yield statement instead of a return statement. The other is more concise but can quickly become convoluted as the logic gets more complex. It’s called the Python generator expression syntax and is mainly used for simpler generators.
# Basic generator syntax
def generate_until(n: int) -> Generator[int, None, None]:
while i > n;
yield i
i += 1
# Generator expression syntax
gen_until_5: Generator[int, None, None] = (i for i in range(5))
Code
To keep it simple, we run the script once at the end of the day, leaving us with a complete database with all items from all stores.
from stores import store_1, store_2, store_3
from database import all_items
# Typehints are used throughout the code.
items_1: Generator[Item, None, None] = store_1.get_items()
items_2: Generator[Item, None, None] = store_2.get_items()
items_3: Generator[Item, None, None] = store_3.get_items()
# Let's assume our `add_or_update()` function accepts generators.
# If an Item already exists it updates it else it adds it to the database.
# We can just add them one by one like here.
all_items.add_or_update(items_1)
all_items.add_or_update(items_2)
all_items.add_or_update(items_3)
# The database now contains all the latest items from all the stores.
For this use case, this is perfectly fine. But when the complexity grows and more stores are added, it can quickly become cluttered. Fortunately, Python has great built-in tools to simplify our code.
Itertools
One module in Python is called itertools. According to the Python docs, “the module standardizes a core set of fast, memory-efficient tools that are useful by themselves or in combination. Together, they form an “iterator algebra”, making it possible to construct specialized tools succinctly and efficiently in pure Python.”
A great function is itertools.chain(). This is used to ‘chain’ together multiple iterables as if they are one. We can use it to chain our generators together.
from stores import store_1, store_2, store_3
from database import all_items
from itertools import chain
# Typehints are used throughout the code.
items_1: Generator[Item, None, None] = store_1.get_items()
items_2: Generator[Item, None, None] = store_2.get_items()
items_3: Generator[Item, None, None] = store_3.get_items()
# Using itertools.chain we can add the generators together into one.
# Chain itself is also a generator function so no data will be generated yet.
items: Generator[Item, None, None] = chain(items_1, items_2, items_3)
all_items.add_or_update(items) # <- data will be generated here
# The database now contains all the latest items from all the stores
Genertator Functions
Now let’s assume that our item is a tuple with five fields: name, brand, supplier, cost, and the number of pieces in the store. It has the following signature: tuple[str,str,str,int,int]. If we want the total value of the items in the store, we simply need to multiply the number of articles by the cost.
# both receives and returns a generator
def calc_total_val(items: Generator) -> Generator:
for item in items:
# yield the first 3 items and the product of the last 2
yield *item[:3], item[3]*item[4]
# we can also write is as a generator expression since it's so simple
((*item[:3], item[3]*item[4]) for item in items)
Now it looks like this: tuple[str, str, str, int]. But we want to output it as JSON. For that, we can just create a generator that returns a dictionary and call json.dumps() on it. Let’s assume that we can pass an iterator of dicts to the add_or_update() function and that it automatically calls json.dumps().
# both receives and returns a generator
def as_dict_item(items: Generator) -> Generator:
for item in items:
yield {
"name": item[0],
"brand": item[1],
"supplier": item[2],
"total_value": item[3],
}
Now we have more logic, let’s see how we can put it together. One great thing about generators is how clear and concise it is to use them. We can create a function for each process step and run the data through it.
from stores import store_1, store_2, store_3
from database import all_items
from itertools import chain
def calc_total_val(items):
for item in items:
yield *item[:3], item[3]*item[4]
def as_item_dict(items):
for item in items:
yield {
"name": item[0],
"brand": item[1],
"supplier": item[2],
"total_value": item[3],
}
items_1 = store_1.get_items()
items_2 = store_2.get_items()
items_3 = store_3.get_items()
items = chain(items_1, items_2, items_3) # <- make one big iterable
items = calc_total_val(items) # <- calc the total value
items = as_item_dict(items) # <- transform it into a dict
all_items.add_or_update(items) # <- data will be generated here
# The database now contains all the latest items from all the stores
To show the steps that we have taken, I split everything up. There are still some things that could be improved. Take a look at the function calc_total_val(). This is a perfect example of a situation where a generator expression can be used.
from stores import store_1, store_2, store_3
from database import all_items
from itertools import chain
def as_item_dict(items):
for item in items:
yield {
"name": item[0],
"brand": item[1],
"supplier": item[2],
"total_value": item[3],
}
items_1 = store_1.get_items()
items_2 = store_2.get_items()
items_3 = store_3.get_items()
items = chain(items_1, items_2, items_3)
items = ((*item[:3], item[3]*item[4]) for item in items)
items = as_item_dict(items)
all_items.add_or_update(items)
To make it even cleaner, we can put all of our functions into a separate module. In this way, our main file only contains the steps the data goes through. If we use descriptive names for our generators, we can immediately see what the code will do. So now we have created a pipeline for the data. While this is only a simple example, it can also be used for more complicated workflows.
Data products

Everything we did in the example above can be easily applied to a Data Product. If you are not familiar with data products, here is a great text on data meshes.
Imagine that we have a data product that does some data aggregation. It has multiple inputs with different kinds of data. Each of those inputs needs to be filtered, transformed and cleaned before we can aggregate them into one output. The client requires the output to be a single JSON file stored in an S3 bucket. The existing infrastructure only allows for 500 Mb of RAM for the containers. Now let’s load all the data, do some transformations, aggregate everything, and parse it into a JSON file.
from input_ports import port_1, port_2
from output_ports import S3_port
from json import dumps
data_port_1: Generator = port_1.get_data()
data_port_2: Generator = port_2.get_data()
output = []
for row in data_port_1:
# do some transformation or filtering here
output.append(row)
for row in data_port_2:
# do some transformation or filtering here
output.append(row)
S3_port.save(dumps(output))
While this looks like an excellent solution that does the job and is easy to understand, our container suddenly crashes due to an OutOfMemory error. After some local testing on our machine, we see that it has produced an 834Mb file that cannot work with only 500 MB of RAM for the container. The problem with the code above is that we keep everything in a list first, so all is saved in memory.
Solution
Let’s give it another try. For S3, we can use MultipartUpload. This means we do not need to keep the entire file in memory. And of course, we should replace our lists by generators.
from input_ports import port_1, port_2
from output_ports import S3_port
from itertools import chain
from json import dumps
data_port_1: Generator = port_1.get_data()
data_port_2: Generator = port_2.get_data()
def port_1_transformer(data: Generator):
for row in data:
# do some transformation or filtering here
yield row
def port_2_transformer(data: Generator):
for row in data:
# do some transformation or filtering here
yield row
output = chain(port_1_transformer(data_port_1), port_2_transformer(data_port_2))
for part in output:
S3_port.save_part(dumps(part))
Since we now only have one item in memory at a time, this uses dramatically less memory than the earlier solution with almost no extra work. However, sending a post request to S3 for each item might be a bit much. Especially if we have 300,000 items. But there is another issue …

The ‘part size’ should be between 5MiB and 5GiB. To fix this, we can group multiple parts before we parse them. But if we group too many, we will once again reach the memory limit. The chunk size should therefore depend on how large the individual parts of your data are. To demonstrate this, let’s use a size of 1,000. The larger the chunk size, the more memory is used but the fewer requests to S3. So we prefer our chunks to be as large as possible without running out of memory.
from input_ports import port_1, port_2
from output_ports import S3_port
from itertools import chain
from json import dumps
data_port_1: Generator = port_1.get_data()
data_port_2: Generator = port_2.get_data()
def makebatch(iterable, len):
for first in iterable:
yield chain([first], islice(iterable, len - 1))
def port_1_transformer(data: Generator):
for row in data:
# do some transformation or filtering here
yield row
def port_2_transformer(data: Generator):
for row in data:
# do some transformation or filtering here
yield row
output = chain(port_1_transformer(data_port_1), port_2_transformer(data_port_2))
for chunk in makebatch(output, 1000):
S3_port.save_part(dumps(chunk))
This is all that needs to happen. It is enough to transform vast amounts of data and save them in an S3 bucket, even when resources are scarce.
Bonus
When your calculations are compute-intensive, running them in parallel is easy. With just a few extra lines, we can run our transformers with multiple cores.
from multiprocessing.pool import Pool
with Pool(4) as pool:
# imap_unordered could also be used if the order is not important
data_1 = pool.imap(port_1_transformer, data_port_1, chunksize=500)
data_2 = pool.imap(port_2_transformer, data_port_2, chunksize=500)
output = chain(data_1, data_2)
The best part about this? We do not have to change anything else as imap can be iterated to get results, just like any other generator. Now let’s throw it all together. This is all we need for compute-intensive transformations, over large amounts of data, using multiple cores.
from input_ports import port_1, port_2
from output_ports import S3_port
from itertools import chain
from json import dumps
from multiprocessing.pool import Pool
data_port_1: Generator = port_1.get_data()
data_port_2: Generator = port_2.get_data()
def makebatch(iterable, len):
for first in iterable:
yield chain([first], islice(iterable, len - 1))
def port_1_transformer(data: Generator):
for row in data:
# do some transformation or filtering here
yield row
def port_2_transformer(data: Generator):
for row in data:
# do some transformation or filtering here
yield row
with Pool(4) as pool:
# imap_unordered could also be used if the order is not important
data_1 = pool.imap(port_1_transformer, data_port_1, chunksize=500)
data_2 = pool.imap(port_2_transformer, data_port_2, chunksize=500)
output = chain(data_1, data_2)
for chunk in makebatch(output, 1000):
S3_port.save_part(dumps(chunk))
Conclusion
Generators are often misunderstood by new developers, but they can be an excellent tool. Whether for a simple transformation or something more advanced such as a data product, Python is a great choice because of its ease of use and the abundance of tools available within the standard library.


What others have also read


In the ever-evolving landscape of data management, investing in platforms and navigating migrations between them is a recurring theme in many data strategies. How can we ensure that these investments remain relevant and can evolve over time, avoiding endless migration projects? The answer lies in embracing ‘Composability’ - a key principle for designing robust, future-proof data (mesh) platforms. Is there a silver bullet we can buy off-the-shelf? The data-solution market is flooded with data vendor tools positioning themselves as the platform for everything, as the all-in-one silver bullet. It's important to know that there is no silver bullet. While opting for a single off-the-shelf platform might seem like a quick and easy solution at first, it can lead to problems down the line. These monolithic off-the-shelf platforms often end up inflexible to support all use cases, not customizable enough, and eventually become outdated.This results in big complicated migration projects to the next silver bullet platform, and organizations ending up with multiple all-in-one platforms, causing disruptions in day-to-day operations and hindering overall progress. Flexibility is key to your data mesh platform architecture A complete data platform must address numerous aspects: data storage, query engines, security, data access, discovery, observability, governance, developer experience, automation, a marketplace, data quality, etc. Some vendors claim their all-in-one data solution can tackle all of these. However, typically such a platform excels in certain aspects, but falls short in others. For example, a platform might offer a high-end query engine, but lack depth in features of the data marketplace included in their solution. To future-proof your platform, it must incorporate the best tools for each aspect and evolve as new technologies emerge. Today's cutting-edge solutions can be outdated tomorrow, so flexibility and evolvability are essential for your data mesh platform architecture. Embrace composability: Engineer your future Rather than locking into one single tool, aim to build a platform with composability at its core. Picture a platform where different technologies and tools can be seamlessly integrated, replaced, or evolved, with an integrated and automated self-service experience on top. A platform that is both generic at its core and flexible enough to accommodate the ever-changing landscape of data solutions and requirements. A platform with a long-term return on investment by allowing you to expand capabilities incrementally, avoiding costly, large-scale migrations. Composability enables you to continually adapt your platform capabilities by adding new technologies under the umbrella of one stable core platform layer. Two key ingredients of composability Building blocks: These are the individual components that make up your platform. Interoperability: All building blocks must work together seamlessly to create a cohesive system. An ecosystem of building blocks When building composable data platforms, the key lies in sourcing the right building blocks. But where do we get these? Traditional monolithic data platforms aim to solve all problems in one package, but this stifles the flexibility that composability demands. Instead, vendors should focus on decomposing these platforms into specialized, cost-effective components that excel at addressing specific challenges. By offering targeted solutions as building blocks, they empower organizations to assemble a data platform tailored to their unique needs. In addition to vendor solutions, open-source data technologies also offer a wealth of building blocks. It should be possible to combine both vendor-specific and open-source tools into a data platform tailored to your needs. This approach enhances agility, fosters innovation, and allows for continuous evolution by integrating the latest and most relevant technologies. Standardization as glue between building blocks To create a truly composable ecosystem, the building blocks must be able to work together, i.e. interoperability. This is where standards come into play, enabling seamless integration between data platform building blocks. Standardization ensures that different tools can operate in harmony, offering a flexible, interoperable platform. Imagine a standard for data access management that allows seamless integration across various components. It would enable an access management building block to list data products and grant access uniformly. Simultaneously, it would allow data storage and serving building blocks to integrate their data and permission models, ensuring that any access management solution can be effortlessly composed with them. This creates a flexible ecosystem where data access is consistently managed across different systems. The discovery of data products in a catalog or marketplace can be greatly enhanced by adopting a standard specification for data products. With this standard, each data product can be made discoverable in a generic way. When data catalogs or marketplaces adopt this standard, it provides the flexibility to choose and integrate any catalog or marketplace building block into your platform, fostering a more adaptable and interoperable data ecosystem. A data contract standard allows data products to specify their quality checks, SLOs, and SLAs in a generic format, enabling smooth integration of data quality tools with any data product. It enables you to combine the best solutions for ensuring data reliability across different platforms. Widely accepted standards are key to ensuring interoperability through agreed-upon APIs, SPIs, contracts, and plugin mechanisms. In essence, standards act as the glue that binds a composable data ecosystem. A strong belief in evolutionary architectures At ACA Group, we firmly believe in evolutionary architectures and platform engineering, principles that seamlessly extend to data mesh platforms. It's not about locking yourself into a rigid structure but creating an ecosystem that can evolve, staying at the forefront of innovation. That’s where composability comes in. Do you want a data platform that not only meets your current needs but also paves the way for the challenges and opportunities of tomorrow? Let’s engineer it together Ready to learn more about composability in data mesh solutions? {% module_block module "widget_f1f5c870-47cf-4a61-9810-b273e8d58226" %}{% module_attribute "buttons" is_json="true" %}{% raw %}[{"appearance":{"link_color":"light","primary_color":"primary","secondary_color":"primary","tertiary_color":"light","tertiary_icon_accent_color":"dark","tertiary_text_color":"dark","variant":"primary"},"content":{"arrow":"right","icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"tertiary_icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"text":"Contact us now!"},"target":{"link":{"no_follow":false,"open_in_new_tab":false,"rel":"","sponsored":false,"url":{"content_id":230950468795,"href":"https://25145356.hs-sites-eu1.com/en/contact","href_with_scheme":null,"type":"CONTENT"},"user_generated_content":false}},"type":"normal"}]{% endraw %}{% end_module_attribute %}{% module_attribute "child_css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "definition_id" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "field_types" is_json="true" %}{% raw %}{"buttons":"group","styles":"group"}{% endraw %}{% end_module_attribute %}{% module_attribute "isJsModule" is_json="true" %}{% raw %}true{% endraw %}{% end_module_attribute %}{% module_attribute "label" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "module_id" is_json="true" %}{% raw %}201493994716{% endraw %}{% end_module_attribute %}{% module_attribute "path" is_json="true" %}{% raw %}"@projects/aca-group-project/aca-group-app/components/modules/ButtonGroup"{% endraw %}{% end_module_attribute %}{% module_attribute "schema_version" is_json="true" %}{% raw %}2{% endraw %}{% end_module_attribute %}{% module_attribute "smart_objects" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "smart_type" is_json="true" %}{% raw %}"NOT_SMART"{% endraw %}{% end_module_attribute %}{% module_attribute "tag" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "type" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "wrap_field_tag" is_json="true" %}{% raw %}"div"{% endraw %}{% end_module_attribute %}{% end_module_block %}
Read more

You may well be familiar with the term ‘data mesh’. It is one of those buzzwords to do with data that have been doing the rounds for some time now. Even though data mesh has the potential to bring a lot of value for an organization in quite a few situations, we should not stare ourselves blind on all the fancy terminology. If you are looking to develop a proper data strategy, you do well to start off by asking yourselves the following questions: what is the challenge we are seeking to tackle with data? And how can a solution contribute to achieving our business goals? There is certainly nothing new about organizations using data, but we have come a long way. Initially, companies gathered data from various systems in a data warehouse. The drawback being that the data management was handled by a central team and the turnaround time of reports was likely to seriously run up. Moreover, these data engineers needed to have a solid understanding of the entire business. Over the years that followed, the rise of social media meant the sheer amount of data positively mushroomed, which in turn led to the term Big Data. As a result, tools were developed to analyse huge data volumes, with the focus increasingly shifting towards self-service. The latter trend now means that the business itself is increasingly better able to handle data under their own steam. Which in turn brings yet another new challenge: as is often the case, we are unable to dissociate technology from the processes at the company or from the people that use these data. Are these people ready to start using data? Do they have the right skills and have you thought about the kind of skills you will be needing tomorrow? What are the company’s goals and how can employees contribute towards achieving them? The human aspect is a crucial component of any potent data strategy. How to make the difference with data? In practice, the truth is that, when it comes to their data strategies, a lot of companies have not progressed from where they were a few years ago. Needless to say, this is hardly a robust foundation to move on to the next step. So let’s hone in on some of the key elements in any data strategy: Data need to incite action: it is not enough to just compare a few numbers; a high-quality report leads to a decision or should at the very least make it clear which kind of action is required. Sharing is caring: if you do have data anyway, why not share them? Not just with your own in-house departments, but also with the outside world. If you manage to make data available again to the customer there is a genuine competitive advantage to be had. Visualise: data are often collected in poorly organised tables without proper layout. Studies show the human brain struggles to read these kinds of tables. Visualising data (using GeoMapping for instance) may see you arrive at insights you had not previously thought of. Connect data sets: in the case of data sets, at all times 1+1 needs to equal 3. If you are measuring the efficacy of a marketing campaign, for example, do not just look at the number of clicks. The real added value resides in correlating the data you have with data about the business, such as (increased) sales figures. Make data transparent: be clear about your business goals and KPIs, so everybody in the organization is able to use the data and, in doing so, contribute to meeting a benchmark. Train people: make sure your people understand how to use technology, but also how data are able to simplify their duties and how data contribute to achieving the company goals. Which problem are you seeking to resolve with data? Once you have got the foundations right, we can work up a roadmap. No solution should ever set out from the data themselves, but at all times needs to be linked to a challenge or a goal. This is why ACA Group always organises a workshop first in order to establish what the customer’s goals are. Based on the outcome of this workshop, we come up with concrete problem definition, which sets us on the right track to find a solution for each situation. The integration of data sets will gain even greater importance in the near future, in amongst other things as part of sustainability reporting. In order to prepare and guide companies as best as possible, over the course of this year, we will be digging deeper into some important terminologies, methods and challenges around data with a series of blogs. If in the meantime, are you keen to find out exactly what ‘Data Mesh’ entails, and why this could be rewarding for your organization? {% module_block module "widget_1aee89e6-fefb-47ef-92d6-45fc3014a2b0" %}{% module_attribute "child_css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "definition_id" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "field_types" is_json="true" %}{% raw %}{"buttons":"group","styles":"group"}{% endraw %}{% end_module_attribute %}{% module_attribute "isJsModule" is_json="true" %}{% raw %}true{% endraw %}{% end_module_attribute %}{% module_attribute "label" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "module_id" is_json="true" %}{% raw %}201493994716{% endraw %}{% end_module_attribute %}{% module_attribute "path" is_json="true" %}{% raw %}"@projects/aca-group-project/aca-group-app/components/modules/ButtonGroup"{% endraw %}{% end_module_attribute %}{% module_attribute "schema_version" is_json="true" %}{% raw %}2{% endraw %}{% end_module_attribute %}{% module_attribute "smart_objects" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "smart_type" is_json="true" %}{% raw %}"NOT_SMART"{% endraw %}{% end_module_attribute %}{% module_attribute "tag" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "type" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "wrap_field_tag" is_json="true" %}{% raw %}"div"{% endraw %}{% end_module_attribute %}{% end_module_block %}
Read more

In recent years, the exponential growth of data has led to an increasing demand for more effective ways to manage it. Building a data-driven business remains one of the top strategic goals of many business stakeholders. And while it may seem logical for companies to embrace the idea of being data-driven, it’s far more difficult to execute on that idea. Data Mesh and Data Lakes are two important concepts in the world of data architectures that can work together to provide a flexible and scalable approach to data management. Data Lakes have already proven to be a popular solution, but a newer approach, Data Mesh, is gaining attention. This blog will dive into the two concepts and explore how they can complement each other . Data Lakes A data lake is a large and central storage repository that holds massive amounts of data, from various sources, and in various data formats. It can store structured, semi-structured, and unstructured data (e.g. images). Think of it as a huge pool of water, where you can store all sorts of data, such as customer data, transaction data, social media feeds, images, videos and more. It is a cost-effective and accessible solution for companies dealing with large data volumes and various data formats . Additionally, data lakes allow teams to work with raw data , without the need for extensive preprocessing or normalization. Data Mesh Data Mesh is a relatively new concept that takes a decentralized approach to data management. It treats data as a product and is managed by autonomous teams that are responsible for a particular domain. Data Mesh advocates that data should be owned and managed by the people who understand it best - the domain experts - and should be treated as a product. It means that each team is responsible for the data quality, reliability and accessibility of data within its domain. This creates a more scalable and flexible approach to data management, where teams can make decisions about their data independently, without requiring intervention from a centralized data team. How can data lake technology be used in a data mesh approach? In short, Data Mesh is an architecture where data is owned and managed by individual product teams, creating a decentralized approach to data management. A data lake is a technology that provides a centralized storage solution, allowing teams to store and manage large amounts of data without worrying about data structure or format. Decentralization in Data Mesh is about taking ownership of sharing data as products in a decentralized way. It’s not about abandoning centralized storage solutions, such as Data Lakes, but about using them in a way that adheres to the principles of Data Mesh. Data Mesh is all about defining and managing Data Products as a building block to make data easily accessible and reusable for various use cases. Each ‘Data Product’ should be able to provide its data in multiple ways through different output ports . An output port is aimed at making data natively accessible for a specific use case. Example use cases are analytics and reporting, machine learning, real-time processing, etc. As such, multiple types of output ports need corresponding data technologies that enable a specific access mode. One technology that can support a Data Mesh architecture is a data lake. The data in an output port for a data product can be stored in a data lake . This type of output port then receives all the benefits offered by data lake technology. In a Data Mesh architecture, each data product gets its own segment in the data lake (e.g. an S3 Bucket). This segment acts as the output port for the data product, where the team responsible for the data product can write their data to the lake. By segmenting the data lake in this way, teams can manage and secure their own data without worrying about conflicting with other teams. As such, decentralized ownership is made possible, even when using a more centralized storage technology . While a data lake is an important technology for supporting a Data Mesh architecture, it may not be the ideal solution for every use case . Using a data lake as the only type of data storage technology may limit the flexibility of the Data Mesh platform, as it only provides one type of storage. For example, when it comes to business intelligence and reporting, a data warehouse technology with tabular storage may be more suitable. Another example is when time series databases or graph databases are a better option because of the type of data we want to make natively reusable. To make the Data Mesh platform more flexible , it should provide the capability to plug in different types of data storage technology . Each of them is a different type of output port. In this way, each data product can have its own output ports, with different types of data storage technologies, geared towards specific data usage patterns. We have noticed that cloud vendors frequently recommend implementing a Data Mesh solution using one of their existing data lake services . Typically, their approach involves defining security boundaries to separate segments within these services, which can be owned by different domain teams to create various data products. However, the reference architectures they provide only incorporate one storage technology , namely their own data lake technology. Consequently, the resulting Data Mesh platform is less adaptable and tied to a single technology. What is lacking is an explicit ‘Data Product’ abstraction that goes beyond merely enforcing security boundaries and allows for the integration of various data storage technologies and solutions. Conclusion Data management is a critical component of any organization. Various technologies and approaches are available, like data lakes, data warehouses, data vaults, time series databases, graph databases, etc. They all have their unique strengths and limitations. Ultimately, a successful Data Mesh architecture provides the flexibility to share and reuse data with the right technology for the right use case . While a data lake is a powerful tool for managing raw data, it may not be the best solution for all types of data usage. By considering different types of data storage technologies, teams can choose the solution that best meets their specific needs and optimize their data management workflows. By using data products in a Data Mesh, teams can create a flexible and scalable architecture that can adapt to changing data management needs . Want to find out more about Data Mesh or Data Lakes? {% module_block module "widget_9cdc4a9f-7cb9-4bf2-a07a-3fd969809937" %}{% module_attribute "buttons" is_json="true" %}{% raw %}[{"appearance":{"link_color":"light","primary_color":"primary","secondary_color":"primary","tertiary_color":"light","tertiary_icon_accent_color":"dark","tertiary_text_color":"dark","variant":"primary"},"content":{"arrow":"right","icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"tertiary_icon":{"alt":null,"height":null,"loading":"disabled","size_type":null,"src":"","width":null},"text":"Discover data mesh"},"target":{"link":{"no_follow":false,"open_in_new_tab":false,"rel":"","sponsored":false,"url":{"content_id":null,"href":"","href_with_scheme":"","type":"CONTENT"},"user_generated_content":false}},"type":"normal"}]{% endraw %}{% end_module_attribute %}{% module_attribute "child_css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "definition_id" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "field_types" is_json="true" %}{% raw %}{"buttons":"group","styles":"group"}{% endraw %}{% end_module_attribute %}{% module_attribute "isJsModule" is_json="true" %}{% raw %}true{% endraw %}{% end_module_attribute %}{% module_attribute "label" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "module_id" is_json="true" %}{% raw %}201493994716{% endraw %}{% end_module_attribute %}{% module_attribute "path" is_json="true" %}{% raw %}"@projects/aca-group-project/aca-group-app/components/modules/ButtonGroup"{% endraw %}{% end_module_attribute %}{% module_attribute "schema_version" is_json="true" %}{% raw %}2{% endraw %}{% end_module_attribute %}{% module_attribute "smart_objects" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "smart_type" is_json="true" %}{% raw %}"NOT_SMART"{% endraw %}{% end_module_attribute %}{% module_attribute "tag" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "type" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "wrap_field_tag" is_json="true" %}{% raw %}"div"{% endraw %}{% end_module_attribute %}{% end_module_block %}
Read moreWant to dive deeper into this topic?
Get in touch with our experts today. They are happy to help!

Want to dive deeper into this topic?
Get in touch with our experts today. They are happy to help!

Want to dive deeper into this topic?
Get in touch with our experts today. They are happy to help!

Want to dive deeper into this topic?
Get in touch with our experts today. They are happy to help!

