(2025-Feb-12) I will jump straight to the problem statement without a “boring” introduction, which, in a sense, already feels like an opening statement.

Moving data between two or more endpoints is a common task. Sometimes it’s as simple as migrating data from one place to another. Other times, it’s a request to copy specific documents from source environments. In more complex cases, you might need to consolidate multiple data files into the same destination, such as loading several separate files into a single database table.

Image by Bernd Hildebrandt from Pixabay

Let’s look at a common data scenario where separate data files are used to bring in similarly structured datasets for different periods of time, such as product sales for different months or years. Alternatively, this could involve datasets from the same time period but for different geographic regions.

2025 02 11%2015 15 11 Data%20Format%20Samples.xlsx%20

 

The sourcing file headers evolved progressively across the four timeframes, reflecting changes in data structure and content requirements. In the Early Files, the headers started with basic fields labelled A, B, C, and D, representing the initial dataset. Some Time Later, an additional header E was introduced, expanding the dataset to accommodate new information while retaining all previous headers. Moving to More Time Later, the structure was modified further: header E was repositioned between B and C, indicating a reorganization of data priorities, while headers C and D remained but shifted slightly in placement. In the Latest Data, the structure underwent a significant transformation, header F was introduced as a new addition, while headers D and E were rearranged to reflect updated data priorities. Additionally, headers B and C were removed, suggesting that their data was either deprecated or consolidated elsewhere.

It’s a mess, but it’s a functional mess, a common reality in the data world. Instead of complaining or pointing fingers at the data provider, we can focus on finding a solution. If you’ve worked with Azure Data Factory, this can be resolved quite easily once you identify your target table, which should account for the full scope of your incoming datasets.

2025 02 11%2015 16 40 Data%20Format%20Samples.xlsx%20

By utilizing a common, fairly simple, yet powerful pattern in Data Factory, you can achieve this by:

  1. Retrieving the list of incoming data files,
  2. Passing it to a For Each loop container, and
  3. Executing a Copy Data activity for each loop iteration using the default column name mapping.
2025 02 11%2015 18 18 adf gsh dev cc 01%20 %20Azure%20Data%20Factory%20and%202%20more%20pages%20 %20Profile%202%20 %20Microsoft%E2%80%8B

https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping

The best part is that you don’t need to specify any explicit mappings. The Copy Data activity will automatically map columns with matching names (e.g., A to A, F to F), regardless of the column order in each file. Even if some incoming files are missing certain columns that exist in the target table, the process will still handle it seamlessly. Problem solved, no need to overcomplicate it.

You can simplify all three steps by using a single Copy Data activity with wildcard folder and file name settings. It can even recursively scan for all similar data files and load them into your target data destination at once. Isn’t that wild?

2025 02 11%2022 13 09 adf gsh dev cc 01%20 %20Azure%20Data%20Factory%20and%203%20more%20pages%20 %20Profile%202%20 %20Microsoft%E2%80%8B

But here’s the catch: a single Copy Data activity in Data Factory won’t handle each file individually in terms of file headers. Instead, it will assume that the schema or header of the first scanned file applies to all other files. As a result, it will align the data columns based on that initial structure, treating them as if they are vertically consistent across all files. This approach can still work if all of your data files are equally structured. However, if they’re not, here’s what will happen.

2025 02 11%2022 25 46 Data%20Format%20Samples.xlsx%20

When using a single Copy Data activity in Data Factory without properly handling schema drift, the process will align data based on the structure of the first scanned file, even with the default mapping by column names. This can lead to mismatched mappings, where data from columns in later files might be incorrectly placed under the wrong target columns or even lost if no corresponding target column exists.

For example, columns E and F from the latest files will be mapped to columns C and D in the target table, respectively. As a result, the actual E and F columns in the target table will remain empty since they never existed in the first scanned file. This is truly wild!

So, if you’re not sure whether the data format is consistent across all incoming files, avoid using a single Copy Data activity with wildecard settings in Data Factory. Instead, rely on the safer approach with the 1-2-3 steps described above, unless this issue gets fixed by the product team at Microsoft.

On the other hand, as I’ve mentioned in one of my earlier blog posts, the Copy Data activity is the most important part of Azure Data Factory. All other activities you see are primarily there to help organize your workflow or to call external components like Azure Functions, Databricks notebooks, and more.

If you’ve mastered the Copy Data activity, that single rectangular box packed with settings for various data connectors, logging options, datasets, linked services, and more, then you’ve mastered most of what you need to know about Azure Data Factory. Yes, you can add additional workflows with mapping data flows and gain sufficient knowledge on how to deploy and execute your Data Factory pipelines. But without the Copy Data activity, all of this might feel like reinventing the wheel. Prove me wrong!

I like the principle “Keep It Simple, Stupid” (KISS), what a nice way to conclude this blog post as we approach Valentine’s Day. A KISS to all the data professionals working with Azure Data Factory! 😄

Share.
Leave A Reply