Modifying the source input with a key:value map

In this notebook, we will be looking at one of the properties that the input processors have: They can transform, with a simple map, the names of the fields for the loaded metadata.

Why? Why not just have the input file match the expected values?

Well, sometimes you will have to do that. But let’s imagine that your pipeline is producing an almost-ready input, but your laboratory, instead of calling their samples by name, uses another identifier, such as sample_id. You want to automatise sending the metadata when is ready by the pipeline, but you don’t want to write another script. Easy then! Let’s see how to do that:

[2]:

## Import everything we need
from biobroker.input_processor import TsvInputProcessor # An input processor

sample_tsv = [
    ["sample_id", "collected_at"],
    ["sumple", "noon"]
]

writable_sample = "\n".join(["\t".join(row) for row in sample_tsv])
with open("simple_sample_sumple.tsv", "w") as f:
    f.write(writable_sample)

path = "simple_sample_sumple.tsv" # This is the file we created previously

## Set up the required entities

input_processor = TsvInputProcessor(input_data=path)

print(input_processor.input_data)

[{'sample_id': 'sumple', 'collected_at': 'noon'}]

Up to here, everything is the same: you have set up the input processor pointing to the data.

Here comes the slightly different part: Let’s transform the metadata so that “sample_id” becomes “name”:

[3]:

map_of_fields = {
    "sample_id": "name"
}

input_processor.transform(field_mapping=map_of_fields, delete_non_mapped_fields=False)

Let’s take a look at the metadata now!

[4]:

print(input_processor.input_data)

[{'collected_at': 'noon', 'name': 'sumple'}]

ta-da! We now have the samples in the format that we want and we can process and submit them without any issue.

While not a super complicated transformation, this can help setting up your own pipelines without the need to tailor the metadata in your pipeline’s output.