{ "cells": [ { "cell_type": "markdown", "id": "e78a8bd9-90d9-49a8-bee0-c31fe14abc31", "metadata": {}, "source": [ "# Modifying the source input with a key:value map" ] }, { "cell_type": "markdown", "id": "d2e829eb-3ea2-4554-aec5-98ff55ca3309", "metadata": {}, "source": [ "In this notebook, we will be looking at one of the properties that the input processors have: They can transform, with a simple map, the names of the fields for the loaded metadata.\n", "\n", "Why? Why not just have the input file match the expected values?\n", "\n", "Well, sometimes you **will** have to do that. But let's imagine that your pipeline is producing an almost-ready input, but your laboratory, instead of calling their samples by `name`, uses another identifier, such as `sample_id`. You want to automatise sending the metadata when is ready by the pipeline, but you don't want to write another script. Easy then! Let's see how to do that:" ] }, { "cell_type": "code", "execution_count": 2, "id": "0031d648-860b-4863-b733-c55f338fa165", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[{'sample_id': 'sumple', 'collected_at': 'noon'}]\n" ] } ], "source": [ "## Import everything we need\n", "from biobroker.input_processor import TsvInputProcessor # An input processor\n", "\n", "sample_tsv = [\n", " [\"sample_id\", \"collected_at\"],\n", " [\"sumple\", \"noon\"] \n", "]\n", "\n", "writable_sample = \"\\n\".join([\"\\t\".join(row) for row in sample_tsv])\n", "with open(\"simple_sample_sumple.tsv\", \"w\") as f:\n", " f.write(writable_sample)\n", "\n", "path = \"simple_sample_sumple.tsv\" # This is the file we created previously\n", "\n", "## Set up the required entities\n", "\n", "input_processor = TsvInputProcessor(input_data=path)\n", "\n", "print(input_processor.input_data)" ] }, { "cell_type": "markdown", "id": "5f3bf4b2-8a0a-4151-ab81-1af153cf84d7", "metadata": {}, "source": [ "Up to here, everything is the same: you have set up the input processor pointing to the data.\n", "\n", "Here comes the slightly different part: Let's transform the metadata so that \"sample_id\" becomes \"name\":" ] }, { "cell_type": "code", "execution_count": 3, "id": "bd6ceb29-d574-4db5-8256-e8c71e766504", "metadata": {}, "outputs": [], "source": [ "map_of_fields = {\n", " \"sample_id\": \"name\"\n", "}\n", "\n", "input_processor.transform(field_mapping=map_of_fields, delete_non_mapped_fields=False)" ] }, { "cell_type": "markdown", "id": "381e6523-3126-4cb5-8eb0-d6f20e4cec4a", "metadata": {}, "source": [ "Let's take a look at the metadata now!" ] }, { "cell_type": "code", "execution_count": 4, "id": "9445d66f-2a25-4fd8-8496-a851d593cd5b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[{'collected_at': 'noon', 'name': 'sumple'}]\n" ] } ], "source": [ "print(input_processor.input_data)" ] }, { "cell_type": "markdown", "id": "2ff07d34-b3d9-42cc-9510-500313cb8f98", "metadata": {}, "source": [ "ta-da! We now have the samples in the format that we want and we can `process` and `submit` them without any issue.\n", "\n", "While not a super complicated transformation, this can help setting up your own pipelines without the need to tailor the metadata in your pipeline's output." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.11" } }, "nbformat": 4, "nbformat_minor": 5 }