Submit to biosamples

This notebook will serve to show how to use this library, in the simplest way:

Before you start
- Generate a valid input file with metadata about 1 sample (A very, very simple TSV)
What components do we need
What do we need to input to each component
How to correct our samples before submission
How to submit
See the results in Excel (or TSV, your decision)

This is the simplest example; in other notebooks, we will explore how to submit multiple samples, with relationships defined amongst them, how to validate our samples against ENA checklists, and how to transform input data.

Before you start

Please make sure you have python 3.10 or higher
Please make sure you have a webin acount set up in webin-dev
Please make sure you have the latest biobroker library installed: pip install biobroker

[1]:

%pip install --upgrade biobroker

Collecting biobroker==0.0.4
  Downloading biobroker-0.0.4-py3-none-any.whl.metadata (42 kB)
Requirement already satisfied: numpy~=1.23.1 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from biobroker==0.0.4) (1.23.1)
Requirement already satisfied: openpyxl==3.1.2 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from biobroker==0.0.4) (3.1.2)
Requirement already satisfied: pandas~=2.0.3 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from biobroker==0.0.4) (2.0.3)
Requirement already satisfied: progressbar2~=4.4.2 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from biobroker==0.0.4) (4.4.2)
Requirement already satisfied: requests>=2.31.0 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from biobroker==0.0.4) (2.31.0)
Requirement already satisfied: et-xmlfile in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from openpyxl==3.1.2->biobroker==0.0.4) (1.1.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from pandas~=2.0.3->biobroker==0.0.4) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from pandas~=2.0.3->biobroker==0.0.4) (2024.2)
Requirement already satisfied: tzdata>=2022.1 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from pandas~=2.0.3->biobroker==0.0.4) (2024.2)
Requirement already satisfied: python-utils>=3.8.1 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from progressbar2~=4.4.2->biobroker==0.0.4) (3.9.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from requests>=2.31.0->biobroker==0.0.4) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from requests>=2.31.0->biobroker==0.0.4) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from requests>=2.31.0->biobroker==0.0.4) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from requests>=2.31.0->biobroker==0.0.4) (2024.8.30)
Requirement already satisfied: six>=1.5 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas~=2.0.3->biobroker==0.0.4) (1.16.0)
Requirement already satisfied: typing-extensions>3.10.0.2 in /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages (from python-utils>=3.8.1->progressbar2~=4.4.2->biobroker==0.0.4) (4.12.2)
Downloading biobroker-0.0.4-py3-none-any.whl (55 kB)
Installing collected packages: biobroker
Successfully installed biobroker-0.0.4
Note: you may need to restart the kernel to use updated packages.

Generate input file

I don’t want to have an example file of this kind in the examples, so, let’s generate it ourselves! Let’s do a simple example with 2 attributes: “name” and “collected_at”

[2]:

sample_tsv = [
    ["name", "collected_at"],
    ["sumple", "noon"]
]

writable_sample = "\n".join(["\t".join(row) for row in sample_tsv])
with open("simple_sample_sumple.tsv", "w") as f:
    f.write(writable_sample)

“name” is a especial property in biosamples. We’ll talk about it later; for now, just remember that this property has to always be set up.

What components do we need

Given that you’ve read the documentation in the main page (I know you’ve done it, I wrote it with lots of love) you will know by now that, in order to submit, we need: - An authenticator: To authenticate ourselves to the archive - An api: To store/execute the instructions to submit to the archive - At least 1 metadata entity: To store the metadata about our samples - An input processor: To process the input file with the metadata

Additionally, I will also import an output processor; Not needed, but we will be able to save our brokering results to a very nice, very demure and readable excel file.

Additionally number 2: Metadata entities have a very nice thing, they have a static method (That just means you can call the method without creating an instance) that gives you guidelines on how to fill out the metadata. Lovely, eh?

[3]:

from biobroker.authenticator import WebinAuthenticator # Biosamples uses the WebinAuthenticator
from biobroker.api import BsdApi # BioSamples Database (BSD) API
from biobroker.metadata_entity import Biosample # The metadata entity
from biobroker.input_processor import TsvInputProcessor # An input processor
from biobroker.output_processor import XlsxOutputProcessor # An output processor

print(Biosample.guidelines())

A Biosamples entity MUST have the following properties set:
        - name: a descriptive title for the sample
        - organism: a string that validates against NCBITaxon records
        - release: date of release for the metadata of the entity, in YYYY-MM-DD format. Accepts iso format
For more information, please see https://www.ebi.ac.uk/biosamples/docs/references/api/submit#_submission_minimal_fields.

To indicate relationships in the samples, please use a field named after the relationshipitself: namely, 'derived_from', 'same_as', 'has_member' or 'child_of'.
Please seehttps://www.ebi.ac.uk/biosamples/docs/guides/relationships

2024-10-16 21:36:25,593 - Biosample - ERROR - Metadata content has failed validation for 'sumple':
        - root: Missing mandatory field 'release'. Provided value: '{'characteristics': {'collected_at': [{'text': 'noon'}]}, 'name': 'sumple'}'
        - characteristics: Value error, 'organism' must be set. Please use the keys 'organism', 'Organism', 'species' or 'Species'. Provided value: '{'collected_at': [{'text': 'noon'}]}'
2024-10-16 21:37:42,280 - BsdApi - INFO - Set up BSD API successfully: using base uri 'https://wwwdev.ebi.ac.uk/biosamples/samples'

Now we have imported everything! See how easy it is?

You may have noticed that, aside from the name, there are 2 other mandatory fields: - organism: Biosamples requires for the samples to identify which is the taxonomic classification for the organism the sample comes from. This is not important now - It will throw an error later and we will correct it. - release: As with any archive, the [meta]data can be stored as private for an amount of time. This sets the release date. We need to set it up to the second, but biobroker provides a relaxed parser, so we’ll set it for today.

How to set up each component

Alrighty! Let’s start setting up the components:

1. Set up the input processor

For the input processor, we just need to give the path to the input file :)

[4]:

path = "simple_sample_sumple.tsv" # This is the file we created previously

input_processor = TsvInputProcessor(input_data=path)

Let’s check it out!

[5]:

input_processor.input_data

[5]:

[{'name': 'sumple', 'collected_at': 'noon'}]

There’s another functionality of the input processors: the transform function. I will discuss it in further notebooks!

For now, we have the input processor set up. Cool!

2. Set up the samples

Now that we have the data in an object… where does that go?

Well, it’s as simple as: the input processors have a method called process. You give, as an input to this function, the class of metadata_entitys that you want to create, and it returns a list of those entities created from the .input_data. If you want to see more documentation on that, refer to ReadTheDocs

[6]:

my_sample = input_processor.process(Biosample) # We're giving it a Biosample class to process
print(my_sample)

---------------------------------------------------------------------------
EntityValidationError                     Traceback (most recent call last)
Cell In[6], line 1
----> 1 my_sample = input_processor.process(Biosample) # We're giving it a Biosample class to process
      2 print(my_sample)

File ~/PycharmProjects/biobroker/biobroker/input_processor/input_processor.py:55, in GenericInputProcessor.process(self, entity)
     53 entities = []
     54 for json_entity in self.input_data:
---> 55     new_entity = entity(metadata_content=deepcopy(json_entity))
     56     entities.append(new_entity)
     57 return entities

File ~/PycharmProjects/biobroker/biobroker/metadata_entity/metadata_entity.py:154, in Biosample.__init__(self, metadata_content, data_model, delimiter, verbose)
    151 def __init__(self, metadata_content: dict, data_model: Type[BaseModel] = BiosampleGeneralModel,
    152              delimiter: str = "||", verbose: bool = False):
    153     self.delimiter = delimiter
--> 154     super().__init__(metadata_content, data_model=data_model, verbose=verbose)

File ~/PycharmProjects/biobroker/biobroker/metadata_entity/metadata_entity.py:35, in GenericEntity.__init__(self, metadata_content, data_model, verbose)
     33 self._entity = None
     34 self.entity = metadata_content
---> 35 self.validate(data_model=data_model)

File ~/PycharmProjects/biobroker/biobroker/metadata_entity/metadata_entity.py:82, in GenericEntity.validate(self, data_model)
     80     self.entity = json.loads(data_model(**self.entity).model_dump_json(exclude_unset=True, by_alias=True))
     81 except pydantic_core.ValidationError as pydantic_error:
---> 82     raise EntityValidationError(self.logger, entity_id=self.id, errors=pydantic_error.errors()) from None

EntityValidationError: Metadata content has failed validation for 'sumple':
        - root: Missing mandatory field 'release'. Provided value: '{'characteristics': {'collected_at': [{'text': 'noon'}]}, 'name': 'sumple'}'
        - characteristics: Value error, 'organism' must be set. Please use the keys 'organism', 'Organism', 'species' or 'Species'. Provided value: '{'collected_at': [{'text': 'noon'}]}'

As we can see, the library is already complaining; when setting up a BioSample entity, it gets validated against a BioSample general model generated with pydantic. This model requires the release date and the organism to be set up. Let’s dissect one of the messages:

root: Missing mandatory field ‘release’. Provided value: ‘{‘characteristics’: {‘collected_at’: [{‘text’: ‘noon’}]}, ‘name’: ‘sumple’}’

This message is composed of three parts: - root:: This indicates where the error happened. In this case, it happened at the root of the sample metadata. - Missing mandatory field 'release'.: This is the error message. It’s telling you it’s missing a mandatory field, and the name of the field - Provided value: '{'characteristics': {'collected_at': [{'text': 'noon'}]}, 'name': 'sumple'}': This is telling you what you provided at the level at which the error happened. For missing characteristics, it’s not super useful, but for incorrect ones it reminds you what was the value you sent.

Let’s fix this and process the samples again!

(Please note: it is way simpler to fix the input data from the source, but I am not going to create another section with the exact same steps and just modifying the tsv)

[7]:

input_processor.input_data[0]['release'] = "2024-10-13"
input_processor.input_data[0]['organism'] = "Homo sapiens" # Characteristics are handled automatically. Fields are set by the Biosample.

my_sample = input_processor.process(Biosample)
print(my_sample)

[<biobroker.metadata_entity.metadata_entity.Biosample object at 0x12eb93a90>]

It’s… a list of objects?

Yup! The process function always returns a list of objects (Biosample entities in this case). This makes writing against the output much easier, as you don’t need to handle methods to work against a list or a single entity. Don’t be lazy - Write against the list!

(Also, let’s see how the metadata inside has been transformed)

[8]:

print(my_sample[0].entity)

{'characteristics': {'collected_at': [{'text': 'noon'}], 'organism': [{'text': 'Homo sapiens'}]}, 'name': 'sumple', 'release': '2024-10-13T00:00:00Z'}

Now the sample is set up! See how it has re-structured the metadata?

This is the format that biosamples expects their metadata to be. You don’t need to understand everything - Just know, there are certain keywords (e.g. name) that get treated differently, and everything else is stored under characteristics. You can review the list of properties in the RTD docs: ROOT PROPERTIES

3. Setting up the authenticator + API

Now, we need to set up the authenticator and the API. For this example, we’re going to use BioSamples dev - the testing environment.

For that, we will set up an environment variable, API_ENVIRONMENT, and we will provide the authenticator with our webin-dev username and password.

[9]:

import os
os.environ['API_ENVIRONMENT'] = "dev" # There are multiple ways to set up environment variables

username = "" # Your username goes here
password = "" # Your password goes here
authenticator = WebinAuthenticator(username=username, password=password)

api = BsdApi(authenticator=authenticator)

For your password and username in a workflow environment, I would recommend either to set them up as environment variables and load them in your script, or use a config file that you’re sure it’s not going to be pushed to the repository. Be mindful!

Now that we have everything set up, let’s try to submit!

4. Submitting your sample

This step is very easy - Since we’ve done everything, we just need to hit submit on the API object and pass the samples we generated!

[10]:

submitted_samples = api.submit(my_sample)

One very cool thing is that metadata_entity objects are set up as dictionaries. The same way you would add a key:value pair to a dictionary, you can do the same with a metadata entity - And the entity will handle where and how to write it.

Let’s see the content!

[11]:

print(submitted_samples[0].entity)

{'characteristics': {'SRA accession': [{'text': 'ERS31055558'}], 'collected_at': [{'text': 'noon'}], 'organism': [{'text': 'Homo sapiens'}]}, 'name': 'sumple', 'accession': 'SAMEA131421913', 'release': '2024-10-13T00:00:00Z', 'sraAccession': 'ERS31055558', 'webinSubmissionAccountId': 'Webin-64342', 'taxId': 9606, 'status': 'PUBLIC', 'update': '2024-10-16T20:37:46.178Z', 'submitted': '2024-10-16T20:37:46.178Z', 'submittedVia': 'JSON_API', 'create': '2024-10-16T20:37:46.178Z', '_links': {'self': {'href': 'https://wwwdev.ebi.ac.uk/biosamples/samples'}, 'curationDomain': {'href': 'https://wwwdev.ebi.ac.uk/biosamples/samples{?curationdomain}', 'templated': True}, 'curationLinks': {'href': 'https://wwwdev.ebi.ac.uk/biosamples/samples/SAMEA131421913/curationlinks'}, 'curationLink': {'href': 'https://wwwdev.ebi.ac.uk/biosamples/samples/SAMEA131421913/curationlinks/{hash}', 'templated': True}, 'structuredData': {'href': 'https://wwwdev.ebi.ac.uk/biosamples/structureddata/SAMEA131421913'}}}

And it’s submitted! if you want to see it, it’s already available in biosamples dev:

[12]:

print(f"https://wwwdev.ebi.ac.uk/biosamples/samples/{submitted_samples[0]['accession']}")

https://wwwdev.ebi.ac.uk/biosamples/samples/SAMEA131421913

(You may need to wait a bit - Biosamples dev operates a bit slower, as it’s normal for testing grounds. It may take a while to make the sample public)

Let’s create an output file so you can be happy with your local version of the metadata!

[13]:

from biobroker.output_processor import XlsxOutputProcessor
output_processor = XlsxOutputProcessor(output_path="simple_sample_submitted.xlsx", sheet_name="Awesome submission")
output_processor.save(submitted_samples)

And you should see something like this! Isn’t this demure?

9facad147c054c5fa7eecb81f7d7a7d2