Metadata entity

GenericEntity

Generic definition of metadata entity.

Biosample

Biosamples metadata entity.

Metadata entities. The goal of this module is to define the metadata entities that each archive uses. All the metadata is saved under the self.entity property, but the whole class can be accessed as a dictionary.

To make it easier for the users, it is expected that all properties can be accessed as a root property. The subclasses then must define how properties are accessed by defining “__setitem__”, “__getitem__” and “__contains__”. If unsure, look at the “Biosample” subclass to understand how this is implemented.

The base generic entity also defines a custom JSON encoder, to avoid JSON.dumps issues. Any class not serializable will be turned to string.

Mandatory arguments:

  • metadata_content: Metadata content for the entity, in JSON format.

  • field_mapping: Map <archive_key_name>:<metadata_content_key_name> in JSON format

  • keep_non_mapped_fields: Boolean, wether to keep the non-mapped fields.

Optional arguments:

  • verbose: set to True if you want INFO and above-level logging events. If not set or set to False, only WARNING and above will be displayed

Subclasses of GenericEntity must define the following methods/properties:

  • @GenericEntity.setter

  • id property

  • accession property

  • validate

  • flatten

  • dictionary special methods (‘__setitem__’, ‘__delitem__’, ‘__getitem__’, ‘__contains__’)

Subclasses of GenericEntity SHOULD define the following methods/properties:

  • guidelines

Aspects to improve:

  • Biosamples entity: taxonId or organism must be set up. Currently allows entities to be created without those fields.

  • Biosamples entity: Currently the root keys are hardcoded in the submodule. I wonder what would be the best way to indicate them without complicating the code and depending too much on external files. These are not bound to change much - but still, not good practice to have them in the code.

class GenericEntity(metadata_content, data_model, verbose=False)

Bases: object

Generic definition of metadata entity.

Parameters:

metadata_content (dict) – dictionary with the content of the entity

Patam data_model:

BaseModel subclass determining the metadata model to validate the metadata_content.

property entity: dict

Entity getter.

Returns:

self._entity

property id: str

id property. Must be overridden by subclasses.

Returns:

string with the ID of the entity

property accession: str

accession property. Must be overridden by subclasses.

Returns:

string with the accession of the entity

validate(data_model)

Validate the metadata content using a Pydantic data model. Each subclass can define its own data model, or it can be provided by the user on validation.

flatten()

Flatten the .entity, returning a non-nested dictionary.

__setitem__(key, value)

Must be overriden by subclasses.

__delitem__(key)

Must be overriden by subclasses.

__getitem__(item)

Must be overriden by subclasses.

__contains__(item)

Must be overriden by subclasses.

static guidelines()

Return a printable string with guidelines on how to fill out each entity subclass. Subclasses are not required to override this property, but… it does help, so try to be nice :)

Return type:

str

Returns:

empty string. This is a generic class and should never be used as is!

to_json()
class Biosample(metadata_content, data_model=<class 'biobroker.generic.pydantic_model.BiosampleGeneralModel'>, delimiter='||', verbose=False)

Bases: GenericEntity

Biosamples metadata entity. Contains the necessary information to process a non-nested JSON into a valid Biosamples sample.

Current known issue: Only setting up and expecting one value for the properties inside characteristics. Well, this is because… Biosamples also expects that! No clue why properties are defaulting to arrays.

Parameters:
  • metadata_content (dict) – non-nested dictionary containing the metadata for the sample.

  • data_model (Type[BaseModel]) – Optional parameter, used to evaluate the metadata content. Defaults to :cls:`~biobroker.generic.pydantic_model.BiosampleGeneralModel`

  • delimiter (str) – optional parameter, used for key delimiters. Used mainly to manage attributes tags, such as ‘unit’ and ‘ontologyTerms’. Explained further in __setitem__(), point 4.

  • verbose (bool) – True if logger should be set to INFO. Default WARNING.

ROOT_PROPERTIES = ['name', 'release', 'relationships', 'accession', 'sraAccession', 'webinSubmissionAccountId', 'status', 'update', 'characteristics', 'submittedVia', 'create', '_links', 'submitted', 'taxId', 'structuredData', 'externalReferences', 'organization']
VALID_TAGS = ['text', 'ontologyTerms', 'unit', 'tag']
VALID_RELATIONSHIPS = ['derived_from', 'same_as']
EXTERNAL_REFERENCE_FIELD = 'url'
ORGANIZATION = 'organizationName'
property id

Return the property ‘id’, extracting it from the ‘name’ property. Defaults to an empty string.

Returns:

property accession: str

Return the property ‘accesssion’, extracting it from the ‘accession’. Defaults to an emtpy string

Returns:

property entity: dict

Entity getter.

Returns:

self._entity

flatten()

Flatten the entity property and return a non-nested dictionary. This will be mostly used for output generation.

Return type:

dict

Returns:

flattened dictionary

__getitem__(item)

Special method to get values from the Biosample.entity. Tries to obtain it from root and then characteristics; raises ValueError if not found.

Parameters:

item – Value of the key to look up for

Return type:

str | int | dict

Returns:

Value of the item if found.

__delitem__(key)

Special method to delete the values from the Biosample.entity. You can delete tags by using delimiter e.g. temperature||unit

Parameters:

key (str) – Key to search for for deletion

Returns:

__setitem__(key, value)

This is a special method to set up values to “entity” in a dict-like manner. Each entity can decide to implement checks (Depending on the archive needs) or just return “self.entity[key] = value”.

Tags for the attributes can be set up from the flattened input dictionary. As such, each sample has a delimiter set up, and if it’s detected, instead of replacing the value, it will add the tag (e.g. {‘size’: 1, ‘size||unit’: ‘cm’} will be translated to {‘size’: [{‘text’: 1, ‘unit’: cm}]}). Please see https://www.ebi.ac.uk/biosamples/docs/references/api/submit#_sample for more information.

Parameters:
  • key (str) – name of the attribute.

  • value (Any) – value of the attribute.

__contains__(item)

Special method to check if Biosample.entity contains ‘item’. :type item: str :param item: value of the key to check for. :rtype: bool :return: True if found, False if not found.

add_relationship(source, target, relationship)

Add a relationship to the entity. Source must be the entity’s accession; target must be a valid BioSamples accession; relationship must be a valid relationship.

Parameters:
  • source – source entity accession (Must be equal to entity)

  • target – target sample accession

  • relationship – Relationship between source and target. Valid relationships: VALID_RELATIONSHIPS

add_external_reference(url)

Add an external reference to the entity.

Parameters:

url (str) – URL to the external reference

add_organization(organization)
_flatten_relationships(flattened_json, relationships)

Flatten relationships. To make it user friendly, limit it to Biosample –> target relationships. Biosamples defines both directionalities, but for flattening, this is way simpler and no information is lost.

Parameters:
  • flattened_json (dict) – Flattened dictionary in progress.

  • relationships (list[dict]) – list of relationship dictionaries.

Return type:

dict

Returns:

flattened dictionary with the processed relationships incorporated.

_flatten_characteristics(flattened_json, characteristics)

flatten the characteristics. :type flattened_json: dict :param flattened_json: Flattened dictionary in progress. :type characteristics: dict :param characteristics: dictionary with the characteristics.

Return type:

dict

Returns:

flattened dictionary with the processed characteristics incorporated.

_flatten_urls(flattened_json, urls)

Flatten the externalReferences.

Parameters:
  • flattened_json (dict) – Flattened dictionary in progress.

  • urls (dict) – dictionary with the urls.

Returns:

flattened dictionary with the processed urls incorporated.

static check_accession(accession)

Check if the provided accession conforms to a BioSamples identifier. Pattern extracted from https://registry.identifiers.org/registry/biosample#!

Parameters:

accession – Accession ID for the sample in BioSamples

Return type:

bool

Returns:

True if correct format, False if not

static _tag_is_valid(tag)

Check if a tag is valid. Tags are evaluated against the VALID_TAGS global. VALID_TAGS extracted from: https://www.ebi.ac.uk/biosamples/docs/references/api/submit#_sample

Parameters:

tag (str) – string with the tag name

Return type:

bool

Returns:

True if valid, False if invalid

static guidelines()

Guidelines for filling out sample metadata for BioSamples.

Return type:

str

Returns:

Printable string with guidelines.

Exceptions

exception EntityValidationError(logger, entity_id, errors)

Bases: Exception

exception NoNameSetError(logger, sample_id)

Bases: Exception

Name has not been set in the sample

exception NameShouldBeStringError(logger, name)

Bases: Exception

Name should be a string, no other types allowed

exception RelationshipInvalidSourceError(logger, sample_id, source)

Bases: Exception

Invalid source for relationship

exception RelationshipInvalidTargetError(logger, sample_id, target)

Bases: Exception

Invalid target for relationship

exception NoOrganismSetError(logger, sample_id)

Bases: Exception