Input processor

GenericInputProcessor

Generic input processor.

TsvInputProcessor

TSV input processor.

XlsxInputProcessor

XLSX input processor.

Input metadata processor. The goal of this module is to take an input file, read it and transform it into a list of dictionaries containing the metadata for the different entities. Once in a flattened JSON state, they can be processed into all the different entities.

Mandatory arguments:

  • input_data: Path to the input file or loaded content. To be decided by subclass processor.

Optional arguments:

  • verbose: set to True if you want INFO and above-level logging events. If not set or set to False, only WARNING and above will be displayed

Subclasses of GenericInputProcessor must define the following methods/properties:

  • @input_data.setter

Aspects to improve:

  • Currently, the process() function can fail at any point if any entity fails to validate. This could be handled in 2 ways:

    • Catch all exceptions (meh) log them, and create the rest of the entities.

    • Current behaviour: fail miserably and not return anything. I really like this option as, for me, an input (spreadsheet, tsv file, etc) probably has meaning together and should remain this way. It could be improved by logging all errors, then failing.

class GenericInputProcessor(input_data_path, verbose=False)

Bases: object

Generic input processor.

Parameters:
  • input_data_path (str) – Path to the input file.

  • verbose (bool) – Boolean indicating if the logger should be verbose.

property input_data: list[dict]

Input data in JSON format. Set from input_path by setter

Returns:

Input data in JSON format

process(entity)

Process self.input_data and return a list of metadata entities that depend on the ‘GenericEntity’ subclass passed to the function.

Any field with value “None” should be processed by the entity type; different services may require different behaviours.

Parameters:

entity (Type[GenericEntity]) – GenericEntity subclass (Not instance) to process the input data into.

Return type:

list[GenericEntity]

Returns:

list of entities. Must be subclass of GenericEntity

transform(field_mapping, delete_non_mapped_fields=False)

Transform the input data retrieved from the source to have new keys based on an {<old_key>: <new_key>} map. Only simple key name change operations are allowed.

Parameters:
  • field_mapping (dict) – {‘old_key’: ‘new_key’} dictionary. Not nested.

  • delete_non_mapped_fields (bool) – Boolean to indicate if fields not present in the map above should be deleted. Defaults False

class TsvInputProcessor(input_data)

Bases: GenericInputProcessor

TSV input processor. Loads a TSV file with entity metadata.

Parameters:
  • input_data – Path to the file with the input metadata

  • field_mapping – Path to the file with the field mapping

property input_data: list[dict]

Input data in JSON format. Set from input_path by setter

Returns:

Input data in JSON format

class XlsxInputProcessor(input_data, sheet_name='Sheet1')

Bases: GenericInputProcessor

XLSX input processor. Loads a XLSX file with entity metadata.

Parameters:
  • input_data (str) – Path to the file with the input metadata.

  • worksheet_name – Name of the worksheet to be processed.

property input_data: list[dict]

Input data in JSON format. Set from input_path by setter

Returns:

Input data in JSON format

Exceptions

Placeholder submodule for input processor-related exceptions.