tarexp.workflow module#

An instance of class Workflow executes the user’s declarative specification of a TAR workflow. In doing so, it reaches out to tarexp.component.Component for services specified in the declarative specification such as creating training batches, scoring and ranking the collection, and testing for stopping conditions.

After an optional initial seed round where the user can specify a starting set of labeled training data, the workflow is executed as a sequence of training rounds. Each round consists of selecting a batch of training documents (using tarexp.component.Sampler), looking up labels for those documents (using tarexp.component.Labeler), training a model and scoring and ranking the collection documents (using tarexp.component.Ranker).

TARexp supports specifications of both one and two-phase TAR workflows, as described in Yang et al. [1]. One-phase workflows (OnePhaseTARWorkflow) can be run for a fixed number of training rounds, or until all documents have been reviewed. Two-phase reviews also use a stopping rule to determine when to end training, but then follow that by ranking the collection with the final trained model and reviewing to a statistically determined cutoff.

Workflow is implemented as a Python iterator, allowing procedures defined outside the workflow to execute at each round. The iterator yields a tarexp.ledger.FrozenLedger. The user can define a custom per-round evaluation process or record information for later analysis.

WorkflowReplay is a special kind of workflow that replays an existing wokrflow record. The replay executes the original TAR run without changing the documents being reviewed at each round. It can be used to gather more information throughout the process and testing other components based on an exisiting TAR run such as stopping rule.

See also

class tarexp.workflow.Workflow(dataset: Dataset, component: Component, max_round_exec: int = -1, saved_score_limit: int = -1, saved_checkpoint_limit: int = 2, random_seed: int | None = None, resume: bool = False, **kwargs)[source]#

Bases: Savable

Meta workflow class.

The meta workflow class provides essential interface and bookkeeping for most workflows. It also implement the iterator interface which yields a tarexp.ledger.FrozenLedger in each round. The state of a workflow instance is stored in the ledger.

All workflow inheriting this class should implement the step() method which defines the TAR process at each step. Any initialization of the workflow, including sampling the control set, should be implemented in the __init__() method. getMetrics() and makeReplay() are optional.

Warning

Workflow initialization raises warnning when the sepcifying component has all four essential roles for running a TAR experiment (ranker, stopping rule, labeler, and sampler). However, workflow can execute without one or multiple of them but simply skipping such process silently. User need to manually perform these processes in the iterator block or after invoking step() to ensure the workflow executes correctly.

Parameters:
  • dataset – Textual dataset the workflow will search on. When running experiments, it should contain gold labels but not strictly required (might raises exceptions depending on the labeler). User could manually provide labels to the workflow in each step manually.

  • component – A combined component that contains ideally all 4 essential roles. Components can be combined by using tarexp.component.base.combine().

  • max_round_exec – Maximum round the workflow will execute before the stopping rule suggests stopping. -1 indicates no limit. Default is -1.

  • saved_score_limit – Number of set of documents’ score from each round would be stored in memory. Number smaller than 0 indicates no limitation. This would also affect the saved checkpoint, so for full replay capability, the value should be < 0. Default is -1. 0 is not allowed.

  • saved_checkpoint_limit – Number of workflow checkpoints would be stored on disk in the output directory. If the limit is reached, older checkpoint will be deleted silently. Default 2 (fewer than 2 is not recommanded for stability as the latest checkpoint could be incomplete if the failure happens during saving).

  • random_seed – Random seed for any randomization process (if any) used in the workflow. Default None.

  • resume – Not initializing the components if True. Used internally by the load() method. Default False.

property ledger: Ledger#
property component: Component#
property dataset: Dataset#
property n_rounds: int#

Number of rounds have been executed.

property isStopped: bool#

Is the workflow stopped. A workflow would stop if the stopping rule suggests so, all documents have been reviewed, or reaches the maximum number of rounds (max_round_exec)

The decision is cached so the stopping rule would only be consulted once for efficiency.

property latest_scores: ndarray#

The latest set of document scores.

save(output_dir, with_component=True, overwrite=False)[source]#

Saving the workflow state.

The workflow state (checkpoint) will be saved a directory containing all essential attributes, including the ledger and the components. The scores will also be stored but if saved_score_limit is not < 0, the scores stored will be incomplete and could only contain the document scores from the latest round. Incomplete scores are sufficient for continuing an incomplete workflow but might not be sufficient to run replay experiments, especially ones that leverage the document scores to estimate the progess (e.g., tarexp.component.stopping.QuantStoppingRule).

Important

The underlying dataset is not saved along with the workflow as it is supposed to be static regardless of containing the gold labels or not. However, dataset hash is stored in the checkpoint to verify the dataset providing at loading time is consistent.

Parameters:
  • output_dir – Path to the directory containing the checkpoints.

  • with_component – Whether the components are saved along with the workflow. Default True. If the components are not saved, resuming the workflow would be unsupported.

  • overwrite – Whether overwriting a checkpoint of the current round if exists. Default False.

classmethod load(saved_path, dataset: Dataset, force=False)[source]#

Load the workflow from checkpoint

Identical dataset must be provided when loading the workflow.

Parameters:
  • saved_path – Path to either the directory containing the checkpoints or a specific checkpoint. If multiple checkpoints exist in the provided directory, the latest checkpoint will be selected and loaded.

  • dataset – The dataset the workflow was using before saving.

  • force – Whether to skip checking the hash of the dataset. It is not recommanded to turn on since it could result in inconsistent workflow behavior. Default False.

step(force: bool = False)[source]#

Abstract method defining a TAR step in the workflow. Workflow classes inheriting this meta class should implement this method.

Ideally, a TAR step should start from checking whether the workflow has been stopped by consulting isStopped and proceed to other parts including retraining the ranker and suggest review documents.

Parameters:

force – Whether to force execute a TAR step if the workflow is already stopped

getMetrics(measures: List[OptimisticCost | ir_measures.measures.Measure | str]) Dict[MeasureKey, int | float][source]#

Abstract method for providing evaluation metric values at the current round.

Each workflow inheriting this meta class can define its own method for evaluating the measures.

Parameters:

measures – List of evaluation measures.

makeReplay() WorkflowReplay[source]#

Abstract method for creating replay workflow based on the current TAR workflow.

class tarexp.workflow.WorkflowReplay(dataset: Dataset, ledger: FrozenLedger, saved_scores: OrderedDict | None = None, random_seed: int | None = None, **kwargs)[source]#

Bases: Workflow

Meta workflow replay class.

An existing workflow can be transformed into a WorkflowReplay class that freeze the past state of the workflow and replay the process. The replay supports efficient experiments on components and procedures that do not interferes the workflow such as control sets and stopping rules. Since replay operates on an existing workflow, the replay is not savable.

Each replay workflow corresponds to an actual workflow. They should be implemented in pair with Replay at the end of the class name for consistence in the module.

Parameters:
  • dataset – Instance of dataset the workflow replay will be operating on. Ideally it should be the same as the one the original workflow operated on but not strictly enforced.

  • ledger – The frozen ledger that records the past states of the workflow.

  • saved_scores – The document scores from each TAR round. It is optional depending on the kind of experiments the user intend to run. Experiments such as testing the size or sampling process of the control set do not require the document scores.

  • random_seed – The random seed for the randomization in the workflow if needed.

property n_rounds#

Number of rounds have been executed.

property ledger#
property isStopped#

Is the workflow stopped. A workflow would stop if the stopping rule suggests so, all documents have been reviewed, or reaches the maximum number of rounds (max_round_exec)

The decision is cached so the stopping rule would only be consulted once for efficiency.

step()[source]#

Replay a TAR step.

save(*args, **kwargs)[source]#

Not supported in workflow replay.

class tarexp.workflow.OnePhaseTARWorkflow(dataset: Dataset, component: Component, seed_doc: list = [], batch_size: int = 200, control_set_size: int = 0, **kwargs)[source]#

Bases: Workflow

One Phase TAR Workflow

This class defines a one phase TAR workflow that samples a fixed-size set of documents at each round based on certain sampling strategy that takes document scores as input. The suggested documents are reviewed by the human expert (which is simulated through revealing the gold label or other procedure defined in the component). The ranker is then retrained based on all the labeled documents and the entire collection is scored and ranked by the updated ranker. Please refer to Yang et al. [1] for further reference.

Please also refer to Workflow for other optional parameters.

review_candidates#

A list of document id that are suggested for review at the current round.

Parameters:
  • dataset – Textual dataset the workflow will search on. When running experiments, it should contain gold labels but not strictly required (might raises exceptions depending on the labeler). User could manually provide labels to the workflow in each step manually.

  • component – A combined component that contains ideally all 4 essential roles. Components can be combined by using tarexp.component.base.combine().

  • seed_doc – A list of documents the human experts reviewed before the iterative process started. Often referred as the seed set in eDiscovery community.

  • batch_size – The number of document the human experts review in each TAR round.

  • control_set_size – Size of the control set.

step(force=False)[source]#

Step of the one phase TAR workflow.

Caution

If the required roles in the component are missing, the process will be skipped silently without warnings or excpetions.

Parameters:

force – Whether to force execute a TAR step if the workflow is already stopped

getMetrics(measures: List[OptimisticCost | ir_measures.measures.Measure | str], labels: List | np.ndarray = None) Dict[MeasureKey, int | float][source]#

Calculate the evaluation measures at the current round

If the underlying dataset does not contain gold labels, they should be provided here in order to calculate the values.

Parameters:
  • measures – List of evaluation measures.

  • labels – Labels of the documets in the collection.

makeReplay() OnePhaseTARWorkflowReplay[source]#

Create a replay workflow that contains records up to the current round.

class tarexp.workflow.OnePhaseTARWorkflowReplay(dataset: Dataset, ledger: FrozenLedger, saved_scores=None)[source]#

Bases: WorkflowReplay, OnePhaseTARWorkflow

Replay workflow for OnePhaseTARWorkflow.

Parameters:
  • dataset – Instance of dataset the workflow replay will be operating on. Ideally it should be the same as the one the original workflow operated on but not strictly enforced.

  • ledger – The frozen ledger that records the past states of the workflow.

  • saved_scores – The document scores from each TAR round. It is optional depending on the kind of experiments the user intend to run. Experiments such as testing the size or sampling process of the control set do not require the document scores.

property latest_scores#

The latest set of document scores.

property review_candidates#

A list of document id that are suggested for review at the current round.

class tarexp.workflow.TwoPhaseTARWorkflow(*args, **kwargs)[source]#

Bases: OnePhaseTARWorkflow

Two phase workflow extended from OnePhaseTARWorkflow.

The two phase workflow adds a second phase process that is defined as a poststopping role in the component, which would be executed after stopping is suggested by the stopping rule. If such role does not exist in the component, an exception would be raised.

Please refer to OnePhaseTARWorkflow for parameter documentation.

step(*args, **kwargs)[source]#

Step of the one phase TAR workflow. If stopping is suggested, the second phase review process will be invoked. Please refer to tarexp.workflow.OnePhaseTARWorkflow.step() for more documentation.