tarexp.workflow module#
An instance of class Workflow
executes the user’s declarative specification of a TAR workflow.
In doing so, it reaches out to tarexp.component.Component
for services specified in the declarative specification
such as creating training batches, scoring and ranking the collection, and testing for stopping conditions.
After an optional initial seed round where the user can specify a starting set of labeled training data,
the workflow is executed as a sequence of training rounds. Each round consists of selecting a batch of
training documents (using tarexp.component.Sampler
), looking up labels for those documents
(using tarexp.component.Labeler
), training a model and scoring and ranking the collection
documents (using tarexp.component.Ranker
).
TARexp
supports specifications of both one and two-phase TAR workflows, as described in Yang et al. [1].
One-phase workflows (OnePhaseTARWorkflow
) can be run for a fixed number of training rounds, or
until all documents have been reviewed. Two-phase reviews also use a stopping rule to determine when to end training,
but then follow that by ranking the collection with the final trained model and reviewing to a statistically determined cutoff.
Workflow
is implemented as a Python iterator, allowing procedures defined outside the workflow
to execute at each round. The iterator yields a tarexp.ledger.FrozenLedger
.
The user can define a custom per-round evaluation process or record information for later analysis.
WorkflowReplay
is a special kind of workflow that replays an existing wokrflow record.
The replay executes the original TAR run without changing the documents being reviewed at each round.
It can be used to gather more information throughout the process and testing other components based on
an exisiting TAR run such as stopping rule.
See also
- class tarexp.workflow.Workflow(dataset: Dataset, component: Component, max_round_exec: int = -1, saved_score_limit: int = -1, saved_checkpoint_limit: int = 2, random_seed: int | None = None, resume: bool = False, **kwargs)[source]#
Bases:
Savable
Meta workflow class.
The meta workflow class provides essential interface and bookkeeping for most workflows. It also implement the iterator interface which yields a
tarexp.ledger.FrozenLedger
in each round. The state of a workflow instance is stored in theledger
.All workflow inheriting this class should implement the
step()
method which defines the TAR process at each step. Any initialization of the workflow, including sampling the control set, should be implemented in the__init__()
method.getMetrics()
andmakeReplay()
are optional.Warning
Workflow initialization raises warnning when the sepcifying component has all four essential roles for running a TAR experiment (ranker, stopping rule, labeler, and sampler). However, workflow can execute without one or multiple of them but simply skipping such process silently. User need to manually perform these processes in the iterator block or after invoking
step()
to ensure the workflow executes correctly.- Parameters:
dataset – Textual dataset the workflow will search on. When running experiments, it should contain gold labels but not strictly required (might raises exceptions depending on the labeler). User could manually provide labels to the workflow in each step manually.
component – A combined component that contains ideally all 4 essential roles. Components can be combined by using
tarexp.component.base.combine()
.max_round_exec – Maximum round the workflow will execute before the stopping rule suggests stopping.
-1
indicates no limit. Default is-1
.saved_score_limit – Number of set of documents’ score from each round would be stored in memory. Number smaller than 0 indicates no limitation. This would also affect the saved checkpoint, so for full replay capability, the value should be < 0. Default is
-1
. 0 is not allowed.saved_checkpoint_limit – Number of workflow checkpoints would be stored on disk in the output directory. If the limit is reached, older checkpoint will be deleted silently. Default
2
(fewer than 2 is not recommanded for stability as the latest checkpoint could be incomplete if the failure happens during saving).random_seed – Random seed for any randomization process (if any) used in the workflow. Default
None
.resume – Not initializing the components if
True
. Used internally by theload()
method. DefaultFalse
.
- property n_rounds: int#
Number of rounds have been executed.
- property isStopped: bool#
Is the workflow stopped. A workflow would stop if the stopping rule suggests so, all documents have been reviewed, or reaches the maximum number of rounds (
max_round_exec
)The decision is cached so the stopping rule would only be consulted once for efficiency.
- property latest_scores: ndarray#
The latest set of document scores.
- save(output_dir, with_component=True, overwrite=False)[source]#
Saving the workflow state.
The workflow state (checkpoint) will be saved a directory containing all essential attributes, including the ledger and the components. The scores will also be stored but if
saved_score_limit
is not < 0, the scores stored will be incomplete and could only contain the document scores from the latest round. Incomplete scores are sufficient for continuing an incomplete workflow but might not be sufficient to run replay experiments, especially ones that leverage the document scores to estimate the progess (e.g.,tarexp.component.stopping.QuantStoppingRule
).Important
The underlying dataset is not saved along with the workflow as it is supposed to be static regardless of containing the gold labels or not. However, dataset hash is stored in the checkpoint to verify the dataset providing at loading time is consistent.
- Parameters:
output_dir – Path to the directory containing the checkpoints.
with_component – Whether the components are saved along with the workflow. Default
True
. If the components are not saved, resuming the workflow would be unsupported.overwrite – Whether overwriting a checkpoint of the current round if exists. Default
False
.
- classmethod load(saved_path, dataset: Dataset, force=False)[source]#
Load the workflow from checkpoint
Identical dataset must be provided when loading the workflow.
- Parameters:
saved_path – Path to either the directory containing the checkpoints or a specific checkpoint. If multiple checkpoints exist in the provided directory, the latest checkpoint will be selected and loaded.
dataset – The dataset the workflow was using before saving.
force – Whether to skip checking the hash of the dataset. It is not recommanded to turn on since it could result in inconsistent workflow behavior. Default
False
.
- step(force: bool = False)[source]#
Abstract method defining a TAR step in the workflow. Workflow classes inheriting this meta class should implement this method.
Ideally, a TAR step should start from checking whether the workflow has been stopped by consulting
isStopped
and proceed to other parts including retraining the ranker and suggest review documents.- Parameters:
force – Whether to force execute a TAR step if the workflow is already stopped
- getMetrics(measures: List[OptimisticCost | ir_measures.measures.Measure | str]) Dict[MeasureKey, int | float] [source]#
Abstract method for providing evaluation metric values at the current round.
Each workflow inheriting this meta class can define its own method for evaluating the measures.
- Parameters:
measures – List of evaluation measures.
- makeReplay() WorkflowReplay [source]#
Abstract method for creating replay workflow based on the current TAR workflow.
- class tarexp.workflow.WorkflowReplay(dataset: Dataset, ledger: FrozenLedger, saved_scores: OrderedDict | None = None, random_seed: int | None = None, **kwargs)[source]#
Bases:
Workflow
Meta workflow replay class.
An existing workflow can be transformed into a
WorkflowReplay
class that freeze the past state of the workflow and replay the process. The replay supports efficient experiments on components and procedures that do not interferes the workflow such as control sets and stopping rules. Since replay operates on an existing workflow, the replay is not savable.Each replay workflow corresponds to an actual workflow. They should be implemented in pair with
Replay
at the end of the class name for consistence in the module.- Parameters:
dataset – Instance of dataset the workflow replay will be operating on. Ideally it should be the same as the one the original workflow operated on but not strictly enforced.
ledger – The frozen ledger that records the past states of the workflow.
saved_scores – The document scores from each TAR round. It is optional depending on the kind of experiments the user intend to run. Experiments such as testing the size or sampling process of the control set do not require the document scores.
random_seed – The random seed for the randomization in the workflow if needed.
- property n_rounds#
Number of rounds have been executed.
- property ledger#
- property isStopped#
Is the workflow stopped. A workflow would stop if the stopping rule suggests so, all documents have been reviewed, or reaches the maximum number of rounds (
max_round_exec
)The decision is cached so the stopping rule would only be consulted once for efficiency.
- class tarexp.workflow.OnePhaseTARWorkflow(dataset: Dataset, component: Component, seed_doc: list = [], batch_size: int = 200, control_set_size: int = 0, **kwargs)[source]#
Bases:
Workflow
One Phase TAR Workflow
This class defines a one phase TAR workflow that samples a fixed-size set of documents at each round based on certain sampling strategy that takes document scores as input. The suggested documents are reviewed by the human expert (which is simulated through revealing the gold label or other procedure defined in the component). The ranker is then retrained based on all the labeled documents and the entire collection is scored and ranked by the updated ranker. Please refer to Yang et al. [1] for further reference.
Please also refer to
Workflow
for other optional parameters.- review_candidates#
A list of document id that are suggested for review at the current round.
- Parameters:
dataset – Textual dataset the workflow will search on. When running experiments, it should contain gold labels but not strictly required (might raises exceptions depending on the labeler). User could manually provide labels to the workflow in each step manually.
component – A combined component that contains ideally all 4 essential roles. Components can be combined by using
tarexp.component.base.combine()
.seed_doc – A list of documents the human experts reviewed before the iterative process started. Often referred as the seed set in eDiscovery community.
batch_size – The number of document the human experts review in each TAR round.
control_set_size – Size of the control set.
- step(force=False)[source]#
Step of the one phase TAR workflow.
Caution
If the required roles in the component are missing, the process will be skipped silently without warnings or excpetions.
- Parameters:
force – Whether to force execute a TAR step if the workflow is already stopped
- getMetrics(measures: List[OptimisticCost | ir_measures.measures.Measure | str], labels: List | np.ndarray = None) Dict[MeasureKey, int | float] [source]#
Calculate the evaluation measures at the current round
If the underlying dataset does not contain gold labels, they should be provided here in order to calculate the values.
- Parameters:
measures – List of evaluation measures.
labels – Labels of the documets in the collection.
- makeReplay() OnePhaseTARWorkflowReplay [source]#
Create a replay workflow that contains records up to the current round.
- class tarexp.workflow.OnePhaseTARWorkflowReplay(dataset: Dataset, ledger: FrozenLedger, saved_scores=None)[source]#
Bases:
WorkflowReplay
,OnePhaseTARWorkflow
Replay workflow for
OnePhaseTARWorkflow
.- Parameters:
dataset – Instance of dataset the workflow replay will be operating on. Ideally it should be the same as the one the original workflow operated on but not strictly enforced.
ledger – The frozen ledger that records the past states of the workflow.
saved_scores – The document scores from each TAR round. It is optional depending on the kind of experiments the user intend to run. Experiments such as testing the size or sampling process of the control set do not require the document scores.
- property latest_scores#
The latest set of document scores.
- property review_candidates#
A list of document id that are suggested for review at the current round.
- class tarexp.workflow.TwoPhaseTARWorkflow(*args, **kwargs)[source]#
Bases:
OnePhaseTARWorkflow
Two phase workflow extended from
OnePhaseTARWorkflow
.The two phase workflow adds a second phase process that is defined as a
poststopping
role in the component, which would be executed after stopping is suggested by the stopping rule. If such role does not exist in the component, an exception would be raised.Please refer to
OnePhaseTARWorkflow
for parameter documentation.- step(*args, **kwargs)[source]#
Step of the one phase TAR workflow. If stopping is suggested, the second phase review process will be invoked. Please refer to
tarexp.workflow.OnePhaseTARWorkflow.step()
for more documentation.