tarexp.ledger module#

Any aspect of the history of a batch-based workflow can, if necessary, be reproduced from a record of which documents were labeled on which training rounds (including any initial seed round). The tarexp.ledger.Ledger instance records this state in memory, and writes it to disk at user-specified intervals to enable restarts (specified in tarexp.workflow.Workflow).

The persisted ledger for a complete run can be used to execute TARexp in frozen mode (tarexp.ledger.FrozenLedger) where no batch selection, training, or scoring is done. Frozen mode supports efficient testing of new components that do not change training or scoring, e.g., non-interventional stopping rules [1], effectiveness estimation methods, etc. Evaluating stopping rules for two-phase reviews also requires persisting scores of all documents at the end of each training round, an option the user can specify.

See also

class tarexp.ledger.Ledger(n_docs: int)[source]#

Bases: Savable

A Ledger records the progress of a TAR run.

Specifically, we record (1) the review result of each document and (2) the round each document being reviewed. Control set documents are marked as reviewed at round -1 and seed documents are at round 0.

Parameters:

n_docs – Number of documents in the collection.

createControl(*args)[source]#

Create a control set Plese refer to annoate() for the argument documentation.

annotate(*args)[source]#

Record a round of annotation(review).

Parameters:
  • Dictionary – If only one positional argument is provided, it should be a dictionary with keys being the document id and values being the corresponding labels.

  • Pair – If two positional arguments are provided, the first one is treated as the documents and the second one is the corresponding labels. The length of the two lists are required be the same.

Returns:

Number of documents being annotated at this round.

Return type:

int

getReviewedIds(round: int) np.ndarray[int][source]#

Get the list of document id that are reviewed at round.

property control_mask: np.ndarray[bool | np.nan]#

The mask of the size of the collection for the control documents.

property n_rounds: int#

Number of rounds have executed

property n_docs: int#

Number of docuemnts in the collection.

property n_annotated: int#

Total number of documents have been reviewed.

property n_pos_annotated: int#

Total number of documents have been labeled as positive (relevant).

property n_neg_annotated: int#

Total number of documents have been labeled as negative (non-relevant).

property annotated: np.ndarray[bool]#

The mask of the size of the collection for the annotated (reviewed) documents.

property annotation: np.ndarray[bool | np.nan]#

List of the annotations. Documents that have not been reviewed are recorded as np.nan. Control documents are considered not annotated.

property isDone: bool#

Whether all documents have been reviewed (including control docuemnts.).

getAnnotationCounts() List[Dict[bool, int]][source]#

Get a list of dictionaries that records the number of positive and negative documents reviewed in each round.

freeze() FrozenLedger[source]#

Get a frozen version of the current ledger.

freeze_at(round: int) FrozenLedger[source]#

Get a frozen version of the ledger at round.

class tarexp.ledger.FrozenLedger(org_ledger: Ledger)[source]#

Bases: Ledger

A frozen ledger prohibits the record being modified. The underlying record (implemented as a numpy array) is locked as not writable.

All methods and properties of Ledger are supported except tarexp.ledger.Ledger.annotate() which will raise a FrozenInstanceError when invoked.

annotate(*args, **kwargs)[source]#

Not supported.

property n_rounds#

Number of rounds this frozen ledger have recorded.