tarexp.dataset module#
A dataset contains the essential information of the collection for retrieval. It is designed to be a static variable throughout the TAR run.
Encoded documents in vectors are required but not limited any form. Depending on the intended experiments, the vectors can be generated by scikit-learn TFIDF vectorizer or even Huggingface Transformers tokenizers. We leave this flexibility to the users for further extensions.
Groud truth labels are essential when running experiments without human intervention. It is also used in most evaluation that requires ground truth labels. If the workflow is designed to run with actual human reviewing, the labels is no longer required.
- class tarexp.dataset.Dataset(name=None)[source]#
Bases:
object
Meta class of
TARexp
dataset.The class defines the basic features for a dataset. All downstream datasets that inherits this class should implement the following properties or method:
Essentials
identifier
The unique identifier of the dataset. It is used in verifying the dataset provided is identical when resuming a workflow. The identifier should summarize both vectors and the labels with a hash that does not depends on the memory location of the variable (e.g. the built-in
hash()
function) but the actual content. Utility functiontarexp.util.stable_hash()
provides such capability.
ingest()
It ingests a list of raw text into vectors and stored in the attribute
_vectors
.
getAllData()
(Optional) It returns all vectors of the documents in the dataset. Ideally, it should returns a copy of the vector but could also be a reference if the colletion is too large to copy in memory. This meta class implemented a simple version but user implementing new dataset class should consider re-implement it to support on-demand processing of the vectors (such as collators in pyTorch).
getTrainingData()
(Optional) It takes a
tarexp.ledger.Ledger
as an argument and returns the vectors of reviewed documents and labels from the ledger. This meta class also already implemented a simple version but should consider re-implementing for the same reason asgetAllData()
.
duplicate()
Returns a copy of the dataset along with any information that should be copied. This method should perform deep copy on all containing objects to prevent memory referencing the prevent fast multi-processing.
Labels (Optional)
labels
The labels of the dataset. We recommand implementing this information as a property instead of an attribute of the class to prevent modifying the labels by accident during the workflow. If the labels are intended to be unavailable, please consider raise an
NotImplemented
exception instead ofNotImplementedError
to reflect the intention.
pos_doc_ids
andneg_doc_ids
The
set
of positive and negative docuemnt ids.
setLabels()
The method that returns a new dataset that contains the label of all documents in the dataset. It should also check all documents are set with a label. Spawning a new instance makes sure that the original dataset instance is not polluted.
- property identifier#
- property name#
- property n_docs#
Number of documents in the dataset.
- property labels#
- property hasLabels#
- property pos_doc_ids: set#
- property neg_doc_ids: set#
- class tarexp.dataset.SparseVectorDataset(vectorizer=None)[source]#
Bases:
Dataset
Dataset with Scipy Sparse Matrix.
- Parameters:
vectorizer – A function or a class instance that has a
fit_transform
method (such as the vectorizers from scikit-learn).
- property n_docs#
Number of documents in the dataset.
- property labels#
Returns a copy of the labels of all docuemnts.
- property identifier#
- property pos_doc_ids: set#
Returns the ids of the positive documents.
- property neg_doc_ids: set#
Returns the ids of the negative documents.
- ingest(text, force=False)[source]#
Ingest the text using the
vectorizer
and store the vectors in this instance.- Parameters:
text – A list of text that will be ingested and stored. If the labels are set, the length of the list should be identical to the length of labels.
force – Whether skipping the test on the length of the text and the labels.
- setLabels(labels, inplace=False)[source]#
Returns a new datset with new labels.
- Parameters:
labels – A list or Numpy array of binary labels. The length should match the number of documents in the dataset.
inplace – Whether applying this set of labels to the current dataset. If
True
, the method will replace the labels and returnsNone
. DefaultFalse
.
- duplicate(deep=False)[source]#
Duplicate the dataset.
- Parameters:
deep – Whether to perform deep copy on the vectors. Default
False
.
- classmethod from_sparse(matrix)[source]#
Create a
SparseVectorDataset
instance from a sparse matrix.
- class tarexp.dataset.TaskFeeder(dataset: Dataset, labels: Any)[source]#
Bases:
object
Python Iterator that yields review tasks with different set of labels given the same base dataset (a dataset without labels.)
This class support both iterator in for loop or
next()
function and index look up[]
if the list of labels provided has already been materialized (not an iterator).- Parameters:
dataset – A
Dataset
instance that does not contain any label. This instance will spawn downstream tasks with different labelslabels –
If a Python dictionary is provided, the key is considered as the names of the tasks and the values are the corresponding labels. The length of all set of labels should be the same as the number of documents provided inbase dataset.
If a Pandas DataFrame is provided, the columns are considered to be the tasks where the column name are fed as the task name. The number of rows in the DataFrame should be the same as the number of documents in the base dataset.
If an iterator is provided, the length of the labels is not checked against the number of documents in the base dataset and a warning will be raised. The iterator should yield a tuple of the name of the task and the corresponding labels. The order of should also be stable especially when running experiments across multiple machines. If the iterator support length via
__len__
, the class will respect it.