lamindb.Artifact

class lamindb.Artifact(data: UPathStr, type: ArtifactType | None = None, key: str | None = None, description: str | None = None, revises: Artifact | None = None, run: Run | None = None)

Bases: Record, IsVersioned, TracksRun, TracksUpdates

Datasets & models stored as files, folders, or arrays.

Artifacts manage data in local or remote storage (Tutorial: Artifacts).

Some artifacts are array-like, e.g., when stored as .parquet, .h5ad, .zarr, or .tiledb.

Parameters:
  • dataUPathStr A path to a local or remote folder or file.

  • typeLiteral["dataset", "model"] | None = None The artifact type.

  • keystr | None = None A path-like key to reference artifact in default storage, e.g., "myfolder/myfile.fcs". Artifacts with the same key form a revision family.

  • descriptionstr | None = None A description.

  • revisesArtifact | None = None Previous version of the artifact. Triggers a revision.

  • runRun | None = None The run that creates the artifact.

Typical storage formats & their API accessors

Arrays:

  • Table: .csv, .tsv, .parquet, .ipcDataFrame, pyarrow.Table

  • Annotated matrix: .h5ad, .h5mu, .zradAnnData, MuData

  • Generic array: HDF5 group, zarr group, TileDB store ⟷ HDF5, zarr, TileDB loaders

Non-arrays:

  • Image: .jpg, .pngnp.ndarray, …

  • Fastq: .fastq ⟷ /

  • VCF: .vcf ⟷ /

  • QC: .html ⟷ /

You’ll find these values in the suffix & accessor fields.

LaminDB makes some default choices (e.g., serialize a DataFrame as a .parquet file).

See also

Storage

Storage locations for artifacts.

Collection

Collections of artifacts.

from_df()

Create an artifact from a DataFrame.

from_anndata()

Create an artifact from an AnnData.

Examples

Create an artifact from a path to a file or folder:

>>> artifact = ln.Artifact("s3://my_bucket/my_folder/my_file.csv", description="My file")
>>> artifact = ln.Artifact("./my_local_file.jpg", description="My image")
>>> artifact = ln.Artifact("s3://my_bucket/my_folder", description="My folder")
>>> artifact = ln.Artifact("./my_local_folder", description="My local folder")
Why does the API look this way?

It’s inspired by APIs building on AWS S3.

Both boto3 and quilt select a bucket (akin to default storage in LaminDB) and define a target path through a key argument.

In boto3:

# signature: S3.Bucket.upload_file(filepath, key)
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('mybucket')
bucket.upload_file('/tmp/hello.txt', 'hello.txt')

In quilt3:

# signature: quilt3.Bucket.put_file(key, filepath)
import quilt3
bucket = quilt3.Bucket('mybucket')
bucket.put_file('hello.txt', '/tmp/hello.txt')

Make a new version of an artifact:

>>> artifact = ln.Artifact.from_df(df, description="My dataframe")
>>> artifact.save()
>>> artifact_v2 = ln.Artifact(df_updated, revises=artifact)

Attributes

features: FeatureManager

Feature manager.

Features denote dataset dimensions, i.e., the variables that measure labels & numbers.

Annotate with features & values:

artifact.features.add_values({
     "species": organism,  # here, organism is an Organism record
     "scientist": ['Barbara McClintock', 'Edgar Anderson'],
     "temperature": 27.6,
     "study": "Candidate marker study"
})

Query for features & values:

ln.Artifact.features.filter(scientist="Barbara McClintock")

Features may or may not be part of the artifact content in storage. For instance, the Curator flow validates the columns of a DataFrame-like artifact and annotates it with features corresponding to these columns. artifact.features.add_values, by contrast, does not validate the content of the artifact.

property labels: LabelManager

Label manager.

To annotate with labels, you typically use the registry-specific accessors, for instance ulabels:

candidate_marker_study = ln.ULabel(name="Candidate marker study").save()
artifact.ulabels.add(candidate_marker_study)

Similarly, you query based on these accessors:

ln.Artifact.filter(ulabels__name="Candidate marker study").all()

Unlike the registry-specific accessors, the .labels accessor provides a way of associating labels with features:

study = ln.Feature(name="study", dtype="cat").save()
artifact.labels.add(candidate_marker_study, feature=study)

Note that the above is equivalent to:

artifact.features.add_values({"study": candidate_marker_study})
params: ParamManager

Param manager.

Example:

artifact.params.add_values({
    "hidden_size": 32,
    "bottleneck_size": 16,
    "batch_size": 32,
    "preprocess_params": {
        "normalization_type": "cool",
        "subset_highlyvariable": True,
    },
})
property path: Path | UPath

Path.

File in cloud storage, here AWS S3:

>>> artifact = ln.Artifact("s3://my-bucket/my-file.csv").save()
>>> artifact.path
S3Path('s3://my-bucket/my-file.csv')

File in local storage:

>>> ln.Artifact("./myfile.csv", description="myfile").save()
>>> artifact = ln.Artifact.get(description="myfile")
>>> artifact.path
PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/myfile.csv')

Simple fields

uid: str

A universal random id (20-char base62 ~ UUID), valid across DB instances.

description: str

A description.

key: str

Storage key, the relative path within the storage location.

suffix: str

Path suffix or empty string if no canonical suffix exists.

This is either a file suffix (".csv", ".h5ad", etc.) or the empty string “”.

type: Literal['dataset', 'model'] | None

ArtifactType (default None).

size: int

Size in bytes.

Examples: 1KB is 1e3 bytes, 1MB is 1e6, 1GB is 1e9, 1TB is 1e12 etc.

hash: str

Hash or pseudo-hash of artifact content.

Useful to ascertain integrity and avoid duplication.

n_objects: int

Number of objects.

Typically, this denotes the number of files in an artifact.

n_observations: int

Number of observations.

Typically, this denotes the first array dimension.

visibility: int

Visibility of artifact record in queries & searches (1 default, 0 hidden, -1 trash).

version: str

Version (default None).

Defines version of a family of records characterized by the same stem_uid.

Consider using semantic versioning with Python versioning.

is_latest: bool

Boolean flag that indicates whether a record is the latest in its version family.

created_at: datetime

Time of creation of record.

updated_at: datetime

Time of last update to record.

Relational fields

storage: Storage

Storage location, e.g. an S3 or GCP bucket or a local directory.

transform: Transform

Transform whose run created the artifact.

run: Run

Run that created the artifact.

created_by: User

Creator of record.

ulabels: ULabel

The ulabels measured in the artifact (ULabel).

input_of_runs: Run

Runs that use this artifact as an input.

feature_sets: FeatureSet

The feature sets measured in the artifact.

collections: Collection

The collections that this artifact is part of.

Class methods

classmethod df(include=None, join='inner', limit=100)

Convert to pd.DataFrame.

By default, shows all direct fields, except updated_at.

If you’d like to include other fields, use parameter include.

Parameters:
  • include (str | list[str] | None, default: None) – Related fields to include as columns. Takes strings of form "labels__name", "cell_types__name", etc. or a list of such strings.

  • join (str, default: 'inner') – The join parameter of pandas.

Return type:

DataFrame

Examples

>>> labels = [ln.ULabel(name="Label {i}") for i in range(3)]
>>> ln.save(labels)
>>> ln.ULabel.filter().df(include=["created_by__name"])
classmethod filter(*queries, **expressions)

Query records.

Parameters:
  • queries – One or multiple Q objects.

  • expressions – Fields and values passed as Django query expressions.

Return type:

QuerySet

Returns:

A QuerySet.

See also

Examples

>>> ln.ULabel(name="my ulabel").save()
>>> ulabel = ln.ULabel.get(name="my ulabel")
classmethod from_anndata(adata, key=None, description=None, run=None, revises=None, **kwargs)

Create from AnnData, validate & link features.

Parameters:
  • adata (AnnData | UPathStr) – An AnnData object or a path of AnnData-like.

  • key (str | None, default: None) – A relative path within default storage, e.g., "myfolder/myfile.h5ad".

  • description (str | None, default: None) – A description.

  • revises (Artifact | None, default: None) – An old version of the artifact.

  • run (Run | None, default: None) – The run that creates the artifact.

Return type:

Artifact

See also

Collection()

Track collections.

Feature

Track features.

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> adata = ln.core.datasets.anndata_with_obs()
>>> artifact = ln.Artifact.from_anndata(adata, description="mini anndata with obs")
>>> artifact.save()
classmethod from_df(df, key=None, description=None, run=None, revises=None, **kwargs)

Create from DataFrame, validate & link features.

For more info, see tutorial: Tutorial: Artifacts.

Parameters:
  • df (DataFrame) – A DataFrame object.

  • key (str | None, default: None) – A relative path within default storage, e.g., "myfolder/myfile.parquet".

  • description (str | None, default: None) – A description.

  • revises (Artifact | None, default: None) – An old version of the artifact.

  • run (Run | None, default: None) – The run that creates the artifact.

Return type:

Artifact

See also

Collection()

Track collections.

Feature

Track features.

Examples

>>> df = ln.core.datasets.df_iris_in_meter_batch1()
>>> df.head()
  sepal_length sepal_width petal_length petal_width iris_organism_code
0        0.051       0.035        0.014       0.002                 0
1        0.049       0.030        0.014       0.002                 0
2        0.047       0.032        0.013       0.002                 0
3        0.046       0.031        0.015       0.002                 0
4        0.050       0.036        0.014       0.002                 0
>>> artifact = ln.Artifact.from_df(df, description="Iris flower collection batch1")
>>> artifact.save()
classmethod from_dir(path, key=None, *, run=None)

Create a list of artifact objects from a directory.

Hint

If you have a high number of files (several 100k) and don’t want to track them individually, create a single Artifact via Artifact(path) for them. See, e.g., RxRx: cell imaging.

Parameters:
  • path (lamindb.core.types.UPathStr) – Source path of folder.

  • key (str | None, default: None) – Key for storage destination. If None and directory is in a registered location, the inferred key will reflect the relative position. If None and directory is outside of a registered storage location, the inferred key defaults to path.name.

  • run (Run | None, default: None) – A Run object.

Return type:

list[Artifact]

Examples

>>> dir_path = ln.core.datasets.generate_cell_ranger_files("sample_001", ln.settings.storage)
>>> artifacts = ln.Artifact.from_dir(dir_path)
>>> ln.save(artifacts)
classmethod from_mudata(mdata, key=None, description=None, run=None, revises=None, **kwargs)

Create from MuData, validate & link features.

Parameters:
  • mdata (MuData) – An MuData object.

  • key (str | None, default: None) – A relative path within default storage, e.g., "myfolder/myfile.h5mu".

  • description (str | None, default: None) – A description.

  • revises (Artifact | None, default: None) – An old version of the artifact.

  • run (Run | None, default: None) – The run that creates the artifact.

Return type:

Artifact

See also

Collection()

Track collections.

Feature

Track features.

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> mdata = ln.core.datasets.mudata_papalexi21_subset()
>>> artifact = ln.Artifact.from_mudata(mdata, description="a mudata object")
>>> artifact.save()
classmethod get(idlike=None, **expressions)

Get a single record.

Parameters:
  • idlike (int | str | None, default: None) – Either a uid stub, uid or an integer id.

  • expressions – Fields and values passed as Django query expressions.

Return type:

Record

Returns:

A record.

Raises:

lamindb.core.exceptions.DoesNotExist – In case no matching record is found.

See also

Examples

>>> ulabel = ln.ULabel.get("2riu039")
>>> ulabel = ln.ULabel.get(name="my-label")
classmethod lookup(field=None, return_field=None)

Return an auto-complete object for a field.

Parameters:
  • field (str | DeferredAttribute | None, default: None) – The field to look up the values for. Defaults to first string field.

  • return_field (str | DeferredAttribute | None, default: None) – The field to return. If None, returns the whole record.

Return type:

NamedTuple

Returns:

A NamedTuple of lookup information of the field values with a dictionary converter.

See also

search()

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> bt.Gene.from_source(symbol="ADGB-DT").save()
>>> lookup = bt.Gene.lookup()
>>> lookup.adgb_dt
>>> lookup_dict = lookup.dict()
>>> lookup_dict['ADGB-DT']
>>> lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id")
>>> genes.ensg00000002745
>>> lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")
classmethod search(string, *, field=None, limit=20, case_sensitive=False)

Search.

Parameters:
  • string (str) – The input string to match against the field ontology values.

  • field (str | DeferredAttribute | None, default: None) – The field or fields to search. Search all string fields by default.

  • limit (int | None, default: 20) – Maximum amount of top results to return.

  • case_sensitive (bool, default: False) – Whether the match is case sensitive.

Return type:

QuerySet

Returns:

A sorted DataFrame of search results with a score in column score. If return_queryset is True. QuerySet.

See also

filter() lookup()

Examples

>>> ulabels = ln.ULabel.from_values(["ULabel1", "ULabel2", "ULabel3"], field="name")
>>> ln.save(ulabels)
>>> ln.ULabel.search("ULabel2")
classmethod using(instance)

Use a non-default LaminDB instance.

Parameters:

instance (str | None) – An instance identifier of form “account_handle/instance_name”.

Return type:

QuerySet

Examples

>>> ln.ULabel.using("account_handle/instance_name").search("ULabel7", field="name")
            uid    score
name
ULabel7  g7Hk9b2v  100.0
ULabel5  t4Jm6s0q   75.0
ULabel6  r2Xw8p1z   75.0

Methods

cache(is_run_input=None)

Download cloud artifact to local cache.

Follows synching logic: only caches an artifact if it’s outdated in the local cache.

Returns a path to a locally cached on-disk object (say a .jpg file).

Return type:

Path

Examples

Sync file from cloud and return the local path of the cache:

>>> artifact.cache()
PosixPath('/home/runner/work/Caches/lamindb/lamindb-ci/lndb-storage/pbmc68k.h5ad')
delete(permanent=None, storage=None, using_key=None)

Trash or permanently delete.

A first call to .delete() puts an artifact into the trash (sets visibility to -1). A second call permanently deletes the artifact.

FAQ: Storage FAQ

Parameters:
  • permanent (bool | None, default: None) – Permanently delete the artifact (skip trash).

  • storage (bool | None, default: None) – Indicate whether you want to delete the artifact in storage.

Return type:

None

Examples

For an Artifact object artifact, call:

>>> artifact.delete()
describe(print_types=False)

Describe relations of record.

Examples

>>> artifact.describe()
load(is_run_input=None, **kwargs)

Cache and load into memory.

See all loaders.

Return type:

Any

Examples

Load a DataFrame-like artifact:

>>> artifact.load().head()
sepal_length sepal_width petal_length petal_width iris_organism_code
0        0.051       0.035        0.014       0.002                 0
1        0.049       0.030        0.014       0.002                 0
2        0.047       0.032        0.013       0.002                 0
3        0.046       0.031        0.015       0.002                 0
4        0.050       0.036        0.014       0.002                 0

Load an AnnData-like artifact:

>>> artifact.load()
AnnData object with n_obs × n_vars = 70 × 765

Fall back to cache() if no in-memory representation is configured:

>>> artifact.load()
PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/.lamindb/jb7BY5UJoQVGMUOKiLcn.jpg')
open(mode='r', is_run_input=None)

Return a cloud-backed data object.

Works for AnnData (.h5ad and .zarr), generic hdf5 and zarr, tiledbsoma objects (.tiledbsoma), pyarrow compatible formats.

Parameters:

mode (str, default: 'r') – can only be "w" (write mode) for tiledbsoma stores, otherwise should be always "r" (read-only mode).

Return type:

AnnDataAccessor | BackedAccessor | SOMACollection | SOMAExperiment | PyArrowDataset

Notes

For more info, see tutorial: Query arrays.

Examples

Read AnnData in backed mode from cloud:

>>> artifact = ln.Artifact.get(key="lndb-storage/pbmc68k.h5ad")
>>> artifact.open()
AnnDataAccessor object with n_obs × n_vars = 70 × 765
    constructed for the AnnData object pbmc68k.h5ad
    ...
replace(data, run=None, format=None)

Replace artifact content.

Parameters:
  • data (lamindb.core.types.UPathStr) – A file path.

  • run (Run | None, default: None) – The run that created the artifact gets auto-linked if ln.track() was called.

Return type:

None

Examples

Say we made a change to the content of an artifact, e.g., edited the image paradisi05_laminopathic_nuclei.jpg.

This is how we replace the old file in storage with the new file:

>>> artifact.replace("paradisi05_laminopathic_nuclei.jpg")
>>> artifact.save()

Note that this neither changes the storage key nor the filename.

However, it will update the suffix if it changes.

restore()

Restore from trash.

Return type:

None

Examples

For any Artifact object artifact, call:

>>> artifact.restore()
save(upload=None, **kwargs)

Save to database & storage.

Parameters:

upload (bool | None, default: None) – Trigger upload to cloud storage in instances with hybrid storage mode.

Return type:

Artifact

Examples

>>> artifact = ln.Artifact("./myfile.csv", description="myfile")
>>> artifact.save()
view_lineage(with_children=True)

Graph of data flow.

Return type:

None

Notes

For more info, see use cases: Data lineage.

Examples

>>> collection.view_lineage()
>>> artifact.view_lineage()
async adelete(using=None, keep_parents=False)
async arefresh_from_db(using=None, fields=None, from_queryset=None)
async asave(*args, force_insert=False, force_update=False, using=None, update_fields=None)
clean()

Hook for doing any extra model-wide validation after clean() has been called on every field by self.clean_fields. Any ValidationError raised by this method will not be associated with a particular field; it will have a special-case association with the field defined by NON_FIELD_ERRORS.

clean_fields(exclude=None)

Clean all fields and raise a ValidationError containing a dict of all validation errors if any occur.

date_error_message(lookup_type, field_name, unique_for)
get_constraints()
get_deferred_fields()

Return a set containing names of deferred fields for this instance.

prepare_database_save(field)
refresh_from_db(using=None, fields=None, from_queryset=None)

Reload field values from the database.

By default, the reloading happens from the database this instance was loaded from, or by the read router if this instance wasn’t loaded from any database. The using parameter will override the default.

Fields can be used to specify which fields to reload. The fields should be an iterable of field attnames. If fields is None, then all non-deferred fields are reloaded.

When accessing deferred fields of an instance, the deferred loading of the field will call this method.

save_base(raw=False, force_insert=False, force_update=False, using=None, update_fields=None)

Handle the parts of saving which should be done only once per save, yet need to be done in raw saves, too. This includes some sanity checks and signal sending.

The ‘raw’ argument is telling save_base not to save any parent models and not to do any changes to the values before save. This is used by fixture loading.

serializable_value(field_name)

Return the value of the field name for this instance. If the field is a foreign key, return the id value instead of the object. If there’s no Field object with this name on the model, return the model attribute’s value.

Used to serialize a field’s value (in the serializer, or form output, for example). Normally, you would just access the attribute directly and not use this method.

unique_error_message(model_class, unique_check)
validate_constraints(exclude=None)
validate_unique(exclude=None)

Check unique constraints on the model and raise ValidationError if any failed.