Fixtures For Tests

Most Dataverse test fixtures are based on JSON files stored in the test resources of the codebase.

In addition, (as of Dataverse 6.11) you can use a generator utility to create dataset-centered fixtures programmatically. This is most useful for local integration and performance tests but may be of use for unit tests as well.

Dataset Fixture Generator

The dataset fixture generator is a test utility for creating connected dataset entity graphs with configurable size and shape. It is located in the core testing utilities at edu.harvard.iq.dataverse.util.testing.fixtures and edu.harvard.iq.dataverse.util.testing.recipes.

The fixture generator is useful when tests need one or more datasets with many files, tabular files, variables, and optional variable metadata, while still keeping the test setup readable. It is primarily intended for integration and performance tests where hand-building entities would be too verbose, brittle, or too uniform to uncover ORM and serialization issues.

The generator creates an in-memory entity graph. Persisting that graph to a database is optional and requires the usual JPA persistence rules to be respected (see below).

Architecture

The fixture generator is built around three main concepts: a builder, recipes for it, and field populators. This separation keeps entity graph shape, relationship wiring, and scalar field population independent of each other.

Fixture Builder

The builder creates the connected entity graph by consuming recipes. It is responsible for:

Creating the entities
Wiring relationships
Keeping both sides of relationships in sync where needed
Returning a DatasetFixture with convenient references to generated objects

Recipes

Recipes describe the shape of the fixture’s entity graph and should not manually wire entity relationships:

How many files should exist?
Which files are tabular?
How many variables should a tabular file contain?
Should variable metadata be created?

Available Recipes:

Recipes are composable using a fluent API and work together.

DatasetRecipe
  -> DatasetTypeRecipe
  -> VersionRecipe
       -> FileRecipe
            -> VariableSetRecipe
                 -> VariableMetadataRecipe

DatasetRecipe
Top-level recipe for creating a dataset fixture. It combines a DatasetTypeRecipe and a VersionRecipe.

DatasetTypeRecipe
Provides the dataset type assigned to the generated dataset. Can create a dataset type from scalar values or wrap an existing instance.

Note: the recipe provides the type object but does not persist it. Tests that persist generated fixtures must ensure the dataset type is managed before the dataset is flushed.

VersionRecipe
Describes the current dataset version. At the moment, this mainly means providing one or more file recipes.

FileRecipe
Describes file populations. A file recipe may create regular files or tabular files.

VariableSetRecipe
Describes how many variables to create for tabular files. It supports uniform and skewed variable populations.

VariableMetadataRecipe
Decides whether a VariableMetadata row should be created for a generated (FileMetadata, DataVariable) pair. At most one metadata row is generated for each such pair.

Fixture Populator

The populator fills scalar and non-relationship fields, which are not primarily about graph shape.

It sets values such as:

Identifiers
Timestamps
File labels
Content types
Checksums
Variable names
Required fields
Null-sensitive collections

The default minimal populator is conservative. It creates enough data for serialization and persistence tests, but it does not try to simulate fully realistic production metadata.

Full Example

The following example creates a small but non-uniform dataset fixture. It’s suitable

for a smoke test of a serializer,
for an integration test with assertions on the result,
for a performance test with benchmarking speed of different implementations, and other scenarios.

var recipe = DatasetRecipe.of(
    DatasetTypeRecipe.dataset(),
    VersionRecipe.of(
        FileRecipe.regular(20),
        FileRecipe.tabular(30, 
            VariableSetRecipe
                .byPredicate(VariableMetadataRecipe.byPredicate(ctx -> ctx.variableIndex() < 5))
                .when(ctx -> ctx.fileIndex() % 10 == 0, 1_000)
                .otherwise(25)
    ))
);

DatasetFixture fixture = DatasetFixtureBuilder.builder()
    .recipe(recipe)
    .populator(FixturePopulator.minimal())
    .build();

JsonArrayBuilder files = Json.createArrayBuilder();

for (FileMetadata fileMetadata : fixture.fileMetadatas()) {
    files.add(JsonPrinter.json(fileMetadata.getDataFile(), fileMetadata, true));
}

var json = files.build();

This creates:

20 regular files
30 tabular files
- some tabular files with 1,000 variables
- other tabular files with 25 variables
- variable metadata only for the first few variables in each tabular file

This helps exercise code paths that traverse files, file metadata, data tables, data variables, variable metadata. All of this happends without the need to pre-produce an enormous fixture as a JSON file. Its deterministic nature allows running the test anywhere without depending on seeded randomness, offering reliable and reproducible results.

Basic Usage

Small Dataset

This example creates:

one dataset
one current version
10 tabular files
10 variables per tabular file
1 regular file

var recipe = DatasetRecipe.of(
    DatasetTypeRecipe.dataset(),
    VersionRecipe.of(
        FileRecipe.tabular(10, VariableSetRecipe.uniform(10)),
        FileRecipe.regular(1)
    )
);

DatasetFixture fixture = DatasetFixtureBuilder.builder()
    .recipe(recipe)
    .populator(FixturePopulator.minimal())
    .build();

Dataset dataset = fixture.dataset();
DatasetVersion version = fixture.currentVersion();

Skewed Variable Populations

Skewed data is useful for performance testing because real datasets are rarely uniform. Some files may have only a few variables, while others may be very large.

This example creates 500 tabular files:

one dataset
one current version
500 tabular files
- every 100th file receives 100,000 variables
- every 10th file receives 10,000 variables
- all others receive 250 variables

var variables = VariableSetRecipe.byPredicate()
    .when(ctx -> ctx.fileIndex() % 100 == 0, 100_000)
    .when(ctx -> ctx.fileIndex() % 10 == 0, 10_000)
    .otherwise(250);

var recipe = DatasetRecipe.of(
    DatasetTypeRecipe.dataset(),
    VersionRecipe.of(
        FileRecipe.tabular(500, variables)
    )
);

DatasetFixture fixture = DatasetFixtureBuilder.builder()
    .recipe(recipe)
    .build();

Adding Variable Metadata

Variable Metadata is optional and controlled by VariableMetadataRecipe. The metadata recipe is evaluated for each generated (FileMetadata, DataVariable) pair. This matters because VariableMetadata is versioned indirectly through FileMetadata.

No variable metadata (default):

VariableSetRecipe.uniform(1_000)
- or -
VariableSetRecipe.uniform(1_000, VariableMetadataRecipe.noop())

Metadata for every variable:

VariableSetRecipe.uniform(1_000, VariableMetadataRecipe.always())

Metadata for selected variables:

VariableSetRecipe.uniform(1_000, VariableMetadataRecipe.byPredicate(ctx -> ctx.variableIndex() % 10 == 0))

Persistence Usage

The generator creates an in-memory entity graph. Persisting that graph is optional and follows normal JPA rules.

When persisting a generated fixture to a database, remember that not all relationships cascade from Dataset to every object. In particular, DataFile instances usually need to be persisted explicitly before persisting the dataset graph. The DatasetType must also be managed, either by persisting the generated type or by looking up an existing one in the same persistence context.

A typical persistence sequence is:

jpa.inTransactionVoid(em -> {
    em.persist(fixture.datasetType());
    for (DataFile dataFile : fixture.dataFiles()) {
        em.persist(dataFile);
    }
    em.persist(fixture.dataset());
});

The exact order may evolve as the fixture generator grows, may depend on the exact usage scenario, and is influenced by the evolution of the entity classes themselves, but the important point is:
Shared/reference entities and non-cascaded entities must be managed (persisted) before the dataset graph is flushed.

Discussion and Limitations

Benefits

Readable scenarios: tests describe intent at a high level. For example: FileRecipe.tabular(500, VariableSetRecipe.uniform(1_000)) is easier to understand than manually creating thousands of entities.
Composable graph shape: different recipes can be combined to describe mixed datasets.
Deterministic output: the build context carries fixture-wide values such as sequence and timestamp, making generated data easier to debug and compare.
Reduced boilerplate: relationship wiring and null-sensitive defaults are centralized.
Better performance testing: skewed fixtures can expose ORM issues that uniform data may hide, such as N+1 query expansion over large variable collections.
Serialization safety: the minimal populator initializes fields and collections that serializers commonly traverse.

Tradeoffs

More concepts to learn: developers need to understand builders, recipes, populators, and resulting fixture objects vs. a static factory.
Not a full production object factory: the minimal populator creates safe test data, not necessarily realistic production data.
Persistence still requires care: some entities must be persisted explicitly because the production model does not cascade every relationship.
Hardcoded defaults: the minimal populator uses deterministic placeholder values, tests that need realistic metadata should provide a custom populator.

Limitations

Minimalistic: The current fixture generator is intentionally minimal.
Single dataset version only: the fixture currently models one current dataset version and does not generate multiple versions.
No version evolution recipes: there is no support yet for deriving later versions from earlier versions, modeling change over time.
Limited dataset metadata: dataset fields and metadata blocks are not generated in detail.
Simple dataset type handling: a DatasetType can be generated or supplied, but persistence of shared types is still the responsibility of the test.
No persistence manager: the fixture system builds graphs, but it does not yet provide a dedicated persister that knows the correct persistence order.
One table per tabular file: tabular files currently get one DataTable. The domain model can allow more, but the fixture generator does not expose that yet.
One variable group per tabular file: each non-empty tabular file currently gets one VarGroup containing all variables, there is no VarGroupRecipe yet.
Limited variable metadata content: variable metadata can be present or absent, but the minimal populator only fills basic scalar values.
No category or statistics recipes: the fixture generator does not yet provide recipes for variable categories, summary statistics, invalid ranges, or category metadata.

Unsupported Usage Scenarios

The following scenarios are not yet directly expressible:

multiple dataset versions sharing the same DataFile objects
metadata-only changes between versions
version-specific VariableMetadata changes across versions
files added or removed between versions
multiple DataTable objects per file
different variable group distributions per file
weighted random or seeded random file populations
Zipf-like or heavy-tail distributions as first-class recipes
realistic dataset field metadata
fixture graphs that mimic a fully published dataset lifecycle

Extending The Fixture Generator

When extending the fixture generator, first decide which responsibility your change belongs to.

Add Recipes For Graph Shaping

Use a new recipe when the test needs to describe what shape should be created.

Examples:

number of var groups
number of data tables per file
whether categories should exist
how many variables receive summary statistics
how versions evolve over time

Recipe changes usually belong in the edu.harvard.iq.dataverse.util.testing.recipes package.

Add Populator Behavior For Scalar Values

Use a new or custom populator when entities should be filled differently, but the graph shape is the same. Extend the populator interface if new types of scalar data are required.

Examples:

more realistic file names
different content types
richer variable labels
custom checksums
realistic variable metadata text

Populator changes usually belong in the edu.harvard.iq.dataverse.util.testing.fixtures package.

Change Builder For Wiring

Change the builder when new relationships must be created or maintained.

Examples:

adding support for VariableCategory
wiring category metadata
creating multiple data tables per file
linking version-evolved file metadata back to shared data files

Builder changes should be kept small and split into helper methods where possible.

Recommended Extension Path

A practical roadmap for further evolution is:

Add a VarGroupRecipe to control group count and membership.
Add category and summary statistic recipes for variable-level enrichment.
Add a fixture persister that knows the correct persistence order.
Add version evolution recipes for multi-version datasets.
Add richer dataset metadata generation.
Add (seeded!) random distribution recipes if a deterministic skew is not enough.
Add fuzzy testing by generating fixtures with targeted chaos.

Guidelines For Contributions

Keep recipes declarative: recipes should describe shape, not manually wire entity relationships.
Keep populators focused: populators should fill fields, not decide how many entities exist.
Keep builders responsible for wiring: relationship consistency belongs in the builder.
Prefer deterministic generation: deterministic data makes performance tests easier to reproduce and debug.
Avoid hiding persistence requirements: if an entity must be persisted before another, document it clearly or add a dedicated persister.
Start minimal: add the smallest recipe or populator extension needed for the scenario. Avoid making the DSL generic before there is a concrete test need.

«Previous Next»