Skip to main content
Version: devel

Weaviate

Weaviate is an open-source vector database. It allows you to store data objects and perform similarity searches over them. This destination helps you load data into Weaviate from dlt resources.

Setup Guideโ€‹

  1. To use Weaviate as a destination, make sure dlt is installed with the 'weaviate' extra:
pip install "dlt[weaviate]"
  1. Next, configure the destination in the dlt secrets file. The file is located at ~/.dlt/secrets.toml by default. Add the following section to the secrets file:
[destination.weaviate.credentials]
url = "https://your-weaviate-url"
api_key = "your-weaviate-api-key"

[destination.weaviate.credentials.additional_headers]
X-OpenAI-Api-Key = "your-openai-api-key"

In this setup guide, we are using the Weaviate Cloud Services to get a Weaviate instance and OpenAI API for generating embeddings through the text2vec-openai module.

You can host your own Weaviate instance using Docker Compose, Kubernetes, or embedded. Refer to Weaviate's How-to: Install or dlt recipe we use for our tests. In that case, you can skip the credentials part altogether:

[destination.weaviate.credentials.additional_headers]
X-OpenAI-Api-Key = "your-openai-api-key"

The url will default to http://localhost:8080 and api_key is not defined - which are the defaults for the Weaviate container.

  1. Define the source of the data. For starters, let's load some data from a simple data structure:
import dlt
from dlt.destinations.adapters import weaviate_adapter

movies = [
{
"title": "Blade Runner",
"year": 1982,
},
{
"title": "Ghost in the Shell",
"year": 1995,
},
{
"title": "The Matrix",
"year": 1999,
}
]
  1. Define the pipeline:
pipeline = dlt.pipeline(
pipeline_name="movies",
destination="weaviate",
dataset_name="MoviesDataset",
)
  1. Run the pipeline:
info = pipeline.run(
weaviate_adapter(
movies,
vectorize="title",
)
)
  1. Check the results:
print(info)

The data is now loaded into Weaviate.

Weaviate destination is different from other dlt destinations. To use vector search after the data has been loaded, you must specify which fields Weaviate needs to include in the vector index. You do that by wrapping the data (or dlt resource) with the weaviate_adapter function.

weaviate_adapterโ€‹

The weaviate_adapter is a helper function that configures the resource for the Weaviate destination:

weaviate_adapter(data, vectorize, tokenization)

It accepts the following arguments:

  • data: a dlt resource object or a Python data structure (e.g., a list of dictionaries).
  • vectorize: a name of the field or a list of names that should be vectorized by Weaviate.
  • tokenization: the dictionary containing the tokenization configuration for a field. The dictionary should have the following structure {'field_name': 'method'}. Valid methods are "word", "lowercase", "whitespace", "field". The default is "word". See Property tokenization in Weaviate documentation for more details.

Returns: a dlt resource object that you can pass to the pipeline.run().

Example:

weaviate_adapter(
resource,
vectorize=["title", "description"],
tokenization={"title": "word", "description": "whitespace"},
)

When using the weaviate_adapter, it's important to apply it directly to resources, not to the whole source. Here's an example:

products_tables = sql_database().with_resources("products", "customers")

pipeline = dlt.pipeline(
pipeline_name="postgres_to_weaviate_pipeline",
destination="weaviate",
)

# apply adapter to the needed resources
weaviate_adapter(products_tables.products, vectorize="description")
weaviate_adapter(products_tables.customers, vectorize="bio")

info = pipeline.run(products_tables)
tip

A more comprehensive pipeline would load data from some API or use one of dlt's verified sources.

Write dispositionโ€‹

A write disposition defines how the data should be written to the destination. All write dispositions are supported by the Weaviate destination.

Replaceโ€‹

The replace disposition replaces the data in the destination with the data from the resource. It deletes all the classes and objects and recreates the schema before loading the data.

In the movie example from the setup guide, we can use the replace disposition to reload the data every time we run the pipeline:

info = pipeline.run(
weaviate_adapter(
movies,
vectorize="title",
),
write_disposition="replace",
)

Mergeโ€‹

The merge write disposition merges the data from the resource with the data in the destination. For the merge disposition, you would need to specify a primary_key for the resource:

info = pipeline.run(
weaviate_adapter(
movies,
vectorize="title",
),
primary_key="document_id",
write_disposition="merge"
)

Internally, dlt will use primary_key (document_id in the example above) to generate a unique identifier (UUID) for each object in Weaviate. If the object with the same UUID already exists in Weaviate, it will be updated with the new data. Otherwise, a new object will be created.

caution

If you are using the merge write disposition, you must set it from the first run of your pipeline; otherwise, the data will be duplicated in the database on subsequent loads.

Appendโ€‹

This is the default disposition. It will append the data to the existing data in the destination, ignoring the primary_key field.

Data loadingโ€‹

Loading data into Weaviate from different sources requires a proper understanding of how data is transformed and integrated into Weaviate's schema.

Data typesโ€‹

Data loaded into Weaviate from various sources might have different types. To ensure compatibility with Weaviate's schema, there's a predefined mapping between the dlt types and Weaviate's native types:

dlt TypeWeaviate Type
texttext
doublenumber
boolboolean
timestampdate
datedate
bigintint
binaryblob
decimaltext
weinumber
complextext

Dataset nameโ€‹

Weaviate uses classes to categorize and identify data. To avoid potential naming conflicts, especially when dealing with multiple datasets that might have overlapping table names, dlt includes the dataset name into the Weaviate class name. This ensures a unique identifier for every class.

For example, if you have a dataset named movies_dataset and a table named actors, the Weaviate class name would be MoviesDataset_Actors (the default separator is an underscore).

However, if you prefer to have class names without the dataset prefix, skip the dataset_name argument.

For example:

pipeline = dlt.pipeline(
pipeline_name="movies",
destination="weaviate",
)

Names normalizationโ€‹

When loading data into Weaviate, dlt tries to maintain naming conventions consistent with the Weaviate schema.

Here's a summary of the naming normalization approach:

Table namesโ€‹

  • Snake case identifiers such as snake_case_name get converted to SnakeCaseName (aka Pascal case).
  • Pascal case identifiers such as PascalCaseName remain unchanged.
  • Leading underscores are removed. Hence, _snake_case_name becomes SnakeCaseName.
  • Numbers in names are retained, but if a name starts with a number, it's prefixed with a character, e.g., 1_a_1snake_case_name to C1A1snakeCaseName.
  • Double underscores in the middle of names, like Flat__Space, result in a single underscore: Flat_Space. If these appear at the end, they are followed by an 'x', making Flat__Space_ into Flat_Spacex.
  • Special characters and spaces are replaced with underscores, and emojis are simplified. For instance, Flat Sp!ace becomes Flat_SpAce and Flat_Sp๐Ÿ’กace is changed to Flat_SpAce.

Property namesโ€‹

  • Snake case and camel case remain unchanged: snake_case_name and camelCaseName.
  • Names starting with a capital letter have it lowercased: CamelCase -> camelCase
  • Names with multiple underscores, such as Snake-______c__ase_``, are compacted to snake_c_asex. Except for the case when underscores are leading, in which case they are kept: snake_case_namebecomessnake_case_name`.
  • Names starting with a number are prefixed with a "p_". For example, 123snake_case_name becomes p_123snake_case_name.

Reserved property namesโ€‹

Reserved property names like id or additional are prefixed with underscores for differentiation. Therefore, id becomes __id and _id is rendered as ___id.

Case insensitive naming conventionโ€‹

The default naming convention described above will preserve the casing of the properties (besides the first letter which is lowercased). This generates nice classes in Weaviate but also requires that your input data does not have clashing property names when comparing case insensitive ie. (caseName == casename). In such case Weaviate destination will fail to create classes and report a conflict.

You can configure an alternative naming convention which will lowercase all properties. The clashing properties will be merged and the classes created. Still, if you have a document where clashing properties like:

{"camelCase": 1, "CamelCase": 2}

it will be normalized to:

{"camelcase": 2}

so your best course of action is to clean up the data yourself before loading and use the default naming convention. Nevertheless, you can configure the alternative in config.toml:

[schema]
naming="dlt.destinations.impl.weaviate.ci_naming"

Additional destination optionsโ€‹

  • batch_size: (int) the number of items in the batch insert request. The default is 100.
  • batch_workers: (int) the maximal number of concurrent threads to run batch import. The default is 1.
  • batch_consistency: (str) the number of replica nodes in the cluster that must acknowledge a write or read request before it's considered successful. The available consistency levels include:
    • ONE: Only one replica node needs to acknowledge.
    • QUORUM: Majority of replica nodes (calculated as replication_factor / 2 + 1) must acknowledge.
    • ALL: All replica nodes in the cluster must send a successful response. The default is ONE.
  • batch_retries: (int) number of retries to create a batch that failed with ReadTimeout. The default is 5.
  • dataset_separator: (str) the separator to use when generating the class names in Weaviate.
  • conn_timeout and read_timeout: (float) to set timeouts (in seconds) when connecting and reading from REST API. defaults to (10.0, 180.0)
  • startup_period (int) - how long to wait for weaviate to start
  • vectorizer: (str) the name of the vectorizer to use. The default is text2vec-openai.
  • moduleConfig: (dict) configurations of various Weaviate modules

Configure Weaviate modulesโ€‹

The default configuration for the Weaviate destination uses text2vec-openai. To configure another vectorizer or a generative module, replace the default module_config value by updating config.toml:

[destination.weaviate]
module_config={text2vec-openai = {}, generative-openai = {}}

This ensures the generative-openai module is used for generative queries.

Run Weaviate fully standaloneโ€‹

Below is an example that configures the contextionary vectorizer. You can put this into config.toml. This configuration does not need external APIs for vectorization and may be used fully offline.

[destination.weaviate]
vectorizer="text2vec-contextionary"
module_config={text2vec-contextionary = { vectorizeClassName = false, vectorizePropertyName = true}}

You can find Docker Compose with the instructions to run here

dbt supportโ€‹

Currently, Weaviate destination does not support dbt.

Syncing of dlt stateโ€‹

Weaviate destination supports syncing of the dlt state.

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub โ€“ it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.