Skip to content

Weaviate Explorations: AI and Literature

  • #Projects
Read time: 23 minutes

Initially, I experimented with Chroma for my project, which provided a foundation for handling vector databases. The link for the project is here: the link. However, I decided to transition to Weaviate for its robust capabilities and recent enhancements. I will be using two literary works—"Alice in Wonderland" and "The Master and Margarita"—to compare the results I obtain from both Chroma and Weaviate. With Weaviate's latest version release, I've noticed that many of the available tutorials still reference older versions, making it a bit challenging to navigate the updated features. Nonetheless, I am excited to explore the advanced functionalities of Weaviate and analyze how it measures up against my earlier experiences with Chroma.

Weaviate is an open-source vector database designed for handling large-scale, high-dimensional data, making it particularly useful in the realm of large language models (LLMs). It allows for efficient storage and retrieval of embeddings generated by LLMs, facilitating tasks such as semantic search, recommendation systems, and question-answering applications. With its unique combination of a flexible data model and powerful querying capabilities, Weaviate enables developers to build intelligent applications that leverage the power of LLMs while efficiently managing and processing vast amounts of unstructured data.

Let's begin!

import weaviate, os
os.environ['OPENAI_API_KEY'] = 'key'

client = weaviate.connect_to_embedded(
    headers={
        "X-OpenAi-Api-Key": os.environ.get("OPENAI_API_KEY"), # Replace with your Cohere key
    }
)

print("Client is Ready?", client.is_ready())

Here is output:

{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-10-10T11:06:54+03:00"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2024-10-10T11:06:54+03:00"}
{"level":"info","msg":"No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true","time":"2024-10-10T11:06:54+03:00"}
{"level":"info","msg":"module offload-s3 is enabled","time":"2024-10-10T11:06:54+03:00"}
{"level":"warning","msg":"Multiple vector spaces are present, GraphQL Explore and REST API list objects endpoint module include params has been disabled as a result.","time":"2024-10-10T11:06:54+03:00"}
{"level":"info","msg":"open cluster service","servers":{"Embedded_at_8079":47673},"time":"2024-10-10T11:06:54+03:00"}
{"address":"172.31.244.235:47674","level":"info","msg":"starting cloud rpc server ...","time":"2024-10-10T11:06:54+03:00"}
{"level":"info","msg":"starting raft sub-system ...","time":"2024-10-10T11:06:54+03:00"}
{"address":"172.31.244.235:47673","level":"info","msg":"tcp transport","tcpMaxPool":3,"tcpTimeout":10000000000,"time":"2024-10-10T11:06:54+03:00"}
{"level":"info","msg":"loading local db","time":"2024-10-10T11:06:54+03:00"}
{"level":"info","msg":"local DB successfully loaded","time":"2024-10-10T11:06:54+03:00"}
{"level":"info","msg":"schema manager loaded","n":0,"time":"2024-10-10T11:06:54+03:00"}
{"level":"info","metadata_only_voters":false,"msg":"construct a new raft node","name":"Embedded_at_8079","time":"2024-10-10T11:06:54+03:00"}
{"action":"raft","index":16,"level":"info","msg":"raft initial configuration","servers":"[[{Suffrage:Voter ID:Embedded_at_8079 Address:172.31.244.235:52371}]]","time":"2024-10-10T11:06:54+03:00"}
{"last_snapshot_index":0,"last_store_applied_index":0,"last_store_log_applied_index":33,"level":"info","msg":"raft node constructed","raft_applied_index":0,"raft_last_index":33,"time":"2024-10-10T11:06:54+03:00"}
{"action":"raft","follower":{},"leader-address":"","leader-id":"","level":"info","msg":"raft entering follower state","time":"2024-10-10T11:06:54+03:00"}
{"action":"bootstrap","error":"could not join a cluster from [172.31.244.235:47673]","level":"warning","msg":"failed to join cluster, will notify next if voter","servers":["172.31.244.235:47673"],"time":"2024-10-10T11:06:55+03:00","voter":true}
{"action":"bootstrap","candidates":[{"Suffrage":0,"ID":"Embedded_at_8079","Address":"172.31.244.235:47673"}],"level":"info","msg":"starting cluster bootstrapping","time":"2024-10-10T11:06:55+03:00"}
{"action":"bootstrap","error":"bootstrap only works on new clusters","level":"error","msg":"could not bootstrapping cluster","time":"2024-10-10T11:06:55+03:00"}
{"action":"bootstrap","level":"info","msg":"notified peers this node is ready to join as voter","servers":["172.31.244.235:47673"],"time":"2024-10-10T11:06:55+03:00"}
{"action":"raft","last-leader-addr":"","last-leader-id":"","level":"warning","msg":"raft heartbeat timeout reached, starting election","time":"2024-10-10T11:06:55+03:00"}
{"action":"raft","level":"info","msg":"raft entering candidate state","node":{},"term":21,"time":"2024-10-10T11:06:55+03:00"}
{"action":"raft","level":"info","msg":"raft election won","tally":1,"term":21,"time":"2024-10-10T11:06:56+03:00"}
{"action":"raft","leader":{},"level":"info","msg":"raft entering leader state","time":"2024-10-10T11:06:56+03:00"}
{"level":"info","msg":"reload local db: update schema ...","time":"2024-10-10T11:06:56+03:00"}
{"index":"WikipediaLangChain","level":"info","msg":"reload local index","time":"2024-10-10T11:06:56+03:00"}
{"docker_image_tag":"unknown","level":"info","msg":"configured versions","server_version":"1.26.1","time":"2024-10-10T11:06:56+03:00"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50050","time":"2024-10-10T11:06:56+03:00"}
{"address":"172.31.244.235:47673","level":"info","msg":"current Leader","time":"2024-10-10T11:06:56+03:00"}
{"action":"restapi_management","docker_image_tag":"unknown","level":"info","msg":"Serving weaviate at http://127.0.0.1:8079","time":"2024-10-10T11:06:56+03:00"}
Client is Ready? True
{"action":"telemetry_push","level":"info","msg":"telemetry started","payload":"\u0026{MachineID:d1236b0f-b845-428e-a12a-9e0b5c5806f6 Type:INIT Version:1.26.1 NumObjects:0 OS:linux Arch:amd64 UsedModules:[generative-openai text2vec-openai]}","time":"2024-10-10T11:06:56+03:00"}
{"action":"bootstrap","level":"info","msg":"node reporting ready, node has probably recovered cluster from raft config. Exiting bootstrap process","time":"2024-10-10T11:06:57+03:00"}
{"action":"hnsw_prefill_cache_async","level":"info","msg":"not waiting for vector cache prefill, running in background","time":"2024-10-10T11:06:57+03:00","wait_for_cache_prefill":false}
{"level":"info","msg":"Completed loading shard wikipedialangchain_Pea0cDHeaXZ8 in 11.946172ms","time":"2024-10-10T11:06:57+03:00"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-10-10T11:06:57+03:00","took":16151273}

Let's create our class beforehand

from weaviate import classes as wvc
client.collections.delete("LiteratureLangChain")
# lets make sure its vectorizer is what the one we want
collection = client.collections.create(
    name="LiteratureLangChain",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
    generative_config=wvc.config.Configure.Generative.openai(),
)

The output is below:

{"level":"warning","msg":"prop len tracker file /home/adduser/.local/share/weaviate/literaturelangchain/iRjLUCUwfkjr/proplengths does not exist, creating new tracker","time":"2024-10-10T11:12:13+03:00"}
{"action":"hnsw_prefill_cache_async","level":"info","msg":"not waiting for vector cache prefill, running in background","time":"2024-10-10T11:12:13+03:00","wait_for_cache_prefill":false}
{"level":"info","msg":"Created shard literaturelangchain_iRjLUCUwfkjr in 3.619562ms","time":"2024-10-10T11:12:13+03:00"}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-10-10T11:12:13+03:00","took":94404}

Now we have a Weaviate client! Let's read our 2 input files, then chunk them and ingest using Langchain.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import TextLoader
from langchain_weaviate.vectorstores import WeaviateVectorStore

embeddings = OpenAIEmbeddings()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)

# Load and split the first Markdown file
loader1 = TextLoader("Master and Margarita Bulgakov.md")  
docs1 = loader1.load_and_split(text_splitter)
print(f"GOT {len(docs1)} docs for 'Master and Margarita'")

# Load and split the second Markdown file
loader2 = TextLoader("alice_in_wonderland.md")
docs2 = loader2.load_and_split(text_splitter)
print(f"GOT {len(docs2)} docs for 'Alice in Wonderland'")

# Store documents in Weaviate
db1 = WeaviateVectorStore.from_documents(docs1, embeddings, client=client, index_name="LiteratureLangChain")
db2 = WeaviateVectorStore.from_documents(docs2, embeddings, client=client, index_name="LiteratureLangChain")

Here is the output:

GOT 3603 docs for 'Master and Margarita'
GOT 569 docs for 'Alice in Wonderland'

Let's count how many objects we have in total

# Aggregate collection info
collection = client.collections.get("LiteratureLangChain")
response = collection.aggregate.over_all(total_count=True)
print(response)

The output is below:

AggregateReturn(properties={}, total_count=4172)

Now, how many objects we have per source?

response = collection.aggregate.over_all(group_by="source")
for group in response.groups:
    print(group.grouped_by.value, group.total_count)

The output:

Master and Margarita Bulgakov.md 3603
alice_in_wonderland.md 569
# Define the prompt for RAG
generateTask = "What are the main themes and characters in this literary work?"

# Query specifically using the 'Master and Margarita' file
source_file1 = "Master and Margarita Bulgakov.md"
source_file2 = "alice_in_wonderland.md"

query1 = collection.generate.near_text(
    query="literary themes and characters",
    filters=wvc.query.Filter.by_property("source").equal(source_file1),
    limit=10,
    grouped_task=generateTask
)

query2 = collection.generate.near_text(
    query="literary themes and characters",
    filters=wvc.query.Filter.by_property("source").equal(source_file2),
    limit=10,
    grouped_task=generateTask
)

print("Query results for 'Master and Margarita':", query1.generated)
print("Query results for 'Alice in Wonderland':", query2.generated)

Here is the output:

Query results for 'Master and Margarita': The main themes in "Master and Margarita" include formal originality, satire of Soviet life, theatrical rendering of the Great Terror, audacious portrayal of Jesus Christ and Pontius Pilate, and the challenge to the rule of terror in literature. 

The central characters in the novel are Woland (Satan) and his retinue, the poet Ivan Homeless, Pontius Pilate, the unnamed writer known as "the master," and Margarita. Ivan Homeless is a touchstone character who undergoes radical changes and continues the work of the master. 

The novel is composed of two interwoven parts set in contemporary Moscow and ancient Jerusalem, with the Pilate story being a significant aspect. The novel's form excludes psychological analysis and historical commentary, focusing instead on quickness, pungency, and theatricality in its writing.
Query results for 'Alice in Wonderland': The main themes in "Alice's Adventures in Wonderland" include fantasy, imagination, nonsense, and the absurdity of rules and logic. The characters in the story include Alice, the White Rabbit, the Mad Hatter, the Queen of Hearts, the Cheshire Cat, and many other whimsical and eccentric characters. The story follows Alice as she navigates through a strange and surreal world, encountering various challenges and obstacles along the way. The characters and themes in the story reflect the author's exploration of childhood, identity, and the nature of reality.

Those were some of the objects used for this generation

for object in query1.objects[0:10]:
    print(object.properties)

The output is:

{'text': "major novel, the author's crowning work. Then there were the qualities\nof the novel itself --- its formal originality, its devastating satire\nof Soviet life, and of Soviet literary life in particular, its\n'theatrical' rendering of the Great Terror of the thirties, the audacity\nof its portrayal of Jesus Christ and Pontius Pilate, not to mention", 'source': 'Master and Margarita Bulgakov.md'}
{'text': 'version. They also indicate a thematic link between Pilate, the master,\nand the author himself, connecting the historical and contemporary parts\nof the novel.', 'source': 'Master and Margarita Bulgakov.md'}
{'text': "The touchstone character of the novel is Ivan Homeless, who is there at\nthe start, is radically changed by his encounters with Woland and the\nmaster, becomes the latter's 'disciple' and continues his work, is\npresent at almost every turn of the novel's action, and appears finally\nin the epilogue. He remains an uneasy inhabitant of 'normal' reality, as", 'source': 'Master and Margarita Bulgakov.md'}
{'text': "poison', 'Even by moonlight I have no peace' - migrate from one\ncharacter to another, or to the narrator. A more conspicuous case is the\nPilate story itself, successive parts of which are told by Woland,\ndreamed by the poet Homeless, written by the master, and read by\nMargarita, while the whole preserves its stylistic unity. Narrow notions", 'source': 'Master and Margarita Bulgakov.md'}
{'text': "The novel in its definitive version is composed of two distinct but\ninterwoven parts, one set in contemporary Moscow, the other in ancient\nJerusalem (called Yershalaim). Its central characters are Woland (Satan)\nand his retinue, the poet Ivan Homeless, Pontius Pilate, an unnamed\nwriter known as 'the master', and Margarita. The Pilate story is", 'source': 'Master and Margarita Bulgakov.md'}
{'text': "Illustrious Men, was rejected by the publisher. These circumstances are\neverywhere present in [The Master and Margarita,]{.italic} which was in\npart Bulgakov's challenge to the rule of terror in literature. The\nsuccessive stages of his work on the novel, his changing evaluations of\nthe nature of the book and its characters, reflect events in his life", 'source': 'Master and Margarita Bulgakov.md'}
{'text': "These three stories, in form as well as content, embrace virtually all\nthat was excluded from official Soviet ideology and its literature. But\nif the confines of 'socialist realism' are utterly exploded, so are the\nconfines of more traditional novelistic realism. [The Master and\nMargarita]{.italic} as a whole is a consistently free verbal", 'source': 'Master and Margarita Bulgakov.md'}
{'text': 'there was talk of little else. Certain sentences from the novel\nimmediately became proverbial. The very language of the novel was a\ncontradiction of everything wooden, official, imposed. It was a joy to\nspeak.', 'source': 'Master and Margarita Bulgakov.md'}
{'text': "The novel's form excludes psychological analysis and historical\ncommentary. Hence the quickness and pungency of Bulgakov's writing. At\nthe same time, it allows Bulgakov to exploit all the theatricality of\nits great scenes - storms, flight, the attack of vampires, all the\nantics of the demons Koroviev and Behemoth, the seance in the Variety", 'source': 'Master and Margarita Bulgakov.md'}
{'text': '- Modern fiction\n- General & Literary Fiction\n- Modern & contemporary fiction (post c 1945)\n- Fiction\n- General\ntitle: The Master and Margarita\n---', 'source': 'Master and Margarita Bulgakov.md'}

And of course, we can use different filters, and get different content for our questions

# Define the prompt for RAG
generateTask = "What is common in the food mentioned in these two literary works?"

# List the source files for both literary works
source_files = ["Master and Margarita Bulgakov.md", "alice_in_wonderland.md"]

# Generate the query using the specified question and filtering by the source files
query = collection.generate.near_text(
    query="traditional food",
    filters=wvc.query.Filter.by_property("source").contains_any(source_files),
    limit=10,
    grouped_task=generateTask
)

print(query.generated)

Here is the output:

The common food mentioned in both literary works is soup. In "Master and Margarita" by Bulgakov, there are references to bowls of soup being served in a summer restaurant and a steaming pot of borscht containing a marrow bone. In "Alice in Wonderland," there is a mention of the characters eating comfits, which caused noise and confusion, but they eventually sat down again in a ring and begged the Mouse to tell them something more.

Let's ask more difficult question.

# Define the prompt for RAG
generateTask = "How do both authors use humor to convey serious themes in their stories?"

# List the source files for both literary works
source_files = ["Master and Margarita Bulgakov.md", "alice_in_wonderland.md"]

# Generate the query using the specified question and filtering by the source files
query = collection.generate.near_text(
    query="humor",
    filters=wvc.query.Filter.by_property("source").contains_any(source_files),
    limit=10,
    grouped_task=generateTask
)

print(query.generated)

The output is below.

Both authors use humor to convey serious themes in their stories by incorporating elements of absurdity and exaggeration. In "Master and Margarita" by Bulgakov, the descriptions of lavish meals and extravagant settings are exaggerated to the point of being comical, highlighting the absurdity of societal norms and values. The use of humor in these descriptions serves to critique the materialism and superficiality of the characters and society as a whole.

Similarly, in "Alice in Wonderland" by Lewis Carroll, the absurd and nonsensical situations that Alice finds herself in are used to satirize the rigid social conventions and expectations of Victorian society. The humorous interactions between the characters, such as the birds complaining about not being able to taste their comfits or the small ones choking and needing to be patted on the back, serve to highlight the absurdity of societal norms and expectations.

Overall, both authors use humor as a tool to convey deeper themes and critiques of society, using exaggeration and absurdity to shed light on the underlying issues at play.
# Define the prompt for RAG
generateTask = "How does Wolands presence affect the characters and events?"

# Query specifically using the 'Master and Margarita' file
source_file = "Master and Margarita Bulgakov.md"

query = collection.generate.near_text(
    query="literary themes and characters",
    filters=wvc.query.Filter.by_property("source").equal(source_file),
    limit=10,
    grouped_task=generateTask
)

print("Query results for 'Master and Margarita':", query.generated)

The output is:

Query results for 'Master and Margarita': Woland's presence in "The Master and Margarita" has a significant impact on the characters and events in the novel. 

One of the touchstone characters, Ivan Homeless, is radically changed by his encounters with Woland and the master. He becomes the master's disciple and continues his work, playing a key role in the novel's action. Ivan Homeless remains an uneasy inhabitant of "normal" reality, showing the lasting effects of Woland's influence.

Additionally, Woland's presence allows for the migration of themes and motifs throughout the novel. Concepts such as 'madness' and 'poison' move from one character to another, creating a sense of interconnectedness among the characters and events. The Pilate story itself is told by Woland, dreamed by Ivan Homeless, written by the master, and read by Margarita, showcasing the multiple perspectives and layers of storytelling in the novel.

Overall, Woland's presence adds depth and complexity to the characters and events in "The Master and Margarita," shaping their actions and interactions in profound ways.

Returning to my experiment with Chroma project, let's recall response for the same question 'How does Wolands presence affect the characters and events?'.

Response was:

content="Woland's presence causes fear, discomfort, and changes in the lives of the characters. Georges Bengalsky had to give up his work at the Variety due to the memory of black magic and the exposure he experienced. Margarita is in awe of Woland's true image and the transformation of the landscape below them as they fly. Woland's actions, such as raising his sword and transforming a head into a skull, instill a sense of dread and anticipation in the characters. Overall, Woland's presence brings about significant emotional and physical impact on the characters and events in the story." response_metadata={'token_usage': {'completion_tokens': 119, 'prompt_tokens': 513, 'total_tokens': 632}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None} id='run-cc836402-cbdc-44ae-924e-447011ffa034-0' usage_metadata={'input_tokens': 513, 'output_tokens': 119, 'total_tokens': 632} Sources: ['data\Master and Margarita Bulgakov.md', 'data\Master and Margarita Bulgakov.md', 'data\Master and Margarita Bulgakov.md']

The Chroma response emphasizes the emotional impact of Woland's presence, describing the fear and discomfort he instills in characters like Georges Bengalsky and Margarita. It points to specific actions, such as Woland raising his sword, that create a sense of dread, emphasizing the physical and psychological effects on those around him.

In contrast, the Weaviate response delves deeper into character development, particularly focusing on Ivan Homeless. It illustrates how encounters with Woland transform him into the master's disciple and affect his perception of reality. This response also highlights the thematic interconnectedness that Woland fosters among characters, showcasing how motifs like 'madness' and 'poison' migrate throughout the narrative, reflecting a more intricate understanding of the novel's structure and storytelling layers.

Overall, while both responses recognize Woland's formidable presence, Chroma's answer centers more on immediate emotional reactions, whereas Weaviate's response provides a broader analysis of character evolution and thematic depth. This comparison illustrates the varying focus each platform brings to the analysis of complex literary works.

Conclusion

Chroma and Weaviate both offer powerful capabilities for managing and querying vector data, making them suitable for applications involving large language models. Chroma is known for its simplicity and ease of use, with straightforward integration for storing embeddings and retrieving relevant content. It tends to produce responses that focus on immediate, surface-level details, making it well-suited for tasks requiring quick, context-specific answers.

On the other hand, Weaviate is a more robust solution, designed for handling large-scale data with advanced querying features. It excels in exploring deeper connections within data, often offering richer, more complex insights. This makes it ideal for projects that require a nuanced analysis of themes and relationships across diverse content. However, with Weaviate’s recent updates, the learning curve can be steeper, as many tutorials and resources are still based on older versions.

Overall, Chroma is great for simplicity and speed, while Weaviate provides a more powerful framework for in-depth, sophisticated analyses.

© 2025

Elena Medvedeva. Created by Elena Aseeva. Some assets are created by freepik.com