Skip to content

Nirvana: LLM-powered Semantic Data Analytics Programming Framework

Paper PyPI Documentation

Nirvana is an LLM-powered semantic data analytics programming framework that enables semantic data analytics queries over multi-modal data (e.g., text, images, audio). It provides a pandas-like interface with semantic operators that use large language models to process data based on natural language instructions. It also allows an optimizer to find the best execution plan for a given query to strick a balance between quality, runtime, and cost. With Nirvana, users focus only on "what they want to do", instead of "how they achieve it".

Step 0: Install nirvana and set up initial llm

pip install nirvana-ai
uv pip install nirvana-ai
pip install git+https://github.com/JunHao-Zhu/nirvana.git

Before you get started with enjoying features of Nirvana, the first thing to do is to set up a default llm. Taking gpt-4o as an example,you can authenticate by setting the OPENAI_API_KEY env variable or passing api_key below.

import nirvana as nv
nv.configure_llm_backbone(model_name="gpt-4o", api_key="YOUR_OPENAI_API_KEY")

Apply Semantic Operators to DataFrame

Suppose that you have only a simple semantic processing task on hand, for which you want to apply semantic operators to the data and obtain results in a few lines of code as soon as possible. You can easily use function wrappers of semantic operators on your data frame. Here is an example.

Extract the genre from the movie overview

1
2
3
4
5
6
7
8
9
df = pd.DataFrame(
{
    "title": ["The Godfather", "The Dark Knight"], 
    "overview": [
        "An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.", 
        "When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice."
    ]
})
nv.ops.map(df, "According to the movie overview, extract the genre of each movie.", input_columns=["overview"], output_columns = ["genre"], strategy="plain")

Possible Output:

MapOpOutputs(
    outputs = {"genre": ["crime, drama", "action, thriller, superhero"]}
)

More usages of semantic operators can be found in operators

Enable Query Optimization

If you have a complex semantic query over large datasets on hand, you probabily want to process the query in a faster, lower-cost way. In this case, Nirvana enables lazy execution and query optimization to automatically find a plan that scales down runtime and monetary costs. Here is a usage example.

1
2
3
4
5
6
7
8
9
movie = nv.DataFrame.from_external_file("/testdata/movie_data.csv")
movie.semantic_map(user_instruction="According to the movie overview, extract the genre of each movie.", input_columns=["Overview"], output_columns=["Genre"])
movie.semantic_filter(user_instruction="The rating is higher than 7.", input_columns=["IMDB_Rating"])
movie.semantic_filter(user_instruction="The rating is lower than 8.", input_columns=["IMDB_Rating"])
movie.semantic_filter(user_instruction="The movie is a crime movie.", input_columns=["Genre"])
movie.semantic_reduce(user_instruction="Summerize the common plot structure of these high-rated crime movies.", input_column="Overview")

config = nv.optim.OptimizeConfig(do_logical_optimization=True, do_physical_optimization=True, max_rounds=5, num_samples=5, improve_margin=0.2)
result, cost, runtime = movie.optimize_and_execute(optim_config=config)

For details and usages of query optimization refers to optimization