DataAnalyzr’s pythonic analysis provides users with a comprehensive toolkit for conducting machine learning-based analysis using Python’s extensive array of libraries such as NumPy, pandas, scikit-learn, and statsmodels. Users can tap into Python’s vast ecosystem of machine learning tools and techniques to derive actionable insights from their data and solve complex analytical problems.

Getting started

Data analysis with DataAnalyzr includes four simple steps:

1

Install lyzr

Install the lyzr package with the data-analyzr variant using pip.

2

Create an instance

Create an instance of the DataAnalyzr class with the desired analysis type and API key.

3

Load data

Load data from files, Redshift, PostgreSQL, or SQLite databases using the get_data method.

4

Ask a question

Ask a question using the ask method to generate visualizations, insights, recommendations, and tasks.

Let’s go over these steps one by one for pythonic data analysis.

Installation

The first step is to install using pip. In order to use lyzr’s data analysis capabilities, install it with the data-analyzr variant.

pip install lyzr[data-analyzr]

This will install the lyzr package alongside all dependencies required for data analysis.

Creating an instance

The next step is to create a class instance.

from lyzr import DataAnalyzr

analyzr = DataAnalyzr(analysis_type="ml", api_key="openai_key")

The analysis_type parameter can take three options:

  1. ml - for analysis with Python code, using Pandas, Scikit-learn and other similar packages.
  2. sql - for SQL analysis.
  3. skip - if you want to skip the analysis altogether, and get insights directly from the uploaded data.
Note that you can also provide an api_key parameter. This parameter is optional and given as an alternative to setting the environment variable.
For documentation on the SQL-based analysis, please refer to the SQL Analysis Agent documentation.

For details on all the options available in instantiating the DataAnalyzr class, visit the API reference.

Loading data

DataAnalyzr provides multiple options for connecting with your data. Whether you are working with data files in CSV, Excel, JSON, etc. formats, or you want to connect to an online database in Redshift, or perhaps you have a local SQLite database, there is an option for you.

The class method used to connect with data is get_data. It takes three parameters - db_type, db_config, and vector_store_config - the values of which depend on the format of the input data. Let’s look at a couple of examples:

Loading data from files

Collect all your data files in a list of dictionaries, with their names, paths and keyword arguments. Then pass this dictionary when calling the get_data method.

db_config = {
    "datasets": [
        {
            "name": "Name of dataset",
            "value": "path/to/dataset.csv",
            # files can be in .csv, .xlsx, .xls, and .json formats
        },
        {
            "name": "Name of dataset",
            "value": "path/to/dataset.xlsx",
            "kwargs": {"sheet_name": "Sheet1"},
            # pass additional keyword arguments for reading the file
        },
        {
            "name": "Name of dataset",
            "value": pd.read_csv("path/to/dataset.csv"),
            # you can also pass pandas DataFrame objects
        },
    ]
}
# Call the get_data method
analyzr.get_data(
    db_type = "files",
    db_config = db_config,
)

The value of db_type tells the system:

  • which type of data it will need to explore
  • what to expect in the db_config parameter

Loading data from Redshift

As a first step, you will need to collect all the Redshift details.

redshift_config = {
    "host": "",
    "port": 3000, # e.g. 5439
    "database": "",
    "user": "",
    "password": "",
    "schema": ["schema1", "public"], # optional
    "tables": ["table1", "table2"], # optional
}
analyzr.get_data(
    db_type = "redshift",
    db_config = redshift_config,
    vector_store_config = { # optional
        "path": "path/to/vector/store",
        "remake": True, # overwrite the existing vector store, if any
    }
)

Once again, note that the value of db_type tells the system:

  • which type of data it will need to explore
  • what to expect in the db_config parameter

While you may pass only one database in db_config, the number of tables and schemas is not limited, they are passed as lists. If no schema and tables are passed, all the tables from the public schema are taken.

Loading data from PostgreSQL

The implementation for PostgreSQL is very similar to that for Redshift. Start by collecting all the DB details.

postgres_config = {
    "host": "",
    "port": 3001, # e.g. 5439
    "database": "",
    "user": "",
    "password": "",
    "schema": ["schema1", "public"], # optional
    "tables": ["table1", "table2"], # optional
}
analyzr.get_data(
    db_type = "postgres",
    db_config = postgres_config,
)

Note that the value of db_type is now postgres, while everything else is the same.

Loading data from SQLite

A local SQLite database can also be used for analysis with DataAnalyzr. You only need to pass the path to this database in the db_config parameter.

analyzr.get_data(
    db_type = "sqlite",
    db_config = {"db_path": "path/to/sqlite.db"},
)

Alternatively, if you have a URL which holds the SQLite database, you can pass this URL as the value of db_path.

Getting results

You can use the DataAnalyzr object to perform an analysis on the DataFrame by passing an analysis query to the method ask. This function enables you to ask questions directly related to the data at hand, allowing DataAnalyzr to process the inquiry and provide the corresponding visualisation, insights, recommendations, and tasks.

A most simple such implementation looks like this:

result = analyzr.ask("Your question here") # returns dict

Here, result has keys visualisation, insights, recommendations and tasks.

You can control the outputs received from ask:

result = analyzr.ask(
    user_input = "Your question here",
    outputs = ["insights"],
)

Here, result still has the keys "visualisation", “insights”, “recommendations” and “tasks” but their values are changed.

>>> print(result["insights"])
This is the insights generated by the LLM analysis...
>>> print(result["recommendations"])
	
>>> repr(result["recommendations"])
"''"

You can also specify the context for the analysis and the generation of outputs. For example, you could let the system know that this analysis is for a specific user profile, or that the outputs should be presented in a specific way.

# A humorous example
result = analyzr.ask(
    "How many columns in my dataset?",
    context = {"insights": "Give your response as a pirate."},
    outputs = ["insights"],
)
print(result['insights'])
- Yer dataset be havin' 11 columns in total, matey.
- Thar be 147 rows o' data, with ranks spannin' from 1 to 147.
- The mean rank be 74, which be the midpoint o' the data, arrr.

Getting visualisations

To retrieve plots from your analysis, use the ask method and pass "visualisation" in your outputs parameter. The dictionary returned has a key "visualisation" which contains the path to the PNG image. By default, visualization images are saved to ./generated_plots/plot.png, but this can be controlled using the plot_path parameter. Here’s an example:

result = analyzr.ask(
    "Your query here",
    outputs = ["visualisation"],
    plot_path = "my_image.png",
)

The result then has a "visualisation" key which is simply the value of plot_path:

>>>> print(result["visualisation"])
my_image.png

It is also possible to pass a directory to plot_path. The generated image is then saved in the directory. Additionally, you may pass a context for the image generation.

result = analyzr.ask(
    "Your query here",
    outputs = ["visualisation"],
    context = {"visualisation": "Your visualisation context here"},
    plot_path = "my_directory",
)
print(result["visualisation"])
> my_directory/plot.png

You can then view the image using pillow or matplotlib:

from PIL import Image

image = Image.open(result["visualisation"])
image.show()

# OR ----

import matplotlib.pyplot as plt
import matplotlib.image as mpimg

img = mpimg.imread(result["visualisation"])
imgplot = plt.imshow(img)
plt.axis('off')
plt.show()