diff --git a/.github/templates/procedure.md b/.github/templates/procedure.md index c6c9ddf..cbf3001 100644 --- a/.github/templates/procedure.md +++ b/.github/templates/procedure.md @@ -95,7 +95,7 @@ of the same thing: You have \. -[create-a-service]: /cloud/tiger/get-started/create-services -[secure-vpc-aws]: /cloud/tiger/secure-access/vpc-peering-and-aws-private-link -[install-linux]: /self-host/timescaledb/install-and-update/install-self-hosted +[create-a-service]: /deploy-and-operate/tiger/get-started/create-services +[secure-vpc-aws]: /deploy-and-operate/tiger/secure-access/vpc-peering-and-aws-private-link +[install-linux]: /deploy-and-operate/timescaledb/install-and-update/install-self-hosted [gdsg]: https://developers.google.com/style/highlights diff --git a/.gitignore b/.gitignore index 8bc210f..410b19b 100644 --- a/.gitignore +++ b/.gitignore @@ -93,7 +93,7 @@ dist # vuepress v2.x temp and cache directory .temp - +n # Sveltekit cache directory .svelte-kit/ diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 17ac55b..6fc13d6 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -32,7 +32,7 @@ Each major doc section has a dedicated directory with `.md` files inside, repres - An argument table with `Name`, `Type`, `Default`, `Required`, `Description` columns. - A return table with `Column`, `Type`, and `Description` columns. -- **Troubleshooting pages** are not written as whole Markdown files, but are programmatically assembled from individual files in the`_troubleshooting` folder. Each entry describes a single troubleshooting case and its solution, and contains the following front matter: +- **Troubleshoot pages** are not written as whole Markdown files, but are programmatically assembled from individual files in the`_troubleshooting` folder. Each entry describes a single troubleshooting case and its solution, and contains the following front matter: |Key| Type |Required| Description | |-|-------|-|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| @@ -69,7 +69,7 @@ The navigation hierarchy of a doc section is governed by `page-index/page-index. excerpt: "Tiger Cloud services explorer", }, { - title: "Troubleshooting Tiger Cloud services", + title: "Troubleshoot Tiger Cloud services", href: "troubleshooting", type: "placeholder", }, diff --git a/agentic-postgres/agents/mcp-server.mdx b/agentic-postgres/agents/mcp-server.mdx new file mode 100644 index 0000000..81777d8 --- /dev/null +++ b/agentic-postgres/agents/mcp-server.mdx @@ -0,0 +1,223 @@ +--- +title: Integrate Tiger Cloud with your AI Assistant +description: Manage your services and optimize your schema and queries using your AI Assistant +products: [cloud, self_hosted] +keywords: [ai, mcp, server, security] +tags: [ai] +--- + +import RESTPrereqs from '/snippets/prerequisites/_prereqs-cloud-account-only.mdx'; +import CLIINSTALL from '/snippets/devops/_devops-cli-install.mdx'; +import MCPCOMMANDS from '/snippets/devops/_devops-mcp-commands.mdx'; +import MCPCOMMANDSCLI from '/snippets/devops/_devops-mcp-commands-cli.mdx'; +import GLOBALFLAGS from '/snippets/devops/_devops-cli-global-flags.mdx'; +import { MCP_LONG, MCP_SHORT, CLI_LONG, CLI_SHORT, CLOUD_LONG, ACCOUNT_LONG, PROJECT_SHORT, SERVICE_SHORT, SERVICE_LONG, COMPANY } from '/snippets/vars.mdx'; + + +{MCP_LONG} provides access to your {CLOUD_LONG} resources through Claude and other AI Assistants. {MCP_SHORT} +mirrors the functionality of {CLI_LONG} and is integrated directly into the {CLI_SHORT} binary. You manage your +{CLOUD_LONG} resources using natural language from your AI Assistant. As {MCP_SHORT} is integrated with the +{COMPANY} documentation, ask any question and you will get the best answer. + +This page shows you how to install {CLI_LONG} and set up secure authentication for {MCP_SHORT}, then manage the +resources in your {ACCOUNT_LONG} through {MCP_LONG} using your AI Assistant. + +## Prerequisites + + + +- Install an AI Assistant on your developer device with an active API key. + + The following AI Assistants are automatically configured by {MCP_LONG}: `claude-code`, `cursor`, `windsurf`, `codex`, `gemini/gemini-cli`, `vscode/code/vs-code`. + You can also [manually configure][manual-config] {MCP_SHORT}. + +## Install and configure MCP Server + +{MCP_SHORT} is bundled with {CLI_LONG}: + + + +1. **Configure your AI Assistant to interact with the {PROJECT_SHORT} and {SERVICE_SHORT}s in your {ACCOUNT_LONG}** + + For example: + ```shell + tiger mcp install + ``` + +1. **Choose the client to integrate with, then press `Enter` ** + + ```shell + Select an MCP client to configure: + + > 1. Claude Code + 2. Codex + 3. Cursor + 4. Gemini CLI + 5. VS Code + 6. Windsurf + + Use ↑/↓ arrows or number keys to navigate, enter to select, q to quit + ``` + +And that is it, you are ready to use {MCP_LONG} to manage your {SERVICE_SHORT}s in {CLOUD_LONG}. + +## Manage the resources in your Tiger Cloud account through your AI Assistant + +Your AI Assistant is connected to your {ACCOUNT_LONG} and the {COMPANY} documentation, you can now use it to +manage your {SERVICE_SHORT}s and learn more about how to implement {CLOUD_LONG} features. For example: + +1. **Run your AI Assistant** + ```shell + claude + ``` + Claude automatically runs {MCP_SHORT} server that enables you to interact with {CLOUD_LONG} from your + AI Assistant. + +1. **Check your {MCP_LONG} configuration** + ```shell + > is the tigerdata mcp server active for you? + ``` + You see something like: + ```shell + MCP server is active. I can see the following Tiger Data-related tools available: + + - mcp__tiger__get_guide - Retrieve TimescaleDB guides and best practices + - mcp__tiger__semantic_search_postgres_docs - Search PostgreSQL documentation + - mcp__tiger__semantic_search_tiger_docs - Search Tiger Cloud and TimescaleDB documentation + - mcp__tiger__tiger_service_create - Create new database services + - mcp__tiger__tiger_service_list - List all database services + - mcp__tiger__tiger_service_show - Show detailed service information + - mcp__tiger__tiger_service_update_password - Update service passwords + + Is there something specific you'd like to do with the Tiger Data MCP server? + ``` + +1. **Ask a basic question about your {SERVICE_SHORT}s** + ```shell + > can you list my active services please + ``` + You see something like: + ```shell + ⏺ You have 3 active services, all with the same name "my-first-service": + + 1. tgrservice1 - READY - TimescaleDB - us-east-1 - 1 CPU/4GB - Created: 2025-09-30 + 2. tgrservice1 - READY - TimescaleDB - us-east-1 - 1 CPU/4GB - Created: 2025-09-30 + 3. tgrservice1 - READY - TimescaleDB - us-east-1 - 1 CPU/4GB - Created: 2025-09-30 + ``` + +1. **Manage your {SERVICE_SHORT}s without having to learn how to** + + For example: + ```shell + Can you create a new ai service called bob with a replica for me please + ``` + You see something like: + ```shell + tiger - Create Database Service (MCP)(name: "bob", addons: ["ai"], replicas: 1, wait: false) + ``` + +1. **Find best practice for things you need to do** + + For example: + ```shell + > I need to migrate a database with 25 million rows of data to Tiger Cloud, + please tell me what service optimizations i should do before migration. + + ``` + + You see something like: + + ```shell + ⏺ For migrating a database with 25 million rows to Tiger Cloud, here are the key service optimizations to do before migration: + + Pre-Migration Service Optimizations + + 1. Right-size your compute resources: + - For 25M rows, start with at least 2 CPU/8GB or 4 CPU/16GB + - You can scale up during migration if needed, then scale down after + + 2. Enable the appropriate addons: + - Use time-series addon if you are working with time-series data + - Use ai addon if you need vector/AI capabilities + + 3. Choose optimal region: + - Select the region closest to your current database to minimize network transfer time + - This significantly reduces migration duration + + 4. Consider enhanced storage (Enterprise only): + - If you expect high write throughput during migration, enhanced storage provides better IOPS (up to 32,000) + + 5. Skip replicas initially: + - Start with 0 replicas during migration + - Add replicas after migration completes to avoid duplicate data copying + + 6. Plan for storage space: + - Estimate your uncompressed data size + - Allocate sufficient high-performance storage (scales automatically but good to plan) + - You can enable tiered storage later for older data + + During Migration Best Practices + + Migration method selection: + - 25M rows: Use schema-then-data migration (migrate schema first, then copy data in chunks) + - Allows restart if a copy operation fails + - Better for databases >10GB + + Database configuration to set AFTER migration: + - Set up compression policies (once data is loaded) + - Configure retention policies + - Create continuous aggregates + - Enable tiered storage for cost optimization + + Connection settings: + - Use the pooled connection endpoint for better performance during bulk inserts + - Disable unnecessary constraints/indexes during migration, re-enable after + + Would you like me to help you create an optimized service for this migration? + ``` + +That beats working. Let {MCP_SHORT} do it all for you. + +## Manually configure MCP Server + +If your MCP client is not supported by `tiger mcp install`, follow the client's instructions to install +MCP servers. For example, many clients use a JSON file like the following that use `tiger mcp start` to +start {MCP_LONG}: + +```json +{ + "mcpServers": { + "tiger": { + "command": "tiger", + "args": [ + "mcp", + "start" + ] + } + } +} +``` + +## Tiger MCP Server tools + + + +## Tiger CLI commands for MCP Server + + + +## Global flags + +You can use the following {CLI_LONG} global flags when you run {MCP_SHORT}: + + + + +[rest-api-reference]: /api/api-reference +[rest-api-credentials]: https://console.cloud.timescale.com/dashboard/settings +[get-project-id]: /integrations/find-connection-details#find-your-project-and-service-id +[create-client-credentials]: /integrations/find-connection-details#create-client-credentials +[curl]: https://curl.se/ +[cloud-regions]: /deploy-and-operate/understand/regions +[readreplica]: /deploy-and-operate/scale/ha-replicas +[manual-config]: #manually-configure-mcp-server \ No newline at end of file diff --git a/agentic-postgres/agents/tiger-agents-for-work.mdx b/agentic-postgres/agents/tiger-agents-for-work.mdx new file mode 100644 index 0000000..34af219 --- /dev/null +++ b/agentic-postgres/agents/tiger-agents-for-work.mdx @@ -0,0 +1,277 @@ +--- +title: Integrate a slack-native AI agent +description: Unify company knowledge with slack-native AI agents +products: [cloud] +keywords: [ai, vector, pgvector, TigerData vector, pgvectorizer] +tags: [ai, vector, pgvectorizer] +--- + +import { AGENTS_LONG, AGENTS_SHORT, AGENTS_CLI, PG, COMPANY, MCP_SHORT, CLOUD_LONG, SERVICE_LONG, CONSOLE } from '/snippets/vars.mdx'; +import PrereqAccount from '/snippets/prerequisites/_prereqs-cloud-and-self.mdx'; + +{AGENTS_LONG} is a Slack-native AI agent that you use to unify the knowledge in your company. This includes your Slack +history, docs, GitHub repositories, Salesforce and so on. You use your {AGENTS_SHORT} to get instant answers for real +business, technical, and operations questions in your Slack channels. + +![Query Tiger Agent](https://assets.timescale.com/docs/images/tiger-agent/query-in-slack.png) + +{AGENTS_LONG} can handle concurrent conversations with enterprise-grade reliability. They have the following features: + +- **Durable and atomic event handling**: {PG}-backed event claiming ensures exactly-once processing, even under high concurrency and failure conditions +- **Bounded concurrency**: fixed worker pools prevent resource exhaustion while maintaining predictable performance under load +- **Immediate event processing**: {AGENTS_LONG} provide real-time responsiveness. Events are processed within milliseconds of arrival rather than waiting for polling cycles +- **Resilient retry logic**: automatic retry with visibility thresholds, plus stuck or expired event cleanup +- **Horizontal scalability**: run multiple {AGENTS_SHORT} instances simultaneously with coordinated work distribution across all instances +- **AI-Powered Responses**: use the AI model of your choice, you can also integrate with MCP servers +- **Extensible architecture**: zero code integration for basic agents. For more specialized use cases, easily customize your agent using [Jinja templates][jinja-templates] +- **Complete observability**: detailed tracing of event flow, worker activity, and database operations with full [Logfire][logfire] instrumentation + +This page shows you how to install the {AGENTS_CLI}, connect to the {COMPANY} MCP server, and customize prompts for +your specific needs. + +## Prerequisites + + + +- Install the [uv package manager][uv-install] +- Get an [Anthropic API key][claude-api-key] +- Optional: get a [Logfire token][logfire] + +## Create a Slack app + +Before installing {AGENTS_LONG}, you need to create a Slack app that the {AGENTS_SHORT} will connect to. This app +provides the security tokens for Slack integration with your {AGENTS_SHORT}: + +1. **Create a manifest for your Slack App** + + 1. In a temporary directory, download the {AGENTS_SHORT} Slack manifest template: + + ```bash + curl -O https://raw.githubusercontent.com/timescale/tiger-agents-for-work/main/slack-manifest.json + ``` + + 1. Edit `slack-manifest.json` and customize your name and description of your Slack App. For example: + + ```json + "display_information": { + "name": "Tiger Agent", + "description": "Tiger AI Agent helps you easily access your business information, and tune your Tiger services", + "background_color": "#000000" + }, + "features": { + "bot_user": { + "display_name": "Tiger Agent", + "always_online": true + } + }, + ``` + + 1. Copy the contents of `slack-manifest.json` to the clipboard: + + ```shell + cat slack-manifest.json| pbcopy + ``` + +1. **Create the Slack app** + + 1. Go to [api.slack.com/apps](https://api.slack.com/apps). + 1. Click `Create New App`. + 1. Select `From a manifest`. + 1. Choose your workspace, then click `Next`. + 1. Paste the contents of `slack-manifest.json` and click `Next`. + 1. Click `Create`. +1. **Generate an app-level token** + + 1. In your app settings, go to `Basic Information`. + 1. Scroll to `App-Level Tokens`. + 1. Click `Generate Token and Scopes`. + 1. Add a `Token Name`, then click `Add Scope`, add `connections:write` then click `Generate`. + 1. Copy the `xapp-*` token locally and click `Done`. + +1. **Install your app to a Slack workspace** + + 1. In the sidebar, under `Settings`, click `Install App`. + 1. Click `Install to `, then click `Allow`. + 1. Copy the `xoxb-` Bot User OAuth Token locally. + +You have created a Slack app and obtained the necessary tokens for {AGENTS_SHORT} integration. + + +## Install and configure your Agent instance + +{AGENTS_LONG} are a production-ready library and CLI written in Python that you use to create Slack-native AI agents. +This section shows you how to configure a {AGENTS_SHORT} to connect to your Slack app, and give it access to your +data and analytics stored in {CLOUD_LONG}. + +1. **Create a project directory** + + ```bash + mkdir my-tiger-agent + cd my-tiger-agent + ``` + +1. **Create a {AGENTS_SHORT} environment with your Slack, AI Assistant, and database configuration** + + 1. Download `.env.sample` to a local `.env` file: + ```shell + curl -L -o .env https://raw.githubusercontent.com/timescale/tiger-agent/refs/heads/main/.env.sample + ``` + 1. In `.env`, add your Slack tokens and Anthropic API key: + + ```bash + # Slack tokens (from the Slack app you created) + SLACK_APP_TOKEN=xapp-your-app-token + SLACK_BOT_TOKEN=xoxb-your-bot-token + + # Anthropic API key + ANTHROPIC_API_KEY=sk-ant-your-api-key + + # Optional: Logfire token for enhanced logging + LOGFIRE_TOKEN=your-logfire-token + ``` + 1. Add the [connection details][connection-info] for the {SERVICE_LONG} you are using for this {AGENTS_SHORT}: + ```bash + PGHOST= + PGDATABASE=tsdb + PGPORT= + PGUSER=tsdbadmin + PGPASSWORD= + ``` + 1. Save and close `.env`. + +1. **Add the default {AGENTS_SHORT} prompts to your project** + ```bash + mkdir prompts + curl -L -o prompts/system_prompt.md https://raw.githubusercontent.com/timescale/tiger-agent/refs/heads/main/prompts/system_prompt.md + curl -L -o prompts/user_prompt.md https://raw.githubusercontent.com/timescale/tiger-agent/refs/heads/main/prompts/user_prompt.md + ``` + +1. **Install {AGENTS_LONG} to manage and run your AI-powered Slack bots** + + 1. Install the {AGENTS_CLI} using uv. + + ```bash + uv tool install --from git+https://github.com/timescale/tiger-agents-for-work.git tiger-agent + ``` + `tiger-agent` is installed in `~/.local/bin/tiger-agent`. If necessary, add this folder to your `PATH`. + + 1. Verify the installation. + + ```bash + tiger-agent --help + ``` + + You see the {AGENTS_CLI} help output with the available commands and options. + + +1. **Connect your {AGENTS_SHORT} with Slack** + + 1. Run your {AGENTS_SHORT}: + ```bash + tiger-agent run --prompts prompts/ --env .env + ``` + If you open the explorer in [{CONSOLE}][portal-ops-mode], you can see the tables used by your {AGENTS_SHORT}. + + 1. In Slack, open a public channel app and ask {AGENTS_SHORT} a couple of questions. You see the response in your + public channel and log messages in the terminal. + + ![Query Tiger Agent](https://assets.timescale.com/docs/images/tiger-agent/query-in-terminal.png) + +## Add information from MCP servers to your Agent + +To increase the amount of specialized information your AI Assistant can use, you can add MCP servers supplying data +your users need. For example, to add the {COMPANY} MCP server to your {AGENTS_SHORT}: + +1. **Copy the example `mcp_config.json` to your project** + + In `my-tiger-agent`, run the following command: + + ```bash + curl -L -o mcp_config.json https://raw.githubusercontent.com/timescale/tiger-agent/refs/heads/main/examples/mcp_config.json + ``` + +1. **Configure your {AGENTS_SHORT} to connect to the most useful MCP servers for your organization** + + For example, to add the {COMPANY} documentation MCP server to your {AGENTS_SHORT}, update the docs entry to the + following: + ```json + "docs": { + "tool_prefix": "docs", + "url": "https://mcp.tigerdata.com/docs", + "allow_sampling": false + }, + ``` + To avoid errors, delete all entries in `mcp_config.json` with invalid URLs. For example the `github` entry with `http://github-mcp-server/mcp`. + +1. **Restart your {AGENTS_SHORT}** + ```bash + tiger-agent run --prompts prompts/ --mcp-config mcp_config.json + ``` + +You have configured your {AGENTS_SHORT} to connect to {MCP_SHORT}. For more information, +see [MCP Server Configuration][mcp-configuration-docs]. + +## Customize prompts for personalization + +{AGENTS_LONG} uses Jinja2 templates for dynamic, context-aware prompt generation. This system allows for sophisticated +prompts that adapt to conversation context, user preferences, and event metadata. {AGENTS_LONG} uses the following +templates: + +- `system_prompt.md`: defines the AI Assistant's role, capabilities, and behavior patterns. This template sets the + foundation for the way your {AGENTS_SHORT} will respond and interact. +- `user_prompt.md`: formats the user's request with relevant context, providing the AI Assistant with the + information necessary to generate an appropriate response. + +To change the way your {AGENTS_SHORT}s interact with users in your Slack app: + +1. **Update the prompt** + + For example, in `prompts/system_prompt.md`, add another item in the `Response Protocol` section to fine tune + the behavior of your {AGENTS_SHORT}s. For example: + ```shell + 5. Be snarky but vaguely amusing + ``` + +1. **Test your configuration** + + Run {AGENTS_SHORT} with your custom prompt: + + ```bash + tiger-agent run --mcp-config mcp_config.json --prompts prompts/ + ``` + +For more information, see [Prompt tempates][prompt-templates]. + +## Advanced configuration options + +For additional customization, you can modify the following {AGENTS_SHORT} parameters: + +- `--model`: change AI model (default: `anthropic:claude-sonnet-4-20250514`) +- `--num-workers`: adjust concurrent workers (default: `5`) +- `--max-attempts`: set retry attempts per event (default: `3`) + +Example with custom settings: + +```bash +tiger-agent run \ + --model claude-3-5-sonnet-latest \ + --mcp-config mcp_config.json \ + --prompts prompts/ \ + --num-workers 10 \ + --max-attempts 5 +``` + +Your {AGENTS_SHORT}s are now configured with {COMPANY} MCP server access and personalized prompts. + + + + +[jinja-templates]: https://jinja.palletsprojects.com/en/stable/ +[logfire]: https://pydantic.dev/logfire +[claude-api-key]: https://console.anthropic.com/settings/keys +[create-a-service]: /deploy-and-operate/get-started/create-services +[uv-install]: https://docs.astral.sh/uv/getting-started/installation/ +[connection-info]: /integrations/find-connection-details +[portal-ops-mode]: https://console.cloud.timescale.com/dashboard/services +[mcp-configuration-docs]: https://github.com/timescale/tiger-agents-for-work/blob/main/docs/mcp_config.md +[prompt-templates]: https://github.com/timescale/tiger-agents-for-work/blob/main/docs/prompt_templates.md diff --git a/agentic-postgres/agents/tiger-eon.mdx b/agentic-postgres/agents/tiger-eon.mdx new file mode 100644 index 0000000..3b696b4 --- /dev/null +++ b/agentic-postgres/agents/tiger-eon.mdx @@ -0,0 +1,176 @@ +--- +title: Aggregate organizational data with AI agents +description: Unify company knowledge with slack-native AI agents +products: [cloud, self_hosted] +keywords: [ai, vector, pgvector, TigerData vector, pgvectorizer] +tags: [ai, vector, pgvectorizer] +--- + +import { EON_LONG, EON_SHORT, AGENTS_LONG, AGENTS_SHORT, CLOUD_LONG, SERVICE_LONG, PG, TIMESCALE_DB, CLI_LONG } from '/snippets/vars.mdx'; +import PrereqAccount from '/snippets/prerequisites/_prereqs-cloud-and-self.mdx'; + +Your business already has the answers in Slack threads, GitHub pull requests, Linear tasks, your own docs, Salesforce +service tickets, anywhere you store data. However, those answers are scattered, hard to find, and often forgotten. +{EON_LONG} automatically integrates {AGENTS_LONG} with your organizational data so you can let AI Assistants analyze your +company data and give you the answers you need. For example: +- What did we ship last week? +- What's blocking the release? +- Summarize the latest GitHub pull requests. + +{EON_SHORT} responds instantly, pulling from the tools you already use. No new UI, no new workflow, just answers in Slack. + +![Query Tiger Agent](https://assets.timescale.com/docs/images/tiger-eon-big-question.png) + +{EON_LONG}: + +- **Unlocks hidden value**: your data in Slack, GitHub, and Linear already contains the insights you need. {EON_SHORT} makes them accessible. +- **Enables faster decisions**: no need to search or ask around, you get answers in seconds. +- **Is easy to use**: {EON_SHORT} runs a {AGENTS_SHORT} and MCP servers statelessly in lightweight Docker containers. +- **Integrates seamlessly with {CLOUD_LONG}**: {EON_SHORT} uses a {SERVICE_LONG} so you securely and reliably store + your company data. Prefer to self-host? Use a [{PG} instance with {TIMESCALE_DB}](/deploy-and-operate/timescaledb/install-and-update/install-self-hosted). + +{EON_LONG}'s real-time ingestion system connects to Slack and captures everything: every message, reaction, edit, and +channel update. It can also process historical Slack exports. {EON_SHORT} had instant access to years +of institutional knowledge from the very beginning. + +All of this data is stored in your {SERVICE_LONG} as time-series data: conversations are events unfolding over time, +and {CLOUD_LONG} is purpose-built for precisely this. Your data is optimized by: + +- Automatically partitioning the data into 7-day chunks for efficient queries +- Compressing the data after 45 days to save space +- Segmenting by channel for faster retrieval + +When someone asks {EON_SHORT} a question, it uses simple SQL to instantly retrieve the full thread context, related +conversations, and historical decisions. No rate limits. No API quotas. Just direct access to your data. + +This page shows you how to install and run {EON_SHORT}. + +## Prerequisites + + + +- [Install Docker](https://docs.docker.com/engine/install/) on your developer device +- Install [{CLI_LONG}](https://github.com/timescale/tiger-cli/) +- Have rights to create an [Anthropic API key](https://console.anthropic.com/settings/keys) +- Optionally: + - Have rights to create a [GitHub token](https://github.com/settings/tokens/new?description=Tiger%20Agent&scopes=repo,read:org) + - Have rights to create a [Logfire token](http://logfire.pydantic.dev/docs/how-to-guides/create-write-tokens/) + - Have rights to create a [Linear token](https://linear.app/docs/api-and-webhooks#api-keys) + +## Interactive setup + +{EON_LONG} is a production-ready repository running [{CLI_LONG}](https://github.com/timescale/tiger-cli/) and [{AGENTS_LONG}](https://github.com/timescale/tiger-agents-for-work) that creates +and runs the following components for you: + +- An ingest Slack app that consumes all messages and reactions from public channels in your Slack workspace +- A [{AGENTS_SHORT}](https://github.com/timescale/tiger-agents-for-work) that analyzes your company data for you +- A {SERVICE_LONG} instance that stores data from the Slack apps +- MCP servers that connect data sources to {EON_SHORT} +- A listener Slack app that passes questions to the {AGENTS_SHORT} when you @tag it in a public channel, and returns the + AI analysis on your data + +All local components are run in lightweight Docker containers via Docker Compose. + +This section shows you how to run the {EON_SHORT} setup to configure {EON_SHORT} to connect to your Slack app, and give it access to your +data and analytics stored in {CLOUD_LONG}. + +1. **Install {EON_LONG} to manage and run your AI-powered Slack bots** + + In a local folder, run the following command from the terminal: + ```shell + git clone git@github.com:timescale/tiger-eon.git + ``` + +2. **Start the {EON_SHORT} setup** + + ```shell + cd tiger-eon + ./setup-tiger-eon.sh + ``` + You see a summary of the setup procedure. Type `y` and press `Enter`. + +3. **Create the {SERVICE_LONG} to use with {EON_SHORT}** + + You see `Do you want to use a free tier Tiger Cloud Database? [y/N]:`. Press `Y` to create a free {SERVICE_LONG}. + + {EON_SHORT} opens the {CLOUD_LONG} authentication page in your browser. Click `Authorize`. {EON_SHORT} creates a + {SERVICE_LONG} called [tiger-eon](https://console.cloud.timescale.com/dashboard/services) and stores the credentials in your local keychain. + + If you press `N`, the {EON_SHORT} setup creates and runs {TIMESCALE_DB} in a local Docker container. + +4. **Create the ingest Slack app** + + 1. In the terminal, name your ingest Slack app: + - {EON_SHORT} proposes to create an ingest app called `tiger-slack-ingest`, press `Enter`. + - Do the same for the App description. + - {EON_SHORT} opens `Your Apps` in https://api.slack.com/apps/. + + 2. Start configuring your ingest app in Slack: + - In the Slack `Your Apps` page: + - Click `Create New App`, click `From an manifest`, then select a workspace. + - Click `Next`. Slack opens `Create app from manifest`. + + 3. Add the Slack app manifest: + - In terminal press `Enter`. The setup prints the Slack app manifest to terminal and adds it to your clipboard. + - In the Slack `Create app from manifest` window, paste the manifest. + - Click `Next`, then click `Create`. + + 4. Configure an app-level token: + - In your app settings, go to `Basic Information`. + - Scroll to `App-Level Tokens`. + - Click `Generate Token and Scopes`. + - Add a `Token Name`, then click `Add Scope` add `connections:write`, then click `Generate`. + - Copy the `xapp-*` token and click `Done`. + - In the terminal, paste the token, then press `Enter`. + + 5. Configure a bot user OAuth token: + - In your app settings, under `Features`, click `App Home`. + - Scroll down, then enable `Allow users to send Slash commands and messages from the messages tab`. + - In your app settings, under `Settings`, click `Install App`. + - Click `Install to `, then click `Allow`. + - Copy the `xoxb-` Bot User OAuth Token locally. + - In the terminal, paste the token, then press `Enter`. + +5. **Create the {EON_SHORT} Slack app** + + Follow the same procedure as you did for the ingest Slack app. + +6. **Integrate {EON_SHORT} with Anthropic** + + The {EON_SHORT} setup opens https://console.anthropic.com/settings/keys. Create a Claude Code key, then + paste it in the terminal. + +7. **Integrate {EON_SHORT} with Logfire** + + If you would like to integrate logfire with {EON_SHORT}, paste your token and press `Enter`. If not, press `Enter`. + +8. **Integrate {EON_SHORT} with GitHub** + + The {EON_SHORT} setup asks if you would like to `Enable github MCP server?". For {EON_SHORT} to answer questions + about the activity in your Github organization`. Press `y` to integrate with GitHub. + +9. **Integrate {EON_SHORT} with Linear** + + The {EON_SHORT} setup asks if you would like to `Enable linear MCP server? [y/N]:`. Press `y` to integrate with Linear. + +10. **Give {EON_SHORT} access to private repositories** + + 1. The setup asks if you would like to include access to private repositories. Press `y`. + 2. Follow the GitHub token creation process. + 3. In the {EON_SHORT} setup add your organization name, then paste the GitHub token. + + The setup sets up a new {SERVICE_LONG} for you called `tiger-eon`, then starts {EON_SHORT} in Docker. + + ![Eon running in Docker](https://assets.timescale.com/docs/images/tiger-eon-docker-services.png) + +You have created: +* The {EON_SHORT} ingest and chat apps in Slack +* A private MCP server connecting {EON_SHORT} to your data in GitHub +* A {SERVICE_LONG} that securely stores the data used by {EON_SHORT} + +## Integrate Eon in your Slack workspace + +To enable your AI Assistant to analyze your data for you when you ask a question, open a public channel, +invite `@eon` to join, then ask a question: + +![Eon running in Docker](https://assets.timescale.com/docs/images/tiger-eon-slack-channel-add.png) diff --git a/agentic-postgres/index.mdx b/agentic-postgres/index.mdx new file mode 100644 index 0000000..06c19ac --- /dev/null +++ b/agentic-postgres/index.mdx @@ -0,0 +1,91 @@ +--- +title: Agentic Postgres +sidebarTitle: Overview +description: Build AI assistants and agents with Tiger Data using pgvector, pgai, pgvectorscale, and intelligent agent frameworks +products: [cloud, mst, self_hosted] +keywords: [ai, vector, pgvector, pgvectorscale, pgai, agents, assistants] +mode: "wide" +--- + + + + Get started with AI on Tiger Data. Set up pgai and pgvectorscale to automate embeddings, perform semantic search, and build intelligent applications. + + + +## By use case + + + + Build high-performance semantic search with vector embeddings. Use pgvectorscale for billion-scale similarity search and intelligent document retrieval. + + + Create enterprise-grade multi-agent systems. Deploy Slack-native AI agents with durable event handling, horizontal scalability, and complete observability. + + + Automate Retrieval-Augmented Generation workflows. Use the vectorizer to automatically generate, sync, and update embeddings from your data sources. + + + Build organizational AI that integrates with your data. Connect Slack, GitHub, and Linear for real-time AI-powered insights and analytics. + + + +## By product + + + + Complete organizational AI that automatically integrates agents with your data from Slack, GitHub, and Linear. Process data in real-time with time-series partitioning. + + + Enterprise-grade Slack-native AI agents with durable event handling, horizontal scalability, and flexible model choices. Get complete observability and integrate with specialized data sources. + + + Integrate Tiger Data directly with AI assistants like Claude Code, Cursor, and VS Code. Manage services and optimize queries through natural language with secure authentication. + + + Automate AI workflows in your database with embeddings, vector search, and LLM integrations. Use the vectorizer to automatically generate and sync embeddings from your data. + + + High-performance vector search with StreamingDiskANN indexing. Extend pgvector with optimized algorithms for billion-scale vector workloads and faster similarity search. + + diff --git a/agentic-postgres/interfaces/python-interface.mdx b/agentic-postgres/interfaces/python-interface.mdx new file mode 100644 index 0000000..1969e1d --- /dev/null +++ b/agentic-postgres/interfaces/python-interface.mdx @@ -0,0 +1,799 @@ +--- +title: Python interface for pgvector and pgvectorscale +description: Working with pgvectorscale and pgvector in python +products: [cloud] +keywords: [ai, vector, pgvector, TigerData vector, pgvectorscale, python] +tags: [ai, vector, python] +--- + +import { SERVICE_LONG, CLOUD_LONG, PG, TIMESCALE_DB } from '/snippets/vars.mdx'; + +You use pgai to power production grade AI applications. `timescale_vector` is the + Python interface you use to interact with a pgai on {SERVICE_LONG} programmatically. + +Before you get started with `timescale_vector`: + +- [Sign up for pgai on {CLOUD_LONG}](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=docs&utm_medium=direct): Get 90 days free to try pgai on {CLOUD_LONG}. +- [Follow the Get Started Tutorial](https://timescale.github.io/python-vector/tsv_python_getting_started_tutorial.html): +Learn how to use pgai on {CLOUD_LONG} for semantic search on a real-world dataset. + +If you prefer to use an LLM development or data framework, see pgai's integrations with [LangChain](https://python.langchain.com/docs/integrations/vectorstores/timescalevector) and [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/examples/vector_stores/Timescalevector.html). + +## Prerequisites + +`timescale_vector` depends on the source distribution of `psycopg2` and adheres +to [best practices for psycopg2](https://www.psycopg.org/docs/install.html#psycopg-vs-psycopg-binary). + +Before you install `timescale_vector`: + +- Follow the [psycopg2 build prerequisites](https://www.psycopg.org/docs/install.html#build-prerequisites). + +## Install + +To interact with pgai on {CLOUD_LONG} using Python: + +1. Install `timescale_vector`: + + ```bash + pip install timescale_vector + ``` +1. Install `dotenv`: + + ```bash + pip install python-dotenv + ``` + + In these examples, you use `dotenv` to pass secrets and keys. + +That is it, you are ready to go. + +## Basic usage of the timescale_vector library + +First, import all the necessary libraries: + +``` python +from dotenv import load_dotenv, find_dotenv +import os +from timescale_vector import client +import uuid +from datetime import datetime, timedelta +``` + +Load up your {PG} credentials, the safest way is with a `.env` file: + +``` python +_ = load_dotenv(find_dotenv(), override=True) +service_url = os.environ['TIMESCALE_SERVICE_URL'] +``` + +Next, create the client. This tutorial, uses the sync client. But the library has an async client as well (with an identical interface that +uses async functions). + +The client constructor takes three required arguments: + +| name | description | +|----------------|-------------------------------------------------------------------------------------------| +| `service_url` | {SERVICE_LONG} URL / connection string | +| `table_name` | Name of the table to use for storing the embeddings. Think of this as the collection name | +| `num_dimensions` | Number of dimensions in the vector | + +``` python +vec = client.Sync(service_url, "my_data", 2) +``` + +Next, create the tables for the collection: + +``` python +vec.create_tables() +``` + +Next, insert some data. The data record contains: + +- A UUID to uniquely identify the embedding +- A JSON blob of metadata about the embedding +- The text the embedding represents +- The embedding itself + +Because this data includes UUIDs which become primary keys, upserts should be used for ingest. + +``` python +vec.upsert([\ + (uuid.uuid1(), {"animal": "fox"}, "the brown fox", [1.0,1.3]),\ + (uuid.uuid1(), {"animal": "fox", "action":"jump"}, "jumped over the", [1.0,10.8]),\ +]) +``` + +You can now create a vector index to speed up similarity search: + +``` python +vec.create_embedding_index(client.TimescaleVectorIndex()) +``` + +Then, you can query for similar items: + +``` python +vec.search([1.0, 9.0]) +``` + +```python +[[UUID('73d05df0-84c1-11ee-98da-6ee10b77fd08'), + {'action': 'jump', 'animal': 'fox'}, + 'jumped over the', + array([ 1. , 10.8], dtype=float32), + 0.00016793422934946456], + [UUID('73d05d6e-84c1-11ee-98da-6ee10b77fd08'), + {'animal': 'fox'}, + 'the brown fox', + array([1. , 1.3], dtype=float32), + 0.14489260377438218]] +``` + +There are many search options which are covered below in the +`Advanced search` section. + +A simple search example that returns one item using a similarity search +constrained by a metadata filter is shown below: + +``` python +vec.search([1.0, 9.0], limit=1, filter={"action": "jump"}) +``` + +```python +[[UUID('73d05df0-84c1-11ee-98da-6ee10b77fd08'), + {'action': 'jump', 'animal': 'fox'}, + 'jumped over the', + array([ 1. , 10.8], dtype=float32), + 0.00016793422934946456]] +``` + +The returned records contain 5 fields: + +| name | description | +|-----------|---------------------------------------------------------| +| id | The UUID of the record | +| metadata | The JSON metadata associated with the record | +| contents | the text content that was embedded | +| embedding | The vector embedding | +| distance | The distance between the query embedding and the vector | + +You can access the fields by simply using the record as a dictionary +keyed on the field name: + +``` python +records = vec.search([1.0, 9.0], limit=1, filter={"action": "jump"}) +(records[0]["id"],records[0]["metadata"], records[0]["contents"], records[0]["embedding"], records[0]["distance"]) +``` + +```python +(UUID('73d05df0-84c1-11ee-98da-6ee10b77fd08'), + {'action': 'jump', 'animal': 'fox'}, + 'jumped over the', + array([ 1. , 10.8], dtype=float32), + 0.00016793422934946456) +``` + +You can delete by ID: + +``` python +vec.delete_by_ids([records[0]["id"]]) +``` + +Or you can delete by metadata filters: + +``` python +vec.delete_by_metadata({"action": "jump"}) +``` + +To delete all records use: + +``` python +vec.delete_all() +``` + +## Advanced usage + +This section goes into more detail about the Python interface. It covers: + +1. Search filter options - how to narrow your search by additional + constraints +2. Indexing - how to speed up your similarity queries +3. Time-based partitioning - how to optimize similarity queries that + filter on time +4. Setting different distance types to use in distance calculations + +### Search options + +The `search` function is very versatile and allows you to search for the right vector in a wide variety of ways. This section describes the search option in 3 parts: + +1. Basic similarity search. +2. How to filter your search based on the associated metadata. +3. Filtering on time when time-partitioning is enabled. + +The following examples are based on this data: + +``` python +vec.upsert([\ + (uuid.uuid1(), {"animal":"fox", "action": "sit", "times":1}, "the brown fox", [1.0,1.3]),\ + (uuid.uuid1(), {"animal":"fox", "action": "jump", "times":100}, "jumped over the", [1.0,10.8]),\ +]) +``` + +The basic query looks like this: + +``` python +vec.search([1.0, 9.0]) +``` + +```python +[[UUID('7487af96-84c1-11ee-98da-6ee10b77fd08'), + {'times': 100, 'action': 'jump', 'animal': 'fox'}, + 'jumped over the', + array([ 1. , 10.8], dtype=float32), + 0.00016793422934946456], + [UUID('7487af14-84c1-11ee-98da-6ee10b77fd08'), + {'times': 1, 'action': 'sit', 'animal': 'fox'}, + 'the brown fox', + array([1. , 1.3], dtype=float32), + 0.14489260377438218]] +``` + +You could provide a limit for the number of items returned: + +``` python +vec.search([1.0, 9.0], limit=1) +``` + +```python +[[UUID('7487af96-84c1-11ee-98da-6ee10b77fd08'), + {'times': 100, 'action': 'jump', 'animal': 'fox'}, + 'jumped over the', + array([ 1. , 10.8], dtype=float32), + 0.00016793422934946456]] +``` + +#### Narrowing your search by metadata + +There are two main ways to filter results by metadata: +- `filters` for equality matches on metadata. +- `predicates` for complex conditions on metadata. + +Filters are more limited in what they can express, but are also more performant. You should use filters if your use case allows it. + +##### Using filters for equality matches + +You could specify a match on the metadata as a dictionary where all keys +have to match the provided values (keys not in the filter are +unconstrained): + +``` python +vec.search([1.0, 9.0], limit=1, filter={"action": "sit"}) +``` + +```python +[[UUID('7487af14-84c1-11ee-98da-6ee10b77fd08'), + {'times': 1, 'action': 'sit', 'animal': 'fox'}, + 'the brown fox', + array([1. , 1.3], dtype=float32), + 0.14489260377438218]] +``` + +You can also specify a list of filter dictionaries, where an item is +returned if it matches any dict: + +``` python +vec.search([1.0, 9.0], limit=2, filter=[{"action": "jump"}, {"animal": "fox"}]) +``` + +```python +[[UUID('7487af96-84c1-11ee-98da-6ee10b77fd08'), + {'times': 100, 'action': 'jump', 'animal': 'fox'}, + 'jumped over the', + array([ 1. , 10.8], dtype=float32), + 0.00016793422934946456], + [UUID('7487af14-84c1-11ee-98da-6ee10b77fd08'), + {'times': 1, 'action': 'sit', 'animal': 'fox'}, + 'the brown fox', + array([1. , 1.3], dtype=float32), + 0.14489260377438218]] +``` + +##### Using predicates for more advanced filtering on metadata + +Predicates allow for more complex search conditions. For example, you +could use greater than and less than conditions on numeric values. + +``` python +vec.search([1.0, 9.0], limit=2, predicates=client.Predicates("times", ">", 1)) +``` + +```python +[[UUID('7487af96-84c1-11ee-98da-6ee10b77fd08'), + {'times': 100, 'action': 'jump', 'animal': 'fox'}, + 'jumped over the', + array([ 1. , 10.8], dtype=float32), + 0.00016793422934946456]] +``` + +`Predicates` +objects are defined by the name of the metadata key, an operator, and a value. + +The supported operators are: `==`, `!=`, `<`, `<=`, `>`, `>=` + +The type of the values determines the type of comparison to perform. For +example, passing in `"Sam"` (a string) performs a string comparison while +a `10` (an int) performs an integer comparison, and a `10.0` +(float) performs a float comparison. It is important to note that using a +value of `"10"` performs a string comparison as well so it's important to +use the right type. Supported Python types are: `str`, `int`, and +`float`. + +One more example with a string comparison: + +``` python +vec.search([1.0, 9.0], limit=2, predicates=client.Predicates("action", "==", "jump")) +``` + +```python +[[UUID('7487af96-84c1-11ee-98da-6ee10b77fd08'), + {'times': 100, 'action': 'jump', 'animal': 'fox'}, + 'jumped over the', + array([ 1. , 10.8], dtype=float32), + 0.00016793422934946456]] +``` + +The real power of predicates is that they can also be combined using the +`&` operator (for combining predicates with `AND` semantics) and `|`(for +combining using OR semantic). So you can do: + +``` python +vec.search([1.0, 9.0], limit=2, predicates=client.Predicates("action", "==", "jump") & client.Predicates("times", ">", 1)) +``` + +```python +[[UUID('7487af96-84c1-11ee-98da-6ee10b77fd08'), + {'times': 100, 'action': 'jump', 'animal': 'fox'}, + 'jumped over the', + array([ 1. , 10.8], dtype=float32), + 0.00016793422934946456]] +``` + +Just for sanity, the next example shows a case where no results are returned because +of predicates: + +``` python +vec.search([1.0, 9.0], limit=2, predicates=client.Predicates("action", "==", "jump") & client.Predicates("times", "==", 1)) +``` + +```python +[] +``` + +And one more example where the predicates are defined as a variable +and use grouping with parenthesis: + +``` python +my_predicates = client.Predicates("action", "==", "jump") & (client.Predicates("times", "==", 1) | client.Predicates("times", ">", 1)) +vec.search([1.0, 9.0], limit=2, predicates=my_predicates) +``` + +```python +[[UUID('7487af96-84c1-11ee-98da-6ee10b77fd08'), + {'times': 100, 'action': 'jump', 'animal': 'fox'}, + 'jumped over the', + array([ 1. , 10.8], dtype=float32), + 0.00016793422934946456]] +``` + +There is also semantic sugar for combining many predicates with `AND` +semantics. You can pass in multiple 3-tuples to +`Predicates`: + +``` python +vec.search([1.0, 9.0], limit=2, predicates=client.Predicates(("action", "==", "jump"), ("times", ">", 10))) +``` + +```python +[[UUID('7487af96-84c1-11ee-98da-6ee10b77fd08'), + {'times': 100, 'action': 'jump', 'animal': 'fox'}, + 'jumped over the', + array([ 1. , 10.8], dtype=float32), + 0.00016793422934946456]] +``` + +#### Filter your search by time + +When using `time-partitioning` (see below) you can very efficiently +filter your search by time. Time-partitioning associates the timestamp embedded +in a UUID-based ID with an embedding. First, +create a collection with time partitioning and insert some data (one +item from January 2018 and another in January 2019): + +``` python +tpvec = client.Sync(service_url, "time_partitioned_table", 2, time_partition_interval=timedelta(hours=6)) +tpvec.create_tables() + +specific_datetime = datetime(2018, 1, 1, 12, 0, 0) +tpvec.upsert([\ + (client.uuid_from_time(specific_datetime), {"animal":"fox", "action": "sit", "times":1}, "the brown fox", [1.0,1.3]),\ + (client.uuid_from_time(specific_datetime+timedelta(days=365)), {"animal":"fox", "action": "jump", "times":100}, "jumped over the", [1.0,10.8]),\ +]) +``` + +Then, you can filter using the timestamps by specifying a +`uuid_time_filter`: + +``` python +tpvec.search([1.0, 9.0], limit=4, uuid_time_filter=client.UUIDTimeRange(specific_datetime, specific_datetime+timedelta(days=1))) +``` + +```python +[[UUID('33c52800-ef15-11e7-be03-4f1f9a1bde5a'), + {'times': 1, 'action': 'sit', 'animal': 'fox'}, + 'the brown fox', + array([1. , 1.3], dtype=float32), + 0.14489260377438218]] +``` + +A +[`UUIDTimeRange`](https://timescale.github.io/python-vector/vector.html#uuidtimerange) +can specify a `start_date` or `end_date` or both(as in the example above). +Specifying only the `start_date` or `end_date` leaves the other end +unconstrained. + +``` python +tpvec.search([1.0, 9.0], limit=4, uuid_time_filter=client.UUIDTimeRange(start_date=specific_datetime)) +``` + +```python +[[UUID('ac8be800-0de6-11e9-889a-5eec84ba8a7b'), + {'times': 100, 'action': 'jump', 'animal': 'fox'}, + 'jumped over the', + array([ 1. , 10.8], dtype=float32), + 0.00016793422934946456], + [UUID('33c52800-ef15-11e7-be03-4f1f9a1bde5a'), + {'times': 1, 'action': 'sit', 'animal': 'fox'}, + 'the brown fox', + array([1. , 1.3], dtype=float32), + 0.14489260377438218]] +``` + +You have the option to define whether the start and end dates +are inclusive with the `start_inclusive` and `end_inclusive` parameters. Setting +`start_inclusive` to true results in comparisons using the `>=` +operator, whereas setting it to false applies the `>` operator. By +default, the start date is inclusive, while the end date is exclusive. +One example: + +``` python +tpvec.search([1.0, 9.0], limit=4, uuid_time_filter=client.UUIDTimeRange(start_date=specific_datetime, start_inclusive=False)) +``` + +```python +[[UUID('ac8be800-0de6-11e9-889a-5eec84ba8a7b'), + {'times': 100, 'action': 'jump', 'animal': 'fox'}, + 'jumped over the', + array([ 1. , 10.8], dtype=float32), + 0.00016793422934946456]] +``` + +Notice how the results are different when using the +`start_inclusive=False` option because the first row has the exact +timestamp specified by `start_date`. + +It is also easy to integrate time filters using the `filter` and +`predicates` parameters described above using special reserved key names +to make it appear that the timestamps are part of your metadata. This +is useful when integrating with other systems that just want to +specify a set of filters (often these are "auto retriever" type +systems). The reserved key names are `__start_date` and `__end_date` for +filters and `__uuid_timestamp` for predicates. Some examples below: + +``` python +tpvec.search([1.0, 9.0], limit=4, filter={ "__start_date": specific_datetime, "__end_date": specific_datetime+timedelta(days=1)}) +``` + +```python +[[UUID('33c52800-ef15-11e7-be03-4f1f9a1bde5a'), + {'times': 1, 'action': 'sit', 'animal': 'fox'}, + 'the brown fox', + array([1. , 1.3], dtype=float32), + 0.14489260377438218]] +``` + +``` python +tpvec.search([1.0, 9.0], limit=4, + predicates=client.Predicates("__uuid_timestamp", ">", specific_datetime) & client.Predicates("__uuid_timestamp", "<", specific_datetime+timedelta(days=1))) +``` + +```python +[[UUID('33c52800-ef15-11e7-be03-4f1f9a1bde5a'), + {'times': 1, 'action': 'sit', 'animal': 'fox'}, + 'the brown fox', + array([1. , 1.3], dtype=float32), + 0.14489260377438218]] +``` + +### Indexing + +Indexing speeds up queries over your data. By default, the system creates indexes +to query your data by the UUID and the metadata. + +To speed up similarity search based on the embeddings, you have to +create additional indexes. + +Note that if performing a query without an index, you always get an +exact result, but the query is slow (it has to read all of the data +you store for every query). With an index, your queries are +order-of-magnitude faster, but the results are approximate (because there +are no known indexing techniques that are exact). + +Luckily, {TIMESCALE_DB} provides 3 excellent approximate indexing algorithms, +StreamingDiskANN, HNSW, and ivfflat. + +Below are the trade-offs between these algorithms: + +| Algorithm | Build speed | Query speed | Need to rebuild after updates | +|------------------|-------------|-------------|-------------------------------| +| StreamingDiskAnn | Fast | Fastest | No | +| HNSW | Fast | Faster | No | +| ivfflat | Fastest | Slowest | Yes | + +You can see +[benchmarks](https://www.timescale.com/blog/how-we-made-postgresql-the-best-vector-database/) +on the blog. + +You should use the StreamingDiskANN index for most use cases. This +can be created with: + +``` python +vec.create_embedding_index(client.TimescaleVectorIndex()) +``` + +Indexes are created for a particular distance metric type. So it is +important that the same distance metric is set on the client during +index creation as it is during queries. See the `distance type` section +below. + +Each of these indexes has a set of build-time options for controlling +the speed/accuracy trade-off when creating the index and an additional +query-time option for controlling accuracy during a particular query. The +library uses smart defaults for all of these options. The +details for how to adjust these options manually are below. + +#### StreamingDiskANN index + +The StreamingDiskANN index is a graph-based algorithm that uses the +[DiskANN](https://github.com/microsoft/DiskANN) algorithm. You can read +more about it in the +[blog](https://www.timescale.com/blog/how-we-made-postgresql-the-best-vector-database/) +announcing its release. + +To create this index, run: + +``` python +vec.create_embedding_index(client.TimescaleVectorIndex()) +``` + +The above command creates the index using smart defaults. There are +a number of parameters you could tune to adjust the accuracy/speed +trade-off. + +The parameters you can set at index build time are: + +| Parameter name | Description | Default value | +|------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| +| `num_neighbors` | Sets the maximum number of neighbors per node. Higher values increase accuracy but make the graph traversal slower. | 50 | +| `search_list_size` | This is the S parameter used in the greedy search algorithm used during construction. Higher values improve graph quality at the cost of slower index builds. | 100 | +| `max_alpha` | Is the alpha parameter in the algorithm. Higher values improve graph quality at the cost of slower index builds. | 1.0 | + +To set these parameters, you could run: + +``` python +vec.create_embedding_index(client.TimescaleVectorIndex(num_neighbors=50, search_list_size=100, max_alpha=1.0)) +``` + +You can also set a parameter to control the accuracy vs. query speed +trade-off at query time. The parameter is set in the `search()` function +using the `query_params` argument. You can set the +`search_list_size`(default: 100). This is the number of additional +candidates considered during the graph search at query time. Higher +values improve query accuracy while making the query slower. + +You can specify this value during search as follows: + +``` python +vec.search([1.0, 9.0], limit=4, query_params=TimescaleVectorIndexParams(search_list_size=10)) +``` + +To drop the index, run: + +``` python +vec.drop_embedding_index() +``` + +#### pgvector HNSW index + +Pgvector provides a graph-based indexing algorithm based on the popular +[HNSW algorithm](https://arxiv.org/abs/1603.09320). + +To create this index, run: + +``` python +vec.create_embedding_index(client.HNSWIndex()) +``` + +The above command creates the index using smart defaults. There are +a number of parameters you could tune to adjust the accuracy/speed +trade-off. + +The parameters you can set at index build time are: + +| Parameter name | Description | Default value | +|-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| +| `m` | Represents the maximum number of connections per layer. Think of these connections as edges created for each node during graph construction. Increasing m increases accuracy but also increases index build time and size. | 16 | +| `ef_construction` | Represents the size of the dynamic candidate list for constructing the graph. It influences the trade-off between index quality and construction speed. Increasing `ef_construction` enables more accurate search results at the expense of lengthier index build times. | 64 | + +To set these parameters, you could run: + +``` python +vec.create_embedding_index(client.HNSWIndex(m=16, ef_construction=64)) +``` + +You can also set a parameter to control the accuracy vs. query speed +trade-off at query time. The parameter is set in the `search()` function +using the `query_params` argument. You can set the `ef_search`(default: +40). This parameter specifies the size of the dynamic candidate list +used during search. Higher values improve query accuracy while making +the query slower. + +You can specify this value during search as follows: + +``` python +vec.search([1.0, 9.0], limit=4, query_params=HNSWIndexParams(ef_search=10)) +``` + +To drop the index run: + +``` python +vec.drop_embedding_index() +``` + +#### pgvector ivfflat index + +Pgvector provides a clustering-based indexing algorithm. The [blog +post](https://www.timescale.com/blog/nearest-neighbor-indexes-what-are-ivfflat-indexes-in-pgvector-and-how-do-they-work/) +describes how it works in detail. It provides the fastest +index-build speed but the slowest query speeds of any indexing +algorithm. + +To create this index, run: + +``` python +vec.create_embedding_index(client.IvfflatIndex()) +``` + +Note: *ivfflat should never be created on empty tables* because it needs +to cluster data, and that only happens when an index is first created, +not when new rows are inserted or modified. Also, if your table +undergoes a lot of modifications, you need to rebuild this index +occasionally to maintain good accuracy. See the [blog +post](https://www.timescale.com/blog/nearest-neighbor-indexes-what-are-ivfflat-indexes-in-pgvector-and-how-do-they-work/) +for details. + +Pgvector ivfflat has a `lists` index parameter that is automatically set +with a smart default based on the number of rows in your table. If you +know that you'll have a different table size, you can specify the number +of records to use for calculating the `lists` parameter as follows: + +``` python +vec.create_embedding_index(client.IvfflatIndex(num_records=1000000)) +``` + +You can also set the `lists` parameter directly: + +``` python +vec.create_embedding_index(client.IvfflatIndex(num_lists=100)) +``` + +You can also set a parameter to control the accuracy vs. query speed +trade-off at query time. The parameter is set in the `search()` function +using the `query_params` argument. You can set the `probes`. This +parameter specifies the number of clusters searched during a query. It +is recommended to set this parameter to `sqrt(lists)` where lists is the +`num_list` parameter used above during index creation. Higher values +improve query accuracy while making the query slower. + +You can specify this value during search as follows: + +``` python +vec.search([1.0, 9.0], limit=4, query_params=IvfflatIndexParams(probes=10)) +``` + +To drop the index, run: + +``` python +vec.drop_embedding_index() +``` + +### Time partitioning + +In many use cases where you have many embeddings, time is an important +component associated with the embeddings. For example, when embedding +news stories, you often search by time as well as similarity +(for example, stories related to Bitcoin in the past week or stories about +Clinton in November 2016). + +Yet, traditionally, searching by two components "similarity" and "time" +is challenging for Approximate Nearest Neighbor (ANN) indexes and makes the +similarity-search index less effective. + +One approach to solving this is partitioning the data by time and +creating ANN indexes on each partition individually. Then, during search, +you can: + +- Step 1: filter partitions that don't match the time predicate. +- Step 2: perform the similarity search on all matching partitions. +- Step 3: combine all the results from each partition in step 2, re-rank, + and filter out results by time. + +Step 1 makes the search a lot more efficient by filtering out whole +swaths of data in one go. + +Timescale-vector supports time partitioning using {TIMESCALE_DB}'s +hypertables. To use this feature, simply indicate the length of time for +each partition when creating the client: + +``` python +from datetime import timedelta +from datetime import datetime +``` + +``` python +vec = client.Async(service_url, "my_data_with_time_partition", 2, time_partition_interval=timedelta(hours=6)) +await vec.create_tables() +``` + +Then, insert data where the IDs use UUIDs v1 and the time component of +the UUIDspecifies the time of the embedding. For example, to create an +embedding for the current time, simply do: + +``` python +id = uuid.uuid1() +await vec.upsert([(id, {"key": "val"}, "the brown fox", [1.0, 1.2])]) +``` + +To insert data for a specific time in the past, create the UUID using the +`uuid_from_time` function + +``` python +specific_datetime = datetime(2018, 8, 10, 15, 30, 0) +await vec.upsert([(client.uuid_from_time(specific_datetime), {"key": "val"}, "the brown fox", [1.0, 1.2])]) +``` + +You can then query the data by specifying a `uuid_time_filter` in the +search call: + +``` python +rec = await vec.search([1.0, 2.0], limit=4, uuid_time_filter=client.UUIDTimeRange(specific_datetime-timedelta(days=7), specific_datetime+timedelta(days=7))) +``` + +### Distance metrics + +Cosine distance is used by default to measure how similarly an embedding +is to a given query. In addition to cosine distance, Euclidean/L2 distance is +also supported. The distance type is set when creating the client +using the `distance_type` parameter. For example, to use the Euclidean +distance metric, you can create the client with: + +``` python +vec = client.Sync(service_url, "my_data", 2, distance_type="euclidean") +``` + +Valid values for `distance_type` are `cosine` and `euclidean`. + +It is important to note that you should use consistent distance types on +clients that create indexes and perform queries. That is because an +index is only valid for one particular type of distance measure. + +Note that the StreamingDiskANN index only supports cosine distance at +this time. \ No newline at end of file diff --git a/agentic-postgres/interfaces/sql-interface.mdx b/agentic-postgres/interfaces/sql-interface.mdx new file mode 100644 index 0000000..095564a --- /dev/null +++ b/agentic-postgres/interfaces/sql-interface.mdx @@ -0,0 +1,336 @@ +--- +title: SQL inteface for pgvector and pgvectorscale +description: Use the SQL interface to work with pgvector and pgvectorscale, including installing the extensions, creating a table, querying the vector embeddings, and more +products: [cloud, mst, self_hosted] +keywords: [ai, vector, pgvector, tiger data vector, sql, pgvectorscale] +tags: [ai, vector, sql] +--- + +import { COMPANY, PG, TIMESCALE_DB } from '/snippets/vars.mdx'; + +## Installing the pgvector and pgvectorscale extensions + +If not already installed, install the `vector` and `vectorscale` extensions on your {COMPANY} database. + +```sql +CREATE EXTENSION IF NOT EXISTS vector; +CREATE EXTENSION IF NOT EXISTS vectorscale; +``` + +## Creating the table for storing embeddings using pgvector + +Vectors inside of the database are stored in regular {PG} tables using `vector` columns. The `vector` column type is provided by the pgvector extension. A common way to store vectors is alongside the data they are embedding. For example, to store embeddings for documents, a common table structure is: + +```sql +CREATE TABLE IF NOT EXISTS document_embedding ( + id BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY, + document_id BIGINT FOREIGN KEY(document.id) + metadata JSONB, + contents TEXT, + embedding VECTOR(1536) +) +``` + +This table contains a primary key, a foreign key to the document table, some metadata, the text being embedded (in the `contents` column) and the embedded vector. + +You may ask why not just add an embedding column to the document table? The answer is that there is a limit on the length of text an embedding can encode and so there needs to be a one-to-many relationship between the full document and its embeddings. + +The above table is just an illustration, it's totally fine to have a table without a foreign key and/or without a metadata column. The important thing is to have a column with the data being embedded and the vector in the same row, enabling you to return the raw data for a given similarity search query + +The vector type can specify an optional number of dimensions (1,538) in the example above). If specified, it enforces the constraint that all vectors in the column have that number of dimensions. A plain `VECTOR` (without specifying the number of dimensions) column is also possible and allows a variable number of dimensions. + +## Query the vector embeddings + +The canonical query is: + +```sql +SELECT * +FROM document_embedding +ORDER BY embedding <=> $1 +LIMIT 10 +``` + +Which returns the 10 rows whose distance is the smallest. The distance function used here is cosine distance (specified by using the `<=>` operator). Other distance functions are available, see the [discussion][distance-functions]. + +The available distance types and their operators are: + +| Distance type | Operator | +|------------------------|---------------| +| Cosine/Angular | `<=>` | +| Euclidean | `<->` | +| Negative inner product | `<#>` | + + + +If you are using an index, you need to make sure that the distance function used in index creation is the same one used during query (see below). This is important because if you create your index with one distance function but query with another, your index cannot be used to speed up the query. + + + + +## Indexing the vector data using indexes provided by pgvector and pgvectorscale + +Indexing helps speed up similarity queries of the basic form: + +```sql +SELECT * +FROM document_embedding +ORDER BY embedding <=> $1 +LIMIT 10 +``` + +The key part is that the `ORDER BY` contains a distance measure against a constant or a pseudo-constant. + +Note that if performing a query without an index, you always get an exact result, but the query is slow (it has to read all of the data you store for every query). With an index, your queries are an order-of-magnitude faster, but the results are approximate (because there are no known indexing techniques that are exact see [here for more][vector-search-indexing]). + +Nevertheless, there are excellent approximate algorithms. There are 3 different indexing algorithms available on {TIMESCALE_DB}: StreamingDiskANN, HNSW, and ivfflat. Below is the trade-offs between these algorithms: + +| Algorithm | Build Speed | Query Speed | Need to rebuild after updates | +|------------------|-------------|-------------|-------------------------------| +| StreamingDiskANN | Fast | Fastest | No | +| HNSW | Fast | Fast | No | +| ivfflat | Fastest | Slowest | Yes | + + +You can see [benchmarks](https://www.timescale.com/blog/how-we-made-postgresql-the-best-vector-database/) in the blog. + +For most use cases, the StreamingDiskANN index is recommended. + +Each of these indexes has a set of build-time options for controlling the speed/accuracy trade-off when creating the index and an additional query-time option for controlling accuracy during a particular query. + +You can see the details of each index below. + +### StreamingDiskANN index + + +The StreamingDiskANN index is a graph-based algorithm that was inspired by the [DiskANN](https://github.com/microsoft/DiskANN) algorithm. +You can read more about it in +[How We Made {PG} as Fast as Pinecone for Vector Data](https://www.timescale.com/blog/how-we-made-postgresql-as-fast-as-pinecone-for-vector-data). + +To create an index named `document_embedding_idx` on table `document_embedding` having a vector column named `embedding`, with cosine distance metric, run: +```sql +CREATE INDEX document_embedding_cos_idx ON document_embedding +USING diskann (embedding vector_cosine_ops); +``` + +Since this index uses cosine distance, you should use the `<=>` operator in your queries. StreamingDiskANN also supports L2 distance: +```sql +CREATE INDEX document_embedding_l2_idx ON document_embedding +USING diskann (embedding vector_l2_ops); +``` +For L2 distance, use the `<->` operator in queries. + +These examples create the index with smart defaults for all parameters not listed. These should be the right values for most cases. But if you want to delve deeper, the available parameters are below. + +#### StreamingDiskANN index build-time parameters + +These parameters can be set when an index is created. + +| Parameter name | Description | Default value | +|------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| +| `storage_layout` | `memory_optimized` which uses SBQ to compress vector data or `plain` which stores data uncompressed | memory_optimized +| `num_neighbors` | Sets the maximum number of neighbors per node. Higher values increase accuracy but make the graph traversal slower. | 50 | +| `search_list_size` | This is the S parameter used in the greedy search algorithm used during construction. Higher values improve graph quality at the cost of slower index builds. | 100 | +| `max_alpha` | Is the alpha parameter in the algorithm. Higher values improve graph quality at the cost of slower index builds. | 1.2 | +| `num_dimensions` | The number of dimensions to index. By default, all dimensions are indexed. But you can also index less dimensions to make use of [Matryoshka embeddings](https://huggingface.co/blog/matryoshka) | 0 (all dimensions) +| `num_bits_per_dimension` | Number of bits used to encode each dimension when using SBQ | 2 for less than 900 dimensions, 1 otherwise + +An example of how to set the `num_neighbors` parameter is: + +```sql +CREATE INDEX document_embedding_idx ON document_embedding +USING diskann (embedding) WITH(num_neighbors=50); +``` + +#### StreamingDiskANN query-time parameters + +You can also set two parameters to control the accuracy vs. query speed trade-off at query time. We suggest adjusting `diskann.query_rescore` to fine-tune accuracy. + +| Parameter name | Description | Default value | +|------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| +| `diskann.query_search_list_size` | The number of additional candidates considered during the graph search. | 100 +| `diskann.query_rescore` | The number of elements rescored (0 to disable rescoring) | 50 + +You can set the value by using `SET` before executing a query. For example: + +```sql +SET diskann.query_rescore = 400; +``` + +Note the [SET command](https://www.postgresql.org/docs/current/sql-set.html) applies to the entire session (database connection) from the point of execution. You can use a transaction-local variant using `LOCAL` which will +be reset after the end of the transaction: + +```sql +BEGIN; +SET LOCAL diskann.query_search_list_size= 10; +SELECT * FROM document_embedding ORDER BY embedding <=> $1 LIMIT 10 +COMMIT; +``` + +#### StreamingDiskANN index-supported queries + +You need to use the cosine-distance embedding measure (`<=>`) in your `ORDER BY` clause. A canonical query would be: + +```sql +SELECT * +FROM document_embedding +ORDER BY embedding <=> $1 +LIMIT 10 +``` + +### pgvector HNSW + +Pgvector provides a graph-based indexing algorithm based on the popular [HNSW algorithm](https://arxiv.org/abs/1603.09320). + +To create an index named `document_embedding_idx` on table `document_embedding` having a vector column named `embedding`, run: +```sql +CREATE INDEX document_embedding_idx ON document_embedding +USING hnsw(embedding vector_cosine_ops); +``` + +This command creates an index for cosine-distance queries because of `vector_cosine_ops`. There are also "ops" classes for Euclidean distance and negative inner product: + +| Distance type | Query operator | Index ops class | +|------------------------|----------------|-------------------| +| Cosine / Angular | `<=>` | `vector_cosine_ops` | +| Euclidean / L2 | `<->` | `vector_ip_ops` | +| Negative inner product | `<#>` | `vector_l2_ops` | + +Pgvector HNSW also includes several index build-time and query-time parameters. + +#### pgvector HNSW index build-time parameters + +These parameters can be set at index build time: + +| Parameter name | Description | Default value | +|------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------| +| `m` | Represents the maximum number of connections per layer. Think of these connections as edges created for each node during graph construction. Increasing m increases accuracy but also increases index build time and size. | 16 | +| `ef_construction` | Represents the size of the dynamic candidate list for constructing the graph. It influences the trade-off between index quality and construction speed. Increasing `ef_construction` enables more accurate search results at the expense of lengthier index build times. | 64 | + +An example of how to set the m parameter is: + +```sql +CREATE INDEX document_embedding_idx ON document_embedding +USING hnsw(embedding vector_cosine_ops) WITH (m = 20); +``` + +#### pgvector HNSW query-time parameters + +You can also set a parameter to control the accuracy vs. query speed trade-off at query time. The parameter is called `hnsw.ef_search`. This parameter specifies the size of the dynamic candidate list used during search. Defaults to 40. Higher values improve query accuracy while making the query slower. + +You can set the value by running: + +```sql +SET hnsw.ef_search = 100; +``` + +Before executing the query, note the [SET command](https://www.postgresql.org/docs/current/sql-set.html) applies to the entire session (database connection) from the point of execution. You can use a transaction-local variant using `LOCAL`: + +```sql +BEGIN; +SET LOCAL hnsw.ef_search = 100; +SELECT * FROM document_embedding ORDER BY embedding <=> $1 LIMIT 10 +COMMIT; +``` + +#### pgvector HNSW index-supported queries + +You need to use the distance operator (`<=>`, `<->`, or `<#>`) matching the ops class you used during index creation in your `ORDER BY` clause. A canonical query would be: + +```sql +SELECT * +FROM document_embedding +ORDER BY embedding <=> $1 +LIMIT 10 +``` + +### pgvector ivfflat + +Pgvector provides a clustering-based indexing algorithm. The [blog post](https://www.timescale.com/blog/nearest-neighbor-indexes-what-are-ivfflat-indexes-in-pgvector-and-how-do-they-work) describes how it works in detail. It provides the fastest index-build speed but the slowest query speeds of any indexing algorithm. + +To create an index named `document_embedding_idx` on table `document_embedding` having a vector column named `embedding`, run: +```sql +CREATE INDEX document_embedding_idx ON document_embedding +USING ivfflat(embedding vector_cosine_ops) WITH (lists = 100); +``` + +This command creates an index for cosine-distance queries because of `vector_cosine_ops`. There are also "ops" classes for Euclidean distance and negative inner product: + +| Distance type | Query operator | Index ops class | +|------------------------|----------------|-------------------| +| Cosine / Angular | `<=>` | `vector_cosine_ops` | +| Euclidean / L2 | `<->` | `vector_ip_ops` | +| Negative inner product | `<#>` | `vector_l2_ops` | + +Note: *ivfflat should never be created on empty tables* because it needs to cluster data, and that only happens when an index is first created, not when new rows are inserted or modified. Also, if your table undergoes a lot of modifications, you need to rebuild this index occasionally to maintain good accuracy. See the [blog post](https://www.timescale.com/blog/nearest-neighbor-indexes-what-are-ivfflat-indexes-in-pgvector-and-how-do-they-work) for details. + +Pgvector ivfflat has a `lists` index parameter that should be set. See the next section. + +#### pgvector ivfflat index build-time parameters + +Pgvector has a `lists` parameter that should be set as follows: +For datasets with less than one million rows, use lists = rows / 1000. +For datasets with more than one million rows, use lists = sqrt(rows). +It is generally advisable to have at least 10 clusters. + + +You can use the following code to simplify creating ivfflat indexes: +```python +def create_ivfflat_index(conn, table_name, column_name, query_operator="<=>"): + index_method = "invalid" + if query_operator == "<->": + index_method = "vector_l2_ops" + elif query_operator == "<#>": + index_method = "vector_ip_ops" + elif query_operator == "<=>": + index_method = "vector_cosine_ops" + else: + raise ValueError(f"unrecognized operator {query_operator}") + + with conn.cursor() as cur: + cur.execute(f"SELECT COUNT(*) as cnt FROM {table_name};") + num_records = cur.fetchone()[0] + + num_lists = num_records / 1000 + if num_lists < 10: + num_lists = 10 + if num_records > 1000000: + num_lists = math.sqrt(num_records) + + cur.execute(f'CREATE INDEX ON {table_name} USING ivfflat ({column_name} {index_method}) WITH (lists = {num_lists});') + conn.commit() +``` + + +#### pgvector ivfflat query-time parameters + +You can also set a parameter to control the accuracy vs. query speed tradeoff at query time. The parameter is called `ivfflat.probes`. This parameter specifies the number of clusters searched during a query. It is recommended to set this parameter to `sqrt(lists)` where lists is the parameter used above during index creation. Higher values improve query accuracy while making the query slower. + +You can set the value by running: + +```sql +SET ivfflat.probes = 100; +``` + +Before executing the query, note the [SET command](https://www.postgresql.org/docs/current/sql-set.html) applies to the entire session (database connection) from the point of execution. You can use a transaction-local variant using `LOCAL`: + +```sql +BEGIN; +SET LOCAL ivfflat.probes = 100; +SELECT * FROM document_embedding ORDER BY embedding <=> $1 LIMIT 10 +COMMIT; +``` + + +#### pgvector ivfflat index-supported queries + +You need to use the distance operator (`<=>`, `<->`, or `<#>`) matching the ops class you used during index creation in your `ORDER BY` clause. A canonical query would be: + +```sql +SELECT * +FROM document_embedding +ORDER BY embedding <=> $1 +LIMIT 10 +``` + +[distance-functions]: /agentic-postgres/key-vector-database-concepts#vector-distance-types +[vector-search-indexing]: /agentic-postgres/key-vector-database-concepts#vector-search-indexing-approximate-nearest-neighbor-search diff --git a/agentic-postgres/key-vector-database-concepts.mdx b/agentic-postgres/key-vector-database-concepts.mdx new file mode 100644 index 0000000..a7ff331 --- /dev/null +++ b/agentic-postgres/key-vector-database-concepts.mdx @@ -0,0 +1,102 @@ +--- +title: Key vector database concepts for pgvector +description: Learn the most important vector database concepts to understand AI in Postgres +products: [cloud, mst, self_hosted] +keywords: [ai, vector, pgvector, pgvectorscale, pgai] +tags: [ai, vector] +--- + +import { PG, CLOUD_LONG, COMPANY } from '/snippets/vars.mdx'; + +Vectors inside of the database are stored in regular {PG} tables using `vector` columns. The `vector` column type +is provided by the [pgvector](https://github.com/pgvector/pgvector) extension. A common way to store vectors is +alongside the data they have indexed. For example, to store embeddings for documents, a common table structure is: + +```sql +CREATE TABLE IF NOT EXISTS document_embedding ( + id BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY, + document_id BIGINT FOREIGN KEY(document.id) + metadata JSONB, + contents TEXT, + embedding VECTOR(1536) +) +``` + +This table contains a primary key, a foreign key to the document table, some metadata, the text being embedded (in the `contents` column), and the embedded vector. + +This may seem like a bit of a weird design: why aren't the embeddings simply a separate column in the document table? The answer has to do with context length limits of embedding models and of LLMs. When embedding data, there is a limit to the length of content you can embed (for example, OpenAI's ada-002 has a limit of [8191 tokens](https://platform.openai.com/docs/guides/embeddings/embedding-models) ), and so, if you are embedding a long piece of text, you have to break it up into smaller chunks and embed each chunk individually. Therefore, when thinking about this at the database layer, there is usually a one-to-many relationship between the thing being embedded and the embeddings which is represented by a foreign key from the embedding to the thing. + +Of course, if you do not want to store the original data in the database and you are just storing only the embeddings, that's totally fine too. Just omit the foreign key from the table. Another popular alternative is to put the foreign key into the metadata JSONB. + +## Querying vectors using pgvector + +The canonical query for vectors is for the closest query vectors to an embedding of the user's query. This is also known as finding the [K nearest neighbors](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm). + +In the example query below, `$1` is a parameter taking a query embedding, and the `<=>` operator calculates the distance between the query embedding and embedding vectors stored in the database (and returns a float value). + +```sql +SELECT * +FROM document_embedding +ORDER BY embedding <=> $1 +LIMIT 10 +``` + +The query above returns the 10 rows with the smallest distance between the query's embedding and the row's embedding. Of course, this being {PG}, you can add additional `WHERE` clauses (such as filters on the metadata), joins, etc. + +### Vector distance types + +The query shown above uses something called cosine distance (using the \<=> operator) as a measure of how similar two embeddings are. But, there are multiple ways to quantify how far apart two vectors are from each other. + + + +In practice, the choice of distance measure doesn't matters much and it is recommended to just stick with cosine distance for most applications. + + + +#### Description of cosine distance, negative inner product, and Euclidean distance + +Here's a succinct description of three common vector distance measures + +- **Cosine distance a.k.a. angular distance**: This measures the cosine of the angle between two vectors. It's not a true "distance" in the mathematical sense but a similarity measure, where a smaller angle corresponds to a higher similarity. The cosine distance is particularly useful in high-dimensional spaces where the magnitude of the vectors (their length) is less important, such as in text analysis or information retrieval. It ranges from -1 (meaning exactly opposite) to 1 (exactly the same), with 0 typically indicating orthogonality (no similarity). See here for more on [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). + +- **Negative inner product**: This is simply the negative of the inner product (also known as the dot product) of two vectors. The inner product measures vector similarity based on the vectors' magnitudes and the cosine of the angle between them. A higher inner product indicates greater similarity. However, it's important to note that, unlike cosine similarity, the magnitude of the vectors influences the inner product. + +- **Euclidean distance**: This is the "ordinary" straight-line distance between two points in Euclidean space. In terms of vectors, it's the square root of the sum of the squared differences between corresponding elements of the vectors. This measure is sensitive to the magnitude of the vectors and is widely used in various fields such as clustering and nearest neighbor search. + +Many embedding systems (for example OpenAI's ada-002) use vectors with length 1 (unit vectors). For those systems, the rankings (ordering) of all three measures is the same. In particular, +- The cosine distance is `1−dot product`. +- The negative inner product is `−dot product`. +- The Euclidean distance is related to the dot product, where the squared Euclidean distance is `2(1−dot product)`. + +#### Recommended vector distance for use in Postgres + +Using cosine distance, especially on unit vectors, is recommended. These recommendations are based on OpenAI's [recommendation](https://platform.openai.com/docs/guides/embeddings/which-distance-function-should-i-use) as well as the fact that the ranking of different distances on unit vectors is preserved. + +## Vector search indexing (approximate nearest neighbor search) + +In {PG} and other relational databases, indexing is a way to speed up queries. For vector data, indexes speed up the similarity search query shown above where you find the most similar embedding to some given query embedding. This problem is often referred to as finding the [K nearest neighbors](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm). + + + +The term "index" in the context of vector databases has multiple meanings. It can refer to both the storage mechanism for your data and the tool that enhances query efficiency. These docs use the latter meaning. + + + +Finding the K nearest neighbors is not a new problem in {PG}, but existing techniques only work with low-dimensional data. These approaches cease to be effective when dealing with data larger than approximately 10 dimensions due to the "curse of dimensionality." Given that embeddings often consist of more than a thousand dimensions(OpenAI's are 1,536) new techniques had to be developed. + +There are no known exact algorithms for efficiently searching in such high-dimensional spaces. Nevertheless, there are excellent approximate algorithms that fall into the category of approximate nearest neighbor algorithms. + +There are 3 different indexing algorithms available as part of pgai on {CLOUD_LONG}: StreamingDiskANN, HNSW, and ivfflat. The table below illustrates the high-level differences between these algorithms: + +| Algorithm | Build Speed | Query Speed | Need to rebuild after updates | +|------------------|-------------|-------------|-------------------------------| +| StreamingDiskANN | Fast | Fastest | No | +| HNSW | Fast | Fast | No | +| ivfflat | Fastest | Slowest | Yes | + + +See the [performance benchmarks](https://www.timescale.com/blog/how-we-made-postgresql-the-best-vector-database) for details on how the each index performs on a dataset of 1 million OpenAI embeddings. + +## Recommended index types + +For most applications, the StreamingDiskANN index is recommended. diff --git a/agentic-postgres/pgai/pgai.mdx b/agentic-postgres/pgai/pgai.mdx new file mode 100644 index 0000000..f10f0df --- /dev/null +++ b/agentic-postgres/pgai/pgai.mdx @@ -0,0 +1,381 @@ +--- +title: Retrieval for RAG and Agentic apps +description: Power your RAG and Agentic applications with PostgreSQL +products: [cloud, mst, self_hosted] +keywords: [ai, vector, pgai, embeddings, RAG] +--- + +import { CLOUD_LONG, PGAI_SHORT, PGVECTORSCALE } from '/snippets/vars.mdx'; + +A Python library that transforms PostgreSQL into a robust, production-ready retrieval engine for RAG and Agentic applications. + +- **Automatically create and synchronize vector embeddings** from PostgreSQL data and S3 documents. Embeddings update automatically as data changes. + +- **[Semantic Catalog](/agentic-postgres/pgai/semantic-catalog): Enable natural language to SQL with AI**. Automatically generate database descriptions and power text-to-SQL for agentic applications. + +- Powerful vector and semantic search with pgvector and {PGVECTORSCALE}. + +- **Production-ready out-of-the-box**: Supports batch processing for efficient embedding generation, with built-in handling for model failures, rate limits, and latency spikes. + +- Works with any PostgreSQL database, including {CLOUD_LONG}, Amazon RDS, Supabase and more. + +## Features + +Our {PGAI_SHORT} Python library lets you work with embeddings generated from your data: + +* Automatically create and sync vector embeddings for your data using the [vectorizer](/agentic-postgres/pgai/vectorizer-overview). +* [Load data](/agentic-postgres/pgai/vectorizer-api-reference#loading-configuration) from a column in your table or from a file, s3 bucket, etc. +* Create multiple embeddings for the same data with different models and parameters for testing and experimentation. +* [Customize](#a-configurable-vectorizer-pipeline) how your embedding pipeline parses, chunks, formats, and embeds your data. + +You can use the vector embeddings to: +- [Perform semantic search](/agentic-postgres/pgai/vectorizer-overview#query-an-embedding) using pgvector. +- Implement Retrieval Augmented Generation (RAG) +- Perform high-performance, cost-efficient ANN search on large vector workloads with [{PGVECTORSCALE}](https://github.com/timescale/pgvectorscale), which complements pgvector. + +**Text-to-SQL with Semantic Catalog:** Transform natural language into accurate SQL queries. The semantic catalog generates database descriptions automatically, lets a human in the loop review and improve the descriptions and stores SQL examples and business facts. This enables LLMs to understand your schema and data context. See the [semantic catalog](/agentic-postgres/pgai/semantic-catalog) for more details. + +We also offer a [PostgreSQL extension](https://github.com/timescale/pgai/tree/main/projects/extension) that can perform LLM model calling directly from SQL. This is often useful for use cases like classification, summarization, and data enrichment on your existing data. + +### A configurable vectorizer pipeline + +The vectorizer is designed to be flexible and customizable. Each vectorizer defines a pipeline for creating embeddings from your data. The pipeline is defined by a series of components that are applied in sequence to the data: + +- **[Loading](/agentic-postgres/pgai/vectorizer-api-reference#loading-configuration):** First, you define the source of the data to embed. It can be the data stored directly in a column of the source table or a URI referenced in a column of the source table that points to a file, s3 bucket, etc. +- **[Parsing](/agentic-postgres/pgai/vectorizer-api-reference#parsing-configuration):** Then, you define the way the data is parsed if it is a non-text document such as a PDF, HTML, or markdown file. +- **[Chunking](/agentic-postgres/pgai/vectorizer-api-reference#chunking-configuration):** Next, you define the way text data is split into chunks. +- **[Formatting](/agentic-postgres/pgai/vectorizer-api-reference#formatting-configuration):** Then, for each chunk, you define the way the data is formatted before it is sent for embedding. For example, you can add the title of the document as the first line of the chunk. +- **[Embedding](/agentic-postgres/pgai/vectorizer-api-reference#embedding-configuration):** Finally, you specify the LLM provider, model, and the parameters to be used when generating the embeddings. + +### Supported embedding models + +The following models are supported for embedding: + +- [Ollama](/agentic-postgres/pgai/vectorizer-api-reference#aiembedding_ollama) +- [OpenAI](/agentic-postgres/pgai/vectorizer-api-reference#aiembedding_openai) +- [Voyage AI](/agentic-postgres/pgai/vectorizer-api-reference#aiembedding_voyageai) +- [Cohere](/agentic-postgres/pgai/vectorizer-api-reference#aiembedding_litellm) +- [Huggingface](/agentic-postgres/pgai/vectorizer-api-reference#aiembedding_litellm) +- [Mistral](/agentic-postgres/pgai/vectorizer-api-reference#aiembedding_litellm) +- [Azure OpenAI](/agentic-postgres/pgai/vectorizer-api-reference#aiembedding_litellm) +- [AWS Bedrock](/agentic-postgres/pgai/vectorizer-api-reference#aiembedding_litellm) +- [Vertex AI](/agentic-postgres/pgai/vectorizer-api-reference#aiembedding_litellm) + +### Error handling + +Simply creating vector embeddings is easy and straightforward. The challenge is +that LLMs are somewhat unreliable and the endpoints exhibit intermittent +failures and/or degraded performance. A critical part of properly handling +failures is that your primary data-modification operations (INSERT, UPDATE, +DELETE) should not be dependent on the embedding operation. Otherwise, your +application will be down every time the endpoint is slow or fails and your user +experience will suffer. + +Normally, you would need to implement a custom MLops pipeline to properly handle +endpoint failures. This commonly involves queuing system like Kafka, specialized +workers, and other infrastructure for handling the queue and retrying failed +requests. This is a lot of work and it is easy to get wrong. + +With {PGAI_SHORT}, you can skip all that and focus on building your application because +the vectorizer is managing the embeddings for you. We have built in queueing and +retry logic to handle the various failure modes you can encounter. Because we do +this work in the background, the primary data modification operations are not +dependent on the embedding operation. This is why {PGAI_SHORT} is production-ready out of the box. + +Many specialized vector databases create embeddings for you. However, they typically fail when embedding endpoints are down or degraded, placing the burden of error handling and retries back on you. + +## Architecture + +The system consists of an application you write, a PostgreSQL database, and stateless vectorizer workers. The application defines a vectorizer configuration to embed data from sources like PostgreSQL or S3. The workers read this configuration, processes the data queue into embeddings and chunked text, and writes the results back. The application then queries this data to power RAG and semantic search. + +The key strength of this architecture lies in its resilience: data modifications made by the application are decoupled from the embedding process, ensuring that failures in the embedding service do not affect the core data operations. + + + Pgai Architecture: application, database, vectorizer worker + + +## Install + +First, install the {PGAI_SHORT} package. + +```bash +pip install pgai +``` + +Then, install the {PGAI_SHORT} database components. You can do this from the terminal using the CLI or in your Python application code using the {PGAI_SHORT} python package. + +```bash +# from the cli +pgai install -d +``` + +```python +# or from the python package, often done as part of your application setup +import pgai +pgai.install(DB_URL) +``` + +If you are not on {CLOUD_LONG} you will also need to run the {PGAI_SHORT} vectorizer worker. Install the dependencies for it via: + +```bash +pip install "pgai[vectorizer-worker]" +``` + +If you are using the [semantic catalog](/agentic-postgres/pgai/semantic-catalog), you will need to run: + +```bash +pip install "pgai[semantic-catalog]" +``` + +## Quick Start + +This quickstart demonstrates how {PGAI_SHORT} Vectorizer enables semantic search and RAG over PostgreSQL data by automatically creating and synchronizing embeddings as data changes. + +**Looking for text-to-SQL?** Check out the [Semantic Catalog quickstart](/agentic-postgres/pgai/semantic-catalog) to transform natural language questions into SQL queries. + +The key "secret sauce" of {PGAI_SHORT} Vectorizer is its declarative approach to +embedding generation. Simply define your pipeline and let Vectorizer handle the +operational complexity of keeping embeddings in sync, even when embedding +endpoints are unreliable. You can define a simple version of the pipeline as +follows: + +```sql +CREATE TABLE IF NOT EXISTS wiki ( + id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY, + url TEXT NOT NULL, + title TEXT NOT NULL, + text TEXT NOT NULL +) + +SELECT ai.create_vectorizer( + 'wiki'::regclass, + loading => ai.loading_column(column_name=>'text'), + destination => ai.destination_table(target_table=>'wiki_embedding_storage'), + embedding => ai.embedding_openai(model=>'text-embedding-ada-002', dimensions=>'1536') + ) +``` + +The vectorizer will automatically create embeddings for all the rows in the +`wiki` table, and, more importantly, will keep the embeddings synced with the +underlying data as it changes. **Think of it almost like declaring an index** on +the `wiki` table, but instead of the database managing the index datastructure +for you, the Vectorizer is managing the embeddings. + +## Running the quick start + +**Prerequisites:** +- A PostgreSQL database (see [Docker installation](https://docs.timescale.com/self-hosted/latest/install/installation-docker/)). +- An OpenAI API key (we use openai for embedding in the quick start, but you can use [multiple providers](#supported-embedding-models)). + +Create a `.env` file with the following: + +``` +OPENAI_API_KEY= +DB_URL= +``` + +You can download the full [python code](https://github.com/timescale/pgai/blob/main/examples/quickstart/main.py) and [requirements.txt](https://github.com/timescale/pgai/blob/main/examples/quickstart/requirements.txt) from the quickstart example and run it in the same directory as the `.env` file. + + + +```bash +curl -O https://raw.githubusercontent.com/timescale/pgai/main/examples/quickstart/main.py +curl -O https://raw.githubusercontent.com/timescale/pgai/main/examples/quickstart/requirements.txt +python -m venv venv +source venv/bin/activate +pip install -r requirements.txt +python main.py +``` + + +Sample output: + + + +``` +Search results 1: +[WikiSearchResult(id=7, + url='https://en.wikipedia.org/wiki/Aristotle', + title='Aristotle', + text='Aristotle (; Aristotélēs, ; 384–322\xa0BC) was an ' + 'Ancient Greek philosopher and polymath. His writings ' + 'cover a broad range of subjects spanning the natural ' + 'sciences, philosophy, linguistics, economics, ' + 'politics, psychology and the arts. As the founder of ' + 'the Peripatetic school of philosophy in the Lyceum in ' + 'Athens, he began the wider Aristotelian tradition that ' + 'followed, which set the groundwork for the development ' + 'of modern science.\n' + '\n' + "Little is known about Aristotle's life. He was born in " + 'the city of Stagira in northern Greece during the ' + 'Classical period. His father, Nicomachus, died when ' + 'Aristotle was a child, and he was brought up by a ' + "guardian. At 17 or 18 he joined Plato's Academy in " + 'Athens and remained there till the age of 37 (). ' + 'Shortly after Plato died, Aristotle left Athens and, ' + 'at the request of Philip II of Macedon, tutored his ' + 'son Alexander the Great beginning in 343 BC. He ' + 'established a library in the Lyceum which helped him ' + 'to produce many of his hundreds of books on papyru', + chunk='Aristotle (; Aristotélēs, ; 384–322\xa0BC) was an ' + 'Ancient Greek philosopher and polymath. His writings ' + 'cover a broad range of subjects spanning the natural ' + 'sciences, philosophy, linguistics, economics, ' + 'politics, psychology and the arts. As the founder of ' + 'the Peripatetic school of philosophy in the Lyceum in ' + 'Athens, he began the wider Aristotelian tradition ' + 'that followed, which set the groundwork for the ' + 'development of modern science.', + distance=0.22242502364217387)] +Search results 2: +[WikiSearchResult(id=41, + url='https://en.wikipedia.org/wiki/pgai', + title='pgai', + text='pgai is a Python library that turns PostgreSQL into ' + 'the retrieval engine behind robust, production-ready ' + 'RAG and Agentic applications. It does this by ' + 'automatically creating vector embeddings for your data ' + 'based on the vectorizer you define.', + chunk='pgai is a Python library that turns PostgreSQL into ' + 'the retrieval engine behind robust, production-ready ' + 'RAG and Agentic applications. It does this by ' + 'automatically creating vector embeddings for your ' + 'data based on the vectorizer you define.', + distance=0.13639101792546204)] +RAG response: +The main thing pgai does right now is generating vector embeddings for data in PostgreSQL databases based on the vectorizer defined by the user, enabling the creation of robust RAG and Agentic applications. +``` + + +## Code walkthrough + +### Install the pgai database components + +{PGAI_SHORT} requires a few catalog tables and functions to be installed into the database. This is done using the `pgai.install` function, which will install the necessary components into the `ai` schema of the database. + +```python +pgai.install(DB_URL) +``` + +### Create the vectorizer + +This defines the vectorizer, which tells the system how to create the embeddings from the `text` column in the `wiki` table. The vectorizer creates a view `wiki_embedding` that we can query for the embeddings (as we'll see below). + +```python +async def create_vectorizer(conn: psycopg.AsyncConnection): + async with conn.cursor() as cur: + await cur.execute(""" + SELECT ai.create_vectorizer( + 'wiki'::regclass, + if_not_exists => true, + loading => ai.loading_column(column_name=>'text'), + embedding => ai.embedding_openai(model=>'text-embedding-ada-002', dimensions=>'1536'), + destination => ai.destination_table(view_name=>'wiki_embedding') + ) + """) + await conn.commit() +``` + +### Run the vectorizer worker + +In this example, we run the vectorizer worker once to create the embeddings for the existing data. + +```python +worker = Worker(DB_URL, once=True) +worker.run() +``` + +In a real application, we would not call the worker manually like this every time we want to create the embeddings. Instead, we would run the worker in the background and it would run continuously, polling for work from the vectorizer. + +You can run the worker in the background from the application, the cli, or docker. See the [vectorizer worker](/agentic-postgres/pgai/vectorizer-worker) documentation for more details. + +### Search the wiki articles using semantic search + +This is standard pgvector semantic search in PostgreSQL. The search is performed against the `wiki_embedding` view, which is created by the vectorizer and includes all the columns from the `wiki` table plus the `embedding` column and the chunk text. This function returns both the entire `text` column from the `wiki` table and smaller chunks of the text that are most relevant to the query. + +```python +@dataclass +class WikiSearchResult: + id: int + url: str + title: str + text: str + chunk: str + distance: float + +async def _find_relevant_chunks(client: AsyncOpenAI, query: str, limit: int = 1) -> List[WikiSearchResult]: + # Generate embedding for the query using OpenAI's API + response = await client.embeddings.create( + model="text-embedding-ada-002", + input=query, + encoding_format="float", + ) + + embedding = np.array(response.data[0].embedding) + + # Query the database for the most similar chunks using pgvector's cosine distance operator (<=>) + async with pool.connection() as conn: + async with conn.cursor(row_factory=class_row(WikiSearchResult)) as cur: + await cur.execute(""" + SELECT w.id, w.url, w.title, w.text, w.chunk, w.embedding <=> %s as distance + FROM wiki_embedding w + ORDER BY distance + LIMIT %s + """, (embedding, limit)) + + return await cur.fetchall() +``` + +### Insert a new article into the wiki table + +This code is notable for what it is not doing. This is a simple insert of a new article into the `wiki` table. We did not need to do anything different to create the embeddings, the vectorizer worker will take care of updating the embeddings as the data changes. + +```python +def insert_article_about_pgai(conn: psycopg.AsyncConnection): + async with conn.cursor(row_factory=class_row(WikiSearchResult)) as cur: + await cur.execute(""" + INSERT INTO wiki (url, title, text) VALUES + ('https://en.wikipedia.org/wiki/pgai', 'pgai', 'pgai is a Python library that turns PostgreSQL into the retrieval engine behind robust, production-ready RAG and Agentic applications. It does this by automatically creating vector embeddings for your data based on the vectorizer you define.') + """) + await conn.commit() +``` + +### Perform RAG with the LLM + +This code performs RAG with the LLM. It uses the `_find_relevant_chunks` function defined above to find the most relevant chunks of text from the `wiki` table and then uses the LLM to generate a response. + +```python + query = "What is the main thing pgai does right now?" + relevant_chunks = await _find_relevant_chunks(client, query) + context = "\n\n".join( + f"{chunk.title}:\n{chunk.text}" + for chunk in relevant_chunks + ) + prompt = f"""Question: {query} + +Please use the following context to provide an accurate response: + +{context} + +Answer:""" + + response = await client.chat.completions.create({ + model: "gpt-4o-mini", + messages: [{ role: "user", content: prompt }], + }) + print("RAG response:") + print(response.choices[0].message.content) +``` + +## Next steps + +### More RAG and Vectorization Examples +- [FastAPI + psycopg quickstart](https://github.com/timescale/pgai/tree/main/examples/simple_fastapi_app) +- [Vectorizer overview](/agentic-postgres/pgai/vectorizer-overview) +- [Vectorizer worker documentation](/agentic-postgres/pgai/vectorizer-worker) +- [Vectorizer API reference](/agentic-postgres/pgai/vectorizer-api-reference) + +### Text-to-SQL with Semantic Catalog +- **[Semantic Catalog Quickstart](/agentic-postgres/pgai/semantic-catalog)** - Learn how to use the semantic catalog to translate natural language to SQL for agentic applications. + diff --git a/manage-data/pgai/vectorizer-automate-ai-embeddings.mdx b/agentic-postgres/pgai/vectorizer-automate-ai-embeddings.mdx similarity index 99% rename from manage-data/pgai/vectorizer-automate-ai-embeddings.mdx rename to agentic-postgres/pgai/vectorizer-automate-ai-embeddings.mdx index 3a451a2..c1faf0d 100644 --- a/manage-data/pgai/vectorizer-automate-ai-embeddings.mdx +++ b/agentic-postgres/pgai/vectorizer-automate-ai-embeddings.mdx @@ -1,9 +1,8 @@ --- -name: Automate AI embedding with pgai Vectorizer +title: Automate AI embeddings description: Create a table or a hypertable --- - Vector embeddings have emerged as a powerful tool for transforming text into compact, semantically rich representations. This approach unlocks the potential for more nuanced and context-aware searches, surpassing traditional diff --git a/manage-data/pgvectorscale/pgvectorscale-get-started.mdx b/agentic-postgres/pgvectorscale/pgvectorscale-get-started.mdx similarity index 85% rename from manage-data/pgvectorscale/pgvectorscale-get-started.mdx rename to agentic-postgres/pgvectorscale/pgvectorscale-get-started.mdx index ddccd52..8199611 100644 --- a/manage-data/pgvectorscale/pgvectorscale-get-started.mdx +++ b/agentic-postgres/pgvectorscale/pgvectorscale-get-started.mdx @@ -1,17 +1,19 @@ --- -title: Get started with pgvectorscale +title: Improve database performance description: Improve database performance with hypertables, time bucketing, compression and continuous aggregates. products: [cloud, self_hosted, mst] content_group: Getting started --- -pgvectorscale complements [pgvector][pgvector], the open-source vector data extension for PostgreSQL, and introduces the following key innovations for pgvector data: +import { PG, PGVECTORSCALE, TIMESCALE_DB, CLOUD_LONG } from '/snippets/vars.mdx'; + +{PGVECTORSCALE} complements [pgvector][pgvector], the open-source vector data extension for {PG}, and introduces the following key innovations for pgvector data: - A new index type called StreamingDiskANN, inspired by the [DiskANN](https://github.com/microsoft/DiskANN) algorithm, based on research from Microsoft. - Statistical Binary Quantization: developed by Timescale researchers, This compression method improves on standard Binary Quantization. - Label-based filtered vector search: based on Microsoft's Filtered DiskANN research, this allows you to combine vector similarity search with label filtering for more precise and efficient results. On a benchmark dataset of 50 million Cohere embeddings with 768 dimensions -each, PostgreSQL with `pgvector` and `pgvectorscale` achieves **28x lower p95 +each, {PG} with `pgvector` and `pgvectorscale` achieves **28x lower p95 latency** and **16x higher query throughput** compared to Pinecone's storage optimized (s1) index for approximate nearest neighbor queries at 99% recall, all at 75% less cost when self-hosted on AWS EC2. @@ -23,20 +25,20 @@ all at 75% less cost when self-hosted on AWS EC2. To learn more about the performance impact of pgvectorscale, and details about benchmark methodology and results, see the [pgvector vs Pinecone comparison blog post](http://www.timescale.com/blog/pgvector-vs-pinecone). -In contrast to pgvector, which is written in C, pgvectorscale is developed in [Rust][rust-language] using the [PGRX framework](https://github.com/pgcentralfoundation/pgrx), -offering the PostgreSQL community a new avenue for contributing to vector support. +In contrast to pgvector, which is written in C, {PGVECTORSCALE} is developed in [Rust][rust-language] using the [PGRX framework](https://github.com/pgcentralfoundation/pgrx), +offering the {PG} community a new avenue for contributing to vector support. -**Application developers or DBAs** can use pgvectorscale with their PostgreSQL databases. +**Application developers or DBAs** can use {PGVECTORSCALE} with their {PG} databases. * [Install pgvectorscale](#installation) * [Get started using pgvectorscale](#get-started-with-pgvectorscale) If you **want to contribute** to this extension, see how to [build pgvectorscale from source in a developer environment](./DEVELOPMENT.md) and our [testing guide](./TESTING.md). -For production vector workloads, get **private beta access to vector-optimized databases** with pgvector and pgvectorscale on Timescale. [Sign up here for priority access](https://timescale.typeform.com/to/H7lQ10eQ). +For production vector workloads, get **private beta access to vector-optimized databases** with pgvector and {PGVECTORSCALE} on Timescale. [Sign up here for priority access](https://timescale.typeform.com/to/H7lQ10eQ). ## Installation -The fastest ways to run PostgreSQL with pgvectorscale are: +The fastest ways to run {PG} with {PGVECTORSCALE} are: * [Using a pre-built Docker container](#using-a-pre-built-docker-container) * [Installing from source](#installing-from-source) @@ -44,7 +46,7 @@ The fastest ways to run PostgreSQL with pgvectorscale are: ### Using a pre-built Docker container -1. [Run the TimescaleDB Docker image](https://docs.timescale.com/self-hosted/latest/install/installation-docker/). +1. [Run the {TIMESCALE_DB} Docker image](https://docs.timescale.com/self-hosted/latest/install/installation-docker/). 1. Connect to your database: ```bash @@ -61,7 +63,7 @@ The `CASCADE` automatically installs `pgvector`. ### Installing from source -You can install pgvectorscale from source and install it in an existing PostgreSQL server +You can install {PGVECTORSCALE} from source and install it in an existing {PG} server > [!WARNING] > Building pgvectorscale on macOS X86 (Intel) machines is currently not @@ -116,16 +118,16 @@ instructions][pgvector-install]. The `CASCADE` automatically installs `pgvector`. -### Enable pgvectorscale in a Timescale Cloud service +### Enable pgvectorscale in a Tiger Cloud service -Note: the instructions below are for Timescale's standard compute instance. For production vector workloads, we're offering **private beta access to vector-optimized databases** with pgvector and pgvectorscale on Timescale. [Sign up here for priority access](https://timescale.typeform.com/to/H7lQ10eQ). +Note: the instructions below are for Timescale's standard compute instance. For production vector workloads, we're offering **private beta access to vector-optimized databases** with pgvector and {PGVECTORSCALE} on Timescale. [Sign up here for priority access](https://timescale.typeform.com/to/H7lQ10eQ). -To enable pgvectorscale: +To enable {PGVECTORSCALE}: 1. Create a new [Timescale Service](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch). -If you want to use an existing service, pgvectorscale is added as an available extension on the first maintenance window -after the pgvectorscale release date. +If you want to use an existing service, {PGVECTORSCALE} is added as an available extension on the first maintenance window +after the {PGVECTORSCALE} release date. 1. Connect to your Timescale service: ```bash @@ -176,7 +178,7 @@ Note: pgvectorscale currently supports: cosine distance (`<=>`) queries, for ind ## Filtered Vector Search -pgvectorscale supports combining vector similarity search with metadata filtering. There are two basic kinds of filtering, which can be combined in a single query: +{PGVECTORSCALE} supports combining vector similarity search with metadata filtering. There are two basic kinds of filtering, which can be combined in a single query: 1. **Label-based filtering with the diskann index**: This provides optimized performance for filtering by labels. 2. **Arbitrary WHERE clause filtering**: This uses post-filtering after the vector search. @@ -207,9 +209,9 @@ For optimal performance with label filtering, you must specify the label column CREATE INDEX ON documents USING diskann (embedding vector_cosine_ops, labels); ``` -> **Note**: Label values must be within the PostgreSQL `smallint` range (-32768 to 32767). Using `smallint[]` for labels ensures that PostgreSQL's type system will automatically enforce these bounds. +> **Note**: Label values must be within the {PG} `smallint` range (-32768 to 32767). Using `smallint[]` for labels ensures that {PG}'s type system will automatically enforce these bounds. > -> pgvectorscale includes an implementation of the `&&` overlap operator for `smallint[]` arrays, which is used for efficient label-based filtering. +> {PGVECTORSCALE} includes an implementation of the `&&` overlap operator for `smallint[]` arrays, which is used for efficient label-based filtering. 3. Perform label-filtered vector searches using the `&&` operator (array overlap): @@ -284,7 +286,7 @@ This approach gives you the performance benefits of integer-based label filterin ### Arbitrary WHERE Clause Filtering -You can also use any PostgreSQL WHERE clause with vector search, but these conditions will be applied as post-filtering: +You can also use any {PG} WHERE clause with vector search, but these conditions will be applied as post-filtering: ```postgresql -- Find similar documents with specific status and date range @@ -401,16 +403,16 @@ ERROR: ambuildempty: not yet implemented ## Get involved -pgvectorscale is still at an early stage. Now is a great time to help shape the +{PGVECTORSCALE} is still at an early stage. Now is a great time to help shape the direction of this project; we are currently deciding priorities. Have a look at the list of features we're thinking of working on. Feel free to comment, expand the list, or hop on the Discussions forum. ## About Timescale -Timescale is a PostgreSQL cloud company. To learn more visit the [timescale.com](https://www.timescale.com). +Timescale is a {PG} cloud company. To learn more visit the [timescale.com](https://www.timescale.com). -[Timescale Cloud](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch) is a high-performance, developer focused, cloud platform that provides PostgreSQL services for the most demanding AI, time-series, analytics, and event workloads. Timescale Cloud is ideal for production applications and provides high availability, streaming backups, upgrades over time, roles and permissions, and great security. +[{CLOUD_LONG}](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch) is a high-performance, developer focused, cloud platform that provides {PG} services for the most demanding AI, time-series, analytics, and event workloads. {CLOUD_LONG} is ideal for production applications and provides high availability, streaming backups, upgrades over time, roles and permissions, and great security. [pgvector]: https://github.com/pgvector/pgvector/blob/master/README.md [rust-language]: https://www.rust-lang.org/ diff --git a/api-reference/api-reference.mdx b/api-reference/api-reference.mdx deleted file mode 100644 index 1b1742c..0000000 --- a/api-reference/api-reference.mdx +++ /dev/null @@ -1,31 +0,0 @@ ---- -title: Tiger Data APIs -description: Simulate and analyze a transport dataset in your Tiger Cloud service -products: [cloud, mst, self_hosted] -keywords: [IoT, simulate] -mode: "wide" ---- - - - - Postgres tables in TimescaleDB that automatically partition your time-series data by time. - - - The hybrid row-columnar storage engine in TimescaleDB used by hypertables. - - - Incrementally refresh aggregation queies in the background. - - \ No newline at end of file diff --git a/api-reference/overview.mdx b/api-reference/overview.mdx new file mode 100644 index 0000000..8b9c983 --- /dev/null +++ b/api-reference/overview.mdx @@ -0,0 +1,45 @@ +--- +title: Tiger Data APIs +description: Complete API reference for TimescaleDB, pgai, pgvectorscale, and Tiger Cloud +products: [cloud, mst, self_hosted] +keywords: [API, reference, TimescaleDB, pgai, pgvectorscale] +mode: "center" +--- + + + + Core time-series database functions including hypertables, continuous aggregates, compression, and data retention policies. + + + Advanced hyperfunctions for time-series analysis: statistical aggregates, percentile approximation, financial analysis, and more. + + + AI model integration functions for OpenAI, Ollama, Anthropic, Cohere, and automated vectorization. + + + High-performance vector similarity search with StreamingDiskANN indexes for AI/ML applications. + + + Programmatic service management and operations for Tiger Cloud. + + \ No newline at end of file diff --git a/api-reference/pgai/index.mdx b/api-reference/pgai/index.mdx new file mode 100644 index 0000000..0ef66f3 --- /dev/null +++ b/api-reference/pgai/index.mdx @@ -0,0 +1,67 @@ +--- +title: pgai API reference +sidebarTitle: Overview +description: Complete API reference for pgai functions and AI operations in PostgreSQL +products: [cloud, mst, self_hosted] +keywords: [API, reference, AI, pgai, embeddings, LLM, vectorizer] +mode: "wide" +--- + + + + GPT models, embeddings, moderation, and tokenization + + + + Local LLM hosting with embedding and chat support + + + + Claude models for advanced reasoning + + + + Embeddings, classification, and reranking + + + + Specialized embedding models for retrieval + + + + Unified interface for 100+ LLM providers + + + + Automatically create and sync embeddings for your data + + + diff --git a/api-reference/pgai/model-calling/anthropic/anthropic_generate.mdx b/api-reference/pgai/model-calling/anthropic/anthropic_generate.mdx new file mode 100644 index 0000000..f791246 --- /dev/null +++ b/api-reference/pgai/model-calling/anthropic/anthropic_generate.mdx @@ -0,0 +1,169 @@ +--- +title: anthropic_generate() +description: Generate text completions using Claude models for sophisticated reasoning and analysis +keywords: [Anthropic, Claude, generation, reasoning, AI] +tags: [AI, Anthropic, Claude, generation] +license: community +type: function +--- + +Generate text completions using Anthropic's Claude models. This function supports multi-turn conversations, system +prompts, tool use, and vision capabilities for sophisticated reasoning and analysis tasks. + +## Samples + +### Basic text generation + +Generate a simple response: + +```sql +SELECT ai.anthropic_generate( + 'claude-3-5-sonnet-20241022', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'Explain PostgreSQL in one sentence') + ) +)->'content'->0->>'text'; +``` + +### Multi-turn conversation + +Continue a conversation with message history: + +```sql +SELECT ai.anthropic_generate( + 'claude-3-5-sonnet-20241022', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'What is PostgreSQL?'), + jsonb_build_object('role', 'assistant', 'content', 'PostgreSQL is a powerful open-source relational database.'), + jsonb_build_object('role', 'user', 'content', 'What makes it different from MySQL?') + ) +)->'content'->0->>'text'; +``` + +### Use a system prompt + +Guide Claude's behavior: + +```sql +SELECT ai.anthropic_generate( + 'claude-3-5-sonnet-20241022', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'Explain databases') + ), + system_prompt => 'You are a helpful database expert. Give concise, technical answers with code examples.' +)->'content'->0->>'text'; +``` + +### Control creativity with temperature + +Adjust the randomness of responses: + +```sql +SELECT ai.anthropic_generate( + 'claude-3-5-sonnet-20241022', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'Write a creative story about databases') + ), + temperature => 0.9, + max_tokens => 2000 +)->'content'->0->>'text'; +``` + +### Use tools (function calling) + +Enable Claude to call functions: + +```sql +SELECT ai.anthropic_generate( + 'claude-3-5-sonnet-20241022', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'What is the weather in Paris?') + ), + tools => '[ + { + "name": "get_weather", + "description": "Get current weather for a location", + "input_schema": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "City name" + } + }, + "required": ["location"] + } + } + ]'::jsonb, + tool_choice => '{"type": "auto"}'::jsonb +); +``` + +### Control stop sequences + +Stop generation at specific sequences: + +```sql +SELECT ai.anthropic_generate( + 'claude-3-5-sonnet-20241022', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'List three database types') + ), + stop_sequences => ARRAY['4.', 'Fourth'] +)->'content'->0->>'text'; +``` + +### Use with API key name + +Reference a stored API key: + +```sql +SELECT ai.anthropic_generate( + 'claude-3-5-sonnet-20241022', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'Hello, Claude!') + ), + api_key_name => 'ANTHROPIC_API_KEY' +)->'content'->0->>'text'; +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `model` | `TEXT` | - | ✔ | The Claude model to use (e.g., `claude-3-5-sonnet-20241022`) | +| `messages` | `JSONB` | - | ✔ | Array of message objects with `role` and `content` | +| `max_tokens` | `INT` | `1024` | ✖ | Maximum tokens to generate (required by Anthropic API) | +| `api_key` | `TEXT` | `NULL` | ✖ | Anthropic API key. If not provided, uses configured secret | +| `api_key_name` | `TEXT` | `NULL` | ✖ | Name of the secret containing the API key | +| `base_url` | `TEXT` | `NULL` | ✖ | Custom API base URL | +| `timeout` | `FLOAT8` | `NULL` | ✖ | Request timeout in seconds | +| `max_retries` | `INT` | `NULL` | ✖ | Maximum number of retry attempts | +| `system_prompt` | `TEXT` | `NULL` | ✖ | System prompt to guide model behavior | +| `user_id` | `TEXT` | `NULL` | ✖ | Unique identifier for the end user | +| `stop_sequences` | `TEXT[]` | `NULL` | ✖ | Sequences that stop generation | +| `temperature` | `FLOAT8` | `NULL` | ✖ | Sampling temperature (0.0 to 1.0) | +| `tool_choice` | `JSONB` | `NULL` | ✖ | How the model should use tools (e.g., `{"type": "auto"}`) | +| `tools` | `JSONB` | `NULL` | ✖ | Function definitions for tool use | +| `top_k` | `INT` | `NULL` | ✖ | Only sample from top K options | +| `top_p` | `FLOAT8` | `NULL` | ✖ | Nucleus sampling threshold (0.0 to 1.0) | +| `verbose` | `BOOLEAN` | `FALSE` | ✖ | Enable verbose logging for debugging | + +## Returns + +`JSONB`: The complete API response including: +- `id`: Unique message identifier +- `type`: Response type (always `"message"`) +- `role`: Role of the responder (always `"assistant"`) +- `content`: Array of content blocks (text, tool use, etc.) +- `model`: Model used for generation +- `stop_reason`: Why generation stopped (e.g., `"end_turn"`, `"max_tokens"`) +- `usage`: Token usage statistics + +## Related functions + +- [`anthropic_list_models()`][anthropic_list_models]: list available Claude models +- [`openai_chat_complete()`][openai_chat_complete]: alternative with OpenAI models + +[anthropic_list_models]: /api-reference/pgai/model-calling/anthropic/anthropic_list_models +[openai_chat_complete]: /api-reference/pgai/model-calling/openai/openai_chat_complete diff --git a/api-reference/pgai/model-calling/anthropic/anthropic_list_models.mdx b/api-reference/pgai/model-calling/anthropic/anthropic_list_models.mdx new file mode 100644 index 0000000..73e92f7 --- /dev/null +++ b/api-reference/pgai/model-calling/anthropic/anthropic_list_models.mdx @@ -0,0 +1,90 @@ +--- +title: anthropic_list_models() +description: List available Anthropic Claude models +keywords: [Anthropic, Claude, models, list] +tags: [AI, Anthropic, models, management] +license: community +type: function +--- + +List all Claude models available through the Anthropic API. This function returns basic information about each model +including its ID, name, and creation date. + +## Samples + +### List all models + +See all available Claude models: + +```sql +SELECT * FROM ai.anthropic_list_models(); +``` + +Returns: + +```text + id | name | created +------------------------------------+-------------------------+------------------- + claude-3-5-sonnet-20241022 | Claude 3.5 Sonnet | 2024-10-22 ... + claude-3-opus-20240229 | Claude 3 Opus | 2024-02-29 ... + claude-3-sonnet-20240229 | Claude 3 Sonnet | 2024-02-29 ... + claude-3-haiku-20240307 | Claude 3 Haiku | 2024-03-07 ... +``` + +### Use with API key name + +Reference a stored API key: + +```sql +SELECT * FROM ai.anthropic_list_models( + api_key_name => 'ANTHROPIC_API_KEY' +); +``` + +### Filter by model name + +Find specific models: + +```sql +SELECT id, name +FROM ai.anthropic_list_models() +WHERE name LIKE '%Sonnet%'; +``` + +### Get the latest model + +Find the most recently released model: + +```sql +SELECT id, name, created +FROM ai.anthropic_list_models() +ORDER BY created DESC +LIMIT 1; +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `api_key` | `TEXT` | `NULL` | ✖ | Anthropic API key. If not provided, uses configured secret | +| `api_key_name` | `TEXT` | `NULL` | ✖ | Name of the secret containing the API key | +| `base_url` | `TEXT` | `NULL` | ✖ | Custom API base URL | +| `verbose` | `BOOLEAN` | `FALSE` | ✖ | Enable verbose logging for debugging | + +## Returns + +`TABLE`: A table with the following columns: + +| Column | Type | Description | +|--------|------|-------------| +| `id` | `TEXT` | Model identifier (e.g., `claude-3-5-sonnet-20241022`) | +| `name` | `TEXT` | Human-readable model name (e.g., `Claude 3.5 Sonnet`) | +| `created` | `TIMESTAMPTZ` | When the model was released | + +## Related functions + +- [`anthropic_generate()`][anthropic_generate]: generate text with Claude models +- [`openai_list_models()`][openai_list_models]: list OpenAI models + +[anthropic_generate]: /api-reference/pgai/model-calling/anthropic/anthropic_generate +[openai_list_models]: /api-reference/pgai/model-calling/openai/openai_list_models diff --git a/api-reference/pgai/model-calling/anthropic/index.mdx b/api-reference/pgai/model-calling/anthropic/index.mdx new file mode 100644 index 0000000..067484b --- /dev/null +++ b/api-reference/pgai/model-calling/anthropic/index.mdx @@ -0,0 +1,119 @@ +--- +title: Anthropic functions +sidebarTitle: Overview +description: Generate completions with Claude models for advanced reasoning and analysis +keywords: [Anthropic, Claude, AI, reasoning, analysis] +tags: [AI, Anthropic, Claude, generation] +license: community +type: function +--- + +import { PG } from '/snippets/vars.mdx'; + +Call Anthropic's Claude API directly from SQL to generate sophisticated text responses, analyze content, and perform +complex reasoning tasks using state-of-the-art language models. + +## What is Anthropic Claude? + +Claude is Anthropic's family of large language models known for sophisticated reasoning, nuanced analysis, and helpful, +harmless, and honest responses. Claude models excel at complex tasks like analysis, summarization, creative writing, +and detailed explanations. + +## Key features + +- **Advanced reasoning**: Superior performance on complex analytical tasks +- **Long context**: Support for extended conversations and large documents +- **Tool use**: Built-in function calling capabilities +- **Safety-focused**: Designed to be helpful, harmless, and honest +- **Vision support**: Analyze images with Claude 3 models + +## Prerequisites + +To use Anthropic functions, you need: + +1. An Anthropic API key from [console.anthropic.com](https://console.anthropic.com) +2. API key configured in your database (see configuration section below) + +## Quick start + +### Generate a completion + +Generate text with Claude: + +```sql +SELECT ai.anthropic_generate( + 'claude-3-5-sonnet-20241022', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'Explain PostgreSQL in one sentence') + ) +)->'content'->0->>'text'; +``` + +### Use a system prompt + +Guide Claude's behavior with a system prompt: + +```sql +SELECT ai.anthropic_generate( + 'claude-3-5-sonnet-20241022', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'What is a database?') + ), + system_prompt => 'You are a helpful database expert. Give concise, technical answers.' +)->'content'->0->>'text'; +``` + +### List available models + +See which Claude models you can use: + +```sql +SELECT * FROM ai.anthropic_list_models(); +``` + +## Configuration + +Store your Anthropic API key securely in the database: + +```sql +-- Store API key as a secret +SELECT ai.create_secret('ANTHROPIC_API_KEY', 'your-api-key-here'); + +-- Use the secret by name +SELECT ai.anthropic_generate( + 'claude-3-5-sonnet-20241022', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'Hello, Claude!') + ), + api_key_name => 'ANTHROPIC_API_KEY' +); +``` + +## Available functions + +### Generation + +- [`anthropic_generate()`][anthropic_generate]: generate text with Claude models + +### Model management + +- [`anthropic_list_models()`][anthropic_list_models]: list available Claude models + +## Available models + +Anthropic offers several Claude model families: + +- **Claude 3.5 Sonnet** (`claude-3-5-sonnet-20241022`): Best balance of intelligence and speed +- **Claude 3 Opus** (`claude-3-opus-20240229`): Most powerful for complex tasks +- **Claude 3 Sonnet** (`claude-3-sonnet-20240229`): Balance of intelligence and speed +- **Claude 3 Haiku** (`claude-3-haiku-20240307`): Fast and cost-effective + +## Resources + +- [Anthropic documentation](https://docs.anthropic.com) +- [Claude models overview](https://docs.anthropic.com/en/docs/models-overview) +- [Claude API reference](https://docs.anthropic.com/en/api) +- [Anthropic console](https://console.anthropic.com) + +[anthropic_generate]: /api-reference/pgai/model-calling/anthropic/anthropic_generate +[anthropic_list_models]: /api-reference/pgai/model-calling/anthropic/anthropic_list_models diff --git a/api-reference/pgai/model-calling/cohere/cohere_chat_complete.mdx b/api-reference/pgai/model-calling/cohere/cohere_chat_complete.mdx new file mode 100644 index 0000000..f53bf64 --- /dev/null +++ b/api-reference/pgai/model-calling/cohere/cohere_chat_complete.mdx @@ -0,0 +1,97 @@ +--- +title: cohere_chat_complete() +description: Generate chat completions with RAG support using Cohere's Command models +keywords: [Cohere, chat, RAG, conversation] +tags: [AI, Cohere, chat, RAG] +license: community +type: function +--- + +Generate chat completions using Cohere's Command models with built-in support for retrieval augmented generation (RAG). +Ground responses in your documents and get automatic citations. + +## Samples + +### Basic chat + +```sql +SELECT ai.cohere_chat_complete( + 'command-r-plus', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'What is PostgreSQL?') + ) +)->'message'->>'content'; +``` + +### Chat with documents (RAG) + +```sql +SELECT ai.cohere_chat_complete( + 'command-r-plus', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'What are the key features?') + ), + documents => '[ + {"text": "PostgreSQL supports ACID transactions"}, + {"text": "TimescaleDB extends PostgreSQL for time-series"} + ]'::jsonb +); +``` + +### Use tools + +```sql +SELECT ai.cohere_chat_complete( + 'command-r-plus', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'What is the weather?') + ), + tools => '[ + { + "name": "get_weather", + "description": "Get current weather", + "parameter_definitions": { + "location": {"type": "str", "required": true} + } + } + ]'::jsonb +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `model` | `TEXT` | - | ✔ | Cohere model (e.g., `command-r-plus`, `command-r`) | +| `messages` | `JSONB` | - | ✔ | Array of message objects with `role` and `content` | +| `api_key` | `TEXT` | `NULL` | ✖ | Cohere API key | +| `api_key_name` | `TEXT` | `NULL` | ✖ | Name of secret containing the API key | +| `tools` | `JSONB` | `NULL` | ✖ | Tool definitions for function calling | +| `documents` | `JSONB` | `NULL` | ✖ | Documents for RAG | +| `citation_options` | `JSONB` | `NULL` | ✖ | Citation configuration | +| `response_format` | `JSONB` | `NULL` | ✖ | Response format specification | +| `safety_mode` | `TEXT` | `NULL` | ✖ | Safety mode setting | +| `max_tokens` | `INT` | `NULL` | ✖ | Maximum tokens to generate | +| `stop_sequences` | `TEXT[]` | `NULL` | ✖ | Sequences that stop generation | +| `temperature` | `FLOAT8` | `NULL` | ✖ | Sampling temperature (0.0 to 1.0) | +| `seed` | `INT` | `NULL` | ✖ | Random seed for reproducibility | +| `frequency_penalty` | `FLOAT8` | `NULL` | ✖ | Frequency penalty | +| `presence_penalty` | `FLOAT8` | `NULL` | ✖ | Presence penalty | +| `k` | `INT` | `NULL` | ✖ | Top-k sampling parameter | +| `p` | `FLOAT8` | `NULL` | ✖ | Top-p (nucleus) sampling parameter | +| `logprobs` | `BOOLEAN` | `NULL` | ✖ | Return log probabilities | +| `tool_choice` | `TEXT` | `NULL` | ✖ | Tool choice strategy | +| `strict_tools` | `BOOL` | `NULL` | ✖ | Enforce strict tool schemas | +| `verbose` | `BOOLEAN` | `FALSE` | ✖ | Enable verbose logging | + +## Returns + +`JSONB`: Complete API response with message, citations, and metadata. + +## Related functions + +- [`cohere_rerank()`][cohere_rerank]: rerank documents for RAG +- [`anthropic_generate()`][anthropic_generate]: alternative with Claude models + +[cohere_rerank]: /api-reference/pgai/model-calling/cohere/cohere_rerank +[anthropic_generate]: /api-reference/pgai/model-calling/anthropic/anthropic_generate diff --git a/api-reference/pgai/model-calling/cohere/cohere_classify.mdx b/api-reference/pgai/model-calling/cohere/cohere_classify.mdx new file mode 100644 index 0000000..e4e4f69 --- /dev/null +++ b/api-reference/pgai/model-calling/cohere/cohere_classify.mdx @@ -0,0 +1,25 @@ +--- +title: cohere_classify() +description: Classify text into categories with complete API response +keywords: [Cohere, classification, API response] +tags: [AI, Cohere, classification] +license: community +type: function +--- + +Classify text into categories and get the complete API response including confidence scores, labels, and additional +metadata. For a simpler response format, use [`cohere_classify_simple()`][cohere_classify_simple]. + +## Arguments + +This function accepts the same arguments as [`cohere_classify_simple()`][cohere_classify_simple]. + +## Returns + +`JSONB`: The complete API response including classifications array with predictions, confidences, and metadata. + +## Related functions + +- [`cohere_classify_simple()`][cohere_classify_simple]: simplified response format + +[cohere_classify_simple]: /api-reference/pgai/model-calling/cohere/cohere_classify_simple diff --git a/api-reference/pgai/model-calling/cohere/cohere_classify_simple.mdx b/api-reference/pgai/model-calling/cohere/cohere_classify_simple.mdx new file mode 100644 index 0000000..eb7da40 --- /dev/null +++ b/api-reference/pgai/model-calling/cohere/cohere_classify_simple.mdx @@ -0,0 +1,109 @@ +--- +title: cohere_classify_simple() +description: Classify text into categories with a simplified response +keywords: [Cohere, classification, categorization, sentiment] +tags: [AI, Cohere, classification, categorization] +license: community +type: function +--- + +Classify text into categories using Cohere's models with custom examples. This function returns a simplified table +format with the input text, predicted label, and confidence score. + +## Samples + +### Classify sentiment + +Categorize text as positive or negative: + +```sql +SELECT * +FROM ai.cohere_classify_simple( + 'embed-english-v3.0', + ARRAY['I love this product', 'This is terrible', 'Pretty good overall'], + examples => '[ + {"text": "This is amazing", "label": "positive"}, + {"text": "Absolutely wonderful", "label": "positive"}, + {"text": "This is awful", "label": "negative"}, + {"text": "Terrible experience", "label": "negative"} + ]'::jsonb +); +``` + +Returns: + +```text + input | prediction | confidence +--------------------+------------+------------ + I love this product| positive | 0.98 + This is terrible | negative | 0.95 + Pretty good overall| positive | 0.72 +``` + +### Classify support tickets + +Categorize customer inquiries: + +```sql +SELECT * +FROM ai.cohere_classify_simple( + 'embed-english-v3.0', + ARRAY[ + 'My password is not working', + 'When will my order arrive?', + 'I want to cancel my subscription' + ], + examples => '[ + {"text": "Cannot log in", "label": "technical"}, + {"text": "Forgot my password", "label": "technical"}, + {"text": "Where is my package", "label": "shipping"}, + {"text": "Delivery status", "label": "shipping"}, + {"text": "Cancel my account", "label": "billing"}, + {"text": "Refund request", "label": "billing"} + ]'::jsonb +); +``` + +### Batch classification + +Classify multiple texts at once: + +```sql +INSERT INTO classified_feedback (feedback_text, category, confidence) +SELECT input, prediction, confidence +FROM ai.cohere_classify_simple( + 'embed-english-v3.0', + ARRAY(SELECT feedback FROM pending_feedback LIMIT 100), + examples => (SELECT classification_examples FROM model_config WHERE model = 'feedback') +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `model` | `TEXT` | - | ✔ | The Cohere model to use (e.g., `embed-english-v3.0`) | +| `inputs` | `TEXT[]` | - | ✔ | Array of texts to classify | +| `api_key` | `TEXT` | `NULL` | ✖ | Cohere API key. If not provided, uses configured secret | +| `api_key_name` | `TEXT` | `NULL` | ✖ | Name of the secret containing the API key | +| `examples` | `JSONB` | `NULL` | ✖ | Training examples as array of `{"text": "...", "label": "..."}` objects | +| `truncate_long_inputs` | `TEXT` | `NULL` | ✖ | How to handle long inputs: `START`, `END`, `NONE` | +| `verbose` | `BOOLEAN` | `FALSE` | ✖ | Enable verbose logging for debugging | + +## Returns + +`TABLE`: A table with the following columns: + +| Column | Type | Description | +|--------|------|-------------| +| `input` | `TEXT` | The input text that was classified | +| `prediction` | `TEXT` | The predicted category label | +| `confidence` | `FLOAT8` | Confidence score (0.0 to 1.0) for the prediction | + +## Related functions + +- [`cohere_classify()`][cohere_classify]: full API response with additional metadata +- [`cohere_embed()`][cohere_embed]: generate embeddings for custom classification + +[cohere_classify]: /api-reference/pgai/model-calling/cohere/cohere_classify +[cohere_embed]: /api-reference/pgai/model-calling/cohere/cohere_embed diff --git a/api-reference/pgai/model-calling/cohere/cohere_detokenize.mdx b/api-reference/pgai/model-calling/cohere/cohere_detokenize.mdx new file mode 100644 index 0000000..d1fc8d1 --- /dev/null +++ b/api-reference/pgai/model-calling/cohere/cohere_detokenize.mdx @@ -0,0 +1,52 @@ +--- +title: cohere_detokenize() +description: Convert token IDs back into text +keywords: [Cohere, detokenize, decode] +tags: [AI, Cohere, tokens] +license: community +type: function +--- + +Convert an array of token IDs back into readable text. This is the inverse operation of `cohere_tokenize()`. + +## Samples + +### Detokenize tokens + +```sql +SELECT ai.cohere_detokenize( + 'embed-english-v3.0', + ARRAY[5432, 8754, 389, 264, 8147, 4729] +); +``` + +Returns: `'PostgreSQL is a powerful database'` + +### Round-trip tokenization + +```sql +SELECT ai.cohere_detokenize( + 'embed-english-v3.0', + ai.cohere_tokenize('embed-english-v3.0', 'Hello, world!') +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `model` | `TEXT` | - | ✔ | Cohere model for detokenization | +| `tokens` | `INT[]` | - | ✔ | Array of token IDs to convert | +| `api_key` | `TEXT` | `NULL` | ✖ | Cohere API key | +| `api_key_name` | `TEXT` | `NULL` | ✖ | Name of secret containing the API key | +| `verbose` | `BOOLEAN` | `FALSE` | ✖ | Enable verbose logging | + +## Returns + +`TEXT`: The reconstructed text from the token IDs. + +## Related functions + +- [`cohere_tokenize()`][cohere_tokenize]: convert text into tokens + +[cohere_tokenize]: /api-reference/pgai/model-calling/cohere/cohere_tokenize diff --git a/api-reference/pgai/model-calling/cohere/cohere_embed.mdx b/api-reference/pgai/model-calling/cohere/cohere_embed.mdx new file mode 100644 index 0000000..e1c2b39 --- /dev/null +++ b/api-reference/pgai/model-calling/cohere/cohere_embed.mdx @@ -0,0 +1,94 @@ +--- +title: cohere_embed() +description: Generate vector embeddings using Cohere's multilingual models +keywords: [Cohere, embeddings, vectors, semantic search] +tags: [AI, embeddings, Cohere, vectors] +license: community +type: function +--- + +Generate vector embeddings from text using Cohere's enterprise-grade embedding models. Cohere embeddings excel at +multilingual semantic search, clustering, and classification tasks. + +## Samples + +### Generate an embedding + +Create a vector embedding: + +```sql +SELECT ai.cohere_embed( + 'embed-english-v3.0', + 'PostgreSQL is a powerful database' +); +``` + +### Specify input type + +Optimize embeddings for your use case: + +```sql +-- For search queries +SELECT ai.cohere_embed( + 'embed-english-v3.0', + 'best database for time-series', + input_type => 'search_query' +); + +-- For documents to be searched +SELECT ai.cohere_embed( + 'embed-english-v3.0', + 'PostgreSQL is a relational database', + input_type => 'search_document' +); +``` + +### Store embeddings in a table + +Generate and store embeddings for your data: + +```sql +UPDATE documents +SET embedding = ai.cohere_embed( + 'embed-english-v3.0', + content, + input_type => 'search_document' +) +WHERE embedding IS NULL; +``` + +### Multilingual embeddings + +Use multilingual models for non-English content: + +```sql +SELECT ai.cohere_embed( + 'embed-multilingual-v3.0', + 'La base de datos PostgreSQL es poderosa', + input_type => 'search_document' +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `model` | `TEXT` | - | ✔ | The Cohere embedding model to use (e.g., `embed-english-v3.0`) | +| `input_text` | `TEXT` | - | ✔ | Text to embed | +| `api_key` | `TEXT` | `NULL` | ✖ | Cohere API key. If not provided, uses configured secret | +| `api_key_name` | `TEXT` | `NULL` | ✖ | Name of the secret containing the API key | +| `input_type` | `TEXT` | `NULL` | ✖ | Type of input: `search_query`, `search_document`, `classification`, `clustering` | +| `truncate_long_inputs` | `TEXT` | `NULL` | ✖ | How to handle long inputs: `START`, `END`, `NONE` | +| `verbose` | `BOOLEAN` | `FALSE` | ✖ | Enable verbose logging for debugging | + +## Returns + +`vector`: A pgvector compatible vector containing the embedding. + +## Related functions + +- [`cohere_rerank()`][cohere_rerank]: rerank search results for better relevance +- [`openai_embed()`][openai_embed]: alternative with OpenAI models + +[cohere_rerank]: /api-reference/pgai/model-calling/cohere/cohere_rerank +[openai_embed]: /api-reference/pgai/model-calling/openai/openai_embed diff --git a/api-reference/pgai/model-calling/cohere/cohere_list_models.mdx b/api-reference/pgai/model-calling/cohere/cohere_list_models.mdx new file mode 100644 index 0000000..6dc581d --- /dev/null +++ b/api-reference/pgai/model-calling/cohere/cohere_list_models.mdx @@ -0,0 +1,64 @@ +--- +title: cohere_list_models() +description: List available Cohere models +keywords: [Cohere, models, list] +tags: [AI, Cohere, models] +license: community +type: function +--- + +List all models available through the Cohere API, including embedding models, reranking models, and chat models. + +## Samples + +### List all models + +```sql +SELECT * FROM ai.cohere_list_models(); +``` + +### Filter embedding models + +```sql +SELECT name, context_length +FROM ai.cohere_list_models() +WHERE 'embed' = ANY(endpoints); +``` + +### Find default models + +```sql +SELECT name, endpoints +FROM ai.cohere_list_models(default_only => true); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `api_key` | `TEXT` | `NULL` | ✖ | Cohere API key | +| `api_key_name` | `TEXT` | `NULL` | ✖ | Name of secret containing the API key | +| `endpoint` | `TEXT` | `NULL` | ✖ | Filter by endpoint type (e.g., `embed`, `rerank`, `chat`) | +| `default_only` | `BOOL` | `NULL` | ✖ | Show only default models | +| `verbose` | `BOOLEAN` | `FALSE` | ✖ | Enable verbose logging | + +## Returns + +`TABLE`: A table with the following columns: + +| Column | Type | Description | +|--------|------|-------------| +| `name` | `TEXT` | Model name | +| `endpoints` | `TEXT[]` | Supported endpoints (embed, rerank, chat, etc.) | +| `finetuned` | `BOOL` | Whether the model is fine-tuned | +| `context_length` | `INT` | Maximum context length in tokens | +| `tokenizer_url` | `TEXT` | URL to the tokenizer | +| `default_endpoints` | `TEXT[]` | Endpoints where this is the default model | + +## Related functions + +- [`cohere_embed()`][cohere_embed]: generate embeddings +- [`cohere_chat_complete()`][cohere_chat_complete]: chat with Cohere models + +[cohere_embed]: /api-reference/pgai/model-calling/cohere/cohere_embed +[cohere_chat_complete]: /api-reference/pgai/model-calling/cohere/cohere_chat_complete diff --git a/api-reference/pgai/model-calling/cohere/cohere_rerank.mdx b/api-reference/pgai/model-calling/cohere/cohere_rerank.mdx new file mode 100644 index 0000000..8006e96 --- /dev/null +++ b/api-reference/pgai/model-calling/cohere/cohere_rerank.mdx @@ -0,0 +1,25 @@ +--- +title: cohere_rerank() +description: Rerank documents by relevance with complete API response +keywords: [Cohere, rerank, API response] +tags: [AI, Cohere, reranking] +license: community +type: function +--- + +Rerank documents by semantic relevance and get the complete API response including relevance scores and metadata. For a +simpler response format, use [`cohere_rerank_simple()`][cohere_rerank_simple]. + +## Arguments + +This function accepts the same arguments as [`cohere_rerank_simple()`][cohere_rerank_simple]. + +## Returns + +`JSONB`: The complete API response including results array with indexes, relevance scores, and document metadata. + +## Related functions + +- [`cohere_rerank_simple()`][cohere_rerank_simple]: simplified response format + +[cohere_rerank_simple]: /api-reference/pgai/model-calling/cohere/cohere_rerank_simple diff --git a/api-reference/pgai/model-calling/cohere/cohere_rerank_simple.mdx b/api-reference/pgai/model-calling/cohere/cohere_rerank_simple.mdx new file mode 100644 index 0000000..b001d00 --- /dev/null +++ b/api-reference/pgai/model-calling/cohere/cohere_rerank_simple.mdx @@ -0,0 +1,117 @@ +--- +title: cohere_rerank_simple() +description: Rerank search results by semantic relevance with a simplified response +keywords: [Cohere, rerank, search, relevance] +tags: [AI, Cohere, reranking, search] +license: community +type: function +--- + +Rerank a list of documents by semantic relevance to a query. This function returns a simplified table format with just +the document index, text, and relevance score. + +## Samples + +### Rerank search results + +Improve search relevance by reordering results: + +```sql +SELECT * +FROM ai.cohere_rerank_simple( + 'rerank-english-v3.0', + 'What is a database?', + ARRAY[ + 'PostgreSQL is a relational database', + 'Python is a programming language', + 'A database stores and manages data', + 'JavaScript is used for web development' + ] +) +ORDER BY relevance_score DESC; +``` + +Returns: + +```text + index | document | relevance_score +-------+------------------------------------+----------------- + 2 | A database stores and manages data | 0.95 + 0 | PostgreSQL is a relational database| 0.89 + 1 | Python is a programming language | 0.12 + 3 | JavaScript is used for web dev | 0.08 +``` + +### Limit results with top_n + +Return only the most relevant documents: + +```sql +SELECT * +FROM ai.cohere_rerank_simple( + 'rerank-english-v3.0', + 'time-series databases', + ARRAY[ + 'TimescaleDB extends PostgreSQL for time-series data', + 'MongoDB is a document database', + 'Time-series data has temporal ordering', + 'Redis is an in-memory cache' + ], + top_n => 2 +) +ORDER BY relevance_score DESC; +``` + +### Use in a search pipeline + +Combine vector search with reranking: + +```sql +WITH vector_results AS ( + SELECT content, embedding <=> query_embedding AS distance + FROM documents + ORDER BY embedding <=> query_embedding + LIMIT 20 +) +SELECT content, relevance_score +FROM vector_results +CROSS JOIN LATERAL ai.cohere_rerank_simple( + 'rerank-english-v3.0', + 'user query here', + ARRAY_AGG(content) OVER (), + top_n => 5 +) AS reranked +WHERE content = document +ORDER BY relevance_score DESC; +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `model` | `TEXT` | - | ✔ | The Cohere reranking model (e.g., `rerank-english-v3.0`) | +| `query` | `TEXT` | - | ✔ | The search query | +| `documents` | `TEXT[]` | - | ✔ | Array of documents to rerank | +| `api_key` | `TEXT` | `NULL` | ✖ | Cohere API key. If not provided, uses configured secret | +| `api_key_name` | `TEXT` | `NULL` | ✖ | Name of the secret containing the API key | +| `top_n` | `INT` | `NULL` | ✖ | Return only the top N most relevant documents | +| `max_tokens_per_doc` | `INT` | `NULL` | ✖ | Maximum tokens per document (for truncation) | +| `verbose` | `BOOLEAN` | `FALSE` | ✖ | Enable verbose logging for debugging | + +## Returns + +`TABLE`: A table with the following columns: + +| Column | Type | Description | +|--------|------|-------------| +| `index` | `INT` | Original index of the document in the input array (0-based) | +| `document` | `TEXT` | The document text | +| `relevance_score` | `FLOAT8` | Relevance score (0.0 to 1.0, higher is more relevant) | + +## Related functions + +- [`cohere_rerank()`][cohere_rerank]: full API response with additional metadata +- [`cohere_embed()`][cohere_embed]: generate embeddings for vector search + +[cohere_rerank]: /api-reference/pgai/model-calling/cohere/cohere_rerank +[cohere_embed]: /api-reference/pgai/model-calling/cohere/cohere_embed diff --git a/api-reference/pgai/model-calling/cohere/cohere_tokenize.mdx b/api-reference/pgai/model-calling/cohere/cohere_tokenize.mdx new file mode 100644 index 0000000..2b844db --- /dev/null +++ b/api-reference/pgai/model-calling/cohere/cohere_tokenize.mdx @@ -0,0 +1,53 @@ +--- +title: cohere_tokenize() +description: Convert text into token IDs using Cohere's tokenizer +keywords: [Cohere, tokenize, tokens] +tags: [AI, Cohere, tokens] +license: community +type: function +--- + +Convert text into an array of token IDs using Cohere's tokenizer. Useful for counting tokens before API calls or +analyzing tokenization patterns. + +## Samples + +### Tokenize text + +```sql +SELECT ai.cohere_tokenize( + 'embed-english-v3.0', + 'PostgreSQL is a powerful database' +); +``` + +Returns: `{5432, 8754, 389, 264, 8147, 4729}` + +### Count tokens + +```sql +SELECT array_length(ai.cohere_tokenize( + 'embed-english-v3.0', + 'PostgreSQL is a powerful database' +), 1) AS token_count; +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `model` | `TEXT` | - | ✔ | Cohere model for tokenization | +| `text_input` | `TEXT` | - | ✔ | Text to tokenize | +| `api_key` | `TEXT` | `NULL` | ✖ | Cohere API key | +| `api_key_name` | `TEXT` | `NULL` | ✖ | Name of secret containing the API key | +| `verbose` | `BOOLEAN` | `FALSE` | ✖ | Enable verbose logging | + +## Returns + +`INT[]`: Array of token IDs. + +## Related functions + +- [`cohere_detokenize()`][cohere_detokenize]: convert tokens back to text + +[cohere_detokenize]: /api-reference/pgai/model-calling/cohere/cohere_detokenize diff --git a/api-reference/pgai/model-calling/cohere/index.mdx b/api-reference/pgai/model-calling/cohere/index.mdx new file mode 100644 index 0000000..eb952b3 --- /dev/null +++ b/api-reference/pgai/model-calling/cohere/index.mdx @@ -0,0 +1,163 @@ +--- +title: Cohere functions +sidebarTitle: Overview +description: Generate embeddings, classify text, rerank results, and chat with Cohere's enterprise AI models +keywords: [Cohere, embeddings, classification, reranking, chat] +tags: [AI, Cohere, embeddings, classification, reranking] +license: community +type: function +--- + +import { PG } from '/snippets/vars.mdx'; + +Call Cohere's API directly from SQL to access enterprise-grade embeddings, text classification, semantic reranking, +and chat capabilities optimized for production use. + +## What is Cohere? + +Cohere provides production-ready AI models for enterprises, specializing in embeddings, classification, and retrieval +augmented generation (RAG). Cohere's models excel at semantic search, text classification, and reranking results for +improved relevance. + +## Key features + +- **Enterprise embeddings**: High-quality multilingual embeddings for semantic search +- **Text classification**: Classify text into categories with custom examples +- **Semantic reranking**: Improve search relevance by reordering results +- **Chat with RAG**: Ground responses in your documents +- **Production-ready**: Built for scale with enterprise support + +## Prerequisites + +To use Cohere functions, you need: + +1. A Cohere API key from [dashboard.cohere.com](https://dashboard.cohere.com) +2. API key configured in your database (see configuration section below) + +## Quick start + +### Generate embeddings + +Create vector embeddings for semantic search: + +```sql +SELECT ai.cohere_embed( + 'embed-english-v3.0', + 'PostgreSQL is a powerful database' +); +``` + +### Classify text + +Categorize text using custom examples: + +```sql +SELECT * FROM ai.cohere_classify_simple( + 'embed-english-v3.0', + ARRAY['I love this product', 'This is terrible'], + examples => '[ + {"text": "This is amazing", "label": "positive"}, + {"text": "This is awful", "label": "negative"} + ]'::jsonb +); +``` + +### Rerank search results + +Improve search relevance by reordering results: + +```sql +SELECT * FROM ai.cohere_rerank_simple( + 'rerank-english-v3.0', + 'What is a database?', + ARRAY[ + 'PostgreSQL is a relational database', + 'Python is a programming language', + 'A database stores and manages data' + ] +) +ORDER BY relevance_score DESC; +``` + +### Chat completion + +Have conversations grounded in your documents: + +```sql +SELECT ai.cohere_chat_complete( + 'command-r-plus', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'What is PostgreSQL?') + ) +)->'message'->>'content'; +``` + +## Configuration + +Store your Cohere API key securely in the database: + +```sql +-- Store API key as a secret +SELECT ai.create_secret('COHERE_API_KEY', 'your-api-key-here'); + +-- Use the secret by name +SELECT ai.cohere_embed( + 'embed-english-v3.0', + 'sample text', + api_key_name => 'COHERE_API_KEY' +); +``` + +## Available functions + +### Embeddings + +- [`cohere_embed()`][cohere_embed]: generate vector embeddings from text + +### Classification + +- [`cohere_classify()`][cohere_classify]: classify text with full API response +- [`cohere_classify_simple()`][cohere_classify_simple]: classify text with simplified response + +### Reranking + +- [`cohere_rerank()`][cohere_rerank]: rerank documents with full API response +- [`cohere_rerank_simple()`][cohere_rerank_simple]: rerank documents with simplified response + +### Chat + +- [`cohere_chat_complete()`][cohere_chat_complete]: chat with RAG support + +### Tokenization + +- [`cohere_tokenize()`][cohere_tokenize]: convert text to tokens +- [`cohere_detokenize()`][cohere_detokenize]: convert tokens back to text + +### Model management + +- [`cohere_list_models()`][cohere_list_models]: list available Cohere models + +## Available models + +Cohere offers specialized models for different tasks: + +- **Embeddings**: `embed-english-v3.0`, `embed-multilingual-v3.0` +- **Reranking**: `rerank-english-v3.0`, `rerank-multilingual-v3.0` +- **Chat**: `command-r-plus`, `command-r` + +## Resources + +- [Cohere documentation](https://docs.cohere.com) +- [Cohere models overview](https://docs.cohere.com/docs/models) +- [Cohere API reference](https://docs.cohere.com/reference) +- [Cohere dashboard](https://dashboard.cohere.com) + +[cohere_embed]: /api-reference/pgai/model-calling/cohere/cohere_embed +[cohere_classify]: /api-reference/pgai/model-calling/cohere/cohere_classify +[cohere_classify_simple]: /api-reference/pgai/model-calling/cohere/cohere_classify_simple +[cohere_rerank]: /api-reference/pgai/model-calling/cohere/cohere_rerank +[cohere_rerank_simple]: /api-reference/pgai/model-calling/cohere/cohere_rerank_simple +[cohere_chat_complete]: /api-reference/pgai/model-calling/cohere/cohere_chat_complete +[cohere_tokenize]: /api-reference/pgai/model-calling/cohere/cohere_tokenize +[cohere_detokenize]: /api-reference/pgai/model-calling/cohere/cohere_detokenize +[cohere_list_models]: /api-reference/pgai/model-calling/cohere/cohere_list_models diff --git a/api-reference/pgai/model-calling/litellm/index.mdx b/api-reference/pgai/model-calling/litellm/index.mdx new file mode 100644 index 0000000..48e9ce3 --- /dev/null +++ b/api-reference/pgai/model-calling/litellm/index.mdx @@ -0,0 +1,106 @@ +--- +title: LiteLLM functions +sidebarTitle: Overview +description: Unified interface for 100+ LLM providers with a single API +keywords: [LiteLLM, unified API, multi-provider, embeddings] +tags: [AI, LiteLLM, embeddings, multi-provider] +license: community +type: function +--- + +import { PG } from '/snippets/vars.mdx'; + +Call any LLM provider's API through LiteLLM's unified interface. LiteLLM translates a single API format into calls to +OpenAI, Azure, AWS Bedrock, Google Vertex AI, Anthropic, Cohere, and 100+ other providers. + +## What is LiteLLM? + +LiteLLM is a unified interface that lets you call different LLM APIs using the same format. Instead of learning each +provider's API, you can use one consistent interface and switch between providers by changing the model name. + +## Key features + +- **100+ providers**: Access OpenAI, Azure, AWS Bedrock, Google Vertex AI, Anthropic, Cohere, and more +- **Unified API**: One interface for all providers +- **Easy switching**: Change providers by updating the model name +- **Fallback support**: Automatically retry failed requests with different providers +- **Cost tracking**: Built-in usage and cost tracking across providers + +## Prerequisites + +To use LiteLLM functions, you need: + +1. API keys for the providers you want to use +2. API keys configured in your database (see configuration section below) + +## Quick start + +### Use OpenAI through LiteLLM + +```sql +SELECT ai.litellm_embed( + 'text-embedding-ada-002', + 'PostgreSQL is a powerful database', + api_key_name => 'OPENAI_API_KEY' +); +``` + +### Use Azure OpenAI + +```sql +SELECT ai.litellm_embed( + 'azure/my-deployment', + 'PostgreSQL is a powerful database', + api_key_name => 'AZURE_API_KEY', + extra_options => '{"api_base": "https://my-endpoint.openai.azure.com/"}'::jsonb +); +``` + +### Use AWS Bedrock + +```sql +SELECT ai.litellm_embed( + 'bedrock/amazon.titan-embed-text-v1', + 'PostgreSQL is a powerful database', + extra_options => '{"aws_region_name": "us-east-1"}'::jsonb +); +``` + +### Use Google Vertex AI + +```sql +SELECT ai.litellm_embed( + 'vertex_ai/textembedding-gecko', + 'PostgreSQL is a powerful database', + api_key_name => 'VERTEX_AI_KEY', + extra_options => '{"vertex_project": "my-project", "vertex_location": "us-central1"}'::jsonb +); +``` + +## Configuration + +LiteLLM typically requires provider-specific configuration through environment variables or the `extra_options` +parameter. Consult the [LiteLLM documentation](https://docs.litellm.ai/docs/providers) for provider-specific setup. + +## Available functions + +### Embeddings + +- [`litellm_embed()`][litellm_embed]: generate embeddings from any supported provider + +## Model naming convention + +LiteLLM uses a simple naming convention: + +- **OpenAI models**: Use the model name directly (e.g., `text-embedding-ada-002`) +- **Other providers**: Prefix with provider name (e.g., `azure/deployment-name`, `bedrock/model-id`, `vertex_ai/model-name`) + +See the [LiteLLM providers documentation](https://docs.litellm.ai/docs/providers) for the complete list. + +## Resources + +- [LiteLLM documentation](https://docs.litellm.ai) +- [LiteLLM providers](https://docs.litellm.ai/docs/providers) +- [LiteLLM GitHub](https://github.com/BerriAI/litellm) + +[litellm_embed]: /api-reference/pgai/model-calling/litellm/litellm_embed diff --git a/api-reference/pgai/model-calling/litellm/litellm_embed.mdx b/api-reference/pgai/model-calling/litellm/litellm_embed.mdx new file mode 100644 index 0000000..84190ae --- /dev/null +++ b/api-reference/pgai/model-calling/litellm/litellm_embed.mdx @@ -0,0 +1,160 @@ +--- +title: litellm_embed() +description: Generate embeddings from 100+ providers through a unified API +keywords: [LiteLLM, embeddings, multi-provider, unified API] +tags: [AI, LiteLLM, embeddings, vectors] +license: community +type: function +--- + +Generate vector embeddings from any LLM provider through LiteLLM's unified interface. Switch between OpenAI, Azure, +AWS Bedrock, Google Vertex AI, and 100+ other providers by simply changing the model name. + +## Samples + +### Use OpenAI + +```sql +SELECT ai.litellm_embed( + 'text-embedding-ada-002', + 'PostgreSQL is a powerful database', + api_key_name => 'OPENAI_API_KEY' +); +``` + +### Use Azure OpenAI + +Embed with Azure OpenAI deployment: + +```sql +SELECT ai.litellm_embed( + 'azure/my-embedding-deployment', + 'PostgreSQL is a powerful database', + api_key_name => 'AZURE_API_KEY', + extra_options => '{ + "api_base": "https://my-resource.openai.azure.com/", + "api_version": "2023-05-15" + }'::jsonb +); +``` + +### Use AWS Bedrock + +Embed with Amazon Titan: + +```sql +SELECT ai.litellm_embed( + 'bedrock/amazon.titan-embed-text-v1', + 'PostgreSQL is a powerful database', + extra_options => '{ + "aws_region_name": "us-east-1", + "aws_access_key_id": "your-key", + "aws_secret_access_key": "your-secret" + }'::jsonb +); +``` + +### Use Google Vertex AI + +Embed with Vertex AI: + +```sql +SELECT ai.litellm_embed( + 'vertex_ai/textembedding-gecko', + 'PostgreSQL is a powerful database', + extra_options => '{ + "vertex_project": "my-project-id", + "vertex_location": "us-central1" + }'::jsonb +); +``` + +### Batch embeddings + +Process multiple texts efficiently: + +```sql +SELECT index, embedding +FROM ai.litellm_embed( + 'text-embedding-ada-002', + ARRAY[ + 'PostgreSQL is a powerful database', + 'TimescaleDB extends PostgreSQL', + 'pgai brings AI to PostgreSQL' + ], + api_key_name => 'OPENAI_API_KEY' +); +``` + +### Store embeddings in a table + +```sql +UPDATE documents +SET embedding = ai.litellm_embed( + 'text-embedding-ada-002', + content, + api_key_name => 'OPENAI_API_KEY' +) +WHERE embedding IS NULL; +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `model` | `TEXT` | - | ✔ | Model identifier with optional provider prefix (e.g., `text-embedding-ada-002`, `azure/deployment`, `bedrock/model-id`) | +| `input_text` | `TEXT` | - | ✔ | Single text input to embed (use this OR `input_texts`) | +| `input_texts` | `TEXT[]` | - | ✔ | Array of text inputs to embed in a batch | +| `api_key` | `TEXT` | `NULL` | ✖ | API key for the provider | +| `api_key_name` | `TEXT` | `NULL` | ✖ | Name of the secret containing the API key | +| `extra_options` | `JSONB` | `NULL` | ✖ | Provider-specific options (API base URL, region, project, etc.) | +| `verbose` | `BOOLEAN` | `FALSE` | ✖ | Enable verbose logging for debugging | + +## Returns + +**For single text input:** +- `vector`: A pgvector compatible vector containing the embedding + +**For array input:** +- `TABLE(index INT, embedding vector)`: A table with an index and embedding for each input text + +## Provider-specific configuration + +Different providers require different configurations through the `extra_options` parameter: + +### Azure OpenAI +```json +{ + "api_base": "https://resource.openai.azure.com/", + "api_version": "2023-05-15" +} +``` + +### AWS Bedrock +```json +{ + "aws_region_name": "us-east-1", + "aws_access_key_id": "key", + "aws_secret_access_key": "secret" +} +``` + +### Google Vertex AI +```json +{ + "vertex_project": "project-id", + "vertex_location": "us-central1" +} +``` + +See [LiteLLM providers documentation](https://docs.litellm.ai/docs/providers) for complete configuration options. + +## Related functions + +- [`openai_embed()`][openai_embed]: direct OpenAI integration +- [`cohere_embed()`][cohere_embed]: direct Cohere integration +- [`voyageai_embed()`][voyageai_embed]: direct Voyage AI integration + +[openai_embed]: /api-reference/pgai/model-calling/openai/openai_embed +[cohere_embed]: /api-reference/pgai/model-calling/cohere/cohere_embed +[voyageai_embed]: /api-reference/pgai/model-calling/voyageai/voyageai_embed diff --git a/api-reference/pgai/model-calling/ollama/index.mdx b/api-reference/pgai/model-calling/ollama/index.mdx new file mode 100644 index 0000000..7bafc9e --- /dev/null +++ b/api-reference/pgai/model-calling/ollama/index.mdx @@ -0,0 +1,116 @@ +--- +title: Ollama functions +sidebarTitle: Overview +description: Run local LLMs with embeddings, chat completion, and model management +keywords: [Ollama, local LLM, embeddings, chat, open source] +tags: [AI, Ollama, local, embeddings, chat] +license: community +type: function +--- + +import { PG } from '/snippets/vars.mdx'; + +Call Ollama's local LLM API directly from SQL to generate embeddings, completions, and chat responses using open-source +models running on your infrastructure. + +## What is Ollama? + +Ollama is a tool for running large language models locally on your own hardware. Unlike cloud-based APIs, Ollama +provides complete control over your models, data privacy, and costs. It supports popular open-source models like Llama, +Mistral, and CodeLlama. + +## Key features + +- **Privacy-first**: All data stays on your infrastructure +- **Cost-effective**: No per-token API costs +- **Offline operation**: Works without internet connectivity +- **Open-source models**: Access to Llama 2, Mistral, CodeLlama, and more +- **Full control**: Manage model versions and configurations + +## Prerequisites + +Before using Ollama functions, you need to: + +1. Install and run Ollama on your infrastructure +2. Pull the models you want to use +3. Ensure your {PG} database can access the Ollama host + +For installation instructions, visit [ollama.com](https://ollama.com). + +## Quick start + +### Generate embeddings + +Create vector embeddings using a local model: + +```sql +SELECT ai.ollama_embed( + 'llama2', + 'PostgreSQL is a powerful database', + host => 'http://localhost:11434' +); +``` + +### Generate completions + +Get text completions from a local model: + +```sql +SELECT ai.ollama_generate( + 'llama2', + 'Explain what PostgreSQL is in one sentence' +)->'response'; +``` + +### Chat completion + +Have a conversation with a local model: + +```sql +SELECT ai.ollama_chat_complete( + 'llama2', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'What is PostgreSQL?') + ) +)->'message'->>'content'; +``` + +## Available functions + +### Embeddings + +- [`ollama_embed()`][ollama_embed]: generate vector embeddings from text + +### Completions and chat + +- [`ollama_generate()`][ollama_generate]: generate text completions with optional images +- [`ollama_chat_complete()`][ollama_chat_complete]: multi-turn conversations with tool support + +### Model management + +- [`ollama_list_models()`][ollama_list_models]: list all locally installed models +- [`ollama_ps()`][ollama_ps]: show currently running models and their resource usage + +## Configuration + +All Ollama functions accept a `host` parameter to specify the Ollama server location: + +```sql +-- Use default host (http://localhost:11434) +SELECT ai.ollama_embed('llama2', 'sample text'); + +-- Specify custom host +SELECT ai.ollama_embed('llama2', 'sample text', host => 'http://ollama-server:11434'); +``` + +## Resources + +- [Ollama documentation](https://github.com/ollama/ollama/tree/main/docs) +- [Ollama models library](https://ollama.com/library) +- [Ollama API reference](https://github.com/ollama/ollama/blob/main/docs/api.md) + +[ollama_embed]: /api-reference/pgai/model-calling/ollama/ollama_embed +[ollama_generate]: /api-reference/pgai/model-calling/ollama/ollama_generate +[ollama_chat_complete]: /api-reference/pgai/model-calling/ollama/ollama_chat_complete +[ollama_list_models]: /api-reference/pgai/model-calling/ollama/ollama_list_models +[ollama_ps]: /api-reference/pgai/model-calling/ollama/ollama_ps diff --git a/api-reference/pgai/model-calling/ollama/ollama_chat_complete.mdx b/api-reference/pgai/model-calling/ollama/ollama_chat_complete.mdx new file mode 100644 index 0000000..15a5f1c --- /dev/null +++ b/api-reference/pgai/model-calling/ollama/ollama_chat_complete.mdx @@ -0,0 +1,145 @@ +--- +title: ollama_chat_complete() +description: Generate chat completions using local Ollama models +keywords: [Ollama, chat, conversation, local LLM] +tags: [AI, chat, Ollama, local] +license: community +type: function +--- + +Generate chat completions using locally hosted Ollama models. This function supports multi-turn conversations, tool +calling, and structured output with complete data privacy. + +## Samples + +### Basic chat completion + +Have a conversation with a local model: + +```sql +SELECT ai.ollama_chat_complete( + 'llama2', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'What is PostgreSQL?') + ) +)->'message'->>'content'; +``` + +### Multi-turn conversation + +Continue a conversation with message history: + +```sql +SELECT ai.ollama_chat_complete( + 'llama2', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'What is PostgreSQL?'), + jsonb_build_object('role', 'assistant', 'content', 'PostgreSQL is a powerful open-source database.'), + jsonb_build_object('role', 'user', 'content', 'What makes it different from MySQL?') + ) +)->'message'->>'content'; +``` + +### Use with specific host + +Connect to a custom Ollama server: + +```sql +SELECT ai.ollama_chat_complete( + 'llama2', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'Explain databases') + ), + host => 'http://ollama-server:11434' +)->'message'->>'content'; +``` + +### Configure chat options + +Customize the chat parameters: + +```sql +SELECT ai.ollama_chat_complete( + 'llama2', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'Write a creative story') + ), + chat_options => '{"temperature": 0.9, "top_p": 0.95}'::jsonb +)->'message'->>'content'; +``` + +### Structured output with JSON + +Request JSON responses: + +```sql +SELECT ai.ollama_chat_complete( + 'llama2', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'List 3 database types') + ), + response_format => '{"type": "json"}'::jsonb +)->'message'->>'content'; +``` + +### Use tools (function calling) + +Enable the model to call tools: + +```sql +SELECT ai.ollama_chat_complete( + 'llama2', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'What is the weather in Paris?') + ), + tools => '[ + { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get current weather", + "parameters": { + "type": "object", + "properties": { + "location": {"type": "string"} + } + } + } + } + ]'::jsonb +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `model` | `TEXT` | - | ✔ | The Ollama model to use (e.g., `llama2`, `mistral`, `codellama`) | +| `messages` | `JSONB` | - | ✔ | Array of message objects with `role` and `content` | +| `host` | `TEXT` | `NULL` | ✖ | Ollama server URL (defaults to `http://localhost:11434`) | +| `keep_alive` | `TEXT` | `NULL` | ✖ | How long to keep the model loaded (e.g., `5m`, `1h`) | +| `chat_options` | `JSONB` | `NULL` | ✖ | Model-specific options like temperature, top_p | +| `tools` | `JSONB` | `NULL` | ✖ | Function definitions for tool calling | +| `response_format` | `JSONB` | `NULL` | ✖ | Format specification (e.g., `{"type": "json"}`) | +| `verbose` | `BOOLEAN` | `FALSE` | ✖ | Enable verbose logging for debugging | + +## Returns + +`JSONB`: The complete API response including: +- `model`: Model used for the chat +- `message`: The assistant's response with `role` and `content` +- `created_at`: Response timestamp +- `done`: Whether generation is complete +- `total_duration`: Total time taken +- `prompt_eval_count`: Number of tokens in prompt +- `eval_count`: Number of tokens generated + +## Related functions + +- [`ollama_generate()`][ollama_generate]: single-turn text completion +- [`ollama_embed()`][ollama_embed]: generate embeddings +- [`ollama_list_models()`][ollama_list_models]: see available models + +[ollama_generate]: /api-reference/pgai/model-calling/ollama/ollama_generate +[ollama_embed]: /api-reference/pgai/model-calling/ollama/ollama_embed +[ollama_list_models]: /api-reference/pgai/model-calling/ollama/ollama_list_models diff --git a/api-reference/pgai/model-calling/ollama/ollama_embed.mdx b/api-reference/pgai/model-calling/ollama/ollama_embed.mdx new file mode 100644 index 0000000..c2db3a4 --- /dev/null +++ b/api-reference/pgai/model-calling/ollama/ollama_embed.mdx @@ -0,0 +1,88 @@ +--- +title: ollama_embed() +description: Generate vector embeddings using local Ollama models +keywords: [Ollama, embeddings, vectors, local LLM] +tags: [AI, embeddings, Ollama, local] +license: community +type: function +--- + +Generate vector embeddings from text using locally hosted Ollama models. Embeddings are numerical representations of +text that capture semantic meaning, ideal for semantic search, recommendations, and clustering without sending data to +external APIs. + +## Samples + +### Generate an embedding + +Create a vector embedding using a local model: + +```sql +SELECT ai.ollama_embed( + 'llama2', + 'PostgreSQL is a powerful database' +); +``` + +### Specify Ollama host + +Connect to a specific Ollama server: + +```sql +SELECT ai.ollama_embed( + 'llama2', + 'PostgreSQL is a powerful database', + host => 'http://ollama-server:11434' +); +``` + +### Configure model options + +Customize the embedding generation: + +```sql +SELECT ai.ollama_embed( + 'llama2', + 'PostgreSQL is a powerful database', + embedding_options => '{"temperature": 0.5}'::jsonb +); +``` + +### Store embeddings in a table + +Generate and store embeddings for your data: + +```sql +UPDATE documents +SET embedding = ai.ollama_embed( + 'llama2', + content, + host => 'http://localhost:11434' +) +WHERE embedding IS NULL; +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `model` | `TEXT` | - | ✔ | The Ollama model to use (e.g., `llama2`, `mistral`, `nomic-embed-text`) | +| `input_text` | `TEXT` | - | ✔ | Text input to embed | +| `host` | `TEXT` | `NULL` | ✖ | Ollama server URL (defaults to `http://localhost:11434`) | +| `keep_alive` | `TEXT` | `NULL` | ✖ | How long to keep the model loaded (e.g., `5m`, `1h`) | +| `embedding_options` | `JSONB` | `NULL` | ✖ | Model-specific options as JSON | +| `verbose` | `BOOLEAN` | `FALSE` | ✖ | Enable verbose logging for debugging | + +## Returns + +`vector`: A pgvector compatible vector containing the embedding. + +## Related functions + +- [`ollama_generate()`][ollama_generate]: generate text completions +- [`ollama_chat_complete()`][ollama_chat_complete]: chat with local models +- [`ollama_list_models()`][ollama_list_models]: see available models + +[ollama_generate]: /api-reference/pgai/model-calling/ollama/ollama_generate +[ollama_chat_complete]: /api-reference/pgai/model-calling/ollama/ollama_chat_complete +[ollama_list_models]: /api-reference/pgai/model-calling/ollama/ollama_list_models diff --git a/api-reference/pgai/model-calling/ollama/ollama_generate.mdx b/api-reference/pgai/model-calling/ollama/ollama_generate.mdx new file mode 100644 index 0000000..00ac4f3 --- /dev/null +++ b/api-reference/pgai/model-calling/ollama/ollama_generate.mdx @@ -0,0 +1,114 @@ +--- +title: ollama_generate() +description: Generate text completions using local Ollama models +keywords: [Ollama, completion, generation, local LLM] +tags: [AI, generation, Ollama, local] +license: community +type: function +--- + +Generate text completions using locally hosted Ollama models. Unlike chat completion, this function is designed for +single-turn text generation with optional system prompts, images, and custom templates. + +## Samples + +### Generate a completion + +Get a text completion from a local model: + +```sql +SELECT ai.ollama_generate( + 'llama2', + 'Explain what PostgreSQL is in one sentence' +)->'response'; +``` + +### Use a system prompt + +Set a system prompt to control the model's behavior: + +```sql +SELECT ai.ollama_generate( + 'llama2', + 'What is a database?', + system_prompt => 'You are a helpful database expert. Give concise answers.' +)->'response'; +``` + +### Add context for continuation + +Continue a previous generation using context: + +```sql +-- First generation +WITH first_gen AS ( + SELECT ai.ollama_generate('llama2', 'Tell me about databases') AS result +) +-- Continue with context +SELECT ai.ollama_generate( + 'llama2', + 'Tell me more about PostgreSQL specifically', + context => (SELECT (result->'context')::text::int[] FROM first_gen) +)->'response'; +``` + +### Generate with images + +Analyze images with vision-capable models: + +```sql +SELECT ai.ollama_generate( + 'llava', + 'What do you see in this image?', + images => ARRAY[(SELECT content FROM images WHERE id = 1)] +)->'response'; +``` + +### Configure model options + +Customize the generation parameters: + +```sql +SELECT ai.ollama_generate( + 'llama2', + 'Write a creative story', + embedding_options => '{"temperature": 0.9, "top_p": 0.9}'::jsonb +)->'response'; +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `model` | `TEXT` | - | ✔ | The Ollama model to use (e.g., `llama2`, `mistral`, `codellama`) | +| `prompt` | `TEXT` | - | ✔ | The prompt to generate a response for | +| `host` | `TEXT` | `NULL` | ✖ | Ollama server URL (defaults to `http://localhost:11434`) | +| `images` | `BYTEA[]` | `NULL` | ✖ | Array of images for multimodal models | +| `keep_alive` | `TEXT` | `NULL` | ✖ | How long to keep the model loaded (e.g., `5m`, `1h`) | +| `embedding_options` | `JSONB` | `NULL` | ✖ | Model-specific options like temperature, top_p | +| `system_prompt` | `TEXT` | `NULL` | ✖ | System prompt to set model behavior | +| `template` | `TEXT` | `NULL` | ✖ | Custom prompt template | +| `context` | `INT[]` | `NULL` | ✖ | Context from a previous generation for continuation | +| `verbose` | `BOOLEAN` | `FALSE` | ✖ | Enable verbose logging for debugging | + +## Returns + +`JSONB`: The complete API response including: +- `model`: Model used for generation +- `response`: The generated text +- `context`: Context array for continuation +- `created_at`: Generation timestamp +- `done`: Whether generation is complete +- `total_duration`: Total time taken +- `prompt_eval_count`: Number of tokens in prompt +- `eval_count`: Number of tokens generated + +## Related functions + +- [`ollama_chat_complete()`][ollama_chat_complete]: multi-turn conversations +- [`ollama_embed()`][ollama_embed]: generate embeddings +- [`ollama_list_models()`][ollama_list_models]: see available models + +[ollama_chat_complete]: /api-reference/pgai/model-calling/ollama/ollama_chat_complete +[ollama_embed]: /api-reference/pgai/model-calling/ollama/ollama_embed +[ollama_list_models]: /api-reference/pgai/model-calling/ollama/ollama_list_models diff --git a/api-reference/pgai/model-calling/ollama/ollama_list_models.mdx b/api-reference/pgai/model-calling/ollama/ollama_list_models.mdx new file mode 100644 index 0000000..5427427 --- /dev/null +++ b/api-reference/pgai/model-calling/ollama/ollama_list_models.mdx @@ -0,0 +1,99 @@ +--- +title: ollama_list_models() +description: List all locally installed Ollama models +keywords: [Ollama, models, list, management] +tags: [AI, Ollama, models, management] +license: community +type: function +--- + +List all models that are locally installed and available on your Ollama server. This function provides detailed +information about each model including size, family, parameters, and when it was last modified. + +## Samples + +### List all models + +See all installed models: + +```sql +SELECT * FROM ai.ollama_list_models(); +``` + +Returns: + +```text + name | model | size | digest | family | format | ... +-------------+--------------+------------+------------+---------+--------+----- + llama2 | llama2:latest| 3825819519 | sha256:... | llama | gguf | ... + mistral | mistral:7b | 4109865159 | sha256:... | mistral | gguf | ... + codellama | codellama:7b | 3825819519 | sha256:... | llama | gguf | ... +``` + +### Connect to specific host + +List models from a remote Ollama server: + +```sql +SELECT * FROM ai.ollama_list_models( + host => 'http://ollama-server:11434' +); +``` + +### Filter by model name + +Find a specific model: + +```sql +SELECT name, size, modified_at +FROM ai.ollama_list_models() +WHERE name LIKE 'llama%'; +``` + +### Check model sizes + +See how much disk space models are using: + +```sql +SELECT + name, + pg_size_pretty(size) AS disk_space, + modified_at +FROM ai.ollama_list_models() +ORDER BY size DESC; +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `host` | `TEXT` | `NULL` | ✖ | Ollama server URL (defaults to `http://localhost:11434`) | +| `verbose` | `BOOLEAN` | `FALSE` | ✖ | Enable verbose logging for debugging | + +## Returns + +`TABLE`: A table with the following columns: + +| Column | Type | Description | +|--------|------|-------------| +| `name` | `TEXT` | Model name (e.g., `llama2`, `mistral:7b`) | +| `model` | `TEXT` | Full model identifier | +| `size` | `BIGINT` | Model size in bytes | +| `digest` | `TEXT` | SHA256 digest of the model | +| `family` | `TEXT` | Model family (e.g., `llama`, `mistral`) | +| `format` | `TEXT` | Model format (typically `gguf`) | +| `families` | `JSONB` | Array of model families | +| `parent_model` | `TEXT` | Parent model if this is a derivative | +| `parameter_size` | `TEXT` | Number of parameters (e.g., `7B`, `13B`) | +| `quantization_level` | `TEXT` | Quantization level (e.g., `Q4_0`, `Q5_K_M`) | +| `modified_at` | `TIMESTAMPTZ` | Last modification timestamp | + +## Related functions + +- [`ollama_ps()`][ollama_ps]: see currently running models +- [`ollama_embed()`][ollama_embed]: generate embeddings with a model +- [`ollama_chat_complete()`][ollama_chat_complete]: chat with a model + +[ollama_ps]: /api-reference/pgai/model-calling/ollama/ollama_ps +[ollama_embed]: /api-reference/pgai/model-calling/ollama/ollama_embed +[ollama_chat_complete]: /api-reference/pgai/model-calling/ollama/ollama_chat_complete diff --git a/api-reference/pgai/model-calling/ollama/ollama_ps.mdx b/api-reference/pgai/model-calling/ollama/ollama_ps.mdx new file mode 100644 index 0000000..19d5c8e --- /dev/null +++ b/api-reference/pgai/model-calling/ollama/ollama_ps.mdx @@ -0,0 +1,113 @@ +--- +title: ollama_ps() +description: List currently running Ollama models and their resource usage +keywords: [Ollama, models, running, monitoring, VRAM] +tags: [AI, Ollama, monitoring, performance] +license: community +type: function +--- + +List all models currently loaded and running on your Ollama server. This function shows active models, when they will +expire from memory, and how much VRAM they are using. + +## Samples + +### List running models + +See which models are currently loaded: + +```sql +SELECT * FROM ai.ollama_ps(); +``` + +Returns: + +```text + name | model | size | expires_at | size_vram +-------------+--------------+------------+---------------------+----------- + llama2 | llama2:latest| 3825819519 | 2024-01-15 14:30:00 | 4096000000 +``` + +### Monitor model expiration + +Check when models will unload from memory: + +```sql +SELECT + name, + expires_at, + expires_at - now() AS time_until_unload +FROM ai.ollama_ps() +WHERE expires_at IS NOT NULL +ORDER BY expires_at; +``` + +### Check VRAM usage + +See how much video memory models are consuming: + +```sql +SELECT + name, + pg_size_pretty(size_vram) AS vram_usage, + pg_size_pretty(size) AS total_size +FROM ai.ollama_ps() +ORDER BY size_vram DESC; +``` + +### Connect to specific host + +Monitor models on a remote Ollama server: + +```sql +SELECT * FROM ai.ollama_ps( + host => 'http://ollama-server:11434' +); +``` + +### Total resource usage + +Calculate total VRAM used by all running models: + +```sql +SELECT + count(*) AS running_models, + pg_size_pretty(sum(size_vram)) AS total_vram +FROM ai.ollama_ps(); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `host` | `TEXT` | `NULL` | ✖ | Ollama server URL (defaults to `http://localhost:11434`) | +| `verbose` | `BOOLEAN` | `FALSE` | ✖ | Enable verbose logging for debugging | + +## Returns + +`TABLE`: A table with the following columns: + +| Column | Type | Description | +|--------|------|-------------| +| `name` | `TEXT` | Model name (e.g., `llama2`, `mistral:7b`) | +| `model` | `TEXT` | Full model identifier | +| `size` | `BIGINT` | Model size in bytes | +| `digest` | `TEXT` | SHA256 digest of the model | +| `parent_model` | `TEXT` | Parent model if this is a derivative | +| `format` | `TEXT` | Model format (typically `gguf`) | +| `family` | `TEXT` | Model family (e.g., `llama`, `mistral`) | +| `families` | `JSONB` | Array of model families | +| `parameter_size` | `TEXT` | Number of parameters (e.g., `7B`, `13B`) | +| `quantization_level` | `TEXT` | Quantization level (e.g., `Q4_0`, `Q5_K_M`) | +| `expires_at` | `TIMESTAMPTZ` | When the model will unload from memory | +| `size_vram` | `BIGINT` | VRAM usage in bytes | + +## Related functions + +- [`ollama_list_models()`][ollama_list_models]: see all installed models +- [`ollama_embed()`][ollama_embed]: generate embeddings with a model +- [`ollama_chat_complete()`][ollama_chat_complete]: chat with a model + +[ollama_list_models]: /api-reference/pgai/model-calling/ollama/ollama_list_models +[ollama_embed]: /api-reference/pgai/model-calling/ollama/ollama_embed +[ollama_chat_complete]: /api-reference/pgai/model-calling/ollama/ollama_chat_complete diff --git a/api-reference/pgai/model-calling/openai/index.mdx b/api-reference/pgai/model-calling/openai/index.mdx new file mode 100644 index 0000000..95ed900 --- /dev/null +++ b/api-reference/pgai/model-calling/openai/index.mdx @@ -0,0 +1,114 @@ +--- +title: OpenAI functions overview +sidebarTitle: Overview +description: Call OpenAI models directly from PostgreSQL for embeddings, chat completion, moderation, and tokenization +keywords: [OpenAI, AI, LLM, embeddings, chat, moderation] +tags: [AI, embeddings, chat, GPT] +--- + +import { PG, PGAI_SHORT } from '/snippets/vars.mdx'; + +Use OpenAI's powerful language models directly from {PG} with {PGAI_SHORT}. These functions enable you to generate +embeddings, complete chat conversations, moderate content, and work with tokens without leaving your database. + +## Prerequisites + +To use OpenAI functions, you need an [OpenAI API key](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key). + +Set your API key as an environment variable and configure it when connecting: + +```bash +export OPENAI_API_KEY="your-api-key" +PGOPTIONS="-c ai.openai_api_key=$OPENAI_API_KEY" psql -d "postgres://..." +``` + +For more configuration options, see [handling API keys](/api-reference/pgai/utilities/api-keys). + +## Samples + +### Generate embeddings for semantic search + +Create embeddings from text for vector similarity search: + +```sql +SELECT ai.openai_embed( + 'text-embedding-ada-002', + 'PostgreSQL is a powerful database' +); +``` + +### Complete a chat conversation + +Use GPT models for natural language responses: + +```sql +SELECT ai.openai_chat_complete( + 'gpt-4o-mini', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'What is PostgreSQL?') + ) +)->'choices'->0->'message'->>'content'; +``` + +### Moderate content + +Check if content violates OpenAI's usage policies: + +```sql +SELECT ai.openai_moderate( + 'text-moderation-latest', + 'some text to check' +); +``` + +### Tokenize text + +Count tokens to manage API costs and limits: + +```sql +SELECT array_length( + ai.openai_tokenize('text-embedding-ada-002', 'Hello world'), + 1 +) as token_count; +``` + +## Available functions + +### Model management +- [`openai_list_models()`][openai_list_models]: list available OpenAI models + +### Embeddings +- [`openai_embed()`][openai_embed]: generate vector embeddings from text, text arrays, or tokens + +### Chat completion +- [`openai_chat_complete()`][openai_chat_complete]: generate chat completions with full control over parameters +- [`openai_chat_complete_simple()`][openai_chat_complete_simple]: simplified chat completion for quick queries + +### Content moderation +- [`openai_moderate()`][openai_moderate]: check content for policy violations + +### Token management +- [`openai_tokenize()`][openai_tokenize]: convert text into tokens +- [`openai_detokenize()`][openai_detokenize]: convert tokens back into text + +### Advanced usage +- [`openai_embed_with_raw_response()`][openai_embed_with_raw_response]: get raw API response for embeddings +- [`openai_chat_complete_with_raw_response()`][openai_chat_complete_with_raw_response]: get raw API response for + chat completion +- [`openai_moderate_with_raw_response()`][openai_moderate_with_raw_response]: get raw API response for moderation +- [`openai_list_models_with_raw_response()`][openai_list_models_with_raw_response]: get raw API response for model + list +- [`openai_client_config()`][openai_client_config]: configure OpenAI client settings + +[openai_list_models]: /api-reference/pgai/model-calling/openai/openai_list_models +[openai_embed]: /api-reference/pgai/model-calling/openai/openai_embed +[openai_chat_complete]: /api-reference/pgai/model-calling/openai/openai_chat_complete +[openai_chat_complete_simple]: /api-reference/pgai/model-calling/openai/openai_chat_complete_simple +[openai_moderate]: /api-reference/pgai/model-calling/openai/openai_moderate +[openai_tokenize]: /api-reference/pgai/model-calling/openai/openai_tokenize +[openai_detokenize]: /api-reference/pgai/model-calling/openai/openai_detokenize +[openai_embed_with_raw_response]: /api-reference/pgai/model-calling/openai/openai_embed_with_raw_response +[openai_chat_complete_with_raw_response]: /api-reference/pgai/model-calling/openai/openai_chat_complete_with_raw_response +[openai_moderate_with_raw_response]: /api-reference/pgai/model-calling/openai/openai_moderate_with_raw_response +[openai_list_models_with_raw_response]: /api-reference/pgai/model-calling/openai/openai_list_models_with_raw_response +[openai_client_config]: /api-reference/pgai/model-calling/openai/openai_client_config diff --git a/api-reference/pgai/model-calling/openai/openai_chat_complete.mdx b/api-reference/pgai/model-calling/openai/openai_chat_complete.mdx new file mode 100644 index 0000000..b2bdaf5 --- /dev/null +++ b/api-reference/pgai/model-calling/openai/openai_chat_complete.mdx @@ -0,0 +1,119 @@ +--- +title: openai_chat_complete() +description: Generate text completions and have conversations with OpenAI's chat models +keywords: [OpenAI, chat, GPT, completion, LLM] +tags: [AI, chat, GPT, text generation] +license: community +type: function +--- + +import OpenAPIChatComplete from '/snippets/api-reference/pgai/_openai_chat_complete_arguments.mdx'; + +Generate text completions using OpenAI's chat models like GPT-4 and GPT-3.5. This function enables you to have +multi-turn conversations, generate text from prompts, and leverage advanced language model capabilities directly from +PostgreSQL. + +## Samples + +### Generate a simple completion + +Ask a question and get a response: + +```sql +SELECT ai.openai_chat_complete( + 'gpt-4o-mini', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'What is PostgreSQL?') + ) +)->'choices'->0->'message'->>'content'; +``` + +### Use system prompts + +Set the behavior and context with a system message: + +```sql +SELECT ai.openai_chat_complete( + 'gpt-4o', + jsonb_build_array( + jsonb_build_object('role', 'system', 'content', 'You are a helpful database expert'), + jsonb_build_object('role', 'user', 'content', 'Explain hypertables in simple terms') + ) +)->'choices'->0->'message'->>'content'; +``` + +### Multi-turn conversation + +Continue a conversation with message history: + +```sql +SELECT ai.openai_chat_complete( + 'gpt-4o-mini', + jsonb_build_array( + jsonb_build_object('role', 'system', 'content', 'You are a SQL expert'), + jsonb_build_object('role', 'user', 'content', 'How do I create a table?'), + jsonb_build_object('role', 'assistant', 'content', 'Use CREATE TABLE...'), + jsonb_build_object('role', 'user', 'content', 'Now show me how to add an index') + ) +)->'choices'->0->'message'->>'content'; +``` + +### Control response format + +Specify max tokens and temperature: + +```sql +SELECT ai.openai_chat_complete( + 'gpt-4o-mini', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'Write a haiku about databases') + ), + max_tokens => 50, + temperature => 0.7 +); +``` + +### Get the full response + +Access all response metadata: + +```sql +SELECT jsonb_pretty( + ai.openai_chat_complete( + 'gpt-4o-mini', + jsonb_build_array( + jsonb_build_object('role', 'user', 'content', 'Hello!') + ) + ) +); +``` + +## Arguments + + + +## Returns + +`JSONB`: A JSON object containing the completion response with the following structure: +- `id`: Unique identifier for the completion +- `object`: Always `"chat.completion"` +- `created`: Unix timestamp of when the completion was created +- `model`: The model used for completion +- `choices`: Array of completion choices + - `index`: Choice index + - `message`: The generated message with `role` and `content` + - `finish_reason`: Why the model stopped generating +- `usage`: Token usage information + - `prompt_tokens`: Tokens in the prompt + - `completion_tokens`: Tokens in the completion + - `total_tokens`: Total tokens used + +## Related functions + +- [`openai_chat_complete_simple()`][openai_chat_complete_simple]: simplified interface for quick queries +- [`openai_chat_complete_with_raw_response()`][openai_chat_complete_with_raw_response]: get raw HTTP response +- [`openai_tokenize()`][openai_tokenize]: count tokens before making API calls + +[openai_chat_complete_simple]: /api-reference/pgai/model-calling/openai/openai_chat_complete_simple +[openai_chat_complete_with_raw_response]: /api-reference/pgai/model-calling/openai/openai_chat_complete_with_raw_response +[openai_tokenize]: /api-reference/pgai/model-calling/openai/openai_tokenize diff --git a/api-reference/pgai/model-calling/openai/openai_chat_complete_simple.mdx b/api-reference/pgai/model-calling/openai/openai_chat_complete_simple.mdx new file mode 100644 index 0000000..f8bc8c8 --- /dev/null +++ b/api-reference/pgai/model-calling/openai/openai_chat_complete_simple.mdx @@ -0,0 +1,52 @@ +--- +title: openai_chat_complete_simple() +description: Simplified interface for quick chat completions with OpenAI models +keywords: [OpenAI, chat, GPT, simple] +tags: [AI, chat, GPT] +license: community +type: function +--- + +A simplified wrapper around `openai_chat_complete()` for quick, single-turn chat interactions. Use this when you need +a straightforward question-and-answer interaction without the complexity of managing message arrays. + +## Samples + +### Quick question + +Get a response to a simple question: + +```sql +SELECT ai.openai_chat_complete_simple('What is PostgreSQL?'); +``` + +### Specify a model + +Use a different GPT model: + +```sql +SELECT ai.openai_chat_complete_simple( + 'What is a hypertable?', + model => 'gpt-4o' +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `message` | `TEXT` | - | ✔ | The user message/question | +| `api_key` | `TEXT` | `NULL` | ✖ | OpenAI API key. If not provided, uses `ai.openai_api_key` setting | +| `api_key_name` | `TEXT` | `NULL` | ✖ | Name of the secret containing the API key | +| `model` | `TEXT` | `'gpt-4o-mini'` | ✖ | The OpenAI model to use | +| `client_config` | `JSONB` | `NULL` | ✖ | Advanced client configuration options | + +## Returns + +`TEXT`: The text response from the model. + +## Related functions + +- [`openai_chat_complete()`][openai_chat_complete]: full-featured chat completion with message history + +[openai_chat_complete]: /api-reference/pgai/model-calling/openai/openai_chat_complete diff --git a/api-reference/pgai/model-calling/openai/openai_chat_complete_with_raw_response.mdx b/api-reference/pgai/model-calling/openai/openai_chat_complete_with_raw_response.mdx new file mode 100644 index 0000000..74af954 --- /dev/null +++ b/api-reference/pgai/model-calling/openai/openai_chat_complete_with_raw_response.mdx @@ -0,0 +1,27 @@ +--- +title: openai_chat_complete_with_raw_response() +description: Generate chat completions and get the raw HTTP response +keywords: [OpenAI, chat, raw response, HTTP] +tags: [AI, chat, advanced] +license: community +type: function +--- + +import OpenAPIChatComplete from '/snippets/api-reference/pgai/_openai_chat_complete_arguments.mdx'; + +Generate chat completions and receive the raw HTTP response. Use this when you need access to HTTP headers, +status codes, or other low-level response details. + +## Arguments + + + +## Returns + +`JSONB`: The complete HTTP response including headers and body. + +## Related functions + +- [`openai_chat_complete()`][openai_chat_complete]: standard chat completion function + +[openai_chat_complete]: /api-reference/pgai/model-calling/openai/openai_chat_complete diff --git a/api-reference/pgai/model-calling/openai/openai_client_config.mdx b/api-reference/pgai/model-calling/openai/openai_client_config.mdx new file mode 100644 index 0000000..ac0816b --- /dev/null +++ b/api-reference/pgai/model-calling/openai/openai_client_config.mdx @@ -0,0 +1,73 @@ +--- +title: openai_client_config() +description: Configure OpenAI client settings for API requests +keywords: [OpenAI, configuration, client, settings] +tags: [AI, configuration] +license: community +type: function +--- + +Create a configuration object for customizing OpenAI API client behavior. Use this to set base URLs, timeouts, +organization IDs, and other advanced client options. + +## Samples + +### Configure timeout + +Set a custom timeout for API requests: + +```sql +SELECT ai.openai_client_config( + timeout_seconds => 60 +); +``` + +### Use custom base URL + +Point to a different OpenAI-compatible API endpoint: + +```sql +SELECT ai.openai_client_config( + base_url => 'https://api.custom-endpoint.com/v1' +); +``` + +### Full configuration + +Combine multiple settings: + +```sql +SELECT ai.openai_embed( + 'text-embedding-ada-002', + 'sample text', + client_config => ai.openai_client_config( + base_url => 'https://custom.openai.com/v1', + timeout_seconds => 120, + max_retries => 3 + ) +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `base_url` | `TEXT` | `NULL` | ✖ | Custom base URL for the OpenAI API | +| `timeout_seconds` | `FLOAT8` | `NULL` | ✖ | Timeout in seconds for API requests | +| `organization` | `TEXT` | `NULL` | ✖ | OpenAI organization ID | +| `project` | `TEXT` | `NULL` | ✖ | OpenAI project ID | +| `max_retries` | `INT` | `NULL` | ✖ | Maximum number of retry attempts | +| `default_headers` | `JSONB` | `NULL` | ✖ | Default headers to include in all requests | +| `default_query` | `JSONB` | `NULL` | ✖ | Default query parameters for all requests | + +## Returns + +`JSONB`: A configuration object that can be passed to other OpenAI functions via the `client_config` parameter. + +## Related functions + +- [`openai_embed()`][openai_embed]: use custom client config with embeddings +- [`openai_chat_complete()`][openai_chat_complete]: use custom client config with chat + +[openai_embed]: /api-reference/pgai/model-calling/openai/openai_embed +[openai_chat_complete]: /api-reference/pgai/model-calling/openai/openai_chat_complete diff --git a/api-reference/pgai/model-calling/openai/openai_detokenize.mdx b/api-reference/pgai/model-calling/openai/openai_detokenize.mdx new file mode 100644 index 0000000..82a881a --- /dev/null +++ b/api-reference/pgai/model-calling/openai/openai_detokenize.mdx @@ -0,0 +1,70 @@ +--- +title: openai_detokenize() +description: Convert tokens back into readable text +keywords: [OpenAI, tokens, detokenize, decode] +tags: [AI, tokens, utilities] +license: community +type: function +--- + +Convert an array of token IDs back into readable text. This is the inverse operation of `openai_tokenize()` and is +useful for debugging tokenization or reconstructing text from tokens. + +## Samples + +### Detokenize tokens + +Convert token IDs back into text: + +```sql +SELECT ai.openai_detokenize( + 'text-embedding-ada-002', + array[1820, 25977, 46840, 23874, 389, 264, 2579, 58466] +); +``` + +Returns: + +```text + openai_detokenize +-------------------------------------------- + the purple elephant sits on a red mushroom +``` + +### Round-trip tokenization + +Verify tokenization is reversible: + +```sql +SELECT ai.openai_detokenize( + 'text-embedding-ada-002', + ai.openai_tokenize('text-embedding-ada-002', 'Hello, world!') +); +``` + +Returns: + +```text + openai_detokenize +------------------- + Hello, world! +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `model` | `TEXT` | - | ✔ | The OpenAI model to detokenize for (e.g., `text-embedding-ada-002`, `gpt-4o`) | +| `tokens` | `INT[]` | - | ✔ | Array of token IDs to convert back into text | + +## Returns + +`TEXT`: The reconstructed text from the token IDs. + +## Related functions + +- [`openai_tokenize()`][openai_tokenize]: convert text into tokens +- [`openai_embed()`][openai_embed]: generate embeddings from tokens + +[openai_tokenize]: /api-reference/pgai/model-calling/openai/openai_tokenize +[openai_embed]: /api-reference/pgai/model-calling/openai/openai_embed diff --git a/api-reference/pgai/model-calling/openai/openai_embed.mdx b/api-reference/pgai/model-calling/openai/openai_embed.mdx new file mode 100644 index 0000000..4aca184 --- /dev/null +++ b/api-reference/pgai/model-calling/openai/openai_embed.mdx @@ -0,0 +1,100 @@ +--- +title: openai_embed() +description: Generate vector embeddings from text using OpenAI models +keywords: [OpenAI, embeddings, vectors, semantic search] +tags: [AI, embeddings, vectors] +license: community +type: function +--- + +import OpenAPIEmbedArguments from '/snippets/api-reference/pgai/_openai_embed_arguments.mdx'; + +Generate vector embeddings from text, text arrays, or tokens using OpenAI's embedding models. Embeddings are numerical +representations of text that capture semantic meaning, making them ideal for semantic search, recommendations, and +clustering. + +## Samples + +### Generate an embedding from text + +Create a vector embedding for a single piece of text: + +```sql +SELECT ai.openai_embed( + 'text-embedding-ada-002', + 'PostgreSQL is a powerful database' +); +``` + +### Generate embeddings for multiple texts + +Process multiple texts at once for efficiency: + +```sql +SELECT ai.openai_embed( + 'text-embedding-ada-002', + array[ + 'PostgreSQL is a powerful database', + 'TimescaleDB extends PostgreSQL for time-series', + 'pgai brings AI capabilities to PostgreSQL' + ] +); +``` + +### Specify embedding dimensions + +Control the size of the output vector (model-dependent): + +```sql +SELECT ai.openai_embed( + 'text-embedding-3-small', + 'PostgreSQL is a powerful database', + dimensions => 768 +); +``` + +### Use pre-tokenized input + +Provide tokens directly instead of text: + +```sql +SELECT ai.openai_embed( + 'text-embedding-ada-002', + array[1820, 25977, 46840, 23874, 389, 264, 2579, 58466] +); +``` + +### Store embeddings in a table + +Generate and store embeddings for your data: + +```sql +UPDATE documents +SET embedding = ai.openai_embed( + 'text-embedding-ada-002', + content +) +WHERE embedding IS NULL; +``` + +## Arguments + + + +## Returns + +**For single text input:** +- `vector`: A pgvector compatible vector containing the embedding + +**For array input:** +- `TABLE(index INT, embedding vector)`: A table with an index and embedding for each input text + +## Related functions + +- [`openai_embed_with_raw_response()`][openai_embed_with_raw_response]: get the full API response including metadata +- [`openai_tokenize()`][openai_tokenize]: convert text to tokens before embedding +- [`openai_list_models()`][openai_list_models]: see available embedding models + +[openai_embed_with_raw_response]: /api-reference/pgai/model-calling/openai/openai_embed_with_raw_response +[openai_tokenize]: /api-reference/pgai/model-calling/openai/openai_tokenize +[openai_list_models]: /api-reference/pgai/model-calling/openai/openai_list_models diff --git a/api-reference/pgai/model-calling/openai/openai_embed_with_raw_response.mdx b/api-reference/pgai/model-calling/openai/openai_embed_with_raw_response.mdx new file mode 100644 index 0000000..1967ba6 --- /dev/null +++ b/api-reference/pgai/model-calling/openai/openai_embed_with_raw_response.mdx @@ -0,0 +1,56 @@ +--- +title: openai_embed_with_raw_response() +description: Generate embeddings and get the complete API response including metadata +keywords: [OpenAI, embeddings, raw response, metadata] +tags: [AI, embeddings, advanced] +license: community +type: function +--- + +import OpenAPIEmbedArguments from '/snippets/api-reference/pgai/_openai_embed_arguments.mdx'; + +Generate embeddings and receive the complete raw API response including all metadata. Use this when you need access to +token usage, model information, or other response details not included in the standard `openai_embed()` function. + +## Samples + +### Get raw embedding response + +Receive the full API response: + +```sql +SELECT ai.openai_embed_with_raw_response( + 'text-embedding-ada-002', + 'PostgreSQL is powerful' +); +``` + +### Extract token usage + +Check how many tokens were used: + +```sql +SELECT + (ai.openai_embed_with_raw_response( + 'text-embedding-ada-002', + 'sample text' + )->>'usage')::jsonb; +``` + +## Arguments + + + +## Returns + +`JSONB`: The complete API response including: +- `object`: Response type +- `data`: Array of embedding objects +- `model`: Model used +- `usage`: Token usage information + +## Related functions + +- [`openai_embed()`][openai_embed]: standard embedding function returning just the vector + +[openai_embed]: /api-reference/pgai/model-calling/openai/openai_embed diff --git a/api-reference/pgai/model-calling/openai/openai_list_models.mdx b/api-reference/pgai/model-calling/openai/openai_list_models.mdx new file mode 100644 index 0000000..2baff17 --- /dev/null +++ b/api-reference/pgai/model-calling/openai/openai_list_models.mdx @@ -0,0 +1,81 @@ +--- +title: openai_list_models() +description: List all models available from OpenAI +keywords: [OpenAI, models, GPT, list] +tags: [AI, models] +license: community +type: function +--- + +import OpenAPIListModels from '/snippets/api-reference/pgai/_openai_list_models_arguments.mdx'; + +Retrieve a list of all models available from OpenAI, including their creation dates and ownership information. This is +useful for discovering available models and verifying access to specific models. + +## Samples + +### List all models + +Get all available OpenAI models: + +```sql +SELECT * +FROM ai.openai_list_models() +ORDER BY created DESC; +``` + +Returns: + +```text + id | created | owned_by +---------------------------+------------------------+----------------- + gpt-4o | 2024-05-10 13:50:49-05 | system + gpt-4o-mini | 2024-07-17 09:33:20-05 | system + gpt-4-turbo | 2024-04-05 18:57:21-05 | system + text-embedding-ada-002 | 2022-12-16 12:48:49-06 | openai-internal + ... +``` + +### Find specific models + +Search for models matching a pattern: + +```sql +SELECT * +FROM ai.openai_list_models() +WHERE id LIKE '%gpt-4o%' +ORDER BY created DESC; +``` + +### Check model availability + +Verify a specific model is available: + +```sql +SELECT EXISTS ( + SELECT 1 + FROM ai.openai_list_models() + WHERE id = 'gpt-4o-mini' +) AS model_available; +``` + +## Arguments + + + +## Returns + +`TABLE(id TEXT, created TIMESTAMPTZ, owned_by TEXT)`: A table with one row per model containing: +- `id`: The model identifier (e.g., `gpt-4o-mini`) +- `created`: When the model was created +- `owned_by`: The organization that owns the model + +## Related functions + +- [`openai_list_models_with_raw_response()`][openai_list_models_with_raw_response]: get the full API response +- [`openai_chat_complete()`][openai_chat_complete]: use models for chat completion +- [`openai_embed()`][openai_embed]: use models for embeddings + +[openai_list_models_with_raw_response]: /api-reference/pgai/model-calling/openai/openai_list_models_with_raw_response +[openai_chat_complete]: /api-reference/pgai/model-calling/openai/openai_chat_complete +[openai_embed]: /api-reference/pgai/model-calling/openai/openai_embed diff --git a/api-reference/pgai/model-calling/openai/openai_list_models_with_raw_response.mdx b/api-reference/pgai/model-calling/openai/openai_list_models_with_raw_response.mdx new file mode 100644 index 0000000..d818a14 --- /dev/null +++ b/api-reference/pgai/model-calling/openai/openai_list_models_with_raw_response.mdx @@ -0,0 +1,27 @@ +--- +title: openai_list_models_with_raw_response() +description: List models and get the raw HTTP response +keywords: [OpenAI, models, raw response, HTTP] +tags: [AI, models, advanced] +license: community +type: function +--- + +import OpenAPIListModels from '/snippets/api-reference/pgai/_openai_list_models_arguments.mdx'; + +List available models and receive the raw HTTP response. Use this when you need access to HTTP headers, status codes, +or other low-level response details. + +## Arguments + + + +## Returns + +`JSONB`: The complete HTTP response including headers and body. + +## Related functions + +- [`openai_list_models()`][openai_list_models]: standard model listing function + +[openai_list_models]: /api-reference/pgai/model-calling/openai/openai_list_models diff --git a/api-reference/pgai/model-calling/openai/openai_moderate.mdx b/api-reference/pgai/model-calling/openai/openai_moderate.mdx new file mode 100644 index 0000000..f69ff3b --- /dev/null +++ b/api-reference/pgai/model-calling/openai/openai_moderate.mdx @@ -0,0 +1,128 @@ +--- +title: openai_moderate() +description: Check content for policy violations using OpenAI's moderation API +keywords: [OpenAI, moderation, content filtering, safety] +tags: [AI, moderation, safety] +license: community +type: function +--- + +import OpenAPIModerate from '/snippets/api-reference/pgai/_openai_moderate_arguments.mdx'; + +Analyze text content to detect potential policy violations including hate speech, violence, sexual content, self-harm, +and harassment. This uses OpenAI's moderation API to help ensure your application complies with usage policies. + +## Samples + +### Check content for violations + +Analyze text for potentially harmful content: + +```sql +SELECT ai.openai_moderate( + 'text-moderation-latest', + 'I want to hurt someone' +); +``` + +Returns a JSON object with flagged categories and confidence scores: + +```json +{ + "id": "modr-...", + "model": "text-moderation-007", + "results": [{ + "flagged": true, + "categories": { + "violence": true, + "harassment": true, + ... + }, + "category_scores": { + "violence": 0.997, + "harassment": 0.571, + ... + } + }] +} +``` + +### Check if content is flagged + +Get a simple boolean result: + +```sql +SELECT + content, + (ai.openai_moderate('text-moderation-latest', content)-> + 'results'->0->>'flagged')::boolean AS is_flagged +FROM user_comments; +``` + +### Filter by specific categories + +Check for specific types of violations: + +```sql +SELECT + id, + content, + (ai.openai_moderate('text-moderation-latest', content)-> + 'results'->0->'categories'->>'violence')::boolean AS has_violence +FROM posts +WHERE (ai.openai_moderate('text-moderation-latest', content)-> + 'results'->0->'categories'->>'violence')::boolean = true; +``` + +### Moderate user-generated content with triggers + +Automatically flag problematic content: + +```sql +CREATE TABLE comments ( + id SERIAL PRIMARY KEY, + content TEXT, + is_flagged BOOLEAN +); + +CREATE OR REPLACE FUNCTION moderate_comment() +RETURNS TRIGGER AS $$ +BEGIN + NEW.is_flagged := ( + ai.openai_moderate('text-moderation-latest', NEW.content)-> + 'results'->0->>'flagged' + )::boolean; + RETURN NEW; +END; +$$ LANGUAGE plpgsql; + +CREATE TRIGGER moderate_on_insert + BEFORE INSERT ON comments + FOR EACH ROW + EXECUTE FUNCTION moderate_comment(); +``` + +## Arguments + + + +## Returns + +`JSONB`: A JSON object containing moderation results with the following structure: +- `id`: Unique identifier for the moderation request +- `model`: The model used +- `results`: Array of result objects (one per input) + - `flagged`: Boolean indicating if content was flagged + - `categories`: Object with boolean flags for each category + - `hate`, `hate/threatening` + - `harassment`, `harassment/threatening` + - `self-harm`, `self-harm/intent`, `self-harm/instructions` + - `sexual`, `sexual/minors` + - `violence`, `violence/graphic` + - `category_scores`: Object with confidence scores (0-1) for each category + +## Related functions + +- [`openai_moderate_with_raw_response()`][openai_moderate_with_raw_response]: get the full HTTP response + +[openai_moderate_with_raw_response]: /api-reference/pgai/model-calling/openai/openai_moderate_with_raw_response diff --git a/api-reference/pgai/model-calling/openai/openai_moderate_with_raw_response.mdx b/api-reference/pgai/model-calling/openai/openai_moderate_with_raw_response.mdx new file mode 100644 index 0000000..e8f3d1e --- /dev/null +++ b/api-reference/pgai/model-calling/openai/openai_moderate_with_raw_response.mdx @@ -0,0 +1,27 @@ +--- +title: openai_moderate_with_raw_response() +description: Moderate content and get the raw HTTP response +keywords: [OpenAI, moderation, raw response, HTTP] +tags: [AI, moderation, advanced] +license: community +type: function +--- + +import OpenAPIModerate from '/snippets/api-reference/pgai/_openai_moderate_arguments.mdx'; + +Moderate content and receive the raw HTTP response. Use this when you need access to HTTP headers, status codes, or +other low-level response details. + +## Arguments + + + +## Returns + +`JSONB`: The complete HTTP response including headers and body. + +## Related functions + +- [`openai_moderate()`][openai_moderate]: standard moderation function + +[openai_moderate]: /api-reference/pgai/model-calling/openai/openai_moderate diff --git a/api-reference/pgai/model-calling/openai/openai_tokenize.mdx b/api-reference/pgai/model-calling/openai/openai_tokenize.mdx new file mode 100644 index 0000000..275cadb --- /dev/null +++ b/api-reference/pgai/model-calling/openai/openai_tokenize.mdx @@ -0,0 +1,87 @@ +--- +title: openai_tokenize() +description: Convert text into tokens for token counting and API cost estimation +keywords: [OpenAI, tokens, tokenize, tiktoken] +tags: [AI, tokens, utilities] +license: community +type: function +--- + +Convert text into an array of token IDs using OpenAI's tokenization algorithm. This is useful for counting tokens to +estimate API costs, stay within model limits, and understand how your text is processed. + +## Samples + +### Tokenize text + +Convert a string into tokens: + +```sql +SELECT ai.openai_tokenize( + 'text-embedding-ada-002', + 'Tiger Data is Postgres made Powerful' +); +``` + +Returns: + +```text + openai_tokenize +---------------------------------------- + {19422,2296,374,3962,18297,1903,75458} +``` + +### Count tokens + +Determine how many tokens a text will use: + +```sql +SELECT array_length( + ai.openai_tokenize( + 'text-embedding-ada-002', + 'Tiger Data is Postgres made Powerful' + ), + 1 +) AS token_count; +``` + +Returns: + +```text + token_count +------------- + 7 +``` + +### Check token count before API call + +Ensure your text fits within model limits: + +```sql +SELECT + content, + array_length(ai.openai_tokenize('gpt-4o-mini', content), 1) AS tokens +FROM documents +WHERE array_length(ai.openai_tokenize('gpt-4o-mini', content), 1) > 8000; +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `model` | `TEXT` | - | ✔ | The OpenAI model to tokenize for (e.g., `text-embedding-ada-002`, `gpt-4o`) | +| `text_input` | `TEXT` | - | ✔ | The text to convert into tokens | + +## Returns + +`INT[]`: An array of token IDs representing the input text. + +## Related functions + +- [`openai_detokenize()`][openai_detokenize]: convert tokens back into text +- [`openai_embed()`][openai_embed]: generate embeddings from text or tokens +- [`openai_chat_complete()`][openai_chat_complete]: use tokens for completion + +[openai_detokenize]: /api-reference/pgai/model-calling/openai/openai_detokenize +[openai_embed]: /api-reference/pgai/model-calling/openai/openai_embed +[openai_chat_complete]: /api-reference/pgai/model-calling/openai/openai_chat_complete diff --git a/api-reference/pgai/model-calling/voyageai/index.mdx b/api-reference/pgai/model-calling/voyageai/index.mdx new file mode 100644 index 0000000..573a1cf --- /dev/null +++ b/api-reference/pgai/model-calling/voyageai/index.mdx @@ -0,0 +1,122 @@ +--- +title: Voyage AI functions +sidebarTitle: Overview +description: Generate specialized embeddings optimized for retrieval and semantic search +keywords: [Voyage AI, embeddings, retrieval, semantic search] +tags: [AI, VoyageAI, embeddings, search] +license: community +type: function +--- + +import { PG, TIMESCALE_DB } from '/snippets/vars.mdx'; + +Call Voyage AI's API directly from SQL to generate high-quality embeddings optimized for retrieval and semantic search +tasks. + +## What is Voyage AI? + +Voyage AI provides state-of-the-art embedding models specifically optimized for retrieval tasks. Their models excel at +capturing semantic relationships and delivering superior performance on retrieval and reranking benchmarks. + +## Key features + +- **Retrieval-optimized**: Embeddings designed specifically for search and retrieval +- **High quality**: State-of-the-art performance on embedding benchmarks +- **Flexible input types**: Support for queries, documents, and general use +- **Cost-effective**: Competitive pricing for production use + +## Prerequisites + +To use Voyage AI functions, you need: + +1. A Voyage AI API key from [dash.voyageai.com](https://dash.voyageai.com) +2. API key configured in your database (see configuration section below) + +## Quick start + +### Generate an embedding + +Create a vector embedding for semantic search: + +```sql +SELECT ai.voyageai_embed( + 'voyage-3', + 'PostgreSQL is a powerful database' +); +``` + +### Specify input type + +Optimize embeddings for your use case: + +```sql +-- For search queries +SELECT ai.voyageai_embed( + 'voyage-3', + 'best database for time-series', + input_type => 'query' +); + +-- For documents +SELECT ai.voyageai_embed( + 'voyage-3', + 'PostgreSQL is a relational database', + input_type => 'document' +); +``` + +### Batch embeddings + +Process multiple texts efficiently: + +```sql +SELECT * FROM ai.voyageai_embed( + 'voyage-3', + ARRAY[ + 'PostgreSQL is a powerful database', + 'TimescaleDB extends PostgreSQL for time-series', + 'pgai brings AI capabilities to PostgreSQL' + ], + input_type => 'document' +); +``` + +## Configuration + +Store your Voyage AI API key securely in the database: + +```sql +-- Store API key as a secret +SELECT ai.create_secret('VOYAGE_API_KEY', 'your-api-key-here'); + +-- Use the secret by name +SELECT ai.voyageai_embed( + 'voyage-3', + 'sample text', + api_key_name => 'VOYAGE_API_KEY' +); +``` + +## Available functions + +### Embeddings + +- [`voyageai_embed()`][voyageai_embed]: generate vector embeddings from text + +## Available models + +Voyage AI offers several specialized embedding models: + +- **voyage-3**: Latest model with best overall performance +- **voyage-3-lite**: Faster and more cost-effective option +- **voyage-code-3**: Optimized for code search and understanding +- **voyage-finance-2**: Specialized for financial documents +- **voyage-law-2**: Specialized for legal documents + +## Resources + +- [Voyage AI documentation](https://docs.voyageai.com) +- [Voyage AI models overview](https://docs.voyageai.com/docs/embeddings) +- [Voyage AI dashboard](https://dash.voyageai.com) + +[voyageai_embed]: /api-reference/pgai/model-calling/voyageai/voyageai_embed diff --git a/api-reference/pgai/model-calling/voyageai/voyageai_embed.mdx b/api-reference/pgai/model-calling/voyageai/voyageai_embed.mdx new file mode 100644 index 0000000..e3041a9 --- /dev/null +++ b/api-reference/pgai/model-calling/voyageai/voyageai_embed.mdx @@ -0,0 +1,127 @@ +--- +title: voyageai_embed() +description: Generate retrieval-optimized embeddings using Voyage AI models +keywords: [Voyage AI, embeddings, retrieval, vectors] +tags: [AI, VoyageAI, embeddings, vectors] +license: community +type: function +--- + +Generate vector embeddings from text using Voyage AI's retrieval-optimized models. Voyage embeddings excel at semantic +search, retrieval augmented generation (RAG), and clustering tasks. + +## Samples + +### Generate a single embedding + +Create a vector embedding: + +```sql +SELECT ai.voyageai_embed( + 'voyage-3', + 'PostgreSQL is a powerful database' +); +``` + +### Specify input type for queries + +Optimize embeddings for search queries: + +```sql +SELECT ai.voyageai_embed( + 'voyage-3', + 'best time-series database', + input_type => 'query' +); +``` + +### Specify input type for documents + +Optimize embeddings for documents: + +```sql +SELECT ai.voyageai_embed( + 'voyage-3', + 'TimescaleDB is an extension for PostgreSQL that adds time-series capabilities.', + input_type => 'document' +); +``` + +### Generate embeddings for multiple texts + +Process multiple texts in one API call: + +```sql +SELECT index, embedding +FROM ai.voyageai_embed( + 'voyage-3', + ARRAY[ + 'PostgreSQL is a powerful database', + 'TimescaleDB extends PostgreSQL', + 'pgai brings AI to PostgreSQL' + ], + input_type => 'document' +); +``` + +### Store embeddings in a table + +Generate and store embeddings for your data: + +```sql +UPDATE documents +SET embedding = ai.voyageai_embed( + 'voyage-3', + content, + input_type => 'document' +) +WHERE embedding IS NULL; +``` + +### Use domain-specific models + +Use specialized models for your domain: + +```sql +-- For code search +SELECT ai.voyageai_embed( + 'voyage-code-3', + 'def calculate_sum(a, b): return a + b', + input_type => 'document' +); + +-- For financial documents +SELECT ai.voyageai_embed( + 'voyage-finance-2', + 'Q3 revenue increased by 15% year-over-year', + input_type => 'document' +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `model` | `TEXT` | - | ✔ | Voyage AI model (e.g., `voyage-3`, `voyage-code-3`) | +| `input_text` | `TEXT` | - | ✔ | Single text input to embed (use this OR `input_texts`) | +| `input_texts` | `TEXT[]` | - | ✔ | Array of text inputs to embed in a batch | +| `input_type` | `TEXT` | `NULL` | ✖ | Type of input: `query` for search queries, `document` for documents to search | +| `api_key` | `TEXT` | `NULL` | ✖ | Voyage AI API key. If not provided, uses configured secret | +| `api_key_name` | `TEXT` | `NULL` | ✖ | Name of the secret containing the API key | +| `verbose` | `BOOLEAN` | `FALSE` | ✖ | Enable verbose logging for debugging | + +## Returns + +**For single text input:** +- `vector`: A pgvector compatible vector containing the embedding + +**For array input:** +- `TABLE(index INT, embedding vector)`: A table with an index and embedding for each input text + +## Related functions + +- [`cohere_embed()`][cohere_embed]: alternative with Cohere models +- [`openai_embed()`][openai_embed]: alternative with OpenAI models + +[cohere_embed]: /api-reference/pgai/model-calling/cohere/cohere_embed +[openai_embed]: /api-reference/pgai/model-calling/openai/openai_embed diff --git a/api-reference/pgai/pgai-api-reference.mdx b/api-reference/pgai/pgai-api-reference.mdx deleted file mode 100644 index 85759e5..0000000 --- a/api-reference/pgai/pgai-api-reference.mdx +++ /dev/null @@ -1,17 +0,0 @@ ---- -title: pgai API Reference -description: Complete API reference for pgai functions and AI operations -products: [cloud, mst, self_hosted] -keywords: [API, reference, AI, vector, embeddings] -mode: "wide" ---- - - - - Complete API reference for pgai vectorizer functions and AI embedding operations. - - \ No newline at end of file diff --git a/api-reference/pgai/vectorizer-api-reference.mdx b/api-reference/pgai/vectorizer-api-reference.mdx index 0e0ff7c..c955c0c 100644 --- a/api-reference/pgai/vectorizer-api-reference.mdx +++ b/api-reference/pgai/vectorizer-api-reference.mdx @@ -3,6 +3,8 @@ title: Vectorizer API reference description: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. --- +import { CLOUD_LONG } from '/snippets/vars.mdx'; + This page provides an API reference for Vectorizer functions. For an overview of Vectorizer and how it works, see the [Vectorizer Guide](/docs/vectorizer/overview.md). @@ -676,7 +678,6 @@ The function takes several parameters to customize the LiteLLM embedding configu | api_key_name | text | - | ✖ | Set the name of the environment variable that contains the API key. This allows for flexible API key management without hardcoding keys in the database. | | extra_options | jsonb | - | ✖ | Set provider-specific configuration options. | -[LiteLLM embedding documentation]: https://docs.litellm.ai/docs/embedding/supported_embedding #### Returns @@ -705,7 +706,6 @@ Note: The [Cohere documentation on input_type] specifies that the `input_type` p By default, LiteLLM sets this to `search_document`. The input type can be provided via `extra_options`, i.e. `extra_options => '{"input_type": "search_document"}'::jsonb`. -[Cohere documentation on input_type]: https://docs.cohere.com/v2/docs/embeddings#the-input_type-parameter #### Mistral @@ -779,7 +779,6 @@ never succeed. ); ``` -[Huggingface inference]: https://huggingface.co/docs/huggingface_hub/en/guides/inference #### AWS Bedrock @@ -792,7 +791,6 @@ The simplest method is to provide the `AWS_ACCESS_KEY_ID`, vectorizer worker. Consult the [boto3 credentials documentation] for more options. -[boto3 credentials documentation]: (https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) ```sql SELECT ai.create_vectorizer( @@ -838,7 +836,6 @@ correspond to the project id, and the path to a file containing credentials for a service account. Consult the [Authentication methods at Google] for more options. -[Authentication methods at Google]: https://cloud.google.com/docs/authentication ```sql SELECT ai.create_vectorizer( @@ -915,7 +912,7 @@ The function takes several parameters to customize the OpenAI embedding configur | model | text | - | ✔ | Specify the name of the OpenAI embedding model to use. For example, `text-embedding-3-small`. | | dimensions | int | - | ✔ | Define the number of dimensions for the embedding vectors. This should match the output dimensions of the chosen model. | | chat_user | text | - | ✖ | The identifier for the user making the API call. This can be useful for tracking API usage or for OpenAI's monitoring purposes. | -| api_key_name | text | `OPENAI_API_KEY` | ✖ | Set [the name of the environment variable that contains the OpenAI API key][openai-use-env-var]. This allows for flexible API key management without hardcoding keys in the database. On Timescale Cloud, you should set this to the name of the secret that contains the OpenAI API key. | +| api_key_name | text | `OPENAI_API_KEY` | ✖ | Set [the name of the environment variable that contains the OpenAI API key][openai-use-env-var]. This allows for flexible API key management without hardcoding keys in the database. On {CLOUD_LONG}, you should set this to the name of the secret that contains the OpenAI API key. | | base_url | text | - | ✖ | Set the base_url of the OpenAI API. Note: no default configured here to allow configuration of the vectorizer worker through `OPENAI_BASE_URL` env var. | #### Returns @@ -1001,7 +998,7 @@ The function takes several parameters to customize the Voyage AI embedding confi | model | text | - | ✔ | Specify the name of the [Voyage AI model](https://docs.voyageai.com/docs/embeddings#model-choices) to use. | | dimensions | int | - | ✔ | Define the number of dimensions for the embedding vectors. This should match the output dimensions of the chosen model. | | input_type | text | 'document' | ✖ | Type of the input text, null, 'query', or 'document'. | -| api_key_name | text | `VOYAGE_API_KEY` | ✖ | Set the name of the environment variable that contains the Voyage AI API key. This allows for flexible API key management without hardcoding keys in the database. On Timescale Cloud, you should set this to the name of the secret that contains the Voyage AI API key. | +| api_key_name | text | `VOYAGE_API_KEY` | ✖ | Set the name of the environment variable that contains the Voyage AI API key. This allows for flexible API key management without hardcoding keys in the database. On {CLOUD_LONG}, you should set this to the name of the secret that contains the Voyage AI API key. | #### Returns @@ -1127,7 +1124,7 @@ The available functions are: You use `ai.indexing_default` to use the platform-specific default value for indexing. -On Timescale Cloud, the default is `ai.indexing_diskann()`. On self-hosted, the default is `ai.indexing_none()`. +On {CLOUD_LONG}, the default is `ai.indexing_diskann()`. On self-hosted, the default is `ai.indexing_none()`. A timescaledb background job is used for automatic index creation. Since timescaledb may not be installed in a self-hosted environment, we default to `ai.indexing_none()`. @@ -1259,17 +1256,17 @@ considerations. The available functions are: -- [ai.scheduling_default](#aischeduling_default): uses the platform-specific default scheduling configuration. On Timescale Cloud this is equivalent to `ai.scheduling_timescaledb()`. On self-hosted deployments, this is equivalent to `ai.scheduling_none()`. +- [ai.scheduling_default](#aischeduling_default): uses the platform-specific default scheduling configuration. On {CLOUD_LONG} this is equivalent to `ai.scheduling_timescaledb()`. On self-hosted deployments, this is equivalent to `ai.scheduling_none()`. - [ai.scheduling_none](#aischeduling_none): when you want manual control over when the vectorizer runs. Use this when you're using an external scheduling system, as is the case with self-hosted deployments. -- [ai.scheduling_timescaledb](#aischeduling_timescaledb): leverages TimescaleDB's robust job scheduling system, which is designed for reliability and scalability. Use this when you're using Timescale Cloud. +- [ai.scheduling_timescaledb](#aischeduling_timescaledb): leverages TimescaleDB's robust job scheduling system, which is designed for reliability and scalability. Use this when you're using {CLOUD_LONG}. ### ai.scheduling_default You use `ai.scheduling_default` to use the platform-specific default scheduling configuration. -On Timescale Cloud, the default is `ai.scheduling_timescaledb()`. On self-hosted, the default is `ai.scheduling_none()`. -A timescaledb background job is used to periodically trigger a cloud vectorizer on Timescale Cloud. +On {CLOUD_LONG}, the default is `ai.scheduling_timescaledb()`. On self-hosted, the default is `ai.scheduling_none()`. +A timescaledb background job is used to periodically trigger a cloud vectorizer on {CLOUD_LONG}. This is not available in a self-hosted environment. #### Example usage @@ -1819,6 +1816,11 @@ SELECT ai.vectorizer_queue_pending(1, exact_count=>true); The number of items in the queue for the specified vectorizer +[LiteLLM embedding documentation]: https://docs.litellm.ai/docs/embedding/supported_embedding +[Cohere documentation on input_type]: https://docs.cohere.com/v2/docs/embeddings#the-input_type-parameter +[Huggingface inference]: https://huggingface.co/docs/huggingface_hub/en/guides/inference +[boto3 credentials documentation]: (https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) +[Authentication methods at Google]: https://cloud.google.com/docs/authentication [timescale-cloud]: https://console.cloud.timescale.com/ [openai-use-env-var]: https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety#h_a1ab3ba7b2 [openai-set-key]: https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety#h_a1ab3ba7b2 diff --git a/api-reference/pgai/vectorizer/chunking_character_text_splitter.mdx b/api-reference/pgai/vectorizer/chunking_character_text_splitter.mdx new file mode 100644 index 0000000..e860d34 --- /dev/null +++ b/api-reference/pgai/vectorizer/chunking_character_text_splitter.mdx @@ -0,0 +1,84 @@ +--- +title: chunking_character_text_splitter() +description: Split text into chunks based on character count with configurable separators +keywords: [vectorizer, chunking, character, text splitting] +tags: [vectorizer, configuration, chunking] +license: community +type: function +--- + +Split text into chunks based on a specified separator, with control over chunk size and overlap between chunks. + +You use this function to: + +- Split text into chunks based on a specified separator +- Control the chunk size and amount of overlap between chunks +- Simple, predictable chunking strategy + +## Samples + +### Basic character splitting + +Split content into 128-character chunks with 10-character overlap: + +```sql +SELECT ai.create_vectorizer( + 'blog_posts'::regclass, + loading => ai.loading_column('content'), + embedding => ai.embedding_openai('text-embedding-3-small', 768), + chunking => ai.chunking_character_text_splitter(128, 10) +); +``` + +### Custom separator + +Split on newlines: + +```sql +SELECT ai.create_vectorizer( + 'documents'::regclass, + loading => ai.loading_column('content'), + embedding => ai.embedding_openai('text-embedding-3-small', 768), + chunking => ai.chunking_character_text_splitter(512, 50, E'\n') +); +``` + +### Regex separator + +Split using a regular expression: + +```sql +SELECT ai.create_vectorizer( + 'text_data'::regclass, + loading => ai.loading_column('text'), + embedding => ai.embedding_openai('text-embedding-3-small', 768), + chunking => ai.chunking_character_text_splitter( + chunk_size => 800, + chunk_overlap => 100, + separator => E'\\n\\n|\\. ', + is_separator_regex => true + ) +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `chunk_size` | `int` | `800` | ✖ | Maximum number of characters in a chunk | +| `chunk_overlap` | `int` | `400` | ✖ | Number of characters to overlap between chunks | +| `separator` | `text` | `E'\n\n'` | ✖ | String or character used to split the text | +| `is_separator_regex` | `bool` | `false` | ✖ | Set to `true` if `separator` is a regular expression | + +## Returns + +A JSON configuration object for use in [`create_vectorizer()`][create_vectorizer]. + +## Related functions + +- [`chunking_recursive_character_text_splitter()`][chunking_recursive]: more sophisticated recursive splitting +- [`chunking_none()`][chunking_none]: disable chunking + +[chunking_recursive]: /api-reference/pgai/vectorizer/chunking_recursive_character_text_splitter +[chunking_none]: /api-reference/pgai/vectorizer/chunking_none +[create_vectorizer]: /api-reference/pgai/vectorizer/create_vectorizer diff --git a/api-reference/pgai/vectorizer/chunking_none.mdx b/api-reference/pgai/vectorizer/chunking_none.mdx new file mode 100644 index 0000000..0368c4b --- /dev/null +++ b/api-reference/pgai/vectorizer/chunking_none.mdx @@ -0,0 +1,62 @@ +--- +title: chunking_none() +description: Disable chunking for one-to-one embedding relationships +keywords: [vectorizer, chunking, none, column destination] +tags: [vectorizer, configuration, chunking] +license: community +type: function +--- + +Disable chunking to maintain a one-to-one relationship between source rows and embeddings. This is required when using +column destination, where embeddings are stored directly in the source table. + +## When to use + +Use `chunking_none()` when: +- Using `destination_column()` (required) +- Source text is already chunked or naturally short (< 512 tokens) +- You need exactly one embedding per source row + +## Samples + +### With column destination (required) + +```sql +SELECT ai.create_vectorizer( + 'products'::regclass, + destination => ai.destination_column('description_embedding'), + loading => ai.loading_column('description'), + embedding => ai.embedding_openai('text-embedding-3-small', 768), + chunking => ai.chunking_none() -- Required for column destination +); +``` + +### With pre-chunked data + +```sql +SELECT ai.create_vectorizer( + 'pre_chunked_text'::regclass, + loading => ai.loading_column('chunk'), + embedding => ai.embedding_openai('text-embedding-3-small', 768), + chunking => ai.chunking_none() +); +``` + +## Arguments + +This function takes no arguments. + +## Returns + +A JSON configuration object for use in [`create_vectorizer()`][create_vectorizer]. + +## Related functions + +- [`destination_column()`][destination_column]: requires chunking_none() +- [`chunking_character_text_splitter()`][chunking_character]: split by character count +- [`chunking_recursive_character_text_splitter()`][chunking_recursive]: recursive splitting + +[destination_column]: /api-reference/pgai/vectorizer/destination_column +[chunking_character]: /api-reference/pgai/vectorizer/chunking_character_text_splitter +[chunking_recursive]: /api-reference/pgai/vectorizer/chunking_recursive_character_text_splitter +[create_vectorizer]: /api-reference/pgai/vectorizer/create_vectorizer diff --git a/api-reference/pgai/vectorizer/chunking_recursive_character_text_splitter.mdx b/api-reference/pgai/vectorizer/chunking_recursive_character_text_splitter.mdx new file mode 100644 index 0000000..e427838 --- /dev/null +++ b/api-reference/pgai/vectorizer/chunking_recursive_character_text_splitter.mdx @@ -0,0 +1,86 @@ +--- +title: chunking_recursive_character_text_splitter() +description: Recursively split text using multiple separators for better semantic preservation +keywords: [vectorizer, chunking, recursive, semantic] +tags: [vectorizer, configuration, chunking] +license: community +type: function +--- + +Recursively split text into chunks using multiple separators. This provides more fine-grained control over the +chunking process and can better preserve semantic meaning by trying separators in order. + +The function tries each separator in order. If a chunk is still too large after applying a separator, it tries the +next separator in the list. This helps preserve natural text boundaries like paragraphs and sentences. + +You use this function to: + +- Recursively split text using multiple separators +- Preserve more semantic meaning in chunks +- Try separators in order (paragraphs, then sentences, then words) +- Default configuration balances context preservation and chunk size + +## Samples + +### Default recursive splitting + +Use the default separator hierarchy: + +```sql +SELECT ai.create_vectorizer( + 'blog_posts'::regclass, + loading => ai.loading_column('content'), + embedding => ai.embedding_openai('text-embedding-3-small', 768), + chunking => ai.chunking_recursive_character_text_splitter() +); +``` + +### Custom chunk size and overlap + +```sql +SELECT ai.create_vectorizer( + 'documents'::regclass, + loading => ai.loading_column('content'), + embedding => ai.embedding_openai('text-embedding-3-small', 768), + chunking => ai.chunking_recursive_character_text_splitter(256, 20) +); +``` + +### Custom separator hierarchy + +Try newlines first, then spaces: + +```sql +SELECT ai.create_vectorizer( + 'text_data'::regclass, + loading => ai.loading_column('text'), + embedding => ai.embedding_openai('text-embedding-3-small', 768), + chunking => ai.chunking_recursive_character_text_splitter( + chunk_size => 512, + chunk_overlap => 50, + separators => array[E'\n\n', E'\n', ' ', ''] + ) +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `chunk_size` | `int` | `800` | ✖ | Maximum number of characters per chunk | +| `chunk_overlap` | `int` | `400` | ✖ | Number of characters to overlap between chunks | +| `separators` | `text[]` | `array[E'\n\n', E'\n', '.', '?', '!', ' ', '']` | ✖ | Array of separators to try in order | +| `is_separator_regex` | `bool` | `false` | ✖ | Set to `true` if separators are regular expressions | + +## Returns + +A JSON configuration object for use in [`create_vectorizer()`][create_vectorizer]. + +## Related functions + +- [`chunking_character_text_splitter()`][chunking_character]: simpler single-separator splitting +- [`chunking_none()`][chunking_none]: disable chunking + +[chunking_character]: /api-reference/pgai/vectorizer/chunking_character_text_splitter +[chunking_none]: /api-reference/pgai/vectorizer/chunking_none +[create_vectorizer]: /api-reference/pgai/vectorizer/create_vectorizer diff --git a/api-reference/pgai/vectorizer/create_vectorizer.mdx b/api-reference/pgai/vectorizer/create_vectorizer.mdx new file mode 100644 index 0000000..56c0b88 --- /dev/null +++ b/api-reference/pgai/vectorizer/create_vectorizer.mdx @@ -0,0 +1,116 @@ +--- +title: create_vectorizer() +description: Create and configure an automated embedding system for a table +keywords: [vectorizer, create, embeddings, automation] +tags: [vectorizer, embeddings, automation] +license: community +type: function +--- + +Set up and configure an automated system for generating and managing embeddings for a specific table in your database. +This function creates the necessary infrastructure (tables, views, triggers, columns) and configures the embedding +generation process. + +You use this function to: + +- Automate the process of creating embeddings for table data +- Set up necessary infrastructure (tables, views, triggers, columns) +- Configure the embedding generation process +- Integrate with AI providers for embedding creation +- Set up scheduling for background processing + +## Samples + +### Table destination (default) + +Create a separate table to store embeddings with a view that joins with the source table: + +```sql +SELECT ai.create_vectorizer( + 'website.blog'::regclass, + name => 'website_blog_vectorizer', + loading => ai.loading_column('contents'), + embedding => ai.embedding_ollama('nomic-embed-text', 768), + chunking => ai.chunking_character_text_splitter(128, 10), + formatting => ai.formatting_python_template('title: $title published: $published $chunk'), + grant_to => ai.grant_to('bob', 'alice'), + destination => ai.destination_table( + target_schema => 'website', + target_table => 'blog_embeddings_store', + view_name => 'blog_embeddings' + ) +); +``` + +This creates: +1. A vectorizer named 'website_blog_vectorizer' for the `website.blog` table +2. A separate table `website.blog_embeddings_store` to store embeddings +3. A view `website.blog_embeddings` joining source and embeddings +4. Loads the `contents` column +5. Uses Ollama `nomic-embed-text` model to create 768 dimensional embeddings +6. Chunks content into 128-character pieces with 10-character overlap +7. Formats each chunk with title and published date +8. Grants necessary permissions to roles `bob` and `alice` + +### Column destination + +Store embeddings directly in the source table (requires no chunking): + +```sql +SELECT ai.create_vectorizer( + 'website.product_descriptions'::regclass, + name => 'product_descriptions_vectorizer', + loading => ai.loading_column('description'), + embedding => ai.embedding_openai('text-embedding-3-small', 768), + chunking => ai.chunking_none(), -- Required for column destination + grant_to => ai.grant_to('marketing_team'), + destination => ai.destination_column('description_embedding') +); +``` + +This creates: +1. A vectorizer named 'product_descriptions_vectorizer' +2. A column `description_embedding` directly in the source table +3. Loads the `description` column +4. No chunking (required for column destination) +5. Uses OpenAI's embedding model to create 768 dimensional embeddings +6. Grants necessary permissions to role `marketing_team` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `source` | `regclass` | - | ✔ | The source table that embeddings are generated for | +| `name` | `text` | Auto-generated | ✖ | Unique name for the vectorizer. Auto-generated based on destination type if not provided. Must follow snake_case pattern `^[a-z][a-z_0-9]*$` | +| `destination` | Destination config | `ai.destination_table()` | ✖ | How embeddings will be stored: `ai.destination_table()` (default) or `ai.destination_column()` | +| `embedding` | Embedding config | - | ✔ | How to embed the data using `ai.embedding_*()` functions | +| `loading` | Loading config | - | ✔ | How to load data from source table using `ai.loading_*()` functions | +| `parsing` | Parsing config | `ai.parsing_auto()` | ✖ | How to parse the data using `ai.parsing_*()` functions | +| `chunking` | Chunking config | `ai.chunking_recursive_character_text_splitter()` | ✖ | How to split text data using `ai.chunking_*()` functions | +| `indexing` | Indexing config | `ai.indexing_default()` | ✖ | How to index embeddings using `ai.indexing_*()` functions | +| `formatting` | Formatting config | `ai.formatting_python_template()` | ✖ | How to format data before embedding | +| `scheduling` | Scheduling config | `ai.scheduling_default()` | ✖ | How often to run the vectorizer using `ai.scheduling_*()` functions | +| `processing` | Processing config | `ai.processing_default()` | ✖ | How to process embeddings | +| `queue_schema` | `name` | - | ✖ | Schema where the work queue table is created | +| `queue_table` | `name` | - | ✖ | Name of the work queue table | +| `grant_to` | Grant config | `ai.grant_to_default()` | ✖ | Which users can use objects created by the vectorizer | +| `enqueue_existing` | `bool` | `true` | ✖ | Whether existing rows should be immediately queued for embedding | +| `if_not_exists` | `bool` | `false` | ✖ | Avoid error if the vectorizer already exists | + +## Returns + +`INT`: The ID of the vectorizer created. You can also reference the vectorizer by its name in management functions. + +## Related functions + +- [`drop_vectorizer()`][drop_vectorizer]: remove a vectorizer +- [`destination_table()`][destination_table]: store embeddings in separate table +- [`destination_column()`][destination_column]: store embeddings in source table +- [`enable_vectorizer_schedule()`][enable_vectorizer_schedule]: resume automatic processing +- [`disable_vectorizer_schedule()`][disable_vectorizer_schedule]: pause automatic processing + +[drop_vectorizer]: /api-reference/pgai/vectorizer/drop_vectorizer +[destination_table]: /api-reference/pgai/vectorizer/destination_table +[destination_column]: /api-reference/pgai/vectorizer/destination_column +[enable_vectorizer_schedule]: /api-reference/pgai/vectorizer/enable_vectorizer_schedule +[disable_vectorizer_schedule]: /api-reference/pgai/vectorizer/disable_vectorizer_schedule diff --git a/api-reference/pgai/vectorizer/destination_column.mdx b/api-reference/pgai/vectorizer/destination_column.mdx new file mode 100644 index 0000000..fbcc2b1 --- /dev/null +++ b/api-reference/pgai/vectorizer/destination_column.mdx @@ -0,0 +1,81 @@ +--- +title: destination_column() +description: Store embeddings directly in the source table as a new column +keywords: [vectorizer, destination, embeddings, column] +tags: [vectorizer, configuration, embeddings] +license: community +type: function +--- + +Store embeddings directly in the source table as a new column. This approach requires a one-to-one relationship +between source data and embeddings, so chunking must be disabled. Ideal when source text is short or chunking is done +upstream. + +Key features for this function are: + +- Adds vector column directly to source table +- No separate view created +- Requires `chunking_none()` (no chunking) +- Exactly one embedding per row +- Simpler schema with fewer objects + +## Workflow + +1. Application inserts data with NULL in embedding column +2. Vectorizer detects the NULL value +3. Vectorizer generates embedding +4. Vectorizer updates row with embedding value + +## Samples + +### Basic usage + +Store embeddings in a column for pre-chunked data: + +```sql +SELECT ai.create_vectorizer( + 'product_descriptions'::regclass, + destination => ai.destination_column('description_embedding'), + loading => ai.loading_column('description'), + embedding => ai.embedding_openai('text-embedding-3-small', 768), + chunking => ai.chunking_none() -- Required for column destination +); +``` + +### With specific embedding model + +```sql +SELECT ai.create_vectorizer( + 'short_text_data'::regclass, + destination => ai.destination_column('text_embedding'), + loading => ai.loading_column('text'), + embedding => ai.embedding_ollama('nomic-embed-text', 768), + chunking => ai.chunking_none() +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `embedding_column` | `NAME` | - | ✔ | Name of the column to add to the source table for storing embeddings | + +## Returns + +A JSON configuration object for use in [`create_vectorizer()`][create_vectorizer]. + +## Important notes + +- **Chunking must be disabled**: Use `chunking => ai.chunking_none()` when using column destination +- **One embedding per row**: This approach cannot handle multiple chunks per source row +- **Best for short text**: Ideal when text is already chunked or naturally short (< 512 tokens) + +## Related functions + +- [`destination_table()`][destination_table]: alternative approach with separate embeddings table +- [`chunking_none()`][chunking_none]: required chunking configuration for column destination +- [`create_vectorizer()`][create_vectorizer]: main function using this configuration + +[destination_table]: /api-reference/pgai/vectorizer/destination_table +[chunking_none]: /api-reference/pgai/vectorizer/chunking_none +[create_vectorizer]: /api-reference/pgai/vectorizer/create_vectorizer diff --git a/api-reference/pgai/vectorizer/destination_table.mdx b/api-reference/pgai/vectorizer/destination_table.mdx new file mode 100644 index 0000000..28dc973 --- /dev/null +++ b/api-reference/pgai/vectorizer/destination_table.mdx @@ -0,0 +1,78 @@ +--- +title: destination_table() +description: Store embeddings in a separate table with automatic view creation +keywords: [vectorizer, destination, embeddings, table] +tags: [vectorizer, configuration, embeddings] +license: community +type: function +--- + +Store embeddings in a separate table. This is the default behavior, where a new table is created to store embeddings +and a view joins the source table with the embeddings. This approach supports chunking, allowing multiple embeddings +per source row. + +Key features for this function are: + +- New table created to store embeddings +- View automatically joins source and embeddings tables +- Supports chunking (multiple chunks per row) +- Separate storage optimizes for different access patterns + +## Samples + +### Full configuration + +Specify all destination details: + +```sql +SELECT ai.create_vectorizer( + 'my_table'::regclass, + destination => ai.destination_table( + target_schema => 'public', + target_table => 'my_table_embeddings_store', + view_schema => 'public', + view_name => 'my_table_embeddings' + ), + embedding => ai.embedding_openai('text-embedding-3-small', 768), + chunking => ai.chunking_character_text_splitter(512) +); +``` + +### Simpler configuration with defaults + +Use a base name and let defaults apply: + +```sql +SELECT ai.create_vectorizer( + 'my_table'::regclass, + destination => ai.destination_table('my_table_embeddings'), + embedding => ai.embedding_openai('text-embedding-3-small', 768), + chunking => ai.chunking_character_text_splitter(512) +); +``` + +This creates: +- Table: `my_table_embeddings_store` +- View: `my_table_embeddings` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `destination` | `NAME` | - | ✖ | Base name for view and table. View is named ``, table is named `_store` | +| `target_schema` | `NAME` | Source table schema | ✖ | Schema where the embeddings table will be created | +| `target_table` | `NAME` | `_embedding_store` or `_store` | ✖ | Name of the table where embeddings will be stored | +| `view_schema` | `NAME` | Source table schema | ✖ | Schema where the view will be created | +| `view_name` | `NAME` | `_embedding` or `` | ✖ | Name of the view that joins source and embeddings tables | + +## Returns + +A JSON configuration object for use in [`create_vectorizer()`][create_vectorizer]. + +## Related functions + +- [`destination_column()`][destination_column]: alternative approach storing embeddings in source table +- [`create_vectorizer()`][create_vectorizer]: main function using this configuration + +[destination_column]: /api-reference/pgai/vectorizer/destination_column +[create_vectorizer]: /api-reference/pgai/vectorizer/create_vectorizer diff --git a/api-reference/pgai/vectorizer/disable_vectorizer_schedule.mdx b/api-reference/pgai/vectorizer/disable_vectorizer_schedule.mdx new file mode 100644 index 0000000..3134542 --- /dev/null +++ b/api-reference/pgai/vectorizer/disable_vectorizer_schedule.mdx @@ -0,0 +1,48 @@ +--- +title: disable_vectorizer_schedule +description: Deactivate or pause automatic scheduling for a vectorizer +keywords: [pgai, vectorizer, scheduling, disable, management] +tags: [vectorizer, management, scheduling] +license: community +type: function +--- + +Deactivate the scheduled job for a specific vectorizer. + +- Temporarily stop the automatic processing of new or updated data +- Can be called with either a vectorizer name (recommended) or ID +- Disabling a schedule does not delete the vectorizer or its configuration + +## Samples + +### Using vectorizer name + +```sql +SELECT ai.disable_vectorizer_schedule('public_blog_embeddings'); +``` + +### Using vectorizer ID + +```sql +SELECT ai.disable_vectorizer_schedule(1); +``` + +## Arguments + +`ai.disable_vectorizer_schedule` can be called in two ways: + +### With vectorizer name (recommended) + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| name | text | - | ✔ | The name of the vectorizer whose schedule you want to disable | + +### With vectorizer ID + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| vectorizer_id | int | - | ✔ | The identifier of the vectorizer whose schedule you want to disable | + +## Returns + +`ai.disable_vectorizer_schedule` does not return a value. \ No newline at end of file diff --git a/api-reference/pgai/vectorizer/drop_vectorizer.mdx b/api-reference/pgai/vectorizer/drop_vectorizer.mdx new file mode 100644 index 0000000..c70d99f --- /dev/null +++ b/api-reference/pgai/vectorizer/drop_vectorizer.mdx @@ -0,0 +1,91 @@ +--- +title: drop_vectorizer() +description: Remove a vectorizer and clean up associated resources +keywords: [vectorizer, drop, remove, cleanup] +tags: [vectorizer, management, cleanup] +license: community +type: function +--- + +Remove a vectorizer that you created previously and clean up the associated resources. This provides a controlled way +to delete a vectorizer when it's no longer needed or when you want to reconfigure it from scratch. + +You use this function to: + +- Remove a specific vectorizer configuration from the system +- Clean up associated database objects and scheduled jobs +- Safely undo the creation of a vectorizer + +## What gets dropped + +By default, `drop_vectorizer` removes: +- Scheduled job associated with the vectorizer (if one exists) +- Trigger from the source table used to queue changes +- Trigger function that backed the source table trigger +- Queue table used to manage updates to be processed +- Vectorizer row from the `ai.vectorizer` table + +By default, `drop_vectorizer` does NOT remove: +- Target table containing the embeddings +- View joining the target and source tables + +This allows you to keep generated embeddings and the view even after dropping the vectorizer, useful when you want to +stop automatic updates but still use existing embeddings. + +## Samples + +### Drop by name (recommended) + +```sql +SELECT ai.drop_vectorizer('public_blog_embeddings'); +``` + +### Drop by ID + +```sql +SELECT ai.drop_vectorizer(1); +``` + +### Drop everything including embeddings + +Drop the vectorizer and also remove the target table and view: + +```sql +SELECT ai.drop_vectorizer('public_blog_embeddings', drop_all => true); +``` + +## Arguments + +You can call `drop_vectorizer` in two ways: + +### By name (recommended) + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `name` | `text` | - | ✔ | The name of the vectorizer to drop | +| `drop_all` | `bool` | `false` | ✖ | Set to `true` to also drop the target table and view | + +### By ID + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `vectorizer_id` | `int` | - | ✔ | The identifier of the vectorizer to drop | +| `drop_all` | `bool` | `false` | ✖ | Set to `true` to also drop the target table and view | + +## Returns + +This function does not return a value, but it performs several cleanup operations. + +## Best practices + +- Before dropping a vectorizer, ensure you will not need the automatic embedding updates it provides +- After dropping a vectorizer, manually clean up the target table and view if they're no longer needed +- Reference vectorizers by name (recommended) rather than ID for better readability + +## Related functions + +- [`create_vectorizer()`][create_vectorizer]: create a new vectorizer +- [`disable_vectorizer_schedule()`][disable_vectorizer_schedule]: temporarily pause without dropping + +[create_vectorizer]: /api-reference/pgai/vectorizer/create_vectorizer +[disable_vectorizer_schedule]: /api-reference/pgai/vectorizer/disable_vectorizer_schedule diff --git a/api-reference/pgai/vectorizer/embedding_litellm.mdx b/api-reference/pgai/vectorizer/embedding_litellm.mdx new file mode 100644 index 0000000..0202748 --- /dev/null +++ b/api-reference/pgai/vectorizer/embedding_litellm.mdx @@ -0,0 +1,109 @@ +--- +title: embedding_litellm() +description: Use LiteLLM to access 100+ embedding providers with a unified interface +keywords: [vectorizer, embedding, LiteLLM, multi-provider, configuration] +tags: [vectorizer, configuration, embeddings, LiteLLM] +license: community +type: function +--- + +Use LiteLLM to generate embeddings from models across multiple providers with a unified interface. LiteLLM supports +OpenAI, Azure, AWS Bedrock, Google Vertex AI, Hugging Face, and 100+ other providers. + +You use this function to: + +- Define the embedding model to use (from any supported provider) +- Specify the dimensionality of the embeddings +- Configure optional, provider-specific parameters +- Set the name of the environment variable that holds your API key + +## Samples + +### Hugging Face model + +```sql +SELECT ai.create_vectorizer( + 'code_snippets'::regclass, + loading => ai.loading_column('code'), + embedding => ai.embedding_litellm( + 'huggingface/microsoft/codebert-base', + 768, + api_key_name => 'HUGGINGFACE_API_KEY', + extra_options => '{"wait_for_model": true}'::jsonb + ), + chunking => ai.chunking_character_text_splitter(512) +); +``` + +### Azure OpenAI + +```sql +SELECT ai.create_vectorizer( + 'documents'::regclass, + loading => ai.loading_column('content'), + embedding => ai.embedding_litellm( + 'azure/my-embedding-deployment', + 1536, + api_key_name => 'AZURE_API_KEY', + extra_options => '{ + "api_base": "https://my-resource.openai.azure.com/", + "api_version": "2023-05-15" + }'::jsonb + ), + chunking => ai.chunking_character_text_splitter(512) +); +``` + +### AWS Bedrock + +```sql +SELECT ai.create_vectorizer( + 'text_data'::regclass, + loading => ai.loading_column('text'), + embedding => ai.embedding_litellm( + 'bedrock/amazon.titan-embed-text-v1', + 1536, + extra_options => '{ + "aws_region_name": "us-east-1" + }'::jsonb + ), + chunking => ai.chunking_character_text_splitter(512) +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `model` | `text` | - | ✔ | Name of the embedding model with optional provider prefix (e.g., `huggingface/model-name`, `azure/deployment-name`) | +| `dimensions` | `int` | - | ✔ | Number of dimensions for the embedding vectors | +| `api_key_name` | `text` | - | ✖ | Name of the environment variable containing the API key | +| `extra_options` | `jsonb` | - | ✖ | Provider-specific configuration options (API base URL, region, etc.) | + +## Returns + +A JSON configuration object for use in [`create_vectorizer()`][create_vectorizer]. + +## Supported providers + +LiteLLM supports 100+ providers including: +- OpenAI and Azure OpenAI +- AWS Bedrock +- Google Vertex AI +- Hugging Face +- Cohere +- Anthropic +- And many more + +See the [LiteLLM documentation](https://docs.litellm.ai/docs/embedding/supported_embedding) for the complete list. + +## Related functions + +- [`embedding_openai()`][embedding_openai]: direct OpenAI integration +- [`embedding_ollama()`][embedding_ollama]: local Ollama models +- [`embedding_voyageai()`][embedding_voyageai]: Voyage AI models + +[embedding_openai]: /api-reference/pgai/vectorizer/embedding_openai +[embedding_ollama]: /api-reference/pgai/vectorizer/embedding_ollama +[embedding_voyageai]: /api-reference/pgai/vectorizer/embedding_voyageai +[create_vectorizer]: /api-reference/pgai/vectorizer/create_vectorizer diff --git a/api-reference/pgai/vectorizer/embedding_ollama.mdx b/api-reference/pgai/vectorizer/embedding_ollama.mdx new file mode 100644 index 0000000..b6c57be --- /dev/null +++ b/api-reference/pgai/vectorizer/embedding_ollama.mdx @@ -0,0 +1,88 @@ +--- +title: embedding_ollama() +description: Use local Ollama models to generate embeddings +keywords: [vectorizer, embedding, Ollama, local, configuration] +tags: [vectorizer, configuration, embeddings, Ollama] +license: community +type: function +--- + +Use a local Ollama model to generate embeddings for your vectorizer. Ollama allows you to run open-source models +locally for complete data privacy and control. + +You use this function to: + +- Define which Ollama model to use +- Specify the dimensionality of the embeddings +- Configure how the Ollama API is accessed +- Configure the model's truncation behavior and keep alive settings +- Configure optional, model-specific parameters like temperature + +## Samples + +### Basic Ollama embedding + +```sql +SELECT ai.create_vectorizer( + 'blog_posts'::regclass, + loading => ai.loading_column('content'), + embedding => ai.embedding_ollama('nomic-embed-text', 768), + chunking => ai.chunking_character_text_splitter(512) +); +``` + +### With custom Ollama server + +```sql +SELECT ai.create_vectorizer( + 'documents'::regclass, + loading => ai.loading_column('content'), + embedding => ai.embedding_ollama( + 'nomic-embed-text', + 768, + base_url => 'http://my.ollama.server:11434' + ), + chunking => ai.chunking_character_text_splitter(512) +); +``` + +### With model options and keep alive + +```sql +SELECT ai.create_vectorizer( + 'text_data'::regclass, + loading => ai.loading_column('text'), + embedding => ai.embedding_ollama( + 'nomic-embed-text', + 768, + options => '{"num_ctx": 1024, "temperature": 0.5}'::jsonb, + keep_alive => '10m' + ), + chunking => ai.chunking_character_text_splitter(512) +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `model` | `text` | - | ✔ | Name of the Ollama model to use (e.g., `nomic-embed-text`). The model must already be pulled on your Ollama server | +| `dimensions` | `int` | - | ✔ | Number of dimensions for the embedding vectors | +| `base_url` | `text` | - | ✖ | Base URL of the Ollama API. If not provided, uses `OLLAMA_HOST` environment variable | +| `options` | `jsonb` | - | ✖ | Additional model parameters such as `temperature` or `num_ctx` | +| `keep_alive` | `text` | - | ✖ | How long the model stays loaded in memory after the request (e.g., `5m`, `1h`) | + +## Returns + +A JSON configuration object for use in [`create_vectorizer()`][create_vectorizer]. + +## Related functions + +- [`embedding_openai()`][embedding_openai]: use OpenAI models +- [`embedding_litellm()`][embedding_litellm]: use any provider through LiteLLM +- [`embedding_voyageai()`][embedding_voyageai]: use Voyage AI models + +[embedding_openai]: /api-reference/pgai/vectorizer/embedding_openai +[embedding_litellm]: /api-reference/pgai/vectorizer/embedding_litellm +[embedding_voyageai]: /api-reference/pgai/vectorizer/embedding_voyageai +[create_vectorizer]: /api-reference/pgai/vectorizer/create_vectorizer diff --git a/api-reference/pgai/vectorizer/embedding_openai.mdx b/api-reference/pgai/vectorizer/embedding_openai.mdx new file mode 100644 index 0000000..7410020 --- /dev/null +++ b/api-reference/pgai/vectorizer/embedding_openai.mdx @@ -0,0 +1,100 @@ +--- +title: embedding_openai() +description: Use OpenAI models to generate embeddings +keywords: [vectorizer, embedding, OpenAI, configuration] +tags: [vectorizer, configuration, embeddings, OpenAI] +license: community +type: function +--- + +Use an OpenAI model to generate embeddings for your vectorizer. + +You use this function to: + +- Define which OpenAI embedding model to use +- Specify the dimensionality of the embeddings +- Configure optional parameters like user identifier for API calls +- Set the name of the environment variable that holds your OpenAI API key + +## Samples + +### Basic OpenAI embedding + +```sql +SELECT ai.create_vectorizer( + 'blog_posts'::regclass, + loading => ai.loading_column('content'), + embedding => ai.embedding_openai('text-embedding-3-small', 768), + chunking => ai.chunking_character_text_splitter(512) +); +``` + +### With custom API key name + +```sql +SELECT ai.create_vectorizer( + 'documents'::regclass, + loading => ai.loading_column('content'), + embedding => ai.embedding_openai( + 'text-embedding-3-small', + 768, + api_key_name => 'MY_OPENAI_API_KEY' + ), + chunking => ai.chunking_character_text_splitter(512) +); +``` + +### With user tracking + +```sql +SELECT ai.create_vectorizer( + 'user_content'::regclass, + loading => ai.loading_column('text'), + embedding => ai.embedding_openai( + 'text-embedding-3-small', + 768, + chat_user => 'analytics_team' + ), + chunking => ai.chunking_character_text_splitter(512) +); +``` + +### With custom base URL + +```sql +SELECT ai.create_vectorizer( + 'data_table'::regclass, + loading => ai.loading_column('content'), + embedding => ai.embedding_openai( + 'text-embedding-3-small', + 768, + base_url => 'https://custom-openai-endpoint.com/v1' + ), + chunking => ai.chunking_character_text_splitter(512) +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `model` | `text` | - | ✔ | Name of the OpenAI embedding model (e.g., `text-embedding-3-small`) | +| `dimensions` | `int` | - | ✔ | Number of dimensions for the embedding vectors | +| `chat_user` | `text` | - | ✖ | Identifier for the user making the API call (for tracking/monitoring) | +| `api_key_name` | `text` | `OPENAI_API_KEY` | ✖ | Name of the environment variable containing the OpenAI API key | +| `base_url` | `text` | - | ✖ | Custom base URL for the OpenAI API | + +## Returns + +A JSON configuration object for use in [`create_vectorizer()`][create_vectorizer]. + +## Related functions + +- [`embedding_ollama()`][embedding_ollama]: use local Ollama models +- [`embedding_litellm()`][embedding_litellm]: use any provider through LiteLLM +- [`embedding_voyageai()`][embedding_voyageai]: use Voyage AI models + +[embedding_ollama]: /api-reference/pgai/vectorizer/embedding_ollama +[embedding_litellm]: /api-reference/pgai/vectorizer/embedding_litellm +[embedding_voyageai]: /api-reference/pgai/vectorizer/embedding_voyageai +[create_vectorizer]: /api-reference/pgai/vectorizer/create_vectorizer diff --git a/api-reference/pgai/vectorizer/embedding_voyageai.mdx b/api-reference/pgai/vectorizer/embedding_voyageai.mdx new file mode 100644 index 0000000..796579d --- /dev/null +++ b/api-reference/pgai/vectorizer/embedding_voyageai.mdx @@ -0,0 +1,84 @@ +--- +title: embedding_voyageai() +description: Use Voyage AI models for retrieval-optimized embeddings +keywords: [vectorizer, embedding, Voyage AI, retrieval, configuration] +tags: [vectorizer, configuration, embeddings, VoyageAI] +license: community +type: function +--- + +Use a Voyage AI model to generate retrieval-optimized embeddings for your vectorizer. + +You use this function to: + +- Define which Voyage AI model to use +- Specify the dimensionality of the embeddings +- Configure the model's truncation behavior and API key name +- Configure the input type (query vs document) + +## Samples + +### Basic Voyage AI embedding + +```sql +SELECT ai.create_vectorizer( + 'blog_posts'::regclass, + loading => ai.loading_column('content'), + embedding => ai.embedding_voyageai('voyage-3-lite', 512), + chunking => ai.chunking_character_text_splitter(512) +); +``` + +### With custom API key name + +```sql +SELECT ai.create_vectorizer( + 'documents'::regclass, + loading => ai.loading_column('content'), + embedding => ai.embedding_voyageai( + 'voyage-3', + 1024, + api_key_name => 'MY_VOYAGE_API_KEY' + ), + chunking => ai.chunking_character_text_splitter(512) +); +``` + +### With input type + +```sql +SELECT ai.create_vectorizer( + 'search_content'::regclass, + loading => ai.loading_column('text'), + embedding => ai.embedding_voyageai( + 'voyage-3', + 1024, + input_type => 'document' + ), + chunking => ai.chunking_character_text_splitter(512) +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `model` | `text` | - | ✔ | Name of the Voyage AI model to use (e.g., `voyage-3-lite`, `voyage-3`) | +| `dimensions` | `int` | - | ✔ | Number of dimensions for the embedding vectors | +| `input_type` | `text` | `'document'` | ✖ | Type of input text: `null`, `'query'`, or `'document'` | +| `api_key_name` | `text` | `VOYAGE_API_KEY` | ✖ | Name of the environment variable containing the Voyage AI API key | + +## Returns + +A JSON configuration object for use in [`create_vectorizer()`][create_vectorizer]. + +## Related functions + +- [`embedding_openai()`][embedding_openai]: use OpenAI models +- [`embedding_ollama()`][embedding_ollama]: use local Ollama models +- [`embedding_litellm()`][embedding_litellm]: use any provider through LiteLLM + +[embedding_openai]: /api-reference/pgai/vectorizer/embedding_openai +[embedding_ollama]: /api-reference/pgai/vectorizer/embedding_ollama +[embedding_litellm]: /api-reference/pgai/vectorizer/embedding_litellm +[create_vectorizer]: /api-reference/pgai/vectorizer/create_vectorizer diff --git a/api-reference/pgai/vectorizer/enable_vectorizer_schedule.mdx b/api-reference/pgai/vectorizer/enable_vectorizer_schedule.mdx new file mode 100644 index 0000000..f468f35 --- /dev/null +++ b/api-reference/pgai/vectorizer/enable_vectorizer_schedule.mdx @@ -0,0 +1,47 @@ +--- +title: enable_vectorizer_schedule +description: Activate or resume automatic scheduling for a vectorizer +keywords: [pgai, vectorizer, scheduling, enable, management] +tags: [vectorizer, management, scheduling] +license: community +type: function +--- + +Activate or reactivate the scheduled job for a specific vectorizer. + +- Allow the vectorizer to resume automatic processing of new or updated data +- Can be called with either a vectorizer name (recommended) or ID + +## Samples + +### Using vectorizer name + +```sql +SELECT ai.enable_vectorizer_schedule('public_blog_embeddings'); +``` + +### Using vectorizer ID + +```sql +SELECT ai.enable_vectorizer_schedule(1); +``` + +## Arguments + +`ai.enable_vectorizer_schedule` can be called in two ways: + +### With vectorizer name (recommended) + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| name | text | - | ✔ | The name of the vectorizer whose schedule you want to enable | + +### With vectorizer ID + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| vectorizer_id | int | - | ✔ | The identifier of the vectorizer whose schedule you want to enable | + +## Returns + +`ai.enable_vectorizer_schedule` does not return a value. \ No newline at end of file diff --git a/api-reference/pgai/vectorizer/formatting_python_template.mdx b/api-reference/pgai/vectorizer/formatting_python_template.mdx new file mode 100644 index 0000000..07442d7 --- /dev/null +++ b/api-reference/pgai/vectorizer/formatting_python_template.mdx @@ -0,0 +1,83 @@ +--- +title: formatting_python_template +description: Configure how data is formatted before embedding using Python template strings +keywords: [pgai, vectorizer, formatting, template, embedding] +tags: [vectorizer, configuration, formatting] +license: community +type: function +--- + +Configure the way data from the source table is formatted before it is sent for embedding. + +`ai.formatting_python_template` provides a flexible way to structure the input for embedding models. This enables you to incorporate relevant metadata and additional text. This can significantly enhance the quality and usefulness of the generated embeddings, especially in scenarios where context from multiple fields is important for understanding or searching the content. + +- Define a template for formatting the data before embedding +- Allow the combination of multiple fields from the source table +- Add consistent context or structure to the text being embedded +- Customize the input for the embedding model to improve relevance and searchability + +Formatting happens after chunking and the special `$chunk` variable contains the chunked text. + +## Samples + +### Default formatting + +The default formatter uses the `$chunk` template, resulting in outputting the chunk text as-is. + +```sql +SELECT ai.create_vectorizer( + 'blog_posts'::regclass, + formatting => ai.formatting_python_template('$chunk'), + -- other parameters... +); +``` + +### Add context from other columns + +Add the title and publication date to each chunk, providing more context for the embedding. + +```sql +SELECT ai.create_vectorizer( + 'blog_posts'::regclass, + formatting => ai.formatting_python_template('Title: $title\nDate: $published\nContent: $chunk'), + -- other parameters... +); +``` + +### Combine multiple fields + +Prepend author and category information to each chunk. + +```sql +SELECT ai.create_vectorizer( + 'blog_posts'::regclass, + formatting => ai.formatting_python_template('Author: $author\nCategory: $category\n$chunk'), + -- other parameters... +); +``` + +### Add consistent structure + +Add start and end markers to each chunk, which could be useful for certain types of embeddings or retrieval tasks. + +```sql +SELECT ai.create_vectorizer( + 'blog_posts'::regclass, + formatting => ai.formatting_python_template('BEGIN DOCUMENT\n$chunk\nEND DOCUMENT'), + -- other parameters... +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `template` | `TEXT` | `$chunk` | ✔ | A string using [Python template strings](https://docs.python.org/3/library/string.html#template-strings) with $-prefixed variables that defines how the data should be formatted | + +- The $chunk placeholder is required and represents the text chunk that will be embedded +- Other placeholders can be used to reference columns from the source table +- The template allows for adding static text or structuring the input in a specific way + +## Returns + +A JSON configuration object that you can use in `ai.create_vectorizer`. \ No newline at end of file diff --git a/api-reference/pgai/vectorizer/grant_to.mdx b/api-reference/pgai/vectorizer/grant_to.mdx new file mode 100644 index 0000000..9c2d882 --- /dev/null +++ b/api-reference/pgai/vectorizer/grant_to.mdx @@ -0,0 +1,30 @@ +--- +title: grant_to +description: Grant permissions to users for vectorizer-created objects +keywords: [pgai, vectorizer, permissions, grant, access] +tags: [vectorizer, configuration, security] +license: community +type: function +--- + +Grant permissions to a comma-separated list of users. + +Includes the users specified in the `ai.grant_to_default` setting. + +## Samples + +```sql +SELECT ai.create_vectorizer( + 'my_table'::regclass, + grant_to => ai.grant_to('bob', 'alice'), + -- other parameters... +); +``` + +## Arguments + +This function takes a comma-separated list of usernames to grant permissions to. + +## Returns + +An array of name values that you can use in `ai.create_vectorizer`. \ No newline at end of file diff --git a/api-reference/pgai/vectorizer/index.mdx b/api-reference/pgai/vectorizer/index.mdx new file mode 100644 index 0000000..105ad28 --- /dev/null +++ b/api-reference/pgai/vectorizer/index.mdx @@ -0,0 +1,171 @@ +--- +title: Vectorizer API reference +sidebarTitle: Overview +description: Automate embedding generation and synchronization for your PostgreSQL data +keywords: [vectorizer, embeddings, automation, pgai] +tags: [AI, vectorizer, embeddings, automation] +--- + +import { PG } from '/snippets/vars.mdx'; + +A vectorizer provides a powerful and automated way to generate and manage LLM embeddings for your {PG} data, +keeping them synchronized with your source data automatically. + +## What is a vectorizer? + +A vectorizer automates the entire embedding workflow: + +- **Automated embedding generation**: Create embeddings for table data automatically +- **Automatic synchronization**: Triggers keep embeddings in sync with source data +- **Background processing**: Async processing minimizes impact on database operations +- **Scalability**: Batch processing handles large datasets efficiently +- **Highly configurable**: Customize embedding models, chunking, formatting, indexing, and scheduling + +## Key features + +- **Multiple AI providers**: OpenAI, Ollama, Cohere, Voyage AI, and LiteLLM support +- **Efficient storage**: Separate tables with appropriate indexing for similarity searches +- **View creation**: Automatic views join source data with embeddings +- **Access control**: Fine-grained permissions for vectorizer objects +- **Monitoring**: Built-in tools to track queue status and performance + +## Quick start + +### Create a basic vectorizer + +```sql +SELECT ai.create_vectorizer( + 'blog.posts'::regclass, + embedding => ai.embedding_openai('text-embedding-3-small', 768), + chunking => ai.chunking_character_text_splitter(512) +); +``` + +### Table destination (separate embeddings table) + +```sql +SELECT ai.create_vectorizer( + 'website.blog'::regclass, + destination => ai.destination_table( + target_table => 'blog_embeddings_store', + view_name => 'blog_embeddings' + ), + loading => ai.loading_column('content'), + embedding => ai.embedding_ollama('nomic-embed-text', 768), + chunking => ai.chunking_character_text_splitter(128, 10) +); +``` + +### Column destination (embedding in source table) + +```sql +SELECT ai.create_vectorizer( + 'products'::regclass, + destination => ai.destination_column('description_embedding'), + loading => ai.loading_column('description'), + embedding => ai.embedding_openai('text-embedding-3-small', 768), + chunking => ai.chunking_none() -- Required for column destination +); +``` + +## Configuration functions + +### Core functions + +- [`create_vectorizer()`][create_vectorizer]: create and configure a new vectorizer +- [`drop_vectorizer()`][drop_vectorizer]: remove a vectorizer and clean up resources + +### Destination configuration + +- [`destination_table()`][destination_table]: store embeddings in a separate table (default) +- [`destination_column()`][destination_column]: store embeddings in the source table + +### Loading configuration + +- [`loading_column()`][loading_column]: load data from a column +- [`loading_uri()`][loading_uri]: load data from a file URI + +### Parsing configuration + +- [`parsing_auto()`][parsing_auto]: auto-detect document format (default) +- [`parsing_none()`][parsing_none]: no parsing for text data +- [`parsing_docling()`][parsing_docling]: parse documents with Docling +- [`parsing_pymupdf()`][parsing_pymupdf]: parse PDFs with PyMuPDF + +### Chunking configuration + +- [`chunking_character_text_splitter()`][chunking_character]: split by character count +- [`chunking_recursive_character_text_splitter()`][chunking_recursive]: recursive splitting (default) + +### Embedding configuration + +- [`embedding_openai()`][embedding_openai]: OpenAI embedding models +- [`embedding_ollama()`][embedding_ollama]: local Ollama models +- [`embedding_litellm()`][embedding_litellm]: unified API for 100+ providers +- [`embedding_voyageai()`][embedding_voyageai]: Voyage AI models + +### Formatting configuration + +- [`formatting_python_template()`][formatting_python_template]: format with Python templates + +### Indexing configuration + +- [`indexing_default()`][indexing_default]: default HNSW indexing +- [`indexing_diskann()`][indexing_diskann]: DiskANN indexing +- [`indexing_hnsw()`][indexing_hnsw]: HNSW indexing with options +- [`indexing_none()`][indexing_none]: no automatic indexing + +### Scheduling configuration + +- [`scheduling_default()`][scheduling_default]: run every 5 minutes +- [`scheduling_timescaledb()`][scheduling_timescaledb]: use TimescaleDB job scheduling +- [`scheduling_none()`][scheduling_none]: disable automatic scheduling + +### Processing configuration + +- [`processing_default()`][processing_default]: default processing settings + +### Access control + +- [`grant_to()`][grant_to]: specify user permissions + +## Management functions + +- [`enable_vectorizer_schedule()`][enable_vectorizer_schedule]: resume automatic processing +- [`disable_vectorizer_schedule()`][disable_vectorizer_schedule]: pause automatic processing + +## Monitoring + +- [`vectorizer_status`][vectorizer_status]: view vectorizer status and statistics +- [`vectorizer_queue_pending()`][vectorizer_queue_pending]: check pending work items + +[create_vectorizer]: /api-reference/pgai/vectorizer/create_vectorizer +[drop_vectorizer]: /api-reference/pgai/vectorizer/drop_vectorizer +[destination_table]: /api-reference/pgai/vectorizer/destination_table +[destination_column]: /api-reference/pgai/vectorizer/destination_column +[loading_column]: /api-reference/pgai/vectorizer/loading_column +[loading_uri]: /api-reference/pgai/vectorizer/loading_uri +[parsing_auto]: /api-reference/pgai/vectorizer/parsing_auto +[parsing_none]: /api-reference/pgai/vectorizer/parsing_none +[parsing_docling]: /api-reference/pgai/vectorizer/parsing_docling +[parsing_pymupdf]: /api-reference/pgai/vectorizer/parsing_pymupdf +[chunking_character]: /api-reference/pgai/vectorizer/chunking_character_text_splitter +[chunking_recursive]: /api-reference/pgai/vectorizer/chunking_recursive_character_text_splitter +[embedding_openai]: /api-reference/pgai/vectorizer/embedding_openai +[embedding_ollama]: /api-reference/pgai/vectorizer/embedding_ollama +[embedding_litellm]: /api-reference/pgai/vectorizer/embedding_litellm +[embedding_voyageai]: /api-reference/pgai/vectorizer/embedding_voyageai +[formatting_python_template]: /api-reference/pgai/vectorizer/formatting_python_template +[indexing_default]: /api-reference/pgai/vectorizer/indexing_default +[indexing_diskann]: /api-reference/pgai/vectorizer/indexing_diskann +[indexing_hnsw]: /api-reference/pgai/vectorizer/indexing_hnsw +[indexing_none]: /api-reference/pgai/vectorizer/indexing_none +[scheduling_default]: /api-reference/pgai/vectorizer/scheduling_default +[scheduling_timescaledb]: /api-reference/pgai/vectorizer/scheduling_timescaledb +[scheduling_none]: /api-reference/pgai/vectorizer/scheduling_none +[processing_default]: /api-reference/pgai/vectorizer/processing_default +[grant_to]: /api-reference/pgai/vectorizer/grant_to +[enable_vectorizer_schedule]: /api-reference/pgai/vectorizer/enable_vectorizer_schedule +[disable_vectorizer_schedule]: /api-reference/pgai/vectorizer/disable_vectorizer_schedule +[vectorizer_status]: /api-reference/pgai/vectorizer/vectorizer_status +[vectorizer_queue_pending]: /api-reference/pgai/vectorizer/vectorizer_queue_pending diff --git a/api-reference/pgai/vectorizer/indexing_default.mdx b/api-reference/pgai/vectorizer/indexing_default.mdx new file mode 100644 index 0000000..fe7b7fa --- /dev/null +++ b/api-reference/pgai/vectorizer/indexing_default.mdx @@ -0,0 +1,32 @@ +--- +title: indexing_default +description: Use platform-specific default indexing configuration for embeddings +keywords: [pgai, vectorizer, indexing, default, configuration] +tags: [vectorizer, configuration, indexing] +license: community +type: function +--- + +import { CLOUD_LONG } from '/snippets/vars.mdx'; + +Use the platform-specific default value for indexing. + +On {CLOUD_LONG}, the default is `ai.indexing_diskann()`. On self-hosted, the default is `ai.indexing_none()`. A TimescaleDB background job is used for automatic index creation. Since TimescaleDB may not be installed in a self-hosted environment, we default to `ai.indexing_none()`. + +## Samples + +```sql +SELECT ai.create_vectorizer( + 'blog_posts'::regclass, + indexing => ai.indexing_default(), + -- other parameters... +); +``` + +## Arguments + +This function takes no arguments. + +## Returns + +A JSON configuration object that you can use in `ai.create_vectorizer`. \ No newline at end of file diff --git a/api-reference/pgai/vectorizer/indexing_diskann.mdx b/api-reference/pgai/vectorizer/indexing_diskann.mdx new file mode 100644 index 0000000..bef8ad0 --- /dev/null +++ b/api-reference/pgai/vectorizer/indexing_diskann.mdx @@ -0,0 +1,37 @@ +--- +title: indexing_diskann +description: Configure DiskANN indexing for high-performance vector search on large datasets +keywords: [pgai, vectorizer, indexing, diskann, performance] +tags: [vectorizer, configuration, indexing] +license: community +type: function +--- + +Configure indexing using the DiskANN algorithm, which is designed for high-performance approximate nearest neighbor search on large-scale datasets. This is suitable for very large datasets that need to be stored on disk. + +## Samples + +```sql +SELECT ai.create_vectorizer( + 'blog_posts'::regclass, + indexing => ai.indexing_diskann(min_rows => 500000, storage_layout => 'memory_optimized'), + -- other parameters... +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| min_rows | int | 100000 | ✖ | The minimum number of rows before creating the index | +| storage_layout | text | - | ✖ | Set to either `memory_optimized` or `plain` | +| num_neighbors | int | - | ✖ | Advanced [DiskANN](https://github.com/microsoft/DiskANN/tree/main) parameter | +| search_list_size | int | - | ✖ | Advanced [DiskANN](https://github.com/microsoft/DiskANN/tree/main) parameter | +| max_alpha | float8 | - | ✖ | Advanced [DiskANN](https://github.com/microsoft/DiskANN/tree/main) parameter | +| num_dimensions | int | - | ✖ | Advanced [DiskANN](https://github.com/microsoft/DiskANN/tree/main) parameter | +| num_bits_per_dimension | int | - | ✖ | Advanced [DiskANN](https://github.com/microsoft/DiskANN/tree/main) parameter | +| create_when_queue_empty | boolean | true | ✖ | Create the index only after all of the embeddings have been generated | + +## Returns + +A JSON configuration object that you can use in `ai.create_vectorizer`. \ No newline at end of file diff --git a/api-reference/pgai/vectorizer/indexing_hnsw.mdx b/api-reference/pgai/vectorizer/indexing_hnsw.mdx new file mode 100644 index 0000000..4345276 --- /dev/null +++ b/api-reference/pgai/vectorizer/indexing_hnsw.mdx @@ -0,0 +1,36 @@ +--- +title: indexing_hnsw +description: Configure HNSW indexing for fast and accurate vector search +keywords: [pgai, vectorizer, indexing, hnsw, performance] +tags: [vectorizer, configuration, indexing] +license: community +type: function +--- + +Configure indexing using the [Hierarchical Navigable Small World (HNSW) algorithm](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world), which is known for fast and accurate approximate nearest neighbor search. + +HNSW is suitable for in-memory datasets and scenarios where query speed is crucial. + +## Samples + +```sql +SELECT ai.create_vectorizer( + 'blog_posts'::regclass, + indexing => ai.indexing_hnsw(min_rows => 50000, opclass => 'vector_l1_ops'), + -- other parameters... +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| min_rows | int | 100000 | ✖ | The minimum number of rows before creating the index | +| opclass | text | `vector_cosine_ops` | ✖ | The operator class for the index. Possible values are: `vector_cosine_ops`, `vector_l1_ops`, or `vector_ip_ops` | +| m | int | - | ✖ | Advanced [HNSW parameters](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world) | +| ef_construction | int | - | ✖ | Advanced [HNSW parameters](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world) | +| create_when_queue_empty | boolean | true | ✖ | Create the index only after all of the embeddings have been generated | + +## Returns + +A JSON configuration object that you can use in `ai.create_vectorizer`. \ No newline at end of file diff --git a/api-reference/pgai/vectorizer/indexing_none.mdx b/api-reference/pgai/vectorizer/indexing_none.mdx new file mode 100644 index 0000000..7c1b9d0 --- /dev/null +++ b/api-reference/pgai/vectorizer/indexing_none.mdx @@ -0,0 +1,30 @@ +--- +title: indexing_none +description: Disable automatic indexing for embeddings +keywords: [pgai, vectorizer, indexing, none, manual] +tags: [vectorizer, configuration, indexing] +license: community +type: function +--- + +Specify that no special indexing should be used for the embeddings. + +This is useful when you don't need fast similarity searches or when you're dealing with a small amount of data. + +## Samples + +```sql +SELECT ai.create_vectorizer( + 'blog_posts'::regclass, + indexing => ai.indexing_none(), + -- other parameters... +); +``` + +## Arguments + +This function takes no arguments. + +## Returns + +A JSON configuration object that you can use in `ai.create_vectorizer`. \ No newline at end of file diff --git a/api-reference/pgai/vectorizer/loading_column.mdx b/api-reference/pgai/vectorizer/loading_column.mdx new file mode 100644 index 0000000..a6ebcc3 --- /dev/null +++ b/api-reference/pgai/vectorizer/loading_column.mdx @@ -0,0 +1,60 @@ +--- +title: loading_column() +description: Load data to embed directly from a column in the source table +keywords: [vectorizer, loading, column, data source] +tags: [vectorizer, configuration, loading] +license: community +type: function +--- + +Load data to embed directly from a column in the source table. This is the most common loading method for embedding +textual content stored in your database. + +## Samples + +### Load from a text column + +```sql +SELECT ai.create_vectorizer( + 'blog_posts'::regclass, + loading => ai.loading_column('content'), + embedding => ai.embedding_openai('text-embedding-3-small', 768), + chunking => ai.chunking_character_text_splitter(512) +); +``` + +### Load from multiple tables + +```sql +-- For products table +SELECT ai.create_vectorizer( + 'products'::regclass, + loading => ai.loading_column('description'), + embedding => ai.embedding_openai('text-embedding-3-small', 768) +); + +-- For reviews table +SELECT ai.create_vectorizer( + 'reviews'::regclass, + loading => ai.loading_column('review_text'), + embedding => ai.embedding_openai('text-embedding-3-small', 768) +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `column_name` | `TEXT` | - | ✔ | Name of the column containing the data to load | + +## Returns + +A JSON configuration object for use in [`create_vectorizer()`][create_vectorizer]. + +## Related functions + +- [`loading_uri()`][loading_uri]: load data from files referenced by URIs +- [`create_vectorizer()`][create_vectorizer]: main function using this configuration + +[loading_uri]: /api-reference/pgai/vectorizer/loading_uri +[create_vectorizer]: /api-reference/pgai/vectorizer/create_vectorizer diff --git a/api-reference/pgai/vectorizer/loading_uri.mdx b/api-reference/pgai/vectorizer/loading_uri.mdx new file mode 100644 index 0000000..b1326d5 --- /dev/null +++ b/api-reference/pgai/vectorizer/loading_uri.mdx @@ -0,0 +1,52 @@ +--- +title: loading_uri +description: Load data to embed from a file referenced in a source table column +keywords: [pgai, vectorizer, loading, uri, s3, gcs, azure] +tags: [vectorizer, configuration] +license: community +type: function +--- + +Load the data to embed from a file that is referenced in a column of the source table. + +This file path is internally passed to [smart_open](https://github.com/piskvorky/smart_open), so it supports any protocol that smart_open supports, including: + +- Local files +- Amazon S3 +- Google Cloud Storage +- Azure Blob Storage +- HTTP/HTTPS +- SFTP +- and [many more](https://github.com/piskvorky/smart_open/blob/master/help.txt) + +## Environment configuration + +Ensure the vectorizer worker has the correct credentials to access the file, such as in environment variables. Here is an example for AWS S3: + +```bash +export AWS_ACCESS_KEY_ID='your_access_key' +export AWS_SECRET_ACCESS_KEY='your_secret_key' +export AWS_REGION='your_region' # optional +``` + +Make sure these environment variables are properly set in the environment where the vectorizer worker runs. + +## Samples + +```sql +SELECT ai.create_vectorizer( + 'my_table'::regclass, + loading => ai.loading_uri('file_uri_column_name'), + -- other parameters... +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `column_name` | `TEXT` | - | ✔ | The name of the column containing the file path | + +## Returns + +A JSON configuration object that you can use in `ai.create_vectorizer`. \ No newline at end of file diff --git a/api-reference/pgai/vectorizer/parsing_auto.mdx b/api-reference/pgai/vectorizer/parsing_auto.mdx new file mode 100644 index 0000000..9f0ca91 --- /dev/null +++ b/api-reference/pgai/vectorizer/parsing_auto.mdx @@ -0,0 +1,50 @@ +--- +title: parsing_auto() +description: Automatically select an appropriate parser based on detected file types +keywords: [vectorizer, parsing, automatic, documents] +tags: [vectorizer, configuration, parsing] +license: community +type: function +--- + +Automatically select an appropriate parser based on detected file types. Documents with unrecognizable formats won't +be processed and will generate an error in the `ai.vectorizer_errors` table. + +The parser selection examines file extensions and content types: +- **PDF files, images, Office documents** (DOCX, XLSX, etc.): Uses Docling +- **EPUB and MOBI** (e-book formats): Uses PyMuPDF +- **Text formats** (TXT, MD, etc.): No parser used (content read directly) + +## Samples + +### Use automatic parser selection + +```sql +SELECT ai.create_vectorizer( + 'documents'::regclass, + loading => ai.loading_uri('file_path'), + parsing => ai.parsing_auto(), + embedding => ai.embedding_openai('text-embedding-3-small', 768) +); +``` + +## Arguments + +This function takes no arguments. + +## Returns + +A JSON configuration object for use in [`create_vectorizer()`][create_vectorizer]. + +## Related functions + +- [`parsing_none()`][parsing_none]: skip parsing for textual data +- [`parsing_docling()`][parsing_docling]: explicitly use Docling parser +- [`parsing_pymupdf()`][parsing_pymupdf]: explicitly use PyMuPDF parser +- [`loading_uri()`][loading_uri]: load data from file URIs + +[parsing_none]: /api-reference/pgai/vectorizer/parsing_none +[parsing_docling]: /api-reference/pgai/vectorizer/parsing_docling +[parsing_pymupdf]: /api-reference/pgai/vectorizer/parsing_pymupdf +[loading_uri]: /api-reference/pgai/vectorizer/loading_uri +[create_vectorizer]: /api-reference/pgai/vectorizer/create_vectorizer diff --git a/api-reference/pgai/vectorizer/parsing_docling.mdx b/api-reference/pgai/vectorizer/parsing_docling.mdx new file mode 100644 index 0000000..48d6e75 --- /dev/null +++ b/api-reference/pgai/vectorizer/parsing_docling.mdx @@ -0,0 +1,37 @@ +--- +title: parsing_docling +description: Parse documents using Docling for robust text extraction from complex documents +keywords: [pgai, vectorizer, parsing, docling, ocr, pdf] +tags: [vectorizer, configuration, parsing] +license: community +type: function +--- + +Parse the data provided by the loader using [Docling](https://docling-project.github.io/docling/). + +Docling is a more robust and thorough document parsing library that: +- Uses OCR capabilities to extract text from images +- Can parse complex documents with tables and multi-column layouts +- Supports Office formats (DOCX, XLSX, etc.) +- Preserves document structure better than other parsers +- Converts documents to markdown format + +Note that docling uses ML models for improved parsing, which makes it slower than simpler parsers like pymupdf. + +## Samples + +```sql +SELECT ai.create_vectorizer( + 'my_table'::regclass, + parsing => ai.parsing_docling(), + -- other parameters... +); +``` + +## Arguments + +This function takes no arguments. + +## Returns + +A JSON configuration object that you can use in `ai.create_vectorizer`. \ No newline at end of file diff --git a/api-reference/pgai/vectorizer/parsing_none.mdx b/api-reference/pgai/vectorizer/parsing_none.mdx new file mode 100644 index 0000000..82bf29c --- /dev/null +++ b/api-reference/pgai/vectorizer/parsing_none.mdx @@ -0,0 +1,40 @@ +--- +title: parsing_none() +description: Skip the parsing step for textual data +keywords: [vectorizer, parsing, text, none] +tags: [vectorizer, configuration, parsing] +license: community +type: function +--- + +Skip the parsing step. Only appropriate for textual data that doesn't require parsing. + +## Samples + +### Skip parsing for text data + +```sql +SELECT ai.create_vectorizer( + 'text_content'::regclass, + loading => ai.loading_column('content'), + parsing => ai.parsing_none(), + embedding => ai.embedding_openai('text-embedding-3-small', 768) +); +``` + +## Arguments + +This function takes no arguments. + +## Returns + +A JSON configuration object for use in [`create_vectorizer()`][create_vectorizer]. + +## Related functions + +- [`parsing_auto()`][parsing_auto]: automatically select parser based on file type +- [`loading_column()`][loading_column]: load text data from a column + +[parsing_auto]: /api-reference/pgai/vectorizer/parsing_auto +[loading_column]: /api-reference/pgai/vectorizer/loading_column +[create_vectorizer]: /api-reference/pgai/vectorizer/create_vectorizer diff --git a/api-reference/pgai/vectorizer/parsing_pymupdf.mdx b/api-reference/pgai/vectorizer/parsing_pymupdf.mdx new file mode 100644 index 0000000..b885560 --- /dev/null +++ b/api-reference/pgai/vectorizer/parsing_pymupdf.mdx @@ -0,0 +1,36 @@ +--- +title: parsing_pymupdf +description: Parse documents using PyMuPDF for fast processing of PDFs and e-books +keywords: [pgai, vectorizer, parsing, pymupdf, pdf, epub] +tags: [vectorizer, configuration, parsing] +license: community +type: function +--- + +Parse the data provided by the loader using [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/). + +PyMuPDF is a faster, simpler document parser that: +- Processes PDF documents with basic structure preservation +- Supports e-book formats like EPUB and MOBI +- Is generally faster than docling for simpler documents +- Works well for documents with straightforward layouts + +Choose pymupdf when processing speed is more important than perfect structure preservation. + +## Samples + +```sql +SELECT ai.create_vectorizer( + 'my_table'::regclass, + parsing => ai.parsing_pymupdf(), + -- other parameters... +); +``` + +## Arguments + +This function takes no arguments. + +## Returns + +A JSON configuration object that you can use in `ai.create_vectorizer`. \ No newline at end of file diff --git a/api-reference/pgai/vectorizer/processing_default.mdx b/api-reference/pgai/vectorizer/processing_default.mdx new file mode 100644 index 0000000..159b114 --- /dev/null +++ b/api-reference/pgai/vectorizer/processing_default.mdx @@ -0,0 +1,33 @@ +--- +title: processing_default +description: Configure batch size and concurrency for vectorizer processing +keywords: [pgai, vectorizer, processing, batch, concurrency] +tags: [vectorizer, configuration, processing] +license: community +type: function +--- + +Specify the concurrency and batch size for the vectorizer. + +These are advanced options and most users should use the default. + +## Samples + +```sql +SELECT ai.create_vectorizer( + 'my_table'::regclass, + processing => ai.processing_default(batch_size => 200, concurrency => 5), + -- other parameters... +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| batch_size | int | Determined by the vectorizer | ✖ | The number of items to process in each batch. The optimal batch size depends on your data and cloud function configuration, larger batch sizes can improve efficiency but may increase memory usage. The default is 1 for vectorizers that use document loading (`ai.loading_uri`) and 50 otherwise | +| concurrency | int | Determined by the vectorizer | ✖ | The number of concurrent processing tasks to run. The optimal concurrency depends on your cloud infrastructure and rate limits, higher concurrency can speed up processing but may increase costs and resource usage | + +## Returns + +A JSON configuration object that you can use in `ai.create_vectorizer`. \ No newline at end of file diff --git a/api-reference/pgai/vectorizer/scheduling_default.mdx b/api-reference/pgai/vectorizer/scheduling_default.mdx new file mode 100644 index 0000000..2b5ec37 --- /dev/null +++ b/api-reference/pgai/vectorizer/scheduling_default.mdx @@ -0,0 +1,32 @@ +--- +title: scheduling_default +description: Use platform-specific default scheduling configuration for vectorizers +keywords: [pgai, vectorizer, scheduling, default, configuration] +tags: [vectorizer, configuration, scheduling] +license: community +type: function +--- + +import { CLOUD_LONG } from '/snippets/vars.mdx'; + +Use the platform-specific default scheduling configuration. + +On {CLOUD_LONG}, the default is `ai.scheduling_timescaledb()`. On self-hosted, the default is `ai.scheduling_none()`. A TimescaleDB background job is used to periodically trigger a cloud vectorizer on {CLOUD_LONG}. This is not available in a self-hosted environment. + +## Samples + +```sql +SELECT ai.create_vectorizer( + 'my_table'::regclass, + scheduling => ai.scheduling_default(), + -- other parameters... +); +``` + +## Arguments + +This function takes no arguments. + +## Returns + +A JSON configuration object that you can use in `ai.create_vectorizer`. \ No newline at end of file diff --git a/api-reference/pgai/vectorizer/scheduling_none.mdx b/api-reference/pgai/vectorizer/scheduling_none.mdx new file mode 100644 index 0000000..379870a --- /dev/null +++ b/api-reference/pgai/vectorizer/scheduling_none.mdx @@ -0,0 +1,31 @@ +--- +title: scheduling_none +description: Disable automatic scheduling for manual vectorizer control +keywords: [pgai, vectorizer, scheduling, none, manual] +tags: [vectorizer, configuration, scheduling] +license: community +type: function +--- + +Specify that no automatic scheduling should be set up for the vectorizer. + +- Manually control when the vectorizer runs or when you're using an external scheduling system +- Use this for self-hosted deployments + +## Samples + +```sql +SELECT ai.create_vectorizer( + 'my_table'::regclass, + scheduling => ai.scheduling_none(), + -- other parameters... +); +``` + +## Arguments + +This function takes no arguments. + +## Returns + +A JSON configuration object that you can use in `ai.create_vectorizer`. \ No newline at end of file diff --git a/api-reference/pgai/vectorizer/scheduling_timescaledb.mdx b/api-reference/pgai/vectorizer/scheduling_timescaledb.mdx new file mode 100644 index 0000000..d6381bf --- /dev/null +++ b/api-reference/pgai/vectorizer/scheduling_timescaledb.mdx @@ -0,0 +1,80 @@ +--- +title: scheduling_timescaledb +description: Configure automated scheduling using TimescaleDB job scheduling +keywords: [pgai, vectorizer, scheduling, timescaledb, automation] +tags: [vectorizer, configuration, scheduling] +license: community +type: function +--- + +Configure automated scheduling using TimescaleDB's job scheduling system. + +- Allow periodic execution of the vectorizer to process new or updated data +- Provide fine-grained control over when and how often the vectorizer runs + +## Samples + +### Basic usage + +Run every 5 minutes (default). + +```sql +SELECT ai.create_vectorizer( + 'my_table'::regclass, + scheduling => ai.scheduling_timescaledb(), + -- other parameters... +); +``` + +### Custom interval + +Run every hour. + +```sql +SELECT ai.create_vectorizer( + 'my_table'::regclass, + scheduling => ai.scheduling_timescaledb(interval '1 hour'), + -- other parameters... +); +``` + +### Specific start time and timezone + +```sql +SELECT ai.create_vectorizer( + 'my_table'::regclass, + scheduling => ai.scheduling_timescaledb( + interval '30 minutes', + initial_start => '2024-01-01 00:00:00'::timestamptz, + timezone => 'America/New_York' + ), + -- other parameters... +); +``` + +### Fixed schedule + +```sql +SELECT ai.create_vectorizer( + 'my_table'::regclass, + scheduling => ai.scheduling_timescaledb( + interval '1 day', + fixed_schedule => true, + timezone => 'UTC' + ), + -- other parameters... +); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| schedule_interval | interval | '10m' | ✔ | Set how frequently the vectorizer checks for new or updated data to process | +| initial_start | timestamptz | - | ✖ | Delay the start of scheduling. This is useful for coordinating with other system processes or maintenance windows | +| fixed_schedule | bool | - | ✖ | Set to `true` to use a fixed schedule such as every day at midnight. Set to `false` for a sliding window such as every 24 hours from the last run | +| timezone | text | - | ✖ | Set the timezone this schedule operates in. This ensures that schedules are interpreted correctly, especially important for fixed schedules or when coordinating with business hours | + +## Returns + +A JSON configuration object that you can use in `ai.create_vectorizer`. \ No newline at end of file diff --git a/api-reference/pgai/vectorizer/vectorizer_queue_pending.mdx b/api-reference/pgai/vectorizer/vectorizer_queue_pending.mdx new file mode 100644 index 0000000..fc69af5 --- /dev/null +++ b/api-reference/pgai/vectorizer/vectorizer_queue_pending.mdx @@ -0,0 +1,63 @@ +--- +title: vectorizer_queue_pending +description: Retrieve the number of pending items for a specific vectorizer +keywords: [pgai, vectorizer, monitoring, queue, pending] +tags: [vectorizer, monitoring, function] +license: community +type: function +--- + +Retrieve the number of items in a vectorizer queue when you need to focus on a particular vectorizer or troubleshoot issues. + +- Retrieve the number of pending items for a specific vectorizer +- Allow for more granular monitoring of individual vectorizer queues + +## Samples + +### Using vectorizer name + +```sql +SELECT ai.vectorizer_queue_pending('public_blog_embeddings'); +``` + +### Using vectorizer ID + +```sql +SELECT ai.vectorizer_queue_pending(1); +``` + +### Get exact count + +A queue with a very large number of items may be slow to count. The optional `exact_count` parameter is defaulted to false. When false, the count is limited. An exact count is returned if the queue has 10,000 or fewer items, and returns 9223372036854775807 (the max bigint value) if there are greater than 10,000 items. + +To get an exact count, regardless of queue size: + +```sql +-- Using name (recommended) +SELECT ai.vectorizer_queue_pending('public_blog_embeddings', exact_count=>true); + +-- Using ID +SELECT ai.vectorizer_queue_pending(1, exact_count=>true); +``` + +## Arguments + +`ai.vectorizer_queue_pending` can be called in two ways: + +### With vectorizer name (recommended) + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| name | text | - | ✔ | The name of the vectorizer you want to check | +| exact_count | bool | false | ✖ | If true, return exact count. If false, capped at 10,000 | + +### With vectorizer ID + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| vectorizer_id | int | - | ✔ | The identifier of the vectorizer you want to check | +| exact_count | bool | false | ✖ | If true, return exact count. If false, capped at 10,000 | + +## Returns + +The number of items in the queue for the specified vectorizer. \ No newline at end of file diff --git a/api-reference/pgai/vectorizer/vectorizer_status.mdx b/api-reference/pgai/vectorizer/vectorizer_status.mdx new file mode 100644 index 0000000..24b3a0e --- /dev/null +++ b/api-reference/pgai/vectorizer/vectorizer_status.mdx @@ -0,0 +1,59 @@ +--- +title: vectorizer_status +description: View to monitor vectorizer state and pending items +keywords: [pgai, vectorizer, monitoring, status, health] +tags: [vectorizer, monitoring, view] +license: community +type: view +--- + +Get a high-level overview of all vectorizers in the system. + +- Regularly monitor and check the health of the entire system +- Display key information about each vectorizer's configuration and current state +- Use the `pending_items` column to get a quick indication of processing backlogs + +## Samples + +### Retrieve all vectorizers with pending items + +```sql +SELECT * FROM ai.vectorizer_status WHERE pending_items > 0; +``` + +### System health monitoring + +Alert if any vectorizer has more than 1000 pending items. + +```sql +SELECT id, source_table, pending_items +FROM ai.vectorizer_status +WHERE pending_items > 1000; +``` + +### Get overview of all vectorizers + +```sql +SELECT * FROM ai.vectorizer_status; +``` + +Sample output: + +| id | source_table | target_table | view | pending_items | +|----|--------------|--------------|------|---------------| +| 1 | public.blog | public.blog_contents_embedding_store | public.blog_contents_embeddings | 1 | + +## Returns + +| Column name | Description | +|-------------|-------------| +| id | The unique identifier of this vectorizer | +| source_table | The fully qualified name of the source table | +| target_table | The fully qualified name of the table storing the embeddings | +| view | The fully qualified name of the view joining source and target tables | +| pending_items | The number of items waiting to be processed by the vectorizer | + +The `pending_items` column indicates the number of items still awaiting embedding creation. The pending items count helps you to: +- Identify bottlenecks in processing +- Determine if you need to adjust scheduling or processing configurations +- Monitor the impact of large data imports or updates on your vectorizers \ No newline at end of file diff --git a/api-reference/pgvectorscale/create_index.mdx b/api-reference/pgvectorscale/create_index.mdx new file mode 100644 index 0000000..fcd7b00 --- /dev/null +++ b/api-reference/pgvectorscale/create_index.mdx @@ -0,0 +1,96 @@ +--- +title: CREATE INDEX with diskann +description: Create a StreamingDiskANN index for high-performance vector search +keywords: [pgvectorscale, index, diskann, vector search, create index] +tags: [indexes, vector, performance] +license: community +type: function +--- + +import { PG } from '/snippets/vars.mdx'; + +Create a StreamingDiskANN index on a vector column for high-performance similarity search with optional label-based filtering. + +- Create indexes for fast approximate nearest neighbor search +- Configure index build-time parameters for optimal performance +- Enable label-based filtering for precise vector search +- Choose storage layouts for memory optimization or plain storage + +Note that: + +- The index build process can be memory-intensive. Consider increasing `maintenance_work_mem`: + ```sql + SET maintenance_work_mem = '2GB'; + ``` + +- Label values must be within {PG} `smallint` range (-32768 to 32767) + +- Creating indexes on UNLOGGED tables is not currently supported + +- Null vectors are not indexed; null labels are treated as empty arrays + +## Samples + +### Basic index creation + +```sql +CREATE INDEX document_embedding_idx ON document_embedding +USING diskann (embedding vector_cosine_ops); +``` + +### With custom parameters + +```sql +CREATE INDEX document_embedding_idx ON document_embedding +USING diskann (embedding vector_cosine_ops) +WITH (num_neighbors = 50, search_list_size = 100); +``` + +### With label-based filtering + +```sql +CREATE INDEX document_embedding_idx ON document_embedding +USING diskann (embedding vector_cosine_ops, labels); +``` + +### With storage layout + +```sql +CREATE INDEX document_embedding_idx ON document_embedding +USING diskann (embedding vector_cosine_ops) +WITH (storage_layout = 'memory_optimized'); +``` + +## Syntax + +```sql +CREATE INDEX index_name ON table_name +USING diskann (embedding_column distance_ops [, labels_column]) +[WITH (parameter = value, ...)]; +``` + +## Distance operators + +| Operator | Description | Use with | +|----------|-------------|----------| +| `vector_cosine_ops` | Cosine distance (`<=>`) | Normalized embeddings | +| `vector_l2_ops` | L2 distance (`<->`) | Euclidean distance | +| `vector_ip_ops` | Inner product (`<#>`) | Dot product (not compatible with plain storage) | + +## Build-time parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `storage_layout` | `text` | `memory_optimized` | `memory_optimized` uses Statistical Binary Quantization; `plain` stores uncompressed data | +| `num_neighbors` | `int` | 50 | Maximum number of neighbors per node. Higher values increase accuracy but slow graph traversal | +| `search_list_size` | `int` | 100 | Search parameter during construction. Higher values improve graph quality but slow index builds | +| `max_alpha` | `float` | 1.2 | Alpha parameter in the algorithm. Higher values improve graph quality but slow index builds | +| `num_dimensions` | `int` | 0 (all dimensions) | Number of dimensions to index. Useful for Matryoshka embeddings | +| `num_bits_per_dimension` | `int` | 2 (if less than 900 dims), 1 otherwise | Bits per dimension when using Statistical Binary Quantization | + +For detailed information about each parameter and tuning recommendations, see [Index build-time parameters][index_parameters]. + +To configure query behavior at runtime, see [Query-time parameters][query_parameters]. + +[index_parameters]: /api-reference/pgvectorscale/index_parameters +[query_parameters]: /api-reference/pgvectorscale/query_parameters diff --git a/api-reference/pgvectorscale/index.mdx b/api-reference/pgvectorscale/index.mdx new file mode 100644 index 0000000..fb85bc7 --- /dev/null +++ b/api-reference/pgvectorscale/index.mdx @@ -0,0 +1,34 @@ +--- +title: pgvectorscale API reference +sidebarTitle: Overview +description: Complete API reference for pgvectorscale StreamingDiskANN indexes and configuration +products: [cloud, mst, self_hosted] +keywords: [API, reference, vector, pgvectorscale, diskann, indexing] +mode: "wide" +--- + + + + Create StreamingDiskANN indexes for high-performance vector search + + + + Configure index behavior during creation for optimal performance + + + + Tune accuracy and performance dynamically at query time + + diff --git a/api-reference/pgvectorscale/index_parameters.mdx b/api-reference/pgvectorscale/index_parameters.mdx new file mode 100644 index 0000000..67c7c8c --- /dev/null +++ b/api-reference/pgvectorscale/index_parameters.mdx @@ -0,0 +1,139 @@ +--- +title: Index build-time parameters +description: Configure StreamingDiskANN index build-time behavior for optimal performance +keywords: [pgvectorscale, parameters, index, build, configuration] +tags: [indexes, configuration, performance] +license: community +type: configuration +--- + +import { COMPANY } from '/snippets/vars.mdx'; + +Configure StreamingDiskANN index behavior at build time to optimize for your specific workload, accuracy requirements, and resource constraints. + +- Control index build performance and memory usage +- Tune accuracy vs performance trade-offs +- Configure storage layout and compression +- Enable Matryoshka embedding support + +## Samples + +### Memory-optimized storage with custom neighbors + +```sql +CREATE INDEX document_embedding_idx ON document_embedding +USING diskann (embedding vector_cosine_ops) +WITH ( + storage_layout = 'memory_optimized', + num_neighbors = 75 +); +``` + +### Plain storage for maximum accuracy + +```sql +CREATE INDEX document_embedding_idx ON document_embedding +USING diskann (embedding vector_cosine_ops) +WITH ( + storage_layout = 'plain', + num_neighbors = 100, + search_list_size = 200 +); +``` + +### Matryoshka embeddings + +```sql +-- Index only first 256 dimensions of a 1536-dimension embedding +CREATE INDEX document_embedding_idx ON document_embedding +USING diskann (embedding vector_cosine_ops) +WITH (num_dimensions = 256); +``` + +### High-accuracy build configuration + +```sql +CREATE INDEX document_embedding_idx ON document_embedding +USING diskann (embedding vector_cosine_ops) +WITH ( + num_neighbors = 100, + search_list_size = 200, + max_alpha = 1.5 +); +``` + +### Build performance considerations + +Index builds can be memory-intensive. To improve build performance: + +```sql +-- Increase maintenance work memory +SET maintenance_work_mem = '2GB'; + +-- Then create the index +CREATE INDEX document_embedding_idx ON document_embedding +USING diskann (embedding vector_cosine_ops) +WITH (num_neighbors = 50); +``` + +The default `maintenance_work_mem` is typically 64MB, which may be too low for building indexes on large datasets. + +## Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `storage_layout` | `text` | `memory_optimized` | Storage format: `memory_optimized` uses Statistical Binary Quantization; `plain` stores uncompressed vectors | +| `num_neighbors` | `int` | 50 | Maximum number of neighbors per node. Higher values increase accuracy but slow graph traversal | +| `search_list_size` | `int` | 100 | Search list size during construction (S parameter). Higher values improve graph quality at the cost of slower builds | +| `max_alpha` | `float` | 1.2 | Alpha parameter in the DiskANN algorithm. Higher values improve graph quality but slow index builds | +| `num_dimensions` | `int` | 0 (all dimensions) | Number of dimensions to index. Enables Matryoshka embeddings by indexing fewer dimensions | +| `num_bits_per_dimension` | `int` | 2 (if less than 900 dims), 1 otherwise | Bits per dimension for Statistical Binary Quantization encoding | + +### storage_layout + +Controls how vector data is stored: + +- **memory_optimized** (default): Uses Statistical Binary Quantization (SBQ) developed by {COMPANY} researchers to compress vectors. Provides excellent performance with reduced memory footprint. + +- **plain**: Stores vectors uncompressed. Uses more memory but may provide slightly higher accuracy. Required for certain distance operators. + +### num_neighbors + +The maximum number of neighbors per node in the graph. This is a key parameter for balancing accuracy and performance: + +- Lower values (20-50): Faster queries, lower memory, slightly reduced accuracy +- Higher values (50-100): Better accuracy, slower queries, more memory + +### search_list_size + +The size of the candidate list during graph construction (S parameter in DiskANN). Affects index build quality: + +- Lower values (50-100): Faster index builds, slightly lower quality +- Higher values (100-200): Slower builds, higher quality graph structure + +### max_alpha + +The alpha parameter controls pruning during graph construction: + +- Lower values (1.0-1.2): More aggressive pruning, faster builds +- Higher values (1.2-2.0): Less aggressive pruning, higher quality graph + +### num_dimensions + +Enables Matryoshka embedding support by indexing only the first N dimensions: + +- 0 (default): Index all dimensions +- Positive integer: Index only first N dimensions + +Useful for embeddings trained with Matryoshka representation learning, where information is organized hierarchically across dimensions. + +### num_bits_per_dimension + +Controls the precision of Statistical Binary Quantization: + +- 1 bit: Maximum compression, fastest queries, lower accuracy +- 2 bits: Balanced compression and accuracy (default for \<900 dimensions) + + +[create_index]: /api-reference/pgvectorscale/create_index +[query_parameters]: /api-reference/pgvectorscale/query_parameters diff --git a/api-reference/pgvectorscale/pgvectorscale-api-reference-landing.mdx b/api-reference/pgvectorscale/pgvectorscale-api-reference-landing.mdx deleted file mode 100644 index eeb705f..0000000 --- a/api-reference/pgvectorscale/pgvectorscale-api-reference-landing.mdx +++ /dev/null @@ -1,17 +0,0 @@ ---- -title: pgvectorscale API Reference -description: Complete API reference for pgvectorscale functions and vector operations -products: [cloud, mst, self_hosted] -keywords: [API, reference, vector, scale, indexing] -mode: "wide" ---- - - - - Complete API reference for pgvectorscale functions, indexing, and vector scaling operations. - - \ No newline at end of file diff --git a/api-reference/pgvectorscale/pgvectorscale-api-reference.mdx b/api-reference/pgvectorscale/pgvectorscale-api-reference.mdx deleted file mode 100644 index 1f204ca..0000000 --- a/api-reference/pgvectorscale/pgvectorscale-api-reference.mdx +++ /dev/null @@ -1,11 +0,0 @@ ---- -title: pgvectorscale API reference -description: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. ---- - -This page provides an API reference for Vectorizer functions. For an overview -of Vectorizer and how it works, see the [Vectorizer Guide](/docs/vectorizer/overview.md). - -A vectorizer provides you with a powerful and automated way to generate and -manage LLM embeddings for your PostgreSQL data. Here's a summary of what you -gain from Vectorizers: diff --git a/api-reference/pgvectorscale/query_parameters.mdx b/api-reference/pgvectorscale/query_parameters.mdx new file mode 100644 index 0000000..94f204f --- /dev/null +++ b/api-reference/pgvectorscale/query_parameters.mdx @@ -0,0 +1,175 @@ +--- +title: Query-time parameters +description: Tune StreamingDiskANN query performance and accuracy dynamically +keywords: [pgvectorscale, parameters, query, performance, accuracy] +tags: [query, configuration, performance] +license: community +type: configuration +--- + +Configure StreamingDiskANN query behavior at runtime to dynamically tune the accuracy vs performance trade-off for individual queries or sessions. + +- Adjust accuracy and speed trade-offs per query +- Fine-tune search quality without rebuilding indexes +- Optimize for different use cases (fast approximate vs high-accuracy search) +- Control rescoring behavior for better accuracy + +## Samples + +### Increase accuracy for important queries + +```sql +-- Temporarily increase search list size and rescoring +BEGIN; +SET LOCAL diskann.query_search_list_size = 200; +SET LOCAL diskann.query_rescore = 100; + +SELECT * FROM document_embedding +ORDER BY embedding <=> '[...]' +LIMIT 10; + +COMMIT; +``` + +### Fast approximate search + +```sql +-- Reduce search list size for faster queries +SET diskann.query_search_list_size = 50; +SET diskann.query_rescore = 20; + +SELECT * FROM document_embedding +ORDER BY embedding <=> '[...]' +LIMIT 10; +``` + +### Disable rescoring for maximum speed + +```sql +-- Use quantized results without rescoring +SET diskann.query_rescore = 0; + +SELECT * FROM document_embedding +ORDER BY embedding <=> '[...]' +LIMIT 10; +``` + +### Session-wide configuration + +```sql +-- Set parameters for the entire session +SET diskann.query_search_list_size = 150; +SET diskann.query_rescore = 75; + +-- All queries in this session use these settings +SELECT * FROM document_embedding ORDER BY embedding <=> '[...]' LIMIT 10; +SELECT * FROM document_embedding ORDER BY embedding <=> '[...]' LIMIT 20; +``` + +## Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `diskann.query_search_list_size` | `int` | 100 | Number of additional candidates considered during graph search. Higher values improve accuracy but slow queries | +| `diskann.query_rescore` | `int` | 50 | Number of elements rescored with full precision. Set to 0 to disable rescoring | + +### diskann.query_search_list_size + +Controls the size of the candidate list during graph traversal: + +- **Lower values (20-50)**: Faster queries, reduced accuracy, fewer candidates explored +- **Default (100)**: Balanced performance and accuracy +- **Higher values (100-200)**: Better accuracy, slower queries, more thorough search + +This parameter has the most direct impact on the accuracy/speed trade-off. + +### diskann.query_rescore + +The number of top candidates to rescore using full-precision vectors: + +- **0**: Disable rescoring, use quantized results directly (fastest) +- **Default (50)**: Rescore top 50 candidates for improved accuracy +- **Higher values (50-200)**: More thorough rescoring, better accuracy, slower queries + +Rescoring helps correct errors introduced by quantization in memory-optimized indexes. For plain storage indexes, rescoring has minimal effect. + +## Setting parameters + +### Session-level (persistent) + +```sql +SET diskann.query_search_list_size = 150; +``` + +Applies to all queries in the current session until changed or the session ends. + +### Transaction-local (temporary) + +```sql +BEGIN; +SET LOCAL diskann.query_rescore = 100; +-- Parameter applies only within this transaction +SELECT * FROM document_embedding ORDER BY embedding <=> '[...]' LIMIT 10; +COMMIT; +-- Parameter is reset after commit +``` + +The `LOCAL` keyword ensures parameters reset after the transaction ends. + +## Tuning recommendations + +### High-accuracy applications + +For applications where accuracy is critical (e.g., medical diagnosis, legal document search): + +```sql +SET diskann.query_search_list_size = 200; +SET diskann.query_rescore = 100; +``` + +### Real-time applications + +For latency-sensitive applications (e.g., chatbots, autocomplete): + +```sql +SET diskann.query_search_list_size = 50; +SET diskann.query_rescore = 20; +``` + +### Batch processing + +For offline batch processing where throughput matters: + +```sql +SET diskann.query_search_list_size = 75; +SET diskann.query_rescore = 30; +``` + +### A/B testing different configurations + +```sql +-- Test configuration A +BEGIN; +SET LOCAL diskann.query_search_list_size = 100; +SET LOCAL diskann.query_rescore = 50; +EXPLAIN ANALYZE SELECT * FROM document_embedding ORDER BY embedding <=> '[...]' LIMIT 10; +COMMIT; + +-- Test configuration B +BEGIN; +SET LOCAL diskann.query_search_list_size = 150; +SET LOCAL diskann.query_rescore = 75; +EXPLAIN ANALYZE SELECT * FROM document_embedding ORDER BY embedding <=> '[...]' LIMIT 10; +COMMIT; +``` + +### Performance considerations + +- Start with default values and adjust `diskann.query_rescore` first +- Use transaction-local settings (`SET LOCAL`) for experimentation +- Monitor query latency with `EXPLAIN ANALYZE` when tuning +- Higher values consume more CPU but not significantly more memory + + +[create_index]: /api-reference/pgvectorscale/create_index +[index_parameters]: /api-reference/pgvectorscale/index_parameters diff --git a/api-reference/tiger-cloud-rest-api/introduction.mdx b/api-reference/tiger-cloud-rest-api/introduction.mdx new file mode 100644 index 0000000..cabc06d --- /dev/null +++ b/api-reference/tiger-cloud-rest-api/introduction.mdx @@ -0,0 +1,84 @@ +--- +title: Tiger Cloud REST API +sidebarTitle: Overview +description: A comprehensive RESTful API for managing Tiger Cloud resources including VPCs, services, and read replicas +--- + +- **API Version:** 1.0.0 +- **Base URL:** `https://console.cloud.timescale.com/public/api/v1` + +## Authentication + +The Tiger Cloud REST API uses HTTP Basic Authentication. Include your access key and secret key in the Authorization header. + +### Basic Authentication +```http +Authorization: Basic +``` + +### Example +```bash +# Using cURL +curl -X GET "https://console.cloud.timescale.com/public/api/v1/projects/{project_id}/services" \ + -H "Authorization: Basic $(echo -n 'your_access_key:your_secret_key' | base64)" +``` + +## API Endpoints + +The REST API is organized around three main resource types: + +### Service Management + +Manage Tiger Cloud database services: + +- [**List All Services**](/api-reference/services/list-all-services) - `GET /projects/{project_id}/services` +- [**Create a Service**](/api-reference/services/create-a-service) - `POST /projects/{project_id}/services` +- [**Get a Service**](/api-reference/services/get-a-service) - `GET /projects/{project_id}/services/{service_id}` +- [**Delete a Service**](/api-reference/services/delete-a-service) - `DELETE /projects/{project_id}/services/{service_id}` +- [**Resize a Service**](/api-reference/services/resize-a-service) - `POST /projects/{project_id}/services/{service_id}/resize` +- [**Update Service Password**](/api-reference/services/update-service-password) - `POST /projects/{project_id}/services/{service_id}/updatePassword` +- [**Set Environment for a Service**](/api-reference/services/set-environment-for-a-service) - `POST /projects/{project_id}/services/{service_id}/setEnvironment` +- [**Change HA Configuration for a Service**](/api-reference/services/change-ha-configuration-for-a-service) - `POST /projects/{project_id}/services/{service_id}/setHA` +- [**Enable Connection Pooler for a Service**](/api-reference/services/enable-connection-pooler-for-a-service) - `POST /projects/{project_id}/services/{service_id}/enablePooler` +- [**Disable Connection Pooler for a Service**](/api-reference/services/disable-connection-pooler-for-a-service) - `POST /projects/{project_id}/services/{service_id}/disablePooler` +- [**Fork a Service**](/api-reference/services/fork-a-service) - `POST /projects/{project_id}/services/{service_id}/forkService` +- [**Attach Service to VPC**](/api-reference/services/attach-service-to-vpc) - `POST /projects/{project_id}/services/{service_id}/attachToVPC` +- [**Detach Service from VPC**](/api-reference/services/detach-service-from-vpc) - `POST /projects/{project_id}/services/{service_id}/detachFromVPC` + +### Read Replica Sets + +Manage read replicas for improved read performance: + +- [**Get Read Replica Sets**](/api-reference/services/get-read-replica-sets) - `GET /projects/{project_id}/services/{service_id}/replicaSets` +- [**Create a Read Replica Set**](/api-reference/services/create-a-read-replica-set) - `POST /projects/{project_id}/services/{service_id}/replicaSets` +- [**Delete a Read Replica Set**](/api-reference/services/delete-a-read-replica-set) - `DELETE /projects/{project_id}/services/{service_id}/replicaSets/{replica_set_id}` +- [**Resize a Read Replica Set**](/api-reference/services/resize-a-read-replica-set) - `POST /projects/{project_id}/services/{service_id}/replicaSets/{replica_set_id}/resize` +- [**Enable Connection Pooler for a Read Replica**](/api-reference/services/enable-connection-pooler-for-a-read-replica) - `POST /projects/{project_id}/services/{service_id}/replicaSets/{replica_set_id}/enablePooler` +- [**Disable Connection Pooler for a Read Replica**](/api-reference/services/disable-connection-pooler-for-a-read-replica) - `POST /projects/{project_id}/services/{service_id}/replicaSets/{replica_set_id}/disablePooler` +- [**Set Environment for a Read Replica**](/api-reference/services/set-environment-for-a-read-replica) - `POST /projects/{project_id}/services/{service_id}/replicaSets/{replica_set_id}/setEnvironment` + +### VPC Management + +Manage Virtual Private Clouds for network isolation: + +- [**List All VPCs**](/api-reference/vpcs/list-all-vpcs) - `GET /projects/{project_id}/vpcs` +- [**Create a VPC**](/api-reference/vpcs/create-a-vpc) - `POST /projects/{project_id}/vpcs` +- [**Get a VPC**](/api-reference/vpcs/get-a-vpc) - `GET /projects/{project_id}/vpcs/{vpc_id}` +- [**Delete a VPC**](/api-reference/vpcs/delete-a-vpc) - `DELETE /projects/{project_id}/vpcs/{vpc_id}` +- [**Rename a VPC**](/api-reference/vpcs/rename-a-vpc) - `POST /projects/{project_id}/vpcs/{vpc_id}/rename` + +### VPC Peering + +Manage peering connections between VPCs: + +- [**List VPC Peerings**](/api-reference/vpcs/list-vpc-peerings) - `GET /projects/{project_id}/vpcs/{vpc_id}/peerings` +- [**Create a VPC Peering**](/api-reference/vpcs/create-a-vpc-peering) - `POST /projects/{project_id}/vpcs/{vpc_id}/peerings` +- [**Get a VPC Peering**](/api-reference/vpcs/get-a-vpc-peering) - `GET /projects/{project_id}/vpcs/{vpc_id}/peerings/{peering_id}` +- [**Delete a VPC Peering**](/api-reference/vpcs/delete-a-vpc-peering) - `DELETE /projects/{project_id}/vpcs/{vpc_id}/peerings/{peering_id}` + +### Analytics + +Track usage and events for analytics purposes: + +- [**Identify a User**](/api-reference/analytics/identify-a-user) - `POST /analytics/identify` +- [**Track an Analytics Event**](/api-reference/analytics/track-an-analytics-event) - `POST /analytics/track` diff --git a/api-reference/tiger-cloud-rest-api/openapi.yaml b/api-reference/tiger-cloud-rest-api/openapi.yaml new file mode 100644 index 0000000..0da0641 --- /dev/null +++ b/api-reference/tiger-cloud-rest-api/openapi.yaml @@ -0,0 +1,1162 @@ +openapi: 3.0.3 +info: + title: Tiger Cloud REST API + description: | + A comprehensive RESTful API for managing Tiger Cloud resources including VPCs, services, and read replicas. + + ## Authentication + + The Tiger Cloud REST API uses HTTP Basic Authentication. Include your access key and secret key in the Authorization header. + + ### Basic Authentication + ```http + Authorization: Basic + ``` + + ### Example + ```bash + # Using cURL + curl -X GET "https://console.cloud.timescale.com/public/api/v1/projects/{project_id}/services" \ + -H "Authorization: Basic $(echo -n 'your_access_key:your_secret_key' | base64)" + ``` + + ## Service Management + + You use this endpoint to create a Tiger Cloud service with one or more of the following addons: + + - `time-series`: a Tiger Cloud service optimized for real-time analytics. For time-stamped data like events, prices, metrics, sensor readings, or any information that changes over time. + - `ai`: a Tiger Cloud service instance with vector extensions. + + To have multiple addons when you create a new service, set `"addons": ["time-series", "ai"]`. To create a vanilla Postgres instance, set `addons` to an empty list `[]`. + version: 1.0.0 + license: + name: Proprietary + url: https://www.tigerdata.com/legal/terms + contact: + name: Tiger Data Support + url: https://www.tigerdata.com/contact +servers: + - url: https://console.cloud.timescale.com/public/api/v1 + description: Tiger Cloud API server + - url: http://localhost:8080 + description: Local development server + +tags: + - name: Services + description: Manage services, read replicas, and their associated actions. + - name: VPCs + description: Manage VPCs and their peering connections. + - name: Analytics + description: Track analytics events. + + +paths: + /projects/{project_id}/services: + get: + tags: + - Services + summary: List All Services + description: Retrieves a list of all services within a specific project. + parameters: + - $ref: '#/components/parameters/ProjectId' + responses: + '200': + description: A list of services. + content: + application/json: + schema: + type: array + items: + $ref: '#/components/schemas/Service' + '4XX': + $ref: '#/components/responses/ClientError' + post: + tags: + - Services + summary: Create a Service + description: Creates a new database service within a project. This is an asynchronous operation. + parameters: + - $ref: '#/components/parameters/ProjectId' + requestBody: + required: true + content: + application/json: + schema: + $ref: '#/components/schemas/ServiceCreate' + responses: + '202': + description: Service creation request has been accepted. + content: + application/json: + schema: + $ref: '#/components/schemas/Service' + '4XX': + $ref: '#/components/responses/ClientError' + /projects/{project_id}/services/{service_id}: + get: + tags: + - Services + summary: Get a Service + description: Retrieves the details of a specific service by its ID. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/ServiceId' + responses: + '200': + description: Service details. + content: + application/json: + schema: + $ref: '#/components/schemas/Service' + '4XX': + $ref: '#/components/responses/ClientError' + delete: + tags: + - Services + summary: Delete a Service + description: Deletes a specific service. This is an asynchronous operation. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/ServiceId' + responses: + '202': + description: Deletion request has been accepted. + '4XX': + $ref: '#/components/responses/ClientError' + /projects/{project_id}/services/{service_id}/attachToVPC: + post: + tags: + - Services + summary: Attach Service to VPC + description: Associates a service with a VPC. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/ServiceId' + requestBody: + required: true + content: + application/json: + schema: + $ref: '#/components/schemas/ServiceVPCInput' + responses: + '202': + $ref: '#/components/responses/SuccessMessage' + '4XX': + $ref: '#/components/responses/ClientError' + /projects/{project_id}/services/{service_id}/detachFromVPC: + post: + tags: + - Services + summary: Detach Service from VPC + description: Disassociates a service from its VPC. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/ServiceId' + requestBody: + required: true + content: + application/json: + schema: + $ref: '#/components/schemas/ServiceVPCInput' + responses: + '202': + $ref: '#/components/responses/SuccessMessage' + '4XX': + $ref: '#/components/responses/ClientError' + /projects/{project_id}/services/{service_id}/resize: + post: + tags: + - Services + summary: Resize a Service + description: Changes the CPU and memory allocation for a specific service within a project. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/ServiceId' + requestBody: + required: true + content: + application/json: + schema: + $ref: '#/components/schemas/ResizeInput' + responses: + '202': + description: Resize request has been accepted and is in progress. + '4XX': + $ref: '#/components/responses/ClientError' + /projects/{project_id}/services/{service_id}/enablePooler: + post: + tags: + - Services + summary: Enable Connection Pooler for a Service + description: Activates the connection pooler for a specific service within a project. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/ServiceId' + responses: + '200': + $ref: '#/components/responses/SuccessMessage' + '4XX': + $ref: '#/components/responses/ClientError' + /projects/{project_id}/services/{service_id}/disablePooler: + post: + tags: + - Services + summary: Disable Connection Pooler for a Service + description: Deactivates the connection pooler for a specific service within a project. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/ServiceId' + responses: + '200': + $ref: '#/components/responses/SuccessMessage' + '4XX': + $ref: '#/components/responses/ClientError' + /projects/{project_id}/services/{service_id}/forkService: + post: + tags: + - Services + summary: Fork a Service + description: Creates a new, independent service within a project by taking a snapshot of an existing one. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/ServiceId' + requestBody: + required: true + content: + application/json: + schema: + $ref: '#/components/schemas/ForkServiceCreate' + responses: + '202': + description: Fork request accepted. The response contains the details of the new service being created. + content: + application/json: + schema: + $ref: '#/components/schemas/Service' + '4XX': + $ref: '#/components/responses/ClientError' + /projects/{project_id}/services/{service_id}/updatePassword: + post: + tags: + - Services + summary: Update Service Password + description: Sets a new master password for the service within a project. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/ServiceId' + requestBody: + required: true + content: + application/json: + schema: + $ref: '#/components/schemas/UpdatePasswordInput' + responses: + '204': + description: Password updated successfully. No content returned. + '4XX': + $ref: '#/components/responses/ClientError' + /projects/{project_id}/services/{service_id}/setEnvironment: + post: + tags: + - Services + summary: Set Environment for a Service + description: Sets the environment type for the service. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/ServiceId' + requestBody: + required: true + content: + application/json: + schema: + $ref: '#/components/schemas/SetEnvironmentInput' + responses: + '200': + $ref: '#/components/responses/SuccessMessage' + '4XX': + $ref: '#/components/responses/ClientError' + /projects/{project_id}/services/{service_id}/setHA: + post: + tags: + - Services + summary: Change HA configuration for a Service + description: Changes the HA configuration for a specific service. This is an asynchronous operation. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/ServiceId' + requestBody: + required: true + content: + application/json: + schema: + $ref: '#/components/schemas/SetHAReplicaInput' + responses: + '202': + description: HA replica configuration updated + content: + application/json: + schema: + $ref: '#/components/schemas/Service' + '4XX': + $ref: '#/components/responses/ClientError' + /projects/{project_id}/services/{service_id}/replicaSets: + get: + tags: + - Read Replica Sets + summary: Get Read Replica Sets + description: Retrieves a list of all read replica sets associated with a primary service within a project. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/ServiceId' + responses: + '200': + description: A list of read replica sets. + content: + application/json: + schema: + type: array + items: + $ref: '#/components/schemas/ReadReplicaSet' + '4XX': + $ref: '#/components/responses/ClientError' + post: + tags: + - Read Replica Sets + summary: Create a Read Replica Set + description: Creates a new read replica set for a service. This is an asynchronous operation. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/ServiceId' + requestBody: + required: true + content: + application/json: + schema: + $ref: '#/components/schemas/ReadReplicaSetCreate' + responses: + '202': + description: Read replica set creation request has been accepted. + content: + application/json: + schema: + $ref: '#/components/schemas/ReadReplicaSet' + '4XX': + $ref: '#/components/responses/ClientError' + /projects/{project_id}/services/{service_id}/replicaSets/{replica_set_id}: + delete: + tags: + - Read Replica Sets + summary: Delete a Read Replica Set + description: Deletes a specific read replica set. This is an asynchronous operation. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/ServiceId' + - $ref: '#/components/parameters/ReplicaSetId' + responses: + '202': + description: Deletion request has been accepted. + '4XX': + $ref: '#/components/responses/ClientError' + /projects/{project_id}/services/{service_id}/replicaSets/{replica_set_id}/resize: + post: + tags: + - Read Replica Sets + summary: Resize a Read Replica Set + description: Changes the resource allocation for a specific read replica set. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/ServiceId' + - $ref: '#/components/parameters/ReplicaSetId' + requestBody: + required: true + content: + application/json: + schema: + $ref: '#/components/schemas/ResizeInput' + responses: + '202': + description: Resize request has been accepted and is in progress. + '4XX': + $ref: '#/components/responses/ClientError' + /projects/{project_id}/services/{service_id}/replicaSets/{replica_set_id}/enablePooler: + post: + tags: + - Read Replica Sets + summary: Enable Connection Pooler for a Read Replica + description: Activates the connection pooler for a specific read replica set. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/ServiceId' + - $ref: '#/components/parameters/ReplicaSetId' + responses: + '200': + $ref: '#/components/responses/SuccessMessage' + '4XX': + $ref: '#/components/responses/ClientError' + /projects/{project_id}/services/{service_id}/replicaSets/{replica_set_id}/disablePooler: + post: + tags: + - Read Replica Sets + summary: Disable Connection Pooler for a Read Replica + description: Deactivates the connection pooler for a specific read replica set. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/ServiceId' + - $ref: '#/components/parameters/ReplicaSetId' + responses: + '200': + $ref: '#/components/responses/SuccessMessage' + '4XX': + $ref: '#/components/responses/ClientError' + /projects/{project_id}/services/{service_id}/replicaSets/{replica_set_id}/setEnvironment: + post: + tags: + - Read Replica Sets + summary: Set Environment for a Read Replica + description: Sets the environment type for the read replica set. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/ServiceId' + - $ref: '#/components/parameters/ReplicaSetId' + requestBody: + required: true + content: + application/json: + schema: + $ref: '#/components/schemas/SetEnvironmentInput' + responses: + '200': + $ref: '#/components/responses/SuccessMessage' + '4XX': + $ref: '#/components/responses/ClientError' + /projects/{project_id}/vpcs: + get: + tags: + - VPCs + parameters: + - $ref: '#/components/parameters/ProjectId' + summary: List All VPCs + description: Retrieves a list of all Virtual Private Clouds (VPCs). + responses: + '200': + description: A list of VPCs. + content: + application/json: + schema: + type: array + items: + $ref: '#/components/schemas/VPC' + '4XX': + $ref: '#/components/responses/ClientError' + post: + tags: + - VPCs + parameters: + - $ref: '#/components/parameters/ProjectId' + summary: Create a VPC + description: Creates a new Virtual Private Cloud (VPC). + requestBody: + required: true + content: + application/json: + schema: + $ref: '#/components/schemas/VPCCreate' + responses: + '201': + description: VPC created successfully. + content: + application/json: + schema: + $ref: '#/components/schemas/VPC' + '4XX': + $ref: '#/components/responses/ClientError' + /projects/{project_id}/vpcs/{vpc_id}: + get: + tags: + - VPCs + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/VPCId' + summary: Get a VPC + description: Retrieves the details of a specific VPC by its ID. + responses: + '200': + description: VPC details. + content: + application/json: + schema: + $ref: '#/components/schemas/VPC' + '4XX': + $ref: '#/components/responses/ClientError' + delete: + tags: + - VPCs + summary: Delete a VPC + description: Deletes a specific VPC. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/VPCId' + responses: + '204': + description: VPC deleted successfully. + '4XX': + $ref: '#/components/responses/ClientError' + /projects/{project_id}/vpcs/{vpc_id}/rename: + post: + tags: + - VPCs + summary: Rename a VPC + description: Updates the name of a specific VPC. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/VPCId' + requestBody: + required: true + content: + application/json: + schema: + $ref: '#/components/schemas/VPCRename' + responses: + '200': + description: VPC renamed successfully. + content: + application/json: + schema: + $ref: '#/components/schemas/VPC' + '4XX': + $ref: '#/components/responses/ClientError' + /projects/{project_id}/vpcs/{vpc_id}/peerings: + get: + tags: + - VPCs + summary: List VPC Peerings + description: Retrieves a list of all VPC peering connections for a given VPC. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/VPCId' + responses: + '200': + description: A list of VPC peering connections. + content: + application/json: + schema: + type: array + items: + $ref: '#/components/schemas/Peering' + '4XX': + $ref: '#/components/responses/ClientError' + post: + tags: + - VPCs + summary: Create a VPC Peering + description: Creates a new VPC peering connection. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/VPCId' + requestBody: + required: true + content: + application/json: + schema: + $ref: '#/components/schemas/PeeringCreate' + responses: + '201': + description: VPC peering created successfully. + content: + application/json: + schema: + $ref: '#/components/schemas/Peering' + '4XX': + $ref: '#/components/responses/ClientError' + /projects/{project_id}/vpcs/{vpc_id}/peerings/{peering_id}: + get: + tags: + - VPCs + summary: Get a VPC Peering + description: Retrieves the details of a specific VPC peering connection. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/VPCId' + - $ref: '#/components/parameters/PeeringId' + responses: + '200': + description: VPC peering details. + content: + application/json: + schema: + $ref: '#/components/schemas/Peering' + '4XX': + $ref: '#/components/responses/ClientError' + delete: + tags: + - VPCs + summary: Delete a VPC Peering + description: Deletes a specific VPC peering connection. + parameters: + - $ref: '#/components/parameters/ProjectId' + - $ref: '#/components/parameters/VPCId' + - $ref: '#/components/parameters/PeeringId' + responses: + '204': + description: VPC peering deleted successfully. + '4XX': + $ref: '#/components/responses/ClientError' + /analytics/identify: + post: + tags: + - Analytics + summary: Identify a user + description: Identifies a user with optional properties for analytics tracking. + requestBody: + required: true + content: + application/json: + schema: + type: object + properties: + properties: + type: object + additionalProperties: true + description: Optional map of arbitrary properties associated with the user + example: + email: "user@example.com" + name: "John Doe" + responses: + '200': + $ref: '#/components/responses/AnalyticsResponse' + '4XX': + $ref: '#/components/responses/ClientError' + /analytics/track: + post: + tags: + - Analytics + summary: Track an analytics event + description: Tracks an analytics event with optional properties. + requestBody: + required: true + content: + application/json: + schema: + type: object + required: + - event + properties: + event: + type: string + description: The name of the event to track + example: service_created + properties: + type: object + additionalProperties: true + description: Optional map of arbitrary properties associated with the event + example: + region: "us-east-1" + responses: + '200': + $ref: '#/components/responses/AnalyticsResponse' + '4XX': + $ref: '#/components/responses/ClientError' + +components: + parameters: + ProjectId: + name: project_id + in: path + required: true + description: The unique identifier of the project. + schema: + type: string + example: "rp1pz7uyae" + ServiceId: + name: service_id + in: path + required: true + description: The unique identifier of the service. + schema: + type: string + example: "d1k5vk7hf2" + ReplicaSetId: + name: replica_set_id + in: path + required: true + description: The unique identifier of the read replica set. + schema: + type: string + example: "alb8jicdpr" + VPCId: + name: vpc_id + in: path + required: true + description: The unique identifier of the VPC. + schema: + type: string + example: "1234567890" + PeeringId: + name: peering_id + in: path + required: true + description: The unique identifier of the VPC peering connection. + schema: + type: string + example: "1234567890" + + schemas: + VPC: + type: object + properties: + id: + type: string + readOnly: true + example: "1234567890" + name: + type: string + example: "my-production-vpc" + cidr: + type: string + example: "10.0.0.0/16" + region_code: + type: string + example: "us-east-1" + VPCCreate: + type: object + required: + - name + - cidr + - region_code + properties: + name: + type: string + example: "my-production-vpc" + cidr: + type: string + example: "10.0.0.0/16" + region_code: + type: string + example: "us-east-1" + VPCRename: + type: object + required: + - name + properties: + name: + type: string + description: The new name for the VPC. + example: "my-renamed-vpc" + Peering: + type: object + properties: + id: + type: string + readOnly: true + example: "1234567890" + peer_account_id: + type: string + example: "acc-12345" + peer_region_code: + type: string + example: "aws-us-east-1" + peer_vpc_id: + type: string + example: "1234567890" + provisioned_id: + type: string + example: "1234567890" + status: + type: string + example: "active" + error_message: + type: string + example: "VPC not found" + PeeringCreate: + type: object + required: + - peer_account_id + - peer_region_code + - peer_vpc_id + properties: + peer_account_id: + type: string + example: "acc-12345" + peer_region_code: + type: string + example: "aws-us-east-1" + peer_vpc_id: + type: string + example: "1234567890" + Endpoint: + type: object + properties: + host: + type: string + example: "my-service.com" + port: + type: integer + example: 8080 + ConnectionPooler: + type: object + properties: + endpoint: + $ref: '#/components/schemas/Endpoint' + Service: + type: object + properties: + service_id: + type: string + description: The unique identifier for the service. + project_id: + type: string + description: The project this service belongs to. + name: + type: string + description: The name of the service. + region_code: + type: string + description: The cloud region where the service is hosted. + example: "us-east-1" + service_type: + $ref: '#/components/schemas/ServiceType' + description: The type of the service. + created: + type: string + format: date-time + description: Creation timestamp + initial_password: + type: string + description: The initial password for the service. + format: password + example: "a-very-secure-initial-password" + paused: + type: boolean + description: Whether the service is paused + status: + $ref: '#/components/schemas/DeployStatus' + description: Current status of the service + resources: + type: array + description: List of resources allocated to the service + items: + type: object + properties: + id: + type: string + description: Resource identifier + spec: + type: object + description: Resource specification + properties: + cpu_millis: + type: integer + description: CPU allocation in millicores + memory_gbs: + type: integer + description: Memory allocation in gigabytes + volume_type: + type: string + description: Type of storage volume + metadata: + type: object + description: Additional metadata for the service + properties: + environment: + type: string + description: Environment tag for the service + endpoint: + $ref: '#/components/schemas/Endpoint' + vpcEndpoint: + type: object + nullable: true + description: VPC endpoint configuration if available + forked_from: + $ref: '#/components/schemas/ForkSpec' + ha_replicas: + $ref: '#/components/schemas/HAReplica' + connection_pooler: + $ref: '#/components/schemas/ConnectionPooler' + read_replica_sets: + type: array + items: + $ref: '#/components/schemas/ReadReplicaSet' + ServiceType: + type: string + enum: + - TIMESCALEDB + - POSTGRES + - VECTOR + EnvironmentTag: + type: string + enum: + - DEV + - PROD + description: The environment tag for the service. + ForkStrategy: + type: string + enum: + - LAST_SNAPSHOT + - NOW + - PITR + description: | + Strategy for creating the fork: + - LAST_SNAPSHOT: Use existing snapshot for fast fork + - NOW: Create new snapshot for up-to-date fork + - PITR: Point-in-time recovery using target_time + DeployStatus: + type: string + enum: + - QUEUED + - DELETING + - CONFIGURING + - READY + - DELETED + - UNSTABLE + - PAUSING + - PAUSED + - RESUMING + - UPGRADING + - OPTIMIZING + ForkSpec: + type: object + properties: + project_id: + type: string + example: "asda1b2c3" + service_id: + type: string + example: "bbss422fg" + is_standby: + type: boolean + example: false + ReadReplicaSet: + type: object + properties: + id: + type: string + example: "alb8jicdpr" + name: + type: string + example: "reporting-replica-1" + status: + type: string + enum: [creating, active, resizing, deleting, error] + example: "active" + nodes: + type: integer + description: Number of nodes in the replica set. + example: 2 + cpu_millis: + type: integer + description: CPU allocation in milli-cores. + example: 250 + memory_gbs: + type: integer + description: Memory allocation in gigabytes. + example: 1 + metadata: + type: object + description: Additional metadata for the read replica set + properties: + environment: + type: string + description: Environment tag for the read replica set + endpoint: + $ref: '#/components/schemas/Endpoint' + connection_pooler: + $ref: '#/components/schemas/ConnectionPooler' + ServiceCreate: + type: object + required: + - name + properties: + name: + type: string + description: A human-readable name for the service. + example: "my-production-db" + addons: + type: array + items: + type: string + enum: ["time-series", "ai"] + description: List of addons to enable for the service. 'time-series' enables TimescaleDB, 'ai' enables AI/vector extensions. + example: ["time-series", "ai"] + region_code: + type: string + description: The region where the service will be created. If not provided, we'll choose the best region for you. + example: "us-east-1" + replica_count: + type: integer + description: Number of high-availability replicas to create (all replicas are asynchronous by default). + example: 2 + cpu_millis: + type: string + description: The initial CPU allocation in milli-cores, or 'shared' for a shared-resource service. + example: "1000" + memory_gbs: + type: string + description: The initial memory allocation in gigabytes, or 'shared' for a shared-resource service. + example: "4" + environment_tag: + $ref: '#/components/schemas/EnvironmentTag' + description: The environment tag for the service, 'DEV' by default. + default: DEV + ForkServiceCreate: + type: object + required: + - fork_strategy + properties: + name: + type: string + description: A human-readable name for the forked service. If not provided, will use parent service name with "-fork" suffix. + example: "my-production-db-fork" + cpu_millis: + type: string + description: The initial CPU allocation in milli-cores, or 'shared' for a shared-resource service. If not provided, will inherit from parent service. + example: "1000" + memory_gbs: + type: string + description: The initial memory allocation in gigabytes, or 'shared' for a shared-resource service. If not provided, will inherit from parent service. + example: "4" + fork_strategy: + $ref: '#/components/schemas/ForkStrategy' + description: Strategy for creating the fork. This field is required. + target_time: + type: string + format: date-time + description: Target time for point-in-time recovery. Required when fork_strategy is PITR. + example: "2024-01-01T00:00:00Z" + environment_tag: + $ref: '#/components/schemas/EnvironmentTag' + description: The environment tag for the forked service, 'DEV' by default. + default: DEV + description: | + Create a fork of an existing service. Service type, region code, and storage are always inherited from the parent service. + HA replica count is always set to 0 for forked services. + HAReplica: + type: object + properties: + sync_replica_count: + type: integer + description: Number of synchronous high-availability replicas. + example: 1 + replica_count: + type: integer + description: Number of high-availability replicas (all replicas are asynchronous by default). + example: 1 + SetHAReplicaInput: + type: object + properties: + sync_replica_count: + type: integer + description: Number of synchronous high-availability replicas. + example: 1 + replica_count: + type: integer + description: Number of high-availability replicas (all replicas are asynchronous by default). + example: 1 + description: At least one of sync_replica_count or replica_count must be provided. + ReadReplicaSetCreate: + type: object + required: + - name + - nodes + - cpu_millis + - memory_gbs + properties: + name: + type: string + description: A human-readable name for the read replica. + example: "my-reporting-replica" + nodes: + type: integer + description: Number of nodes to create in the replica set. + example: 2 + cpu_millis: + type: integer + description: The initial CPU allocation in milli-cores. + example: 250 + memory_gbs: + type: integer + description: The initial memory allocation in gigabytes. + example: 1 + ResizeInput: + type: object + required: + - cpu_millis + - memory_gbs + properties: + cpu_millis: + type: integer + description: The new CPU allocation in milli-cores (e.g., 1000 for 1 vCPU). + example: 1000 + memory_gbs: + type: integer + description: The new memory allocation in gigabytes. + example: 4 + nodes: + type: integer + description: The new number of nodes in the replica set. + example: 2 + UpdatePasswordInput: + type: object + required: + - password + properties: + password: + type: string + description: The new password. + format: password + example: "a-very-secure-new-password" + SetEnvironmentInput: + type: object + required: + - environment + properties: + environment: + type: string + description: The target environment for the service. + enum: [PROD, DEV] + example: + environment: "PROD" + ServiceVPCInput: + type: object + required: + - vpc_id + properties: + vpc_id: + type: string + description: The ID of the VPC to attach the service to. + example: "1234567890" + Error: + type: object + properties: + code: + type: string + message: + type: string + + responses: + AnalyticsResponse: + description: Analytics action completed successfully. + content: + application/json: + schema: + type: object + properties: + status: + type: string + description: Status of the analytics operation + example: "success" + SuccessMessage: + description: The action was completed successfully. + content: + application/json: + schema: + type: object + properties: + message: + type: string + example: "Action completed successfully." + ClientError: + description: Client error response (4xx status codes). + content: + application/json: + schema: + $ref: '#/components/schemas/Error' diff --git a/api-reference/timescaledb-toolkit/candlestick_agg/candlestick.mdx b/api-reference/timescaledb-toolkit/candlestick_agg/candlestick.mdx new file mode 100644 index 0000000..61fe1dd --- /dev/null +++ b/api-reference/timescaledb-toolkit/candlestick_agg/candlestick.mdx @@ -0,0 +1,51 @@ +--- +title: candlestick() +description: Transform pre-aggregated candlestick data into the correct form to use with `candlestick_agg` functions +topics: [hyperfunctions] +tags: [hyperfunctions, finance, candlestick, open, high, low, close] +license: community +type: function +experimental: false +toolkit: true +hyperfunction: + family: financial analysis + type: pseudo-aggregate + aggregates: + - candlestick_agg() +--- + + Since 1.14.0 + +Transform pre-aggregated candlestick data into a candlestick aggregate object. This object contains the data in the +correct form to use with the accessors and rollups in this function group. + +If you're starting with raw tick data rather than candlestick data, use [`candlestick_agg()`](#candlestick_agg) instead. + +## Arguments + +The syntax is: + +```sql +candlestick( + ts TIMESTAMPTZ, + open DOUBLE PRECISION, + high DOUBLE PRECISION, + low DOUBLE PRECISION, + close DOUBLE PRECISION, + volume DOUBLE PRECISION +) RETURNS Candlestick +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| ts | TIMESTAMPTZ | - | ✔ | Timestamp associated with stock price | +| open | DOUBLE PRECISION | - | ✔ | Opening price of candlestick | +| high | DOUBLE PRECISION | - | ✔ | High price of candlestick | +| low | DOUBLE PRECISION | - | ✔ | Low price of candlestick | +| close | DOUBLE PRECISION | - | ✔ | Closing price of candlestick | +| volume | DOUBLE PRECISION | - | ✔ | Total volume of trades during the candlestick period | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| agg | Candlestick | An object storing `(timestamp, value)` pairs for each of the opening, high, low, and closing prices, in addition to information used to calculate the total volume and Volume Weighted Average Price. | diff --git a/api-reference/timescaledb-toolkit/candlestick_agg/candlestick_agg.mdx b/api-reference/timescaledb-toolkit/candlestick_agg/candlestick_agg.mdx new file mode 100644 index 0000000..861d3ac --- /dev/null +++ b/api-reference/timescaledb-toolkit/candlestick_agg/candlestick_agg.mdx @@ -0,0 +1,55 @@ +--- +title: candlestick_agg() +description: Aggregate tick data into an intermediate form for further calculation +topics: [hyperfunctions] +tags: [hyperfunctions, finance, candlestick, open, high, low, close] +license: community +type: function +experimental: false +toolkit: true +hyperfunction: + family: financial analysis + type: aggregate + aggregates: + - candlestick_agg() +products: [cloud, mst, self_hosted] +--- + + Since 1.14.0 + +This is the first step for performing financial calculations on raw tick +data. Use `candlestick_agg` to create an intermediate aggregate from your +tick data. This intermediate form can then be used by one or more accessors +in this group to compute final results. + +Optionally, multiple such intermediate aggregate objects can be combined +using [`rollup()`](#rollup) before an accessor is applied. + +If you're starting with pre-aggregated candlestick data rather than raw tick +data, use the companion [`candlestick()`](#candlestick) function instead. +This function transforms the existing aggregated data into the correct form +for use with the candlestick accessors. + + +## Arguments + +The syntax is: + +```sql +candlestick_agg( + ts TIMESTAMPTZ, + price DOUBLE PRECISION, + volume DOUBLE PRECISION +) RETURNS Candlestick +``` +| Name | Type | Default | Required | Description | +|--|--|--|--|--| +| `ts` | TIMESTAMPTZ | - | ✔ | Timestamp associated with stock price | +| `price` | DOUBLE PRECISION | - | ✔ | Stock quote/price at the given time | +| `volume` | DOUBLE PRECISION | - | ✔ | Volume of the trade | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| agg | Candlestick | An object storing `(timestamp, value)` pairs for each of the opening, high, low, and closing prices, in addition to information used to calculate the total volume and Volume Weighted Average Price. | diff --git a/api-reference/timescaledb-toolkit/candlestick_agg/close.mdx b/api-reference/timescaledb-toolkit/candlestick_agg/close.mdx new file mode 100644 index 0000000..55f8bf9 --- /dev/null +++ b/api-reference/timescaledb-toolkit/candlestick_agg/close.mdx @@ -0,0 +1,38 @@ +--- +title: close() +description: Get the closing price from a candlestick aggregate +topics: [hyperfunctions] +tags: [hyperfunctions, finance, candlestick, close] +license: community +type: function +experimental: false +toolkit: true +hyperfunction: + family: financial analysis + type: accessor + aggregates: + - candlestick_agg() +--- + + Since 1.14.0 + +Get the closing price from a candlestick aggregate. + +## Arguments + +The syntax is: + +```sql +close( + candlestick Candlestick +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| candlestick | Candlestick | - | ✔ | Candlestick aggregate | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| close | DOUBLE PRECISION | The closing price | diff --git a/api-reference/timescaledb-toolkit/candlestick_agg/close_time.mdx b/api-reference/timescaledb-toolkit/candlestick_agg/close_time.mdx new file mode 100644 index 0000000..91bdfc7 --- /dev/null +++ b/api-reference/timescaledb-toolkit/candlestick_agg/close_time.mdx @@ -0,0 +1,38 @@ +--- +title: close_time() +description: Get the timestamp corresponding to the closing time from a candlestick aggregate +topics: [hyperfunctions] +tags: [hyperfunctions, finance, candlestick, close] +license: community +type: function +experimental: false +toolkit: true +hyperfunction: + family: financial analysis + type: accessor + aggregates: + - candlestick_agg() +--- + + Since 1.14.0 + +Get the timestamp corresponding to the closing time from a candlestick aggregate. + +## Arguments + +The syntax is: + +```sql +close_time( + candlestick Candlestick +) RETURNS TIMESTAMPTZ +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| candlestick | Candlestick | - | ✔ | Candlestick aggregate | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| close_time | TIMESTAMPTZ | The time at which the closing price occurred | diff --git a/api-reference/timescaledb-toolkit/candlestick_agg/high.mdx b/api-reference/timescaledb-toolkit/candlestick_agg/high.mdx new file mode 100644 index 0000000..ce15cbe --- /dev/null +++ b/api-reference/timescaledb-toolkit/candlestick_agg/high.mdx @@ -0,0 +1,38 @@ +--- +title: high() +description: Get the high price from a candlestick aggregate +topics: [hyperfunctions] +tags: [hyperfunctions, finance, candlestick, high] +license: community +type: function +experimental: false +toolkit: true +hyperfunction: + family: financial analysis + type: accessor + aggregates: + - candlestick_agg() +--- + + Since 1.14.0 + +Get the high price from a candlestick aggregate. + +## Arguments + +The syntax is: + +```sql +high( + candlestick Candlestick +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| candlestick | Candlestick | - | ✔ | Candlestick aggregate | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| high | DOUBLE PRECISION | The high price | diff --git a/api-reference/timescaledb-toolkit/candlestick_agg/high_time.mdx b/api-reference/timescaledb-toolkit/candlestick_agg/high_time.mdx new file mode 100644 index 0000000..f5ab0ce --- /dev/null +++ b/api-reference/timescaledb-toolkit/candlestick_agg/high_time.mdx @@ -0,0 +1,38 @@ +--- +title: high_time() +description: Get the timestamp corresponding to the high time from a candlestick aggregate +topics: [hyperfunctions] +tags: [hyperfunctions, finance, candlestick, high] +license: community +type: function +experimental: false +toolkit: true +hyperfunction: + family: financial analysis + type: accessor + aggregates: + - candlestick_agg() +--- + + Since 1.14.0 + +Get the timestamp corresponding to the high time from a candlestick aggregate. + +## Arguments + +The syntax is: + +```sql +high_time( + candlestick Candlestick +) RETURNS TIMESTAMPTZ +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| candlestick | Candlestick | - | ✔ | Candlestick aggregate | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| high_time | TIMESTAMPTZ | The first time at which the high price occurred | diff --git a/api-reference/timescaledb-toolkit/candlestick_agg/index.mdx b/api-reference/timescaledb-toolkit/candlestick_agg/index.mdx new file mode 100644 index 0000000..67a46eb --- /dev/null +++ b/api-reference/timescaledb-toolkit/candlestick_agg/index.mdx @@ -0,0 +1,278 @@ +--- +title: Financial analysis overview +description: Perform analysis of financial asset data +sidebarTitle: Overview +--- + +import TwoStepAggregation from '/snippets/api-reference/timescaledb/hyperfunctions/_two-step-aggregation.mdx'; + +Perform analysis of financial asset data. These specialized hyperfunctions make +it easier to write financial analysis queries that involve candlestick data. + +They help you answer questions such as: + +* What are the opening and closing prices of these stocks? +* When did the highest price occur for this stock? + +This function group uses the [two-step aggregation][two-step-aggregation] +pattern. In addition to the usual aggregate function, +[`candlestick_agg`][candlestick_agg], it also includes the pseudo-aggregate +function [`candlestick`][candlestick]. `candlestick_agg` produces a candlestick aggregate from +raw tick data, which can then be used with the accessor and rollup functions in +this group. `candlestick` takes pre-aggregated data and transforms it into the +same format that `candlestick_agg` produces. This allows you to use the +accessors and rollups with existing candlestick data. + +## Two-step aggregation + + + +## Samples + +### Get candlestick values from tick data + +Query your tick data table for the opening, high, low, and closing prices, and +the trading volume, for each 1 hour period in the last day: + +``` sql +SELECT + time_bucket('1 hour'::interval, "time") AS ts, + symbol, + open(candlestick_agg("time", price, volume)), + high(candlestick_agg("time", price, volume)), + low(candlestick_agg("time", price, volume)), + close(candlestick_agg("time", price, volume)), + volume(candlestick_agg("time", price, volume)) +FROM crypto_ticks +WHERE "time" > now() - '1 day'::interval +GROUP BY ts, symbol +; + +-- or + +WITH cs AS ( + SELECT time_bucket('1 hour'::interval, "time") AS hourly_bucket, + symbol, + candlestick_agg("time", price, volume) AS candlestick + FROM crypto_ticks + WHERE "time" > now() - '1 day'::interval + GROUP BY hourly_bucket, symbol +) +SELECT hourly_bucket, + symbol, + open(candlestick), + high(candlestick), + low(candlestick), + close(candlestick), + volume(candlestick) +FROM cs +; +``` + +### Create a continuous aggregate from tick data and roll it up + +Create a continuous aggregate on your stock trade data: + +```sql +CREATE MATERIALIZED VIEW candlestick +WITH (timescaledb.continuous) AS +SELECT time_bucket('1 minute'::interval, "time") AS ts, + symbol, + candlestick_agg("time", price, volume) AS candlestick +FROM crypto_ticks +GROUP BY ts, symbol +; +``` + +Query your by-minute continuous aggregate over stock trade data for the opening, +high, low, and closing (OHLC) prices, along with their timestamps, in the last +hour: + +``` sql +SELECT ts, + symbol, + open_time(candlestick), + open(candlestick), + high_time(candlestick), + high(candlestick), + low_time(candlestick), + low(candlestick), + close_time(candlestick), + close(candlestick) +FROM candlestick +WHERE ts > now() - '1 hour'::interval +; +``` + +Roll up your by-minute continuous aggregate into daily buckets and return the +Volume Weighted Average Price for `AAPL` for the last month: + +``` sql +SELECT + time_bucket('1 day'::interval, ts) AS daily_bucket, + symbol, + vwap(rollup(candlestick)) +FROM candlestick +WHERE symbol = 'AAPL' + AND ts > now() - '1 month'::interval +GROUP BY daily_bucket +ORDER BY daily_bucket +; +``` + +Roll up your by-minute continuous aggregate into hourly buckets and return the +the opening, high, low, and closing prices and the volume for each 1 hour period +in the last day: + +``` sql +SELECT + time_bucket('1 hour'::interval, ts) AS hourly_bucket, + symbol, + open(rollup(candlestick)), + high(rollup(candlestick)), + low(rollup(candlestick)), + close(rollup(candlestick)), + volume(rollup(candlestick)) +FROM candlestick +WHERE ts > now() - '1 day'::interval +GROUP BY hourly_bucket +; +``` + +### Starting from already-aggregated data + +If you have a table of pre-aggregated stock data, it might look similar this +this format: + +``` sql + ts │ symbol │ open │ high │ low │ close │ volume +────────────────────────┼────────┼────────┼────────┼────────┼────────┼────────── + 2022-11-17 00:00:00-05 │ VTI │ 195.67 │ 197.9 │ 195.45 │ 197.49 │ 3704700 + 2022-11-16 00:00:00-05 │ VTI │ 199.45 │ 199.72 │ 198.03 │ 198.32 │ 2905000 + 2022-11-15 00:00:00-05 │ VTI │ 201.5 │ 202.14 │ 198.34 │ 200.36 │ 4606200 + 2022-11-14 00:00:00-05 │ VTI │ 199.26 │ 200.92 │ 198.21 │ 198.35 │ 4248200 + 2022-11-11 00:00:00-05 │ VTI │ 198.58 │ 200.7 │ 197.82 │ 200.16 │ 4538500 + 2022-11-10 00:00:00-05 │ VTI │ 194.35 │ 198.31 │ 193.65 │ 198.14 │ 3981600 + 2022-11-09 00:00:00-05 │ VTI │ 190.46 │ 191.04 │ 187.21 │ 187.53 │ 13959600 + 2022-11-08 00:00:00-05 │ VTI │ 191.25 │ 193.31 │ 189.42 │ 191.66 │ 4847500 + 2022-11-07 00:00:00-05 │ VTI │ 189.59 │ 190.97 │ 188.47 │ 190.66 │ 3420000 + 2022-11-04 00:00:00-04 │ VTI │ 189.32 │ 190.3 │ 185.75 │ 188.94 │ 3584600 + 2022-11-03 00:00:00-04 │ VTI │ 186.5 │ 188.09 │ 185.13 │ 186.54 │ 3935600 + 2022-11-02 00:00:00-04 │ VTI │ 193.07 │ 195.27 │ 188.29 │ 188.34 │ 4686000 + 2022-11-01 00:00:00-04 │ VTI │ 196 │ 196.44 │ 192.76 │ 193.43 │ 9873800 + 2022-10-31 00:00:00-04 │ VTI │ 193.99 │ 195.17 │ 193.51 │ 194.03 │ 5053900 + 2022-10-28 00:00:00-04 │ VTI │ 190.84 │ 195.53 │ 190.74 │ 195.29 │ 3178800 + 2022-10-27 00:00:00-04 │ VTI │ 192.46 │ 193.47 │ 190.61 │ 190.85 │ 3556300 + 2022-10-26 00:00:00-04 │ VTI │ 191.26 │ 194.64 │ 191.26 │ 191.75 │ 4091100 + 2022-10-25 00:00:00-04 │ VTI │ 189.57 │ 193.16 │ 189.53 │ 192.94 │ 3287100 + 2022-10-24 00:00:00-04 │ VTI │ 188.38 │ 190.12 │ 186.69 │ 189.51 │ 4527800 + 2022-10-21 00:00:00-04 │ VTI │ 182.99 │ 187.78 │ 182.29 │ 187.49 │ 3381200 + 2022-10-20 00:00:00-04 │ VTI │ 184.54 │ 186.99 │ 182.81 │ 183.27 │ 2636200 + 2022-10-19 00:00:00-04 │ VTI │ 185.25 │ 186.64 │ 183.34 │ 184.87 │ 2589100 + 2022-10-18 00:00:00-04 │ VTI │ 188.14 │ 188.7 │ 184.71 │ 186.46 │ 3906800 +``` + +You can use the [`candlestick`](#candlestick) function to transform the data +into a form that you'll be able pass to all of the accessors and +[`rollup`](#rollup) functions. To show that your data is preserved, this example +shows how these accessors return a table that looks just like your data: + +``` sql +SELECT + ts, + symbol, + open(candlestick), + high(candlestick), + low(candlestick), + close(candlestick), + volume(candlestick) +FROM ( + SELECT + ts, + symbol, + candlestick(ts, open, high, low, close, volume) + FROM historical_data +) AS _(ts, symbol, candlestick); +; + +-- or + +WITH cs AS ( + SELECT ts + symbol, + candlestick(ts, open, high, low, close, volume) + FROM historical_data +) +SELECT + ts + symbol, + open(candlestick), + high(candlestick), + low(candlestick), + close(candlestick), + volume(candlestick) +FROM cs +; +``` + +The advantage of transforming your data into the candlestick aggergate form is +that you can then use other functions in this group, such as [`rollup`](#rollup) +and [`vwap`](#vwap). + +Roll up your by-day historical data into weekly buckets and return the Volume +Weighted Average Price: + +``` sql +SELECT + time_bucket('1 week'::interval, ts) AS weekly_bucket, + symbol, + vwap(rollup(candlestick)) +FROM ( + SELECT + ts, + symbol, + candlestick(ts, open, high, low, close, volume) + FROM historical_data +) AS _(ts, symbol, candlestick) +GROUP BY weekly_bucket, symbol +; +``` + +## Available functions + +### Aggregate +- [`candlestick_agg()`][candlestick_agg]: aggregate tick data into an intermediate form for further calculation + +### Pseudo-aggregate +- [`candlestick()`][candlestick]: transform pre-aggregated candlestick data into the correct form to use with + candlestick_agg functions + +### Accessors +- [`open()`][open]: get the opening price from a candlestick aggregate +- [`open_time()`][open_time]: get the timestamp of the opening price from a candlestick aggregate +- [`high()`][high]: get the high price from a candlestick aggregate +- [`high_time()`][high_time]: get the timestamp of the high price from a candlestick aggregate +- [`low()`][low]: get the low price from a candlestick aggregate +- [`low_time()`][low_time]: get the timestamp of the low price from a candlestick aggregate +- [`close()`][close]: get the closing price from a candlestick aggregate +- [`close_time()`][close_time]: get the timestamp of the closing price from a candlestick aggregate +- [`volume()`][volume]: get the total volume from a candlestick aggregate +- [`vwap()`][vwap]: calculate the volume-weighted average price from a candlestick aggregate + +### Rollup +- [`rollup()`][rollup]: roll up multiple candlestick aggregates + +[two-step-aggregation]: #two-step-aggregation +[candlestick_agg]: /api-reference/timescaledb/hyperfunctions/candlestick_agg/candlestick_agg +[candlestick]: /api-reference/timescaledb/hyperfunctions/candlestick_agg/candlestick +[open]: /api-reference/timescaledb/hyperfunctions/candlestick_agg/open +[open_time]: /api-reference/timescaledb/hyperfunctions/candlestick_agg/open_time +[high]: /api-reference/timescaledb/hyperfunctions/candlestick_agg/high +[high_time]: /api-reference/timescaledb/hyperfunctions/candlestick_agg/high_time +[low]: /api-reference/timescaledb/hyperfunctions/candlestick_agg/low +[low_time]: /api-reference/timescaledb/hyperfunctions/candlestick_agg/low_time +[close]: /api-reference/timescaledb/hyperfunctions/candlestick_agg/close +[close_time]: /api-reference/timescaledb/hyperfunctions/candlestick_agg/close_time +[volume]: /api-reference/timescaledb/hyperfunctions/candlestick_agg/volume +[vwap]: /api-reference/timescaledb/hyperfunctions/candlestick_agg/vwap +[rollup]: /api-reference/timescaledb/hyperfunctions/candlestick_agg/rollup diff --git a/api-reference/timescaledb-toolkit/candlestick_agg/low.mdx b/api-reference/timescaledb-toolkit/candlestick_agg/low.mdx new file mode 100644 index 0000000..c1f1d71 --- /dev/null +++ b/api-reference/timescaledb-toolkit/candlestick_agg/low.mdx @@ -0,0 +1,38 @@ +--- +title: low() +description: Get the low price from a candlestick aggregate +topics: [hyperfunctions] +tags: [hyperfunctions, finance, candlestick, low] +license: community +type: function +experimental: false +toolkit: true +hyperfunction: + family: financial analysis + type: accessor + aggregates: + - candlestick_agg() +--- + + Since 1.14.0 + +Get the low price from a candlestick aggregate. + +## Arguments + +The syntax is: + +```sql +low( + candlestick Candlestick +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| candlestick | Candlestick | - | ✔ | Candlestick aggregate | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| low | DOUBLE PRECISION | The low price | diff --git a/api-reference/timescaledb-toolkit/candlestick_agg/low_time.mdx b/api-reference/timescaledb-toolkit/candlestick_agg/low_time.mdx new file mode 100644 index 0000000..4f5066b --- /dev/null +++ b/api-reference/timescaledb-toolkit/candlestick_agg/low_time.mdx @@ -0,0 +1,38 @@ +--- +title: low_time() +description: Get the timestamp corresponding to the low time from a candlestick aggregate +topics: [hyperfunctions] +tags: [hyperfunctions, finance, candlestick, low] +license: community +type: function +experimental: false +toolkit: true +hyperfunction: + family: financial analysis + type: accessor + aggregates: + - candlestick_agg() +--- + + Since 1.14.0 + +Get the timestamp corresponding to the low time from a candlestick aggregate. + +## Arguments + +The syntax is: + +```sql +low_time( + candlestick Candlestick +) RETURNS TIMESTAMPTZ +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| candlestick | Candlestick | - | ✔ | Candlestick aggregate | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| low_time | TIMESTAMPTZ | The first time at which the low price occurred | diff --git a/api-reference/timescaledb-toolkit/candlestick_agg/open.mdx b/api-reference/timescaledb-toolkit/candlestick_agg/open.mdx new file mode 100644 index 0000000..177ef4f --- /dev/null +++ b/api-reference/timescaledb-toolkit/candlestick_agg/open.mdx @@ -0,0 +1,38 @@ +--- +title: open() +description: Get the opening price from a candlestick aggregate +topics: [hyperfunctions] +tags: [hyperfunctions, finance, candlestick, open] +license: community +type: function +experimental: false +toolkit: true +hyperfunction: + family: financial analysis + type: accessor + aggregates: + - candlestick_agg() +--- + + Since 1.14.0 + +Get the opening price from a candlestick aggregate. + +## Arguments + +The syntax is: + +```sql +open( + candlestick Candlestick +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| candlestick | Candlestick | - | ✔ | Candlestick aggregate | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| open | DOUBLE PRECISION | The opening price | diff --git a/api-reference/timescaledb-toolkit/candlestick_agg/open_time.mdx b/api-reference/timescaledb-toolkit/candlestick_agg/open_time.mdx new file mode 100644 index 0000000..2116e04 --- /dev/null +++ b/api-reference/timescaledb-toolkit/candlestick_agg/open_time.mdx @@ -0,0 +1,38 @@ +--- +title: open_time() +description: Get the timestamp corresponding to the open time from a candlestick aggregate +topics: [hyperfunctions] +tags: [hyperfunctions, finance, candlestick, open] +license: community +type: function +experimental: false +toolkit: true +hyperfunction: + family: financial analysis + type: accessor + aggregates: + - candlestick_agg() +--- + + Since 1.14.0 + +Get the timestamp corresponding to the open time from a candlestick aggregate. + +## Arguments + +The syntax is: + +```sql +open_time( + candlestick Candlestick +) RETURNS TIMESTAMPTZ +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| candlestick | Candlestick | - | ✔ | Candlestick aggregate | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| open_time | TIMESTAMPTZ | The time at which the opening price occurred | diff --git a/api-reference/timescaledb-toolkit/candlestick_agg/rollup.mdx b/api-reference/timescaledb-toolkit/candlestick_agg/rollup.mdx new file mode 100644 index 0000000..1404105 --- /dev/null +++ b/api-reference/timescaledb-toolkit/candlestick_agg/rollup.mdx @@ -0,0 +1,40 @@ +--- +title: rollup() +description: Roll up multiple Candlestick aggregates +topics: [hyperfunctions] +tags: [hyperfunctions, finance, candlestick] +license: community +type: function +experimental: false +toolkit: true +hyperfunction: + family: financial analysis + type: rollup + aggregates: + - candlestick_agg() +--- + + Early access 1.12.0 + +Combine multiple intermediate candlestick aggregates, produced by `candlestick_agg` or `candlestick`, into a single +intermediate candlestick aggregate. For example, you can use `rollup` to combine candlestick aggregates from 15-minute +buckets into daily buckets. + +## Arguments + +The syntax is: + +```sql +rollup( + candlestick Candlestick +) RETURNS Candlestick +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| candlestick | Candlestick | - | ✔ | The aggregate produced by a `candlestick` or `candlestick_agg` call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| candlestick | Candlestick | A new candlestick aggregate produced by combining the input candlestick aggregates | diff --git a/api-reference/timescaledb-toolkit/candlestick_agg/volume.mdx b/api-reference/timescaledb-toolkit/candlestick_agg/volume.mdx new file mode 100644 index 0000000..d6abb19 --- /dev/null +++ b/api-reference/timescaledb-toolkit/candlestick_agg/volume.mdx @@ -0,0 +1,38 @@ +--- +title: volume() +description: Get the total volume from a candlestick aggregate +topics: [hyperfunctions] +tags: [hyperfunctions, finance, candlestick, volume] +license: community +type: function +experimental: false +toolkit: true +hyperfunction: + family: financial analysis + type: accessor + aggregates: + - candlestick_agg() +--- + + Since 1.14.0 + +Get the total volume from a candlestick aggregate. + +## Arguments + +The syntax is: + +```sql +volume( + candlestick Candlestick +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| candlestick | Candlestick | - | ✔ | Candlestick aggregate | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| volume | DOUBLE PRECISION | Total volume of trades within the period | diff --git a/api-reference/timescaledb-toolkit/candlestick_agg/vwap.mdx b/api-reference/timescaledb-toolkit/candlestick_agg/vwap.mdx new file mode 100644 index 0000000..dc053de --- /dev/null +++ b/api-reference/timescaledb-toolkit/candlestick_agg/vwap.mdx @@ -0,0 +1,42 @@ +--- +title: vwap() +description: Get the Volume Weighted Average Price from a candlestick aggregate +topics: [hyperfunctions] +tags: [hyperfunctions, finance, candlestick, average, volume] +license: community +type: function +experimental: false +toolkit: true +hyperfunction: + family: financial analysis + type: accessor + aggregates: + - candlestick_agg() +--- + + Since 1.14.0 + +Get the Volume Weighted Average Price from a candlestick aggregate. + +For Candlesticks constructed from data that is already aggregated, the Volume Weighted Average Price is calculated using +the typical price for each period (where the typical price refers to the arithmetic mean of the high, low, and closing +prices). + +## Arguments + +The syntax is: + +```sql +vwap( + candlestick Candlestick +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| candlestick | Candlestick | - | ✔ | Candlestick aggregate | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| vwap | DOUBLE PRECISION | The volume weighted average price | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/corr.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/corr.mdx new file mode 100644 index 0000000..d8d9fdc --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/corr.mdx @@ -0,0 +1,58 @@ +--- +title: corr() +description: Calculate the correlation coefficient from a counter aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.3.0 + +Calculate the correlation coefficient from a counter aggregate. The calculation uses a linear least-squares fit, and +returns a value between 0.0 and 1.0, from no correlation to the strongest possible correlation. + + +## Samples + +Calculate the correlation coefficient to determine the goodness of a linear fit between counter value and time. + +```sql +SELECT + id, + bucket, + corr(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + counter_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +corr( + summary CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `CounterSummary` | - | ✔ | A counter aggregate created using [`counter_agg`](#counter_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| corr | DOUBLE PRECISION | The correlation coefficient calculated with time as the independent variable and counter value as the dependent variable. | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/counter_agg.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/counter_agg.mdx new file mode 100644 index 0000000..56af862 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/counter_agg.mdx @@ -0,0 +1,64 @@ +--- +title: counter_agg() +description: Aggregate counter data into an intermediate form for further analysis +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: aggregate + aggregates: + - counter_agg() +topics: +- hyperfunctions +products: +- cloud +- mst +- self_hosted +--- + + Since 1.3.0 + +This is the first step for performing any aggregate calculations +on counter data. Use `counter_agg` to create an intermediate aggregate +from your data. This intermediate form can then be used +by one or more accessors in this group to compute final results. Optionally, +you can combine multiple intermediate aggregate objects using +[`rollup()`](#rollup) before an accessor is applied. + + +## Samples + +Create a counter aggregate to summarize daily counter data. + +```sql +SELECT + time_bucket('1 day'::interval, ts) as dt, + counter_agg(ts, val) AS cs +FROM foo +WHERE id = 'bar' +GROUP BY time_bucket('1 day'::interval, ts) +``` +## Arguments + +The syntax is: + +```sql +counter_agg( + ts TIMESTAMPTZ, + value DOUBLE PRECISION + [, bounds TSTZRANGE] +) RETURNS CounterSummary +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `ts` | `TIMESTAMPTZ` | - | ✔ | The time at each point | +| `value` | `DOUBLE PRECISION` | - | ✔ | The value of the counter at each point | +| `bounds` | `TSTZRANGE` | - | | The smallest and largest possible times that can be input to this aggregate. Bounds are required for extrapolation, but not for other accessor functions. If you don't specify bounds at aggregate creation time, you can add them later using the [`with_bounds`](#with_bounds) function. | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| counter_agg | CounterSummary | The counter aggregate, containing data about the variables in an intermediate form. Pass the aggregate to accessor functions in the counter aggregates API to perform final calculations. Or, pass the aggregate to rollup functions to combine multiple counter aggregates into larger aggregates. | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/counter_zero_time.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/counter_zero_time.mdx new file mode 100644 index 0000000..b0864d4 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/counter_zero_time.mdx @@ -0,0 +1,58 @@ +--- +title: counter_zero_time() +description: Calculate the time when the counter value is predicted to have been zero +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.3.0 + +Calculate the time when the counter value is predicted to have been zero. This is the x-intercept of the linear fit +between counter value and time. + + +## Samples + +Estimate the time when the counter started + +```sql +SELECT + id, + bucket, + counter_zero_time(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + counter_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +counter_zero_time( + summary CounterSummary +) RETURNS TIMESTAMPTZ +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `CounterSummary` | - | ✔ | A counter aggregate created using [`counter_agg`](#counter_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| counter_zero_time | TIMESTAMPTZ | The time when the counter value is predicted to have been zero | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/delta.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/delta.mdx new file mode 100644 index 0000000..ca86142 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/delta.mdx @@ -0,0 +1,56 @@ +--- +title: delta() +description: Calculate the change in a counter from a counter aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.3.0 + +Get the change in a counter over a time period. This is the simple delta, computed by subtracting the last seen value +from the first, after accounting for resets. + + +## Samples + +Get the change in each counter over the entire time interval in table `foo`. + +```sql +SELECT + id, + delta(summary) +FROM ( + SELECT + id, + counter_agg(ts, val) AS summary + FROM foo + GROUP BY id +) t +``` +## Arguments + +The syntax is: + +```sql +delta( + summary CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `CounterSummary` | - | ✔ | A counter aggregated created using [`counter_agg`](#counter_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| delta | DOUBLE PRECISION | The change in the counter over the bucketed interval | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/extrapolated_delta.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/extrapolated_delta.mdx new file mode 100644 index 0000000..8c21741 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/extrapolated_delta.mdx @@ -0,0 +1,68 @@ +--- +title: extrapolated_delta() +description: Calculate the extrapolated change from a counter aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.3.0 + +Calculate the change in a counter during the time period specified by the bounds +in the counter aggregate. The bounds must be specified for the `extrapolated_delta` +function to work. You can provide them as part of the original [`counter_agg`](#counter_agg) +call, or by using the [`with_bounds`](#with_bounds) function on an existing +counter aggregate. + + +## Samples + +Extrapolate the change in a counter over every 15-minute interval. + +```sql +SELECT + id, + bucket, + extrapolated_delta( + with_bounds( + summary, + toolkit_experimental.time_bucket_range('15 min'::interval, bucket) + ),'prometheus' + ) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + counter_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t; +``` +## Arguments + +The syntax is: + +```sql +extrapolated_delta( + summary CounterSummary, + method TEXT +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `CounterSummary` | - | ✔ | A counter aggregate created using [`counter_agg`](#counter_agg) | +| `method` | `TEXT` | - | ✔ | The extrapolation method to use. Not case-sensitive. The only allowed value is `prometheus`, for the Prometheus extrapolation protocol. | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| extrapolated_delta | DOUBLE PRECISION | The extrapolated change in the counter over the time period of the counter aggregate. | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/extrapolated_rate.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/extrapolated_rate.mdx new file mode 100644 index 0000000..99ef2f9 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/extrapolated_rate.mdx @@ -0,0 +1,66 @@ +--- +title: extrapolated_rate() +description: Calculate the extrapolated rate of change from a counter aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.3.0 + +Calculate the rate of change of a counter during the time period specified by the bounds +in the counter aggregate. The bounds must be specified for the `extrapolated_rate` +function to work. You can provide them as part of the original [`counter_agg`](#counter_agg) +call, or by using the [`with_bounds`](#with_bounds) function on an existing +counter aggregate. + + +## Samples + +```sql +SELECT + id, + bucket, + extrapolated_rate( + with_bounds( + summary, + toolkit_experimental.time_bucket_range('15 min'::interval, bucket) + ),'prometheus' + ) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + counter_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t; +``` +## Arguments + +The syntax is: + +```sql +extrapolated_rate( + summary CounterSummary, + method TEXT +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `CounterSummary` | - | ✔ | A counter aggregate created using [`counter_agg`](#counter_agg) | +| `method` | `TEXT` | - | ✔ | The extrapolation method to use. Not case-sensitive. The only allowed value is `prometheus`, for the Prometheus extrapolation protocol. | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| extrapolated_rate | DOUBLE PRECISION | The extrapolated rate of change of the counter over the timer period of the counter aggregate. | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/first_time.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/first_time.mdx new file mode 100644 index 0000000..36ad45c --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/first_time.mdx @@ -0,0 +1,57 @@ +--- +title: first_time() +description: Get the first timestamp from a counter aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.11.0 + +Get the first timestamp from a counter aggregate. + + +## Samples + +Get the first and last point of each daily counter aggregate. + +```sql +WITH t as ( + SELECT + time_bucket('1 day'::interval, ts) as dt, + counter_agg(ts, val) AS cs -- get a CounterSummary + FROM table + GROUP BY time_bucket('1 day'::interval, ts) +) +SELECT + dt, + first_time(cs) -- extract the timestamp of the first point in the CounterSummary + last_time(cs) -- extract the timestamp of the last point in the CounterSummary +FROM t; +``` +## Arguments + +The syntax is: + +```sql +first_time( + cs CounterSummary +) RETURNS TIMESTAMPTZ +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `cs` | `CounterSummary` | - | ✔ | A counter aggregate produced using [`counter_agg`](#counter_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| first_time | TIMESTAMPTZ | The timestamp of the first point in the counter aggregate | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/first_val.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/first_val.mdx new file mode 100644 index 0000000..487d98a --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/first_val.mdx @@ -0,0 +1,57 @@ +--- +title: first_val() +description: Get the first value from a counter aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.11.0 + +Get the value of the first point from a counter aggregate. + + +## Samples + +Get the first and last value of each daily counter aggregate. + +```sql +WITH t as ( + SELECT + time_bucket('1 day'::interval, ts) as dt, + counter_agg(ts, val) AS cs -- get a CounterSummary + FROM table + GROUP BY time_bucket('1 day'::interval, ts) +) +SELECT + dt, + first_val(cs) -- extract the value of the first point in the CounterSummary + last_val(cs) -- extract the value of the last point in the CounterSummary +FROM t; +``` +## Arguments + +The syntax is: + +```sql +first_val( + cs CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `cs` | `CounterSummary` | - | ✔ | A counter aggregate produced using [`counter_agg`](#counter_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| first_val | DOUBLE PRECISION | The value of the first point in the counter aggregate | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/idelta_left.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/idelta_left.mdx new file mode 100644 index 0000000..11bdf65 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/idelta_left.mdx @@ -0,0 +1,58 @@ +--- +title: idelta_left() +description: Calculate the instantaneous change at the left, or earliest, edge of a counter aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.3.0 + +Calculate the instantaneous change at the left, or earliest, edge of a counter aggregate. This is equal to the second +value minus the first value, after accounting for resets. + + +## Samples + +Get the instantaneous change at the start of each 15-minute counter aggregate. + +```sql +SELECT + id, + bucket, + idelta_left(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + counter_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +idelta_left( + summary CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `CounterSummary` | - | ✔ | A counter aggregate created using [`counter_agg`](#counter_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| idelta_left | DOUBLE PRECISION | The instantaneous delta at the left, or earliest, edge of the counter aggregate | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/idelta_right.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/idelta_right.mdx new file mode 100644 index 0000000..bca8a42 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/idelta_right.mdx @@ -0,0 +1,58 @@ +--- +title: idelta_right() +description: Calculate the instantaneous change at the right, or latest, edge of a counter aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.3.0 + +Calculate the instantaneous change at the right, or latest, edge of a counter aggregate. This is equal to the last value +minus the second-last value, after accounting for resets. + + +## Samples + +Get the instantaneous change at the end of each 15-minute counter aggregate. + +```sql +SELECT + id, + bucket, + idelta_right(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + counter_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +idelta_right( + summary CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `CounterSummary` | - | ✔ | A counter aggregate created using [`counter_agg`](#counter_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| idelta_right | DOUBLE PRECISION | The instantaneous delta at the right, or latest, edge of the counter aggregate | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/index.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/index.mdx new file mode 100644 index 0000000..d0afc91 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/index.mdx @@ -0,0 +1,137 @@ +--- +title: Counter aggregation overview +sidebarTitle: Overview +description: Functions for analyzing monotonically increasing counter metrics +topics: [hyperfunctions] +license: community +toolkit: true +products: [cloud, mst, self_hosted] +--- + + Since 1.3.0 + +Analyze data whose values are designed to monotonically increase, and where any decreases are treated as resets. The +`counter_agg` functions simplify this task, which can be difficult to do in pure SQL. + +If it's possible for your readings to decrease as well as increase, use [`gauge_agg`][gauge_agg] instead. + +import TwoStepAggregation from '/snippets/api-reference/timescaledb/hyperfunctions/_two-step-aggregation.mdx'; + +## Two-step aggregation + + + +## Samples + +### Roll up counter aggregates and calculate deltas + +Create daily counter aggregates for a counter with id `bar`: + +```sql +SELECT + date_trunc('day', ts) AS dt, + counter_agg(ts, val) AS counter_summary +FROM foo +WHERE id = 'bar' +GROUP BY date_trunc('day'); +``` + +Roll up the daily aggregates to get a counter aggregate that covers all recorded timestamps: + +```sql +WITH t AS ( + SELECT + date_trunc('day', ts) AS dt, + counter_agg(ts, val) AS counter_summary + FROM foo + WHERE id = 'bar' + GROUP BY date_trunc('day') +) +SELECT rollup(counter_summary) AS full_cs +FROM t; +``` + +Calculate the delta, or the difference between the final and first values, from each daily counter aggregate. Also +calculate the fraction of the total delta that happens on each day: + +```sql +WITH t AS ( + SELECT + date_trunc('day', ts) AS dt, + counter_agg(ts, val) AS counter_summary + FROM foo + WHERE id = 'bar' + GROUP BY date_trunc('day') +), q AS ( + SELECT rollup(counter_summary) AS full_cs + FROM t +) +SELECT + dt, + delta(counter_summary), + delta(counter_summary) / (SELECT delta(full_cs) FROM q LIMIT 1) AS normalized +FROM t; +``` + +## Available functions + +### Aggregate +- [`counter_agg()`][counter_agg]: aggregate counter data into an intermediate form for further analysis + +### Accessors +- [`corr()`][corr]: calculate the correlation coefficient from a counter aggregate +- [`counter_zero_time()`][counter_zero_time]: calculate the time when a counter value was zero +- [`delta()`][delta]: calculate the change in a counter's value +- [`extrapolated_delta()`][extrapolated_delta]: estimate the total change in a counter over a time period +- [`extrapolated_rate()`][extrapolated_rate]: estimate the average rate of change over a time period +- [`first_time()`][first_time]: get the timestamp of the first point in a counter aggregate +- [`first_val()`][first_val]: get the value of the first point in a counter aggregate +- [`idelta_left()`][idelta_left]: calculate the instantaneous change at the left boundary +- [`idelta_right()`][idelta_right]: calculate the instantaneous change at the right boundary +- [`intercept()`][intercept]: calculate the y-intercept from a counter aggregate +- [`interpolated_delta()`][interpolated_delta]: calculate the change over a specific time range with interpolation +- [`interpolated_rate()`][interpolated_rate]: calculate the rate of change over a specific time range with interpolation +- [`irate_left()`][irate_left]: calculate the instantaneous rate at the left boundary +- [`irate_right()`][irate_right]: calculate the instantaneous rate at the right boundary +- [`last_time()`][last_time]: get the timestamp of the last point in a counter aggregate +- [`last_val()`][last_val]: get the value of the last point in a counter aggregate +- [`num_changes()`][num_changes]: get the number of times the counter changed value +- [`num_elements()`][num_elements]: get the number of points in a counter aggregate +- [`num_resets()`][num_resets]: get the number of counter resets +- [`rate()`][rate]: calculate the average rate of change +- [`slope()`][slope]: calculate the slope from a counter aggregate +- [`time_delta()`][time_delta]: calculate the elapsed time in a counter aggregate + +### Rollup +- [`rollup()`][rollup]: combine multiple counter aggregates + +### Mutator +- [`with_bounds()`][with_bounds]: add time bounds to a counter aggregate for extrapolation + +[two-step-aggregation]: #two-step-aggregation +[gauge_agg]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/index +[counter_agg]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/counter_agg +[corr]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/corr +[counter_zero_time]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/counter_zero_time +[delta]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/delta +[extrapolated_delta]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/extrapolated_delta +[extrapolated_rate]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/extrapolated_rate +[first_time]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/first_time +[first_val]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/first_val +[idelta_left]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/idelta_left +[idelta_right]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/idelta_right +[intercept]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/intercept +[interpolated_delta]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/interpolated_delta +[interpolated_rate]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/interpolated_rate +[irate_left]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/irate_left +[irate_right]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/irate_right +[last_time]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/last_time +[last_val]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/last_val +[num_changes]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/num_changes +[num_elements]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/num_elements +[num_resets]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/num_resets +[rate]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/rate +[slope]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/slope +[time_delta]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/time_delta +[rollup]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/rollup +[with_bounds]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/with_bounds \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/intercept.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/intercept.mdx new file mode 100644 index 0000000..62f7868 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/intercept.mdx @@ -0,0 +1,59 @@ +--- +title: intercept() +description: Calculate the y-intercept from a counter aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.3.0 + +Calculate the y-intercept of a linear least-squares fit between counter value and time. This corresponds to the +projected value at the Postgres epoch `(2000-01-01 00:00:00+00)`. You can use the y-intercept with the slope to plot a +best-fit line. + + +## Samples + +Calculate the y-intercept of the linear fit for each 15-minute counter aggregate. + +```sql +SELECT + id, + bucket, + intercept(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + counter_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +intercept( + summary CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `CounterSummary` | - | ✔ | A counter aggregate created using [`counter_agg`](#counter_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| intercept | DOUBLE PRECISION | The y-intercept of the linear least-squares fit | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/interpolated_delta.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/interpolated_delta.mdx new file mode 100644 index 0000000..04ad953 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/interpolated_delta.mdx @@ -0,0 +1,74 @@ +--- +title: interpolated_delta() +description: Calculate the change in a counter, interpolating values at boundaries as needed +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.14.0 + +Calculate the change in a counter over the time period covered by a counter aggregate. Data points at the exact +boundaries of the time period aren't needed. The function interpolates the counter values at the boundaries from +adjacent counter aggregates if needed. + + +## Samples + +Calculate the counter delta for each 15-minute interval, using interpolation to get the values at the interval +boundaries if they don't exist in the data. + +```sql +SELECT + id, + bucket, + interpolated_delta( + summary, + bucket, + '15 min', + LAG(summary) OVER (PARTITION BY id ORDER by bucket), + LEAD(summary) OVER (PARTITION BY id ORDER by bucket) + ) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + counter_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +interpolated_delta( + summary CounterSummary, + start TIMESTAMPTZ, + interval INTERVAL + [, prev CounterSummary] + [, next CounterSummary] +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `CounterSummary` | - | ✔ | A counter aggregate created using [`counter_agg`](#counter_agg) | +| `start` | `TIMESTAMPTZ` | - | ✔ | The start of the time period to compute the delta over | +| `interval` | `INTERVAL` | - | ✔ | The length of the time period to compute the delta over | +| `prev` | `CounterSummary` | - | | The counter aggregate from the previous interval, used to interpolate the value at `start`. If `NULL`, the first timestamp in `summary` is used as the start of the interval. | +| `next` | `CounterSummary` | - | | The counter aggregate from the next interval, used to interpolate the value at `start + interval`. If `NULL`, the last timestamp in `summary` is used as the end of the interval. | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| interpolated_delta | DOUBLE PRECISION | The delta between the first and last points of the time interval. If exact values are missing in the raw data for the first and last points, these values are interpolated linearly from the neighboring counter aggregates. | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/interpolated_rate.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/interpolated_rate.mdx new file mode 100644 index 0000000..14ddd14 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/interpolated_rate.mdx @@ -0,0 +1,74 @@ +--- +title: interpolated_rate() +description: Calculate the rate of change in a counter, interpolating values at boundaries as needed +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.14.0 + +Calculate the rate of change in a counter over a time period. Data points at the exact boundaries of the time period +aren't needed. The function interpolates the counter values at the boundaries from adjacent counter aggregates if +needed. + + +## Samples + +Calculate the per-second rate of change for each 15-minute interval, using interpolation to get the values at the +interval boundaries if they don't exist in the data. + +```sql +SELECT + id, + bucket, + interpolated_rate( + summary, + bucket, + '15 min', + LAG(summary) OVER (PARTITION BY id ORDER by bucket), + LEAD(summary) OVER (PARTITION BY id ORDER by bucket) + ) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + counter_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +interpolated_rate( + summary CounterSummary, + start TIMESTAMPTZ, + interval INTERVAL + [, prev CounterSummary] + [, next CounterSummary] +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `CounterSummary` | - | ✔ | A counter aggregate created using [`counter_agg`](#counter_agg) | +| `start` | `TIMESTAMPTZ` | - | ✔ | The start of the time period to compute the rate over | +| `interval` | `INTERVAL` | - | ✔ | The length of the time period to compute the rate over | +| `prev` | `CounterSummary` | - | | The counter aggregate from the previous interval, used to interpolate the value at `start`. If `NULL`, the first timestamp in `summary` is used as the start of the interval. | +| `next` | `CounterSummary` | - | | The counter aggregate from the next interval, used to interpolate the value at `start + interval`. If `NULL`, the last timestamp in `summary` is used as the end of the interval. | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| interpolated_rate | DOUBLE PRECISION | The per-second rate of change of the counter between the specified bounds. If exact values are missing in the raw data for the first and last points, these values are interpolated linearly from the neighboring counter aggregates. | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/irate_left.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/irate_left.mdx new file mode 100644 index 0000000..6ae007b --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/irate_left.mdx @@ -0,0 +1,59 @@ +--- +title: irate_left() +description: Calculate the instantaneous rate of change at the left, or earliest, edge of a counter aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.3.0 + +Calculate the instantaneous rate of change at the left, or earliest, edge of a counter aggregate. This is equal to the +second value minus the first value, divided by the time lapse between the two points, after accounting for resets. This +calculation is useful for fast-moving counters. + + +## Samples + +Get the instantaneous rate of change at the start of each 15-minute counter aggregate. + +```sql +SELECT + id, + bucket, + irate_left(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + counter_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +irate_left( + summary CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `CounterSummary` | - | ✔ | A counter aggregate created using [`counter_agg`](#counter_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| idelta_left | DOUBLE PRECISION | The instantaneous rate of change at the left, or earliest, edge of the counter aggregate | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/irate_right.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/irate_right.mdx new file mode 100644 index 0000000..a604ad8 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/irate_right.mdx @@ -0,0 +1,59 @@ +--- +title: irate_right() +description: Calculate the instantaneous rate of change at the right, or latest, edge of a counter aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.3.0 + +Calculate the instantaneous rate of change at the right, or latest, edge of a counter aggregate. This is equal to the +last value minus the second-last value, divided by the time lapse between the two points, after accounting for resets. +This calculation is useful for fast-moving counters. + + +## Samples + +Get the instantaneous rate of change at the end of each 15-minute counter aggregate. + +```sql +SELECT + id, + bucket, + irate_right(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + counter_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +irate_right( + summary CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `CounterSummary` | - | ✔ | A counter aggregate created using [`counter_agg`](#counter_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| idelta_right | DOUBLE PRECISION | The instantaneous rate of change at the right, or latest, edge of the counter aggregate | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/last_time.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/last_time.mdx new file mode 100644 index 0000000..4bf0500 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/last_time.mdx @@ -0,0 +1,57 @@ +--- +title: last_time() +description: Get the last timestamp from a counter aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.11.0 + +Get the last timestamp from a counter aggregate. + + +## Samples + +Get the first and last point of each daily counter aggregate. + +```sql +WITH t as ( + SELECT + time_bucket('1 day'::interval, ts) as dt, + counter_agg(ts, val) AS cs -- get a CounterSummary + FROM table + GROUP BY time_bucket('1 day'::interval, ts) +) +SELECT + dt, + first_time(cs) -- extract the timestamp of the first point in the CounterSummary + last_time(cs) -- extract the timestamp of the last point in the CounterSummary +FROM t; +``` +## Arguments + +The syntax is: + +```sql +last_time( + cs CounterSummary +) RETURNS TIMESTAMPTZ +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `cs` | `CounterSummary` | - | ✔ | A counter aggregate produced using [`counter_agg`](#counter_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| last_time | TIMESTAMPTZ | The timestamp of the last point in the counter aggregate | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/last_val.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/last_val.mdx new file mode 100644 index 0000000..ca56b9c --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/last_val.mdx @@ -0,0 +1,57 @@ +--- +title: last_val() +description: Get the last value from a counter aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.11.0 + +Get the value of the last point from a counter aggregate. + + +## Samples + +Get the first and last value of each daily counter aggregate. + +```sql +WITH t as ( + SELECT + time_bucket('1 day'::interval, ts) as dt, + counter_agg(ts, val) AS cs -- get a CounterSummary + FROM table + GROUP BY time_bucket('1 day'::interval, ts) +) +SELECT + dt, + first_val(cs) -- extract the value of the first point in the CounterSummary + last_val(cs) -- extract the value of the last point in the CounterSummary +FROM t; +``` +## Arguments + +The syntax is: + +```sql +last_val( + cs CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `cs` | `CounterSummary` | - | ✔ | A counter aggregate produced using [`counter_agg`](#counter_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| last_val | DOUBLE PRECISION | The value of the last point in the counter aggregate | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/num_changes.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/num_changes.mdx new file mode 100644 index 0000000..18de4d3 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/num_changes.mdx @@ -0,0 +1,58 @@ +--- +title: num_changes() +description: Get the number of times a counter changed from a counter aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.3.0 + +Get the number of times the counter changed during the period summarized by the counter aggregate. Any change is +counted, including resets to zero. + + +## Samples + +Get the number of times the counter changed over each 15-minute interval. + +```sql +SELECT + id, + bucket, + num_changes(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + counter_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +num_changes( + summary CounterSummary +) RETURNS BIGINT +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `CounterSummary` | - | ✔ | A counter summary created using [`counter_agg`](#counter_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| num_changes | BIGINT | The number of times the counter changed | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/num_elements.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/num_elements.mdx new file mode 100644 index 0000000..da980fa --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/num_elements.mdx @@ -0,0 +1,57 @@ +--- +title: num_elements() +description: Get the number of points with distinct timestamps from a counter aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.3.0 + +Get the number of points with distinct timestamps from a counter aggregate. Duplicate timestamps are ignored. + + +## Samples + +Get the number of points for each 15-minute counter aggregate. + +```sql +SELECT + id, + bucket, + num_elements(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + counter_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +num_elements( + summary CounterSummary +) RETURNS BIGINT +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `CounterSummary` | - | ✔ | A counter aggregate created using [`counter_agg`](#counter_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| num_elements | BIGINT | The number of points with distinct timestamps | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/num_resets.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/num_resets.mdx new file mode 100644 index 0000000..f93b507 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/num_resets.mdx @@ -0,0 +1,57 @@ +--- +title: num_resets() +description: Get the number of counter resets from a counter aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.3.0 + +Get the number of times the counter is reset. + + +## Samples + +Get the number of counter resets for each 15-minute counter aggregate. + +```sql +SELECT + id, + bucket, + num_resets(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + counter_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +num_resets( + summary CounterSummary +) RETURNS BIGINT +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `CounterSummary` | - | ✔ | A counter aggregate created using [`counter_agg`](#counter_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| num_resets | BIGINT | The number of resets within the counter aggregate | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/rate.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/rate.mdx new file mode 100644 index 0000000..0846bf6 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/rate.mdx @@ -0,0 +1,56 @@ +--- +title: rate() +description: Calculate the rate of change from a counter aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.3.0 + +Calculate the rate of change of the counter. This is the simple rate, equal to the last value minus the first value, +divided by the time elapsed, after accounting for resets. + + +## Samples + +Get the rate of change per `id` over the entire recorded interval. + +```sql +SELECT + id, + rate(summary) +FROM ( + SELECT + id, + counter_agg(ts, val) AS summary + FROM foo + GROUP BY id +) t +``` +## Arguments + +The syntax is: + +```sql +rate( + summary CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `CounterSummary` | - | ✔ | A counter aggregate created using [`counter_agg`](#counter_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| rate | DOUBLE PRECISION | The rate of change of the counter | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/rollup.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/rollup.mdx new file mode 100644 index 0000000..0ac1a5a --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/rollup.mdx @@ -0,0 +1,43 @@ +--- +title: rollup() +description: Combine multiple counter aggregates +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: rollup + aggregates: + - counter_agg() +topics: + - hyperfunctions +products: + - cloud + - mst + - self_hosted +--- + + Since 1.3.0 + +This function combines multiple counter aggregates into one. This can be used to combine aggregates from adjacent +intervals into one larger interval, such as rolling daily aggregates into a weekly or monthly aggregate. + + +## Arguments + +The syntax is: + +```sql +rollup( + cs CounterSummary +) RETURNS CounterSummary +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `cs` | `CounterSummary` | - | ✔ | A counter aggregate created using [`counter_agg`](#counter_agg) | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| counter_agg | CounterSummary | A new counter aggregate created by combining the input counter aggregates | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/slope.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/slope.mdx new file mode 100644 index 0000000..81b6b15 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/slope.mdx @@ -0,0 +1,57 @@ +--- +title: slope() +description: Calculate the slope from a counter aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.3.0 + +Calculate the slope of the linear least-squares fit for a counter aggregate. The dependent variable is the counter value, adjusted for resets, and the independent variable is time. Time is always in seconds, so the slope estimates the per-second rate of change. This gives a result similar to [`rate`](#rate), but it can more accurately reflect the usual counter behavior in the presence of infrequent, abnormally large changes. + + +## Samples + +Calculate the counter slope per `id` and per 15-minute interval. + +```sql +SELECT + id, + bucket, + slope(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + counter_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +slope( + summary CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `CounterSummary` | - | ✔ | A counter aggregate created using [`counter_agg`](#counter_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| slope | DOUBLE PRECISION | The slope of the linear least-squares fit | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/time_delta.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/time_delta.mdx new file mode 100644 index 0000000..d13db83 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/time_delta.mdx @@ -0,0 +1,59 @@ +--- +title: time_delta() +description: Calculate the difference between the first and last times from a counter aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - counter_agg() +topics: +- hyperfunctions +--- + + Since 1.3.0 + +Get the number of seconds between the first and last measurements in a counter aggregate + + +## Samples + +Get the time difference between the first and last counter readings for each 15-minute interval. Note this difference +isn't necessarily equal to `15 minutes * 60 seconds / minute`, because the first and last readings might not fall +exactly on the interval boundaries. + +```sql +SELECT + id, + bucket, + time_delta(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + counter_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +time_delta( + summary CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `CounterSummary` | - | ✔ | A counter aggregate created using [`counter_agg`](#counter_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| time_delta | DOUBLE PRECISION | The difference, in seconds, between the first and last times | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/with_bounds.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/with_bounds.mdx new file mode 100644 index 0000000..d278ed1 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/counter_agg/with_bounds.mdx @@ -0,0 +1,70 @@ +--- +title: with_bounds() +description: Add bounds to a counter aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: mutator + aggregates: + - counter_agg() +topics: + - hyperfunctions +products: + - cloud + - mst + - self_hosted +--- + + Since 1.3.0 + +Add time bounds to an already-computed counter aggregate. Bounds are necessary to use extrapolation accessors on the +aggregate. + + +## Samples + +Create a counter aggregate for each `id` and each 15-minute interval. Then add bounds to the counter aggregate, so you +can calculate the extrapolated rate. + +```sql +SELECT + id, + bucket, + extrapolated_rate( + with_bounds( + summary, + time_bucket_range('15 min'::interval, bucket) + ) + ) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + counter_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` + +## Arguments + +The syntax is: + +```sql +with_bounds( + summary CounterSummary, + bounds TSTZRANGE, +) RETURNS CounterSummary +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `cs` | `CounterSummary` | - | ✔ | A counter aggregate created using [`counter_agg`](#counter_agg) | +| `bounds` | `TSTZRANGE` | - | ✔ | A range of `timestamptz` giving the smallest and largest allowed times in the counter aggregate | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| counter_agg | CounterSummary | A new counter aggregate with the bounds applied | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/corr.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/corr.mdx new file mode 100644 index 0000000..9ce4101 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/corr.mdx @@ -0,0 +1,58 @@ +--- +title: corr() +description: Calculate the correlation coefficient from a gauge aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - gauge_agg() +topics: +- hyperfunctions +--- + + Early access 1.6.0 + +Calculate the correlation coefficient from a gauge aggregate. The calculation uses a linear least-squares fit, and +returns a value between 0.0 and 1.0, from no correlation to the strongest possible correlation. + + +## Samples + +Calculate the correlation coefficient to determine the goodness of a linear fit between gauge value and time. + +```sql +SELECT + id, + bucket, + corr(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + gauge_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +corr( + summary CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `GaugeSummary` | - | ✔ | A gauge aggregate created using [`gauge_agg`](#gauge_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| corr | DOUBLE PRECISION | The correlation coefficient calculated with time as the independent variable and counter value as the dependent variable. | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/delta.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/delta.mdx new file mode 100644 index 0000000..24da77a --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/delta.mdx @@ -0,0 +1,56 @@ +--- +title: delta() +description: Calculate the change in a gauge from a gauge aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - gauge_agg() +topics: +- hyperfunctions +--- + + Early access 1.6.0 + +Get the change in a gauge over a time period. This is the simple delta, computed by subtracting the last seen value from +the first. + + +## Samples + +Get the change in each gauge over the entire time interval in table `foo`. + +```sql +SELECT + id, + delta(summary) +FROM ( + SELECT + id, + gauge_agg(ts, val) AS summary + FROM foo + GROUP BY id +) t +``` +## Arguments + +The syntax is: + +```sql +delta( + summary CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `GaugeSummary` | - | ✔ | A gauge aggregated created using [`gauge_agg`](#gauge_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| delta | DOUBLE PRECISION | The change in the counter over the bucketed interval | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/extrapolated_delta.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/extrapolated_delta.mdx new file mode 100644 index 0000000..8e77697 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/extrapolated_delta.mdx @@ -0,0 +1,68 @@ +--- +title: extrapolated_delta() +description: Calculate the extrapolated change from a gauge aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - gauge_agg() +topics: +- hyperfunctions +--- + + Early access 1.6.0 + +Calculate the change in a gauge during the time period specified by the bounds +in the gauge aggregate. The bounds must be specified for the `extrapolated_delta` +function to work. You can provide them as part of the original [`gauge_agg`](#gauge_agg) +call, or by using the [`with_bounds`](#with_bounds) function on an existing +gauge aggregate. + + +## Samples + +Extrapolate the change in a gauge over every 15-minute interval. + +```sql +SELECT + id, + bucket, + extrapolated_delta( + with_bounds( + summary, + toolkit_experimental.time_bucket_range('15 min'::interval, bucket) + ),'prometheus' + ) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + gauge_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t; +``` +## Arguments + +The syntax is: + +```sql +extrapolated_delta( + summary CounterSummary, + method TEXT +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `GaugeSummary` | - | ✔ | A gauge aggregate created using [`gauge_agg`](#gauge_agg) | +| `method` | `TEXT` | - | ✔ | The extrapolation method to use. Not case-sensitive. The only allowed value is `prometheus`, for the Prometheus extrapolation protocol. | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| extrapolated_delta | DOUBLE PRECISION | The extrapolated change in the counter over the time period of the counter aggregate. | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/extrapolated_rate.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/extrapolated_rate.mdx new file mode 100644 index 0000000..d11e2d8 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/extrapolated_rate.mdx @@ -0,0 +1,66 @@ +--- +title: extrapolated_rate() +description: Calculate the extrapolated rate of change from a gauge aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - gauge_agg() +topics: +- hyperfunctions +--- + + Early access 1.6.0 + +Calculate the rate of change of a gauge during the time period specified by the bounds +in the gauge aggregate. The bounds must be specified for the `extrapolated_rate` +function to work. You can provide them as part of the original [`gauge_agg`](#gauge_agg) +call, or by using the [`with_bounds`](#with_bounds) function on an existing +gauge aggregate. + + +## Samples + +```sql +SELECT + id, + bucket, + extrapolated_rate( + with_bounds( + summary, + toolkit_experimental.time_bucket_range('15 min'::interval, bucket) + ),'prometheus' + ) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + gauge_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t; +``` +## Arguments + +The syntax is: + +```sql +extrapolated_rate( + summary CounterSummary, + method TEXT +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `GaugeSummary` | - | ✔ | A gauge aggregate created using [`gauge_agg`](#gauge_agg) | +| `method` | `TEXT` | - | ✔ | The extrapolation method to use. Not case-sensitive. The only allowed value is `prometheus`, for the Prometheus extrapolation protocol. | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| extrapolated_rate | DOUBLE PRECISION | The extrapolated rate of change of the counter over the timer period of the counter aggregate. | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/gauge_agg.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/gauge_agg.mdx new file mode 100644 index 0000000..c335325 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/gauge_agg.mdx @@ -0,0 +1,64 @@ +--- +title: gauge_agg() +description: Aggregate gauge data into an intermediate form for further analysis +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: aggregate + aggregates: + - gauge_agg() +topics: +- hyperfunctions +products: +- cloud +- mst +- self_hosted +--- + + Early access 1.6.0 + +This is the first step for performing any aggregate calculations +on gauge data. Use `gauge_agg` to create an intermediate aggregate +from your data. This intermediate form can then be used +by one or more accessors in this group to compute final results. Optionally, +you can combine multiple intermediate aggregate objects with +[`rollup()`](#rollup) before an accessor is applied. + + +## Samples + +Create a gauge aggregate to summarize daily gauge data. + +```sql +SELECT + time_bucket('1 day'::interval, ts) as dt, + gauge_agg(ts, val) AS cs +FROM foo +WHERE id = 'bar' +GROUP BY time_bucket('1 day'::interval, ts) +``` +## Arguments + +The syntax is: + +```sql +gauge_agg( + ts TIMESTAMPTZ, + value DOUBLE PRECISION + [, bounds TSTZRANGE] +) RETURNS GaugeSummary +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `ts` | `TIMESTAMPTZ` | - | ✔ | The time at each point | +| `value` | `DOUBLE PRECISION` | - | ✔ | The value of the gauge at each point | +| `bounds` | `TSTZRANGE` | - | | The smallest and largest possible times that can be input to this aggregate. Bounds are required for extrapolation, but not for other accessor functions. If you don't specify bounds at aggregate creation time, you can add them later using the [`with_bounds`](#with_bounds) function. | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| gauge_agg | GaugeSummary | The gauge aggregate, containing data about the variables in an intermediate form. Pass the aggregate to accessor functions in the gauge aggregates API to perform final calculations. Or, pass the aggregate to rollup functions to combine multiple gauge aggregates into larger aggregates. | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/gauge_zero_time.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/gauge_zero_time.mdx new file mode 100644 index 0000000..f8620d9 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/gauge_zero_time.mdx @@ -0,0 +1,58 @@ +--- +title: gauge_zero_time() +description: Calculate the time when the gauge value is predicted to have been zero +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - gauge_agg() +topics: +- hyperfunctions +--- + + Early access 1.6.0 + +Calculate the time when the gauge value is modeled to have been zero. This is the x-intercept of the linear fit between +gauge value and time. + + +## Samples + +Estimate the time when the gauge started + +```sql +SELECT + id, + bucket, + gauge_zero_time(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + gauge_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +gauge_zero_time( + summary GaugeSummary +) RETURNS TIMESTAMPTZ +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `GaugeSummary` | - | ✔ | A gauge aggregate created using [`gauge_agg`](#gauge_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| gauge_zero_time | TIMESTAMPTZ | The time when the gauge value is predicted to have been zero | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/idelta_left.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/idelta_left.mdx new file mode 100644 index 0000000..72f8bc5 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/idelta_left.mdx @@ -0,0 +1,58 @@ +--- +title: idelta_left() +description: Calculate the instantaneous change at the left, or earliest, edge of a gauge aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - gauge_agg() +topics: +- hyperfunctions +--- + + Early access 1.6.0 + +Calculate the instantaneous change at the left, or earliest, edge of a gauge aggregate. This is equal to the second +value minus the first value. + + +## Samples + +Get the instantaneous change at the start of each 15-minute gauge aggregate. + +```sql +SELECT + id, + bucket, + idelta_left(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + gauge_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +idelta_left( + summary CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `GaugeSummary` | - | ✔ | A gauge aggregate created using [`gauge_agg`](#gauge_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| idelta_left | DOUBLE PRECISION | The instantaneous delta at the left, or earliest, edge of the counter aggregate | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/idelta_right.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/idelta_right.mdx new file mode 100644 index 0000000..fe608aa --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/idelta_right.mdx @@ -0,0 +1,58 @@ +--- +title: idelta_right() +description: Calculate the instantaneous change at the right, or latest, edge of a gauge aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - gauge_agg() +topics: +- hyperfunctions +--- + + Early access 1.6.0 + +Calculate the instantaneous change at the right, or latest, edge of a gauge aggregate. This is equal to the last value +minus the second-last value. + + +## Samples + +Get the instantaneous change at the end of each 15-minute gauge aggregate. + +```sql +SELECT + id, + bucket, + idelta_right(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + gauge_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +idelta_right( + summary CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `GaugeSummary` | - | ✔ | A gauge aggregate created using [`gauge_agg`](#gauge_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| idelta_right | DOUBLE PRECISION | The instantaneous delta at the right, or latest, edge of the counter aggregate | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/index.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/index.mdx new file mode 100644 index 0000000..53a053a --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/index.mdx @@ -0,0 +1,108 @@ +--- +title: Gauge aggregation overview +sidebarTitle: Overview +description: Functions for analyzing gauge metrics that can increase or decrease +topics: [hyperfunctions] +license: community +toolkit: true +products: [cloud, mst, self_hosted] +--- + + Early access 1.6.0 + +Analyze data coming from gauges. Unlike counters, gauges can decrease as well as increase. + +If your value can only increase, use [`counter_agg`][counter_agg] instead to appropriately account for resets. + +import TwoStepAggregation from '/snippets/api-reference/timescaledb/hyperfunctions/_two-step-aggregation.mdx'; + +## Two-step aggregation + + + +## Samples + +### Analyze gauge metrics + +Create hourly gauge aggregates and calculate changes: + +```sql +SELECT + time_bucket('1 hour'::interval, ts) AS hour, + gauge_agg(ts, temperature) AS gauge_summary +FROM sensors +WHERE location = 'warehouse' +GROUP BY hour; +``` + +Calculate the delta and rate of change for gauge values: + +```sql +WITH hourly AS ( + SELECT + time_bucket('1 hour'::interval, ts) AS hour, + gauge_agg(ts, temperature) AS gauge_summary + FROM sensors + WHERE location = 'warehouse' + GROUP BY hour +) +SELECT + hour, + delta(gauge_summary) AS temp_change, + rate(gauge_summary) AS temp_change_rate +FROM hourly +ORDER BY hour; +``` + +## Available functions + +### Aggregate +- [`gauge_agg()`][gauge_agg]: aggregate gauge data into an intermediate form for further analysis + +### Accessors +- [`corr()`][corr]: calculate the correlation coefficient from a gauge aggregate +- [`delta()`][delta]: calculate the change in a gauge's value +- [`extrapolated_delta()`][extrapolated_delta]: estimate the total change in a gauge over a time period +- [`extrapolated_rate()`][extrapolated_rate]: estimate the average rate of change over a time period +- [`gauge_zero_time()`][gauge_zero_time]: calculate the time when a gauge value was zero +- [`idelta_left()`][idelta_left]: calculate the instantaneous change at the left boundary +- [`idelta_right()`][idelta_right]: calculate the instantaneous change at the right boundary +- [`intercept()`][intercept]: calculate the y-intercept from a gauge aggregate +- [`interpolated_delta()`][interpolated_delta]: calculate the change over a specific time range with interpolation +- [`interpolated_rate()`][interpolated_rate]: calculate the rate of change over a specific time range with interpolation +- [`irate_left()`][irate_left]: calculate the instantaneous rate at the left boundary +- [`irate_right()`][irate_right]: calculate the instantaneous rate at the right boundary +- [`num_changes()`][num_changes]: get the number of times the gauge changed value +- [`num_elements()`][num_elements]: get the number of points in a gauge aggregate +- [`rate()`][rate]: calculate the average rate of change +- [`slope()`][slope]: calculate the slope from a gauge aggregate +- [`time_delta()`][time_delta]: calculate the elapsed time in a gauge aggregate + +### Rollup +- [`rollup()`][rollup]: combine multiple gauge aggregates + +### Mutator +- [`with_bounds()`][with_bounds]: add time bounds to a gauge aggregate for extrapolation + +[two-step-aggregation]: #two-step-aggregation +[counter_agg]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/index +[gauge_agg]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/gauge_agg +[corr]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/corr +[delta]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/delta +[extrapolated_delta]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/extrapolated_delta +[extrapolated_rate]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/extrapolated_rate +[gauge_zero_time]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/gauge_zero_time +[idelta_left]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/idelta_left +[idelta_right]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/idelta_right +[intercept]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/intercept +[interpolated_delta]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/interpolated_delta +[interpolated_rate]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/interpolated_rate +[irate_left]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/irate_left +[irate_right]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/irate_right +[num_changes]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/num_changes +[num_elements]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/num_elements +[rate]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/rate +[slope]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/slope +[time_delta]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/time_delta +[rollup]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/rollup +[with_bounds]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/with_bounds \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/intercept.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/intercept.mdx new file mode 100644 index 0000000..ef643b2 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/intercept.mdx @@ -0,0 +1,59 @@ +--- +title: intercept() +description: Calculate the y-intercept from a gauge aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - gauge_agg() +topics: +- hyperfunctions +--- + + Early access 1.6.0 + +Calculate the y-intercept of a linear least-squares fit between gauge value and time. This corresponds to the projected +value at the Postgres epoch `(2000-01-01 00:00:00+00)`. You can use the y-intercept with the slope to plot a best-fit +line. + + +## Samples + +Calculate the y-intercept of the linear fit for each 15-minute gauge aggregate. + +```sql +SELECT + id, + bucket, + intercept(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + gauge_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +intercept( + summary CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `GaugeSummary` | - | ✔ | A gauge aggregate created using [`gauge_agg`](#gauge_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| intercept | DOUBLE PRECISION | The y-intercept of the linear least-squares fit | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/interpolated_delta.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/interpolated_delta.mdx new file mode 100644 index 0000000..4d5c71c --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/interpolated_delta.mdx @@ -0,0 +1,74 @@ +--- +title: interpolated_delta() +description: Calculate the change in a gauge, interpolating values at boundaries as needed +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - gauge_agg() +topics: +- hyperfunctions +--- + + Early access 1.8.0 + +Calculate the change in a gauge over the time period covered by a gauge aggregate. Data points at the exact boundaries +of the time period aren't needed. The function interpolates the gauge values at the boundaries from adjacent gauge +aggregates if needed. + + +## Samples + +Calculate the gauge delta for each 15-minute interval, using interpolation to get the values at the interval boundaries +if they don't exist in the data. + +```sql +SELECT + id, + bucket, + interpolated_delta( + summary, + bucket, + '15 min', + LAG(summary) OVER (PARTITION BY id ORDER by bucket), + LEAD(summary) OVER (PARTITION BY id ORDER by bucket) + ) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + gauge_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +interpolated_delta( + summary CounterSummary, + start TIMESTAMPTZ, + interval INTERVAL + [, prev CounterSummary] + [, next CounterSummary] +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `GaugeSummary` | - | ✔ | A gauge aggregate created using [`gauge_agg`](#gauge_agg) | +| `start` | `TIMESTAMPTZ` | - | ✔ | The start of the time period to compute the delta over | +| `interval` | `INTERVAL` | - | ✔ | The length of the time period to compute the delta over | +| `prev` | `GaugeSummary` | - | | The gauge aggregate from the previous interval, used to interpolate the value at `start`. If `NULL`, the first timestamp in `summary` is used as the start of the interval. | +| `next` | `GaugeSummary` | - | | The gauge aggregate from the next interval, used to interpolate the value at `start + interval`. If `NULL`, the last timestamp in `summary` is used as the end of the interval. | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| interpolated_delta | DOUBLE PRECISION | The delta between the first and last points of the time interval. If exact values are missing in the raw data for the first and last points, these values are interpolated linearly from the neighboring counter aggregates. | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/interpolated_rate.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/interpolated_rate.mdx new file mode 100644 index 0000000..2e9a749 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/interpolated_rate.mdx @@ -0,0 +1,73 @@ +--- +title: interpolated_rate() +description: Calculate the rate of change in a gauge, interpolating values at boundaries as needed +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - gauge_agg() +topics: +- hyperfunctions +--- + + Early access 1.8.0 + +Calculate the rate of change in a gauge over a time period. Data points at the exact boundaries of the time period +aren't needed. The function interpolates the gauge values at the boundaries from adjacent gauge aggregates if needed. + + +## Samples + +Calculate the per-second rate of change for each 15-minute interval, using interpolation to get the values at the +interval boundaries if they don't exist in the data. + +```sql +SELECT + id, + bucket, + interpolated_rate( + summary, + bucket, + '15 min', + LAG(summary) OVER (PARTITION BY id ORDER by bucket), + LEAD(summary) OVER (PARTITION BY id ORDER by bucket) + ) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + gauge_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +interpolated_rate( + summary CounterSummary, + start TIMESTAMPTZ, + interval INTERVAL + [, prev CounterSummary] + [, next CounterSummary] +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `GaugeSummary` | - | ✔ | A gauge aggregate created using [`gauge_agg`](#gauge_agg) | +| `start` | `TIMESTAMPTZ` | - | ✔ | The start of the time period to compute the rate over | +| `interval` | `INTERVAL` | - | ✔ | The length of the time period to compute the rate over | +| `prev` | `GaugeSummary` | - | | The gauge aggregate from the previous interval, used to interpolate the value at `start`. If `NULL`, the first timestamp in `summary` is used as the start of the interval. | +| `next` | `GaugeSummary` | - | | The gauge aggregate from the next interval, used to interpolate the value at `start + interval`. If `NULL`, the last timestamp in `summary` is used as the end of the interval. | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| interpolated_rate | DOUBLE PRECISION | The per-second rate of change of the counter between the specified bounds. If exact values are missing in the raw data for the first and last points, these values are interpolated linearly from the neighboring counter aggregates. | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/irate_left.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/irate_left.mdx new file mode 100644 index 0000000..7c4944a --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/irate_left.mdx @@ -0,0 +1,59 @@ +--- +title: irate_left() +description: Calculate the instantaneous rate of change at the left, or earliest, edge of a gauge aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - gauge_agg() +topics: +- hyperfunctions +--- + + Early access 1.6.0 + +Calculate the instantaneous rate of change at the left, or earliest, edge of a gauge aggregate. This is equal to the +second value minus the first value, divided by the time lapse between the two points. This calculation is useful for +fast-moving gauges. + + +## Samples + +Get the instantaneous rate of change at the start of each 15-minute gauge aggregate. + +```sql +SELECT + id, + bucket, + irate_left(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + gauge_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +irate_left( + summary CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `GaugeSummary` | - | ✔ | A gauge aggregate created using [`gauge_agg`](#gauge_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| idelta_left | DOUBLE PRECISION | The instantaneous rate of change at the left, or earliest, edge of the counter aggregate | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/irate_right.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/irate_right.mdx new file mode 100644 index 0000000..14efd5d --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/irate_right.mdx @@ -0,0 +1,59 @@ +--- +title: irate_right() +description: Calculate the instantaneous rate of change at the right, or latest, edge of a gauge aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - gauge_agg() +topics: +- hyperfunctions +--- + + Early access 1.6.0 + +Calculate the instantaneous rate of change at the right, or latest, edge of a gauge aggregate. This is equal to the last +value minus the second-last value, divided by the time lapse between the two points. This calculation is useful for +fast-moving gauges. + + +## Samples + +Get the instantaneous rate of change at the end of each 15-minute gauge aggregate. + +```sql +SELECT + id, + bucket, + irate_right(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + gauge_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +irate_right( + summary CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `GaugeSummary` | - | ✔ | A gauge aggregate created using [`gauge_agg`](#gauge_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| idelta_right | DOUBLE PRECISION | The instantaneous rate of change at the right, or latest, edge of the counter aggregate | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/num_changes.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/num_changes.mdx new file mode 100644 index 0000000..b5a89df --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/num_changes.mdx @@ -0,0 +1,57 @@ +--- +title: num_changes() +description: Get the number of times a gauge changed from a gauge aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - gauge_agg() +topics: +- hyperfunctions +--- + + Early access 1.6.0 + +Get the number of times the gauge changed during the period summarized by the gauge aggregate. + + +## Samples + +Get the number of times the gauge changed over each 15-minute interval. + +```sql +SELECT + id, + bucket, + num_changes(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + gauge_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +num_changes( + summary CounterSummary +) RETURNS BIGINT +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `GaugeSummary` | - | ✔ | A gauge summary created using [`gauge_agg`](#gauge_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| num_changes | BIGINT | The number of times the counter changed | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/num_elements.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/num_elements.mdx new file mode 100644 index 0000000..47894d6 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/num_elements.mdx @@ -0,0 +1,57 @@ +--- +title: num_elements() +description: Get the number of points with distinct timestamps from a gauge aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - gauge_agg() +topics: +- hyperfunctions +--- + + Early access 1.6.0 + +Get the number of points with distinct timestamps from a gauge aggregate. Duplicate timestamps are ignored. + + +## Samples + +Get the number of points for each 15-minute gauge aggregate. + +```sql +SELECT + id, + bucket, + num_elements(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + gauge_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +num_elements( + summary CounterSummary +) RETURNS BIGINT +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `GaugeSummary` | - | ✔ | A gauge aggregate created using [`gauge_agg`](#gauge_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| num_elements | BIGINT | The number of points with distinct timestamps | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/rate.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/rate.mdx new file mode 100644 index 0000000..ef4d36e --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/rate.mdx @@ -0,0 +1,56 @@ +--- +title: rate() +description: Calculate the rate of change from a gauge aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - gauge_agg() +topics: +- hyperfunctions +--- + + Early access 1.6.0 + +Calculate the rate of change of the gauge. This is the simple rate, equal to the last value minus the first value, +divided by the time elapsed. + + +## Samples + +Get the rate of change per `id` over the entire recorded interval. + +```sql +SELECT + id, + rate(summary) +FROM ( + SELECT + id, + gauge_agg(ts, val) AS summary + FROM foo + GROUP BY id +) t +``` +## Arguments + +The syntax is: + +```sql +rate( + summary CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `GaugeSummary` | - | ✔ | A gauge aggregate created using [`gauge_agg`](#gauge_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| rate | DOUBLE PRECISION | The rate of change of the counter | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/rollup.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/rollup.mdx new file mode 100644 index 0000000..6ab846f --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/rollup.mdx @@ -0,0 +1,43 @@ +--- +title: rollup() +description: Combine multiple gauge aggregates +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: rollup + aggregates: + - gauge_agg() +topics: + - hyperfunctions +products: + - cloud + - mst + - self_hosted +--- + + Early access 1.6.0 + +This function combines multiple gauge aggregates into one. This can be used to combine aggregates from adjacent +intervals into one larger interval, such as rolling daily aggregates into a weekly or monthly aggregate. + + +## Arguments + +The syntax is: + +```sql +rollup( + cs CounterSummary +) RETURNS CounterSummary +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `cs` | `GaugeSummary` | - | ✔ | A gauge aggregate created using [`gauge_agg`](#gauge_agg) | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| counter_agg | CounterSummary | A new counter aggregate created by combining the input counter aggregates | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/slope.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/slope.mdx new file mode 100644 index 0000000..9d17ae5 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/slope.mdx @@ -0,0 +1,57 @@ +--- +title: slope() +description: Calculate the slope from a gauge aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - gauge_agg() +topics: +- hyperfunctions +--- + + Early access 1.6.0 + +Calculate the slope of the linear least-squares fit for a gauge aggregate. The dependent variable is the gauge value, and the independent variable is time. Time is always in seconds, so the slope estimates the per-second rate of change. This gives a result similar to [`rate`](#rate), but it can more accurately reflect the usual gauge behavior in the presence of infrequent, abnormally large changes. + + +## Samples + +Calculate the gauge slope per `id` and per 15-minute interval. + +```sql +SELECT + id, + bucket, + slope(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + gauge_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +slope( + summary CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `GaugeSummary` | - | ✔ | A gauge aggregate created using [`gauge_agg`](#gauge_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| slope | DOUBLE PRECISION | The slope of the linear least-squares fit | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/time_delta.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/time_delta.mdx new file mode 100644 index 0000000..90e65a2 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/time_delta.mdx @@ -0,0 +1,59 @@ +--- +title: time_delta() +description: Calculate the difference between the first and last times from a gauge aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: accessor + aggregates: + - gauge_agg() +topics: +- hyperfunctions +--- + + Early access 1.6.0 + +Get the number of seconds between the first and last measurements in a gauge aggregate + + +## Samples + +Get the time difference between the first and last gauge readings for each 15-minute interval. Note this difference +isn't necessarily equal to `15 minutes * 60 seconds / minute`, because the first and last readings might not fall +exactly on the interval boundaries. + +```sql +SELECT + id, + bucket, + time_delta(summary) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + gauge_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +time_delta( + summary CounterSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `summary` | `GaugeSummary` | - | ✔ | A gauge aggregate created using [`gauge_agg`](#gauge_agg) | + + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| time_delta | DOUBLE PRECISION | The difference, in seconds, between the first and last times | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/with_bounds.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/with_bounds.mdx new file mode 100644 index 0000000..9ae3750 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/gauge_agg/with_bounds.mdx @@ -0,0 +1,69 @@ +--- +title: with_bounds() +description: Add bounds to a gauge aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: counters and gauges + type: mutator + aggregates: + - gauge_agg() +topics: + - hyperfunctions +products: + - cloud + - mst + - self_hosted +--- + + Early access 1.6.0 + +Add time bounds to an already-computed gauge aggregate. Bounds are necessary to use extrapolation accessors on the +aggregate. + + +## Samples + +Create a gauge aggregate for each `id` and each 15-minute interval. Then add bounds to the gauge aggregate, so you can +calculate the extrapolated rate. + +```sql +SELECT + id, + bucket, + extrapolated_rate( + with_bounds( + summary, + time_bucket_range('15 min'::interval, bucket) + ) + ) +FROM ( + SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + gauge_agg(ts, val) AS summary + FROM foo + GROUP BY id, time_bucket('15 min'::interval, ts) +) t +``` +## Arguments + +The syntax is: + +```sql +with_bounds( + summary CounterSummary, + bounds TSTZRANGE, +) RETURNS CounterSummary +``` +| Name | Type | Default | Required | Description | +| --- | --- | --- | --- | --- | +| `cs` | `GaugeSummary` | - | ✔ | A gauge aggregate created using [`gauge_agg`](#gauge_agg) | +| `bounds` | `TSTZRANGE` | - | ✔ | A range of `timestamptz` giving the smallest and largest allowed times in the gauge aggregate | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| counter_agg | CounterSummary | A new counter aggregate with the bounds applied | diff --git a/api-reference/timescaledb-toolkit/counters-and-gauges/index.mdx b/api-reference/timescaledb-toolkit/counters-and-gauges/index.mdx new file mode 100644 index 0000000..55e9a14 --- /dev/null +++ b/api-reference/timescaledb-toolkit/counters-and-gauges/index.mdx @@ -0,0 +1,106 @@ +--- +title: Counters and gauges overview +sidebarTitle: Overview +description: Functions for analyzing monotonic counters and gauge metrics +topics: [hyperfunctions] +license: community +toolkit: true +products: [cloud, mst, self_hosted] +--- + + Since 1.3.0 + +Analyze counter and gauge metrics commonly found in monitoring and observability systems. These functions help you +calculate rates, deltas, and trends from time-series measurements. + +- **Counters**: Analyze data whose values are designed to monotonically increase, with any decreases treated as resets + (for example, request counts, bytes sent) +- **Gauges**: Analyze data that can both increase and decrease (for example, temperature, memory usage, queue depth) + +import TwoStepAggregation from '/snippets/api-reference/timescaledb/hyperfunctions/_two-step-aggregation.mdx'; + +## Two-step aggregation + + + +## Samples + +### Calculate counter delta and rate + +Create daily counter aggregates and calculate the change over each day: + +```sql +WITH daily_counters AS ( + SELECT + date_trunc('day', ts) AS day, + counter_agg(ts, requests) AS counter_summary + FROM metrics + WHERE service_id = 'api-server' + GROUP BY day +) +SELECT + day, + delta(counter_summary) AS daily_requests, + rate(counter_summary) AS avg_requests_per_second +FROM daily_counters +ORDER BY day; +``` + +### Calculate gauge statistics + +Analyze gauge metrics to understand trends and variability: + +```sql +WITH hourly_gauges AS ( + SELECT + time_bucket('1 hour'::interval, ts) AS hour, + gauge_agg(ts, memory_usage) AS gauge_summary + FROM system_metrics + WHERE host = 'web-01' + GROUP BY hour +) +SELECT + hour, + delta(gauge_summary) AS memory_change, + rate(gauge_summary) AS memory_change_rate +FROM hourly_gauges +ORDER BY hour; +``` + +### Roll up and extrapolate counter values + +Roll up hourly counter aggregates into daily summaries and extrapolate rates: + +```sql +WITH hourly AS ( + SELECT + time_bucket('1 hour'::interval, ts) AS hour, + counter_agg(ts, bytes_sent, tstzrange('2024-01-01', '2024-01-02')) AS cs + FROM network_metrics + GROUP BY hour +), +daily AS ( + SELECT + date_trunc('day', hour) AS day, + rollup(cs) AS daily_cs + FROM hourly + GROUP BY day +) +SELECT + day, + extrapolated_delta(daily_cs, '1 day'::interval) AS estimated_total_bytes, + extrapolated_rate(daily_cs, '1 day'::interval) AS estimated_avg_bytes_per_sec +FROM daily; +``` + +## Available functions + +### Counter aggregation +- [`counter_agg()`][counter_agg]: analyze monotonically increasing counter metrics + +### Gauge aggregation +- [`gauge_agg()`][gauge_agg]: analyze gauge metrics that can increase or decrease + +[two-step-aggregation]: #two-step-aggregation +[counter_agg]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/counter_agg/index +[gauge_agg]: /api-reference/timescaledb/hyperfunctions/counters-and-gauges/gauge_agg/index diff --git a/api-reference/timescaledb-toolkit/downsampling/asap_smooth.mdx b/api-reference/timescaledb-toolkit/downsampling/asap_smooth.mdx new file mode 100644 index 0000000..4572f59 --- /dev/null +++ b/api-reference/timescaledb-toolkit/downsampling/asap_smooth.mdx @@ -0,0 +1,74 @@ +--- +title: asap_smooth() +description: Downsample a time series using the ASAP smoothing algorithm +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: downsampling + type: function +--- + + Since 1.11.0 + +Downsample your data with the [ASAP smoothing algorithm](http://arxiv.org/pdf/1703.00983). This algorithm preserves the approximate shape and larger trends of the input data, while minimizing the local variance between points. + + +## Samples + +This example uses a table called `metrics`, with columns for `date` and `reading`. The columns contain measurements that +have been accumulated over a large interval of time. This example takes that data and provides a smoothed representation +of approximately 10 points, but that still shows any anomalous readings: + +```sql +SET TIME ZONE 'UTC'; +CREATE TABLE metrics(date TIMESTAMPTZ, reading DOUBLE PRECISION); +INSERT INTO metrics +SELECT + '2020-1-1 UTC'::timestamptz + make_interval(hours=>foo), + (5 + 5 * sin(foo / 12.0 * PI())) + FROM generate_series(1,168) foo; + +SELECT * FROM unnest( + (SELECT asap_smooth(date, reading, 8) + FROM metrics) +); +``` + +```text +time | value +------------------------+--------------------- +2020-01-01 01:00:00+00 | 5.3664814565722665 +2020-01-01 21:00:00+00 | 5.949469264090644 +2020-01-02 17:00:00+00 | 5.582987807518377 +2020-01-03 13:00:00+00 | 4.633518543427733 +2020-01-04 09:00:00+00 | 4.050530735909357 +2020-01-05 05:00:00+00 | 4.417012192481623 +2020-01-06 01:00:00+00 | 5.366481456572268 +2020-01-06 21:00:00+00 | 5.949469264090643 +``` + +## Arguments + +The syntax is: + +```sql +asap_smooth( + ts TIMESTAMPTZ, + value DOUBLE PRECISION, + resolution INT +) RETURNS Timevector +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| ts | TIMESTAMPTZ | - | ✔ | Timestamps for each data point | +| value | DOUBLE PRECISION | - | ✔ | The value at each timestamp | +| resolution | INT | - | ✔ | The approximate number of points to return. Determines the horizontal resolution of the resulting graph. | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| asap_smooth | Timevector | An object representing a series of values occurring at set intervals from a starting time. It can be unpacked with `unnest`. For more information, see the documentation on [timevectors](/use-timescale/latest/hyperfunctions/function-pipelines/#timevectors). | + diff --git a/api-reference/timescaledb-toolkit/downsampling/gp_lttb.mdx b/api-reference/timescaledb-toolkit/downsampling/gp_lttb.mdx new file mode 100644 index 0000000..c14856a --- /dev/null +++ b/api-reference/timescaledb-toolkit/downsampling/gp_lttb.mdx @@ -0,0 +1,75 @@ +--- +title: gp_lttb() +description: Downsample a time series using the Largest Triangle Three Buckets method, while preserving gaps in original data +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: downsampling + type: function +--- + + Early access 1.11.0 + +Downsample your data with the [Largest Triangle Three Buckets algorithm](https://github.com/sveinn-steinarsson/flot-downsample), while preserving gaps in the underlying data. This method is a specialization of the [LTTB](/api/latest/hyperfunctions/downsampling/#lttb) algorithm. + + +## Samples + +This example uses a table with raw data generated as a sine wave, and removes a day from the middle of the data. You can +use gap preserving LTTB to downsample the data while keeping the bounds of the missing region. + +```sql +SET TIME ZONE 'UTC'; +CREATE TABLE metrics(date TIMESTAMPTZ, reading DOUBLE PRECISION); +INSERT INTO metrics +SELECT + '2020-1-1 UTC'::timestamptz + make_interval(hours=>foo), + (5 + 5 * sin(foo / 24.0 * PI())) + FROM generate_series(1,168) foo; +DELETE FROM metrics WHERE date BETWEEN '2020-1-4 UTC' AND '2020-1-5 UTC'; + +SELECT time, value +FROM unnest(( + SELECT toolkit_experimental.gp_lttb(date, reading, 8) + FROM metrics)) +``` + +```text +time | value +-----------------------+------------------- +2020-01-01 01:00:00+00 | 5.652630961100257 +2020-01-02 12:00:00+00 | 0 +2020-01-03 23:00:00+00 | 5.652630961100255 +2020-01-05 01:00:00+00 | 5.652630961100259 +2020-01-05 13:00:00+00 | 9.957224306869051 +2020-01-06 12:00:00+00 | 0 +2020-01-07 10:00:00+00 | 9.82962913144534 +2020-01-08 00:00:00+00 | 5.000000000000004 +``` +## Arguments + +The syntax is: + +```sql +gp_lttb( + ts TIMESTAMPTZ, + value DOUBLE PRECISION, + resolution INT + [, gapsize INTERVAL] +) RETURNS Timevector +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| ts | TIMESTAMPTZ | - | ✔ | Timestamps for each data point | +| value | DOUBLE PRECISION | - | ✔ | The value at each timestamp | +| resolution | INT | - | ✔ | The approximate number of points to return. Determines the horizontal resolution of the resulting graph. | +| gapsize | INTERVAL | - | | Minimum gap size to divide input on | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| gp_lttb | Timevector | An object representing a series of values occurring at set intervals from a starting time. It can be unpacked with `unnest`. For more information, see the documentation on [timevectors](/use-timescale/latest/hyperfunctions/function-pipelines/#timevectors). | + diff --git a/api-reference/timescaledb-toolkit/downsampling/index.mdx b/api-reference/timescaledb-toolkit/downsampling/index.mdx new file mode 100644 index 0000000..ef8e3aa --- /dev/null +++ b/api-reference/timescaledb-toolkit/downsampling/index.mdx @@ -0,0 +1,79 @@ +--- +title: Downsampling overview +sidebarTitle: Overview +description: Functions for downsampling time-series data to visualize trends while preserving visual similarity +topics: [hyperfunctions] +license: community +toolkit: true +products: [cloud, mst, self_hosted] +--- + + Since 1.10.1 + +Downsample your data to visualize trends while preserving fewer data points. Downsampling replaces a set of values with +a much smaller set that is highly representative of the original data. This is particularly useful for graphing +applications where displaying millions of points would be inefficient and visually overwhelming. + +TimescaleDB Toolkit provides two downsampling algorithms: +- **LTTB (Largest Triangle Three Buckets)**: Retains visual similarity between the downsampled data and the original + dataset by selecting points that form the largest triangles +- **ASAP smooth**: Preserves the approximate shape and larger trends while minimizing local variance between points + +## Samples + +### Downsample with LTTB + +Downsample a sine wave dataset from 168 points to approximately 8 points using LTTB: + +```sql +SET TIME ZONE 'UTC'; +CREATE TABLE metrics(date TIMESTAMPTZ, reading DOUBLE PRECISION); +INSERT INTO metrics +SELECT + '2020-1-1 UTC'::timestamptz + make_interval(hours=>foo), + (5 + 5 * sin(foo / 24.0 * PI())) +FROM generate_series(1,168) foo; + +SELECT time, value +FROM unnest(( + SELECT lttb(date, reading, 8) + FROM metrics +)); +``` + +### Downsample with gap preservation + +Use gap-preserving LTTB to downsample data while maintaining boundaries of missing regions: + +```sql +SELECT time, value +FROM unnest(( + SELECT toolkit_experimental.gp_lttb(date, reading, 8, '12 hours'::interval) + FROM metrics +)); +``` + +### Downsample with ASAP smoothing + +Smooth and downsample data to show larger trends while minimizing local variance: + +```sql +SELECT time, value +FROM unnest(( + SELECT asap_smooth(date, reading, 10) + FROM metrics +)); +``` + +## Available functions + +### LTTB algorithm +- [`lttb()`][lttb]: downsample using the Largest Triangle Three Buckets method +- [`gp_lttb()`][gp_lttb]: downsample using LTTB while preserving gaps in data + +### ASAP smoothing +- [`asap_smooth()`][asap_smooth]: downsample using the ASAP smoothing algorithm + +[lttb]: /api-reference/timescaledb/hyperfunctions/downsampling/lttb +[gp_lttb]: /api-reference/timescaledb/hyperfunctions/downsampling/gp_lttb +[asap_smooth]: /api-reference/timescaledb/hyperfunctions/downsampling/asap_smooth diff --git a/api-reference/timescaledb-toolkit/downsampling/lttb.mdx b/api-reference/timescaledb-toolkit/downsampling/lttb.mdx new file mode 100644 index 0000000..c684d11 --- /dev/null +++ b/api-reference/timescaledb-toolkit/downsampling/lttb.mdx @@ -0,0 +1,72 @@ +--- +title: lttb() +description: Downsample a time series using the Largest Triangle Three Buckets method +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: downsampling + type: function +--- + + Since 1.10.1 + +Downsample your data with the [Largest Triangle Three Buckets algorithm](https://github.com/sveinn-steinarsson/flot-downsample). This algorithm tries to retain visual similarity between the downsampled data and the original dataset. + + +## Samples + +This example uses a table with raw data generated as a sine wave. You can use LTTB to dramatically reduce the number of +points while still capturing the peaks and valleys in the data. + +```sql +SET TIME ZONE 'UTC'; +CREATE TABLE metrics(date TIMESTAMPTZ, reading DOUBLE PRECISION); +INSERT INTO metrics +SELECT + '2020-1-1 UTC'::timestamptz + make_interval(hours=>foo), + (5 + 5 * sin(foo / 24.0 * PI())) + FROM generate_series(1,168) foo; + +SELECT time, value +FROM unnest(( + SELECT lttb(date, reading, 8) + FROM metrics)) +``` + +```text +time | value +------------------------+--------------------- +2020-01-01 01:00:00+00 | 5.652630961100257 +2020-01-01 13:00:00+00 | 9.957224306869053 +2020-01-02 11:00:00+00 | 0.04277569313094798 +2020-01-03 11:00:00+00 | 9.957224306869051 +2020-01-04 13:00:00+00 | 0.04277569313094709 +2020-01-05 16:00:00+00 | 9.330127018922191 +2020-01-06 20:00:00+00 | 2.4999999999999996 +2020-01-08 00:00:00+00 | 5.000000000000004 +``` +## Arguments + +The syntax is: + +```sql +lttb( + ts TIMESTAMPTZ, + value DOUBLE PRECISION, + resolution INT +) RETURNS Timevector +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| ts | TIMESTAMPTZ | - | ✔ | Timestamps for each data point | +| value | DOUBLE PRECISION | - | ✔ | The value at each timestamp | +| resolution | INT | - | ✔ | The approximate number of points to return. Determines the horizontal resolution of the resulting graph. | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| lttb | Timevector | An object representing a series of values occurring at set intervals from a starting time. It can be unpacked with `unnest`. For more information, see the documentation on [timevectors](/use-timescale/latest/hyperfunctions/function-pipelines/#timevectors). | + diff --git a/api-reference/timescaledb-toolkit/frequency-analysis/count_min_sketch/approx_count.mdx b/api-reference/timescaledb-toolkit/frequency-analysis/count_min_sketch/approx_count.mdx new file mode 100644 index 0000000..55cf7c0 --- /dev/null +++ b/api-reference/timescaledb-toolkit/frequency-analysis/count_min_sketch/approx_count.mdx @@ -0,0 +1,60 @@ +--- +title: approx_count() +description: Estimate the number of times a value appears from a `CountMinSketch` +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: frequency analysis + type: accessor + aggregates: + - count_min_sketch() +--- + + Early access 1.8.0 + +Estimate the number of times a given text value appears in a column. + +## Samples + +Given a table of stock data, estimate how many times the symbol `AAPL` appears: + +```sql +WITH t AS ( + SELECT toolkit_experimental.count_min_sketch(symbol, 0.01, 0.01) AS symbol_sketch + FROM crypto_ticks +) +SELECT toolkit_experimental.approx_count('AAPL', symbol_sketch) +FROM t; +``` + +## Arguments + +The syntax is: + +```sql +approx_count ( + item TEXT, + agg CountMinSketch +) RETURNS INTEGER +``` + + +```sql +approx_count ( + item TEXT, + agg CountMinSketch +) RETURNS INTEGER +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `item` | TEXT | - | ✔ | The value you want to estimate occurrences of | +| `agg` | CountMinSketch | - | ✔ | A `CountMinSketch` object created using [`count_min_sketch`](#count_min_sketch) | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| approx_count | INTEGER | The estimated number of times `item` appeared in the sketch | + diff --git a/api-reference/timescaledb-toolkit/frequency-analysis/count_min_sketch/count_min_sketch.mdx b/api-reference/timescaledb-toolkit/frequency-analysis/count_min_sketch/count_min_sketch.mdx new file mode 100644 index 0000000..aadc178 --- /dev/null +++ b/api-reference/timescaledb-toolkit/frequency-analysis/count_min_sketch/count_min_sketch.mdx @@ -0,0 +1,44 @@ +--- +title: count_min_sketch() +description: Aggregate data into a `CountMinSketch` for approximate counting +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: frequency analysis + type: aggregate + aggregates: + - count_min_sketch() +--- + + Early access 1.8.0 + +Aggregate data into a `CountMinSketch` object, which you can use to estimate the number of times a given item appears in +a column. The sketch produces a biased estimator of frequency. It might overestimate the item count, but it can't +underestimate. + +You can control the relative error and the probability that the estimate falls outside the error bounds. + +## Arguments + +The syntax is: + +```sql +count_min_sketch( + values TEXT, + error DOUBLE PRECISION, + probability DOUBLE PRECISION, +) RETURNS CountMinSketch +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `values` | TEXT | - | ✔ | The column of values to count | +| `error` | DOUBLE PRECISION | - | ✔ | Error tolerance in estimate, calculated relative to the number of values added to the sketch | +| `probability` | DOUBLE PRECISION | - | ✔ | Probability that an estimate falls outside the error bounds | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| `count_min_sketch` | CountMinSketch | An object storing a table of counters | \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/frequency-analysis/count_min_sketch/index.mdx b/api-reference/timescaledb-toolkit/frequency-analysis/count_min_sketch/index.mdx new file mode 100644 index 0000000..44d93ef --- /dev/null +++ b/api-reference/timescaledb-toolkit/frequency-analysis/count_min_sketch/index.mdx @@ -0,0 +1,57 @@ +--- +title: Count-min sketch overview +sidebarTitle: Overview +description: Functions for estimating value counts using the count-min sketch data structure +topics: [hyperfunctions] +license: community +toolkit: true +products: [cloud, mst, self_hosted] +--- + + Early access 1.8.0 + +Count the number of times a value appears in a column, using the probabilistic count-min sketch data structure and its +associated algorithms. For applications where a small error rate is tolerable, this can result in huge savings in both +CPU time and memory, especially for large datasets. + +The count-min sketch produces a biased estimator of frequency. It might overestimate the item count, but it can't +underestimate. You can control the relative error and the probability that the estimate falls outside the error bounds. + +import TwoStepAggregation from '/snippets/api-reference/timescaledb/hyperfunctions/_two-step-aggregation.mdx'; + +## Two-step aggregation + + + +## Samples + +### Estimate counts for specific values + +Create a count-min sketch and estimate how many times specific user IDs appear: + +```sql +WITH sketch AS ( + SELECT toolkit_experimental.count_min_sketch( + user_id::text, + 0.01, -- 1% error tolerance + 0.01 -- 1% probability of exceeding error bounds + ) AS cms + FROM user_events +) +SELECT + toolkit_experimental.approx_count(cms, 'user123') AS user123_count, + toolkit_experimental.approx_count(cms, 'user456') AS user456_count +FROM sketch; +``` + +## Available functions + +### Aggregate +- [`count_min_sketch()`][count_min_sketch]: aggregate data into a count-min sketch for approximate counting + +### Accessor +- [`approx_count()`][approx_count]: estimate the number of times a value appears in a count-min sketch + +[two-step-aggregation]: #two-step-aggregation +[count_min_sketch]: /api-reference/timescaledb/hyperfunctions/frequency-analysis/count_min_sketch/count_min_sketch +[approx_count]: /api-reference/timescaledb/hyperfunctions/frequency-analysis/count_min_sketch/approx_count diff --git a/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/freq_agg.mdx b/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/freq_agg.mdx new file mode 100644 index 0000000..c441ce1 --- /dev/null +++ b/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/freq_agg.mdx @@ -0,0 +1,49 @@ +--- +title: freq_agg() +description: Aggregate data into a space-saving aggregate for further frequency analysis +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: frequency analysis + type: aggregate + aggregates: + - freq_agg() +--- + + Early access 1.5.0 + +Aggregate data into a space-saving aggregate object, which stores frequency information in an intermediate form. You can +then use any of the accessors in this group to return estimated frequencies or the most common elements. + +## Samples + +Create a space-saving aggregate over a field `ZIP` in a `HomeSales` table. This aggregate tracks any `ZIP` value that +occurs in at least 5% of rows: + +```sql +SELECT toolkit_experimental.freq_agg(0.05, ZIP) FROM HomeSales; +``` + +## Arguments + +The syntax is: + +```sql +freq_agg( + min_freq DOUBLE PRECISION, + value AnyElement +) RETURNS SpaceSavingAggregate +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `min_freq` | DOUBLE PRECISION | - | ✔ | Frequency cutoff for keeping track of a value. Values that occur less frequently than the cutoff are not stored. | +| `value` | AnyElement | - | ✔ | The column to store frequencies for | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| agg | SpaceSavingAggregate | An object storing the most common elements of the given table and their estimated frequency. You can pass this object to any of the accessor functions to get a final result. | + diff --git a/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/index.mdx b/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/index.mdx new file mode 100644 index 0000000..2b93ad9 --- /dev/null +++ b/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/index.mdx @@ -0,0 +1,108 @@ +--- +title: Frequency aggregation overview +sidebarTitle: Overview +description: Functions for finding the most common values using the SpaceSaving algorithm +topics: [hyperfunctions] +license: community +toolkit: true +products: [cloud, mst, self_hosted] +--- + + Early access 1.5.0 + +Get the most common elements of a set and their relative frequency. The estimation uses the SpaceSaving algorithm. + +This group of functions contains two aggregate functions, which let you set the cutoff for keeping track of a value in different ways. [`freq_agg`](#freq_agg) allows you to specify a minimum frequency, and [`mcv_agg`](#mcv_agg) allows you to specify the target number of values to keep. + +To estimate the absolute number of times a value appears, use [`count_min_sketch`][count_min_sketch]. + +import TwoStepAggregation from '/snippets/api-reference/timescaledb/hyperfunctions/_two-step-aggregation.mdx'; + +## Two-step aggregation + + + +## Samples + +### Get the 5 most common values from a table + +This test uses a table of randomly generated data. The values used are the integer square roots of a random number in +the range 0 to 400. + +```sql +CREATE TABLE value_test(value INTEGER); +INSERT INTO value_test SELECT floor(sqrt(random() * 400)) FROM generate_series(1,100000); +``` + +This returns the 5 most common values seen in the table: + +```sql +SELECT topn( + toolkit_experimental.freq_agg(0.05, value), + 5) +FROM value_test; +``` + +The output for this query: + +```sql + topn +------ + 19 + 18 + 17 + 16 + 15 +``` + +### Generate a table with frequencies of the most commonly seen values + +Return values that represent more than 5% of the input: + +```sql +SELECT value, min_freq, max_freq +FROM into_values( + (SELECT toolkit_experimental.freq_agg(0.05, value) FROM value_test)); +``` + +The output for this query looks like this, with some variation due to randomness: + +```sql + value | min_freq | max_freq +-------+----------+---------- + 19 | 0.09815 | 0.09815 + 18 | 0.09169 | 0.09169 + 17 | 0.08804 | 0.08804 + 16 | 0.08248 | 0.08248 + 15 | 0.07703 | 0.07703 + 14 | 0.07157 | 0.07157 + 13 | 0.06746 | 0.06746 + 12 | 0.06378 | 0.06378 + 11 | 0.05565 | 0.05595 + 10 | 0.05286 | 0.05289 +``` + +## Available functions + +### Aggregates +- [`freq_agg()`][freq_agg]: aggregate data into a space-saving aggregate with a minimum frequency cutoff +- [`mcv_agg()`][mcv_agg]: aggregate data into a space-saving aggregate with a target number of values + +### Accessors +- [`into_values()`][into_values]: return the values and their estimated frequencies from a frequency aggregate +- [`max_frequency()`][max_frequency]: get the maximum frequency of a value from a frequency aggregate +- [`min_frequency()`][min_frequency]: get the minimum frequency of a value from a frequency aggregate +- [`topn()`][topn]: get the N most common values from a frequency aggregate + +### Rollup +- [`rollup()`][rollup]: combine multiple frequency aggregates + +[two-step-aggregation]: #two-step-aggregation +[count_min_sketch]: /api-reference/timescaledb/hyperfunctions/frequency-analysis/count_min_sketch/index +[freq_agg]: /api-reference/timescaledb/hyperfunctions/frequency-analysis/freq_agg/freq_agg +[mcv_agg]: /api-reference/timescaledb/hyperfunctions/frequency-analysis/freq_agg/mcv_agg +[into_values]: /api-reference/timescaledb/hyperfunctions/frequency-analysis/freq_agg/into_values +[max_frequency]: /api-reference/timescaledb/hyperfunctions/frequency-analysis/freq_agg/max_frequency +[min_frequency]: /api-reference/timescaledb/hyperfunctions/frequency-analysis/freq_agg/min_frequency +[topn]: /api-reference/timescaledb/hyperfunctions/frequency-analysis/freq_agg/topn +[rollup]: /api-reference/timescaledb/hyperfunctions/frequency-analysis/freq_agg/rollup diff --git a/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/into_values.mdx b/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/into_values.mdx new file mode 100644 index 0000000..cb04f53 --- /dev/null +++ b/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/into_values.mdx @@ -0,0 +1,40 @@ +--- +title: into_values() +description: Get a table of all frequency estimates from a space-saving aggregate +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: frequency analysis + type: accessor + aggregates: + - freq_agg() +--- + + Since 1.16.0 + +Return the data from a space-saving aggregate as a table. The table lists the stored values with the minimum and maximum +bounds for their estimated frequencies. + +## Arguments + +The syntax is: + +```sql +into_values( + agg SpaceSavingAggregate +) RETURNS (AnyElement, DOUBLE PRECISION, DOUBLE PRECISION) +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `agg` | SpaceSavingAggregate | - | ✔ | A space-saving aggregate created using either [`freq_agg`](#freq_agg) or [`mcv_agg`](#mcv_agg) | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| value | AnyElement | A commonly seen value in the original dataset | +| min_freq | DOUBLE PRECISION | The minimum bound for the estimated frequency | +| max_freq | DOUBLE PRECISION | The maximum bound for the estimated frequency | +| `max_freq` | DOUBLE PRECISION | The maximum bound for the estimated frequency | \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/max_frequency.mdx b/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/max_frequency.mdx new file mode 100644 index 0000000..27fadce --- /dev/null +++ b/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/max_frequency.mdx @@ -0,0 +1,58 @@ +--- +title: max_frequency() +description: Get the maximum bound of the estimated frequency for a given value in a space-saving aggregate +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: frequency analysis + type: accessor + aggregates: + - freq_agg() +--- + + Since 1.16.0 + +Get the maximum bound of the estimated frequency for a given value in a space-saving aggregate. + +## Samples + +Find the maximum frequency of the value `3` in a column named `value` within the table `value_test`: + +```sql +SELECT max_frequency( + (SELECT mcv_agg(20, value) FROM value_test), + 3 +); +``` + +## Arguments + +The syntax is: + +```sql +max_frequency ( + agg SpaceSavingAggregate, + value AnyElement +) RETURNS DOUBLE PRECISION +``` + + +```sql +max_frequency ( + agg SpaceSavingAggregate, + value AnyElement +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `agg` | SpaceSavingAggregate | - | ✔ | A space-saving aggregate created using either [`freq_agg`](#freq_agg) or [`mcv_agg`](#mcv_agg) | +| `value` | AnyElement | - | ✔ | The value to get the frequency of | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| max_frequency | DOUBLE PRECISION | The maximum bound for the value's estimated frequency. The maximum frequency might be 0 if the value's frequency falls below the space-saving aggregate's cut-off threshold. For more information, see [`freq_agg`](#freq_agg). | + diff --git a/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/mcv_agg.mdx b/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/mcv_agg.mdx new file mode 100644 index 0000000..46dc2be --- /dev/null +++ b/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/mcv_agg.mdx @@ -0,0 +1,68 @@ +--- +title: mcv_agg() +description: Aggregate data into a space-saving aggregate for further calculation of most-frequent values +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: frequency analysis + type: alternate aggregate + aggregates: + - freq_agg() +--- + + Since 1.16.0 + +Aggregate data into a space-saving aggregate, which stores frequency information in an intermediate form. You can then +use any of the accessors in this group to return estimated frequencies or the most common elements. + +This differs from [`freq_agg`](#freq_agg) in that you can specify a target number of values to keep, rather than a frequency cutoff. + +## Samples + +Create a topN aggregate over the `country` column of the `users` table. Targets the top 10 most-frequent values: + +```sql +SELECT mcv_agg(10, country) FROM users; +``` + +Create a topN aggregate over the `type` column of the `devices` table. Estimates the skew of the data to be 1.05, and +targets the 5 most-frequent values: + +```sql +SELECT mcv_agg(5, 1.05, type) FROM devices; +``` + +## Arguments + +The syntax is: + +```sql +mcv_agg ( + n INTEGER, + value AnyElement + [, skew DOUBLE PRECISION] +) RETURNS SpaceSavingAggregate +``` + + +```sql +mcv_agg ( + n INTEGER, + value AnyElement + [, skew DOUBLE PRECISION] +) RETURNS SpaceSavingAggregate +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `n` | INTEGER | - | ✔ | The target number of most-frequent values | +| `value` | AnyElement | - | ✔ | The column to store frequencies for | +| `skew` | DOUBLE PRECISION | 1.1 | | The estimated skew of the data, defined as the `s` parameter of a zeta distribution. Must be greater than `1.0`. Defaults to `1.1`. For more information, see the section on [skew](#estimated-skew). | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| agg | SpaceSavingAggregate | An object storing the most common elements of the given table and their estimated frequency. You can pass this object to any of the accessor functions to get a final result. | + diff --git a/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/min_frequency.mdx b/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/min_frequency.mdx new file mode 100644 index 0000000..3546415 --- /dev/null +++ b/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/min_frequency.mdx @@ -0,0 +1,58 @@ +--- +title: min_frequency() +description: Get the minimum bound of the estimated frequency for a given value in a space-saving aggregate +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: frequency analysis + type: accessor + aggregates: + - freq_agg() +--- + + Since 1.16.0 + +Get the minimum bound of the estimated frequency for a given value in a space-saving aggregate. + +## Samples + +Find the minimum frequency of the value `3` in a column named `value` within the table `value_test`: + +```sql +SELECT min_frequency( + (SELECT mcv_agg(20, value) FROM value_test), + 3 +); +``` + +## Arguments + +The syntax is: + +```sql +min_frequency ( + agg SpaceSavingAggregate, + value AnyElement +) RETURNS DOUBLE PRECISION +``` + + +```sql +min_frequency ( + agg SpaceSavingAggregate, + value AnyElement +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `agg` | SpaceSavingAggregate | - | ✔ | A space-saving aggregate created using either [`freq_agg`](#freq_agg) or [`mcv_agg`](#mcv_agg) | +| `value` | AnyElement | - | ✔ | The value to get the frequency of | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| min_frequency | DOUBLE PRECISION | The minimum bound for the value's estimated frequency. The minimum frequency might be 0 if the value's frequency falls below the space-saving aggregate's cut-off threshold. For more information, see [`freq_agg`](#freq_agg). | + diff --git a/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/rollup.mdx b/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/rollup.mdx new file mode 100644 index 0000000..50a7be3 --- /dev/null +++ b/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/rollup.mdx @@ -0,0 +1,42 @@ +--- +title: rollup() +description: Combine multiple frequency aggregates +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: frequency analysis + type: rollup + aggregates: + - freq_agg() +--- + + Since 1.16.0 + +Combine multiple aggregates created with `freq_agg` or `mcv_agg` functions. This function requires that the source +aggregates have been created with the same parameters (same `min_freq` for `freq_agg`, same n-factor and `skew`, if +used, for a `mcv_agg`). + +This produces a very similar aggregate to running the same aggregate function over all the source data. In most cases, +any difference is no more than what you might get from simply reordering the input. However, if the source data for the +different aggregates is very differently distributed, the rollup result may have looser frequency bounds. + +## Arguments + +The syntax is: + +```sql +rollup( + agg SpaceSavingAggregate +) RETURNS SpaceSavingAggregate +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `agg` | SpaceSavingAggregate | - | ✔ | The aggregates to roll up. These must have been created with the same parameters. | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| `rollup` | SpaceSavingAggregate | An aggregate containing the most common elements from all of the underlying data for all of the aggregates. | \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/topn.mdx b/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/topn.mdx new file mode 100644 index 0000000..b68146a --- /dev/null +++ b/api-reference/timescaledb-toolkit/frequency-analysis/freq_agg/topn.mdx @@ -0,0 +1,47 @@ +--- +title: topn() +description: Get the top N most common values from a space-saving aggregate +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: frequency analysis + type: accessor + aggregates: + - freq_agg() +--- + + Since 1.16.0 + +Get the top N most common values from a space-saving aggregate. The space-saving aggregate can be created from either [`freq_agg`](#freq_agg) or [`mcv_agg`](#mcv_agg). + +## Samples + +Get the 20 most frequent `zip_codes` from an `employees` table: + +```sql +SELECT topn(mcv_agg(20, zip_code)) FROM employees; +``` + +## Arguments + +The syntax is: + +```sql +topn ( + agg SpaceSavingAggregate, + n INTEGER +) RETURNS AnyElement +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| `agg` | SpacingsavingAggregate | - | ✔ | A space-saving aggregate created using either [`freq_agg`](#freq_agg) or [`mcv_agg`](#mcv_agg) | +| `n` | INTEGER | - | ✔ | The number of values to return. Required only for frequency aggregates. For top N aggregates, defaults to target N of the aggregate itself, and requests for a higher N return an error. In some cases, the function might return fewer than N values. This might happen if a frequency aggregate doesn't contain N values above the minimum frequency, or if the data isn't skewed enough to support N values from a top N aggregate. | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| topn | AnyElement | The N most-frequent values in the aggregate | + diff --git a/api-reference/timescaledb-toolkit/frequency-analysis/index.mdx b/api-reference/timescaledb-toolkit/frequency-analysis/index.mdx new file mode 100644 index 0000000..ab94ce6 --- /dev/null +++ b/api-reference/timescaledb-toolkit/frequency-analysis/index.mdx @@ -0,0 +1,77 @@ +--- +title: Frequency analysis overview +sidebarTitle: Overview +description: Functions for analyzing the frequency of values in time-series data +topics: [hyperfunctions] +license: community +toolkit: true +products: [cloud, mst, self_hosted] +--- + + Early access 1.5.0 + +Analyze the frequency of values in time-series data using memory-efficient probabilistic data structures. These +functions help you identify the most common elements and estimate occurrence counts without storing every individual +value. + +TimescaleDB Toolkit provides two approaches to frequency analysis: +- **freq_agg**: Get the most common elements and their relative frequency using the SpaceSaving algorithm +- **count_min_sketch**: Estimate the absolute number of times a specific value appears using the count-min sketch data + structure + +import TwoStepAggregation from '/snippets/api-reference/timescaledb/hyperfunctions/_two-step-aggregation.mdx'; + +## Two-step aggregation + + + +## Samples + +### Find the most common values + +Get the 5 most common values from a dataset: + +```sql +CREATE TABLE value_test(value INTEGER); +INSERT INTO value_test SELECT floor(sqrt(random() * 400)) FROM generate_series(1,100000); + +SELECT topn( + toolkit_experimental.freq_agg(0.05, value), + 5) +FROM value_test; +``` + +### Get frequency information for common values + +Return values that represent more than 5% of the input, along with their frequency bounds: + +```sql +SELECT value, min_freq, max_freq +FROM into_values( + (SELECT toolkit_experimental.freq_agg(0.05, value) FROM value_test)); +``` + +### Estimate absolute counts + +Use count-min sketch to estimate how many times specific values appear: + +```sql +WITH sketch AS ( + SELECT toolkit_experimental.count_min_sketch(user_id::text, 0.01, 0.01) AS cms + FROM user_events +) +SELECT toolkit_experimental.approx_count(cms, 'user123') AS estimated_count +FROM sketch; +``` + +## Available functions + +### Frequency aggregation +- [`freq_agg()`][freq_agg]: track the most common values using a minimum frequency cutoff + +### Count-min sketch +- [`count_min_sketch()`][count_min_sketch]: estimate absolute counts using the count-min sketch data structure + +[two-step-aggregation]: #two-step-aggregation +[freq_agg]: /api-reference/timescaledb/hyperfunctions/frequency-analysis/freq_agg/index +[count_min_sketch]: /api-reference/timescaledb/hyperfunctions/frequency-analysis/count_min_sketch/index diff --git a/api-reference/timescaledb-toolkit/hyperloglog/approx_count_distinct.mdx b/api-reference/timescaledb-toolkit/hyperloglog/approx_count_distinct.mdx new file mode 100644 index 0000000..17805fb --- /dev/null +++ b/api-reference/timescaledb-toolkit/hyperloglog/approx_count_distinct.mdx @@ -0,0 +1,63 @@ +--- +title: approx_count_distinct() +description: Aggregate data into a hyperloglog for approximate counting without specifying the number of buckets +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: approximate count distinct + type: alternate aggregate + aggregates: + - hyperloglog() +products: [cloud, mst, self_hosted] +--- + + Since 1.16.0 + +This is an alternate first step for approximating the number of distinct +values. It provides some added convenience by using some sensible default +parameters to create a `hyperloglog`. + +Use `approx_count_distinct` to create an intermediate aggregate from your raw data. +This intermediate form can then be used by one or more accessors in this +group to compute final results. + +Optionally, multiple such intermediate aggregate objects can be combined +using [`rollup()`](#rollup) before an accessor is applied. + + +## Samples + +Given a table called `samples`, with a column called `weights`, return +a `hyperloglog` over the `weights` column: + +```sql +SELECT toolkit_experimental.approx_count_distinct(weights) FROM samples; +``` + +Using the same data, build a view from the aggregate that you can pass +to other `hyperloglog` functions. + +```sql +CREATE VIEW hll AS SELECT toolkit_experimental.approx_count_distinct(data) FROM samples; +``` + +## Arguments + +The syntax is: + +```sql +approx_count_distinct( + value AnyElement +) RETURNS Hyperloglog +``` +| Name | Type | Default | Required | Description | +|--|--|--|--|--| +| `value` | AnyElement | - | ✔ | The column containing the elements to count. The type must have an extended, 64-bit, hash function. | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| hyperloglog | Hyperloglog | A `hyperloglog` object which can be passed to other hyperloglog APIs for rollups and final calculation | diff --git a/api-reference/timescaledb-toolkit/hyperloglog/distinct_count.mdx b/api-reference/timescaledb-toolkit/hyperloglog/distinct_count.mdx new file mode 100644 index 0000000..311ab4c --- /dev/null +++ b/api-reference/timescaledb-toolkit/hyperloglog/distinct_count.mdx @@ -0,0 +1,56 @@ +--- +title: distinct_count() +description: Estimate the number of distinct values from a hyperloglog +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: approximate count distinct + type: accessor + aggregates: + - hyperloglog() +products: [cloud, mst, self_hosted] +--- + + Since 1.3.0 + +Estimate the number of distinct values from a hyperloglog + + +## Samples + +Estimate the number of distinct values from a hyperloglog named +`hyperloglog`. The expected output is 98,814. + +```sql +SELECT distinct_count(hyperloglog(8192, data)) + FROM generate_series(1, 100000) data +``` + +Output: + +```sql +distinct_count +---------------- + 98814 +``` + +## Arguments + +The syntax is: + +```sql +distinct_count( + hyperloglog Hyperloglog +) RETURNS BIGINT +``` +| Name | Type | Default | Required | Description | +|--|--|--|--|--| +| `hyperloglog` | Hyperloglog | - | ✔ | The hyperloglog to extract the count from. | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| distinct_count | BIGINT | The number of distinct elements counted by the hyperloglog. | diff --git a/api-reference/timescaledb-toolkit/hyperloglog/hyperloglog.mdx b/api-reference/timescaledb-toolkit/hyperloglog/hyperloglog.mdx new file mode 100644 index 0000000..f8ad16e --- /dev/null +++ b/api-reference/timescaledb-toolkit/hyperloglog/hyperloglog.mdx @@ -0,0 +1,67 @@ +--- +title: hyperloglog() +description: Aggregate data into a hyperloglog for approximate counting +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: approximate count distinct + type: aggregate + aggregates: + - hyperloglog() +products: [cloud, mst, self_hosted] +--- + + Since 1.3.0 + +This is the first step for estimating the approximate number of distinct +values using the `hyperloglog` algorithm. Use `hyperloglog` to create an +intermediate aggregate from your raw data. This intermediate form can then +be used by one or more accessors in this group to compute final results. + +Optionally, multiple such intermediate aggregate objects can be combined +using [`rollup()`](#rollup) before an accessor is applied. + +If you're not sure what value to set for `buckets`, try using the alternate +aggregate function, [`approx_count_distinct()`](#approx_count_distinct). +`approx_count_distinct` also creates a `hyperloglog`, but it sets a +default bucket value that should work for many use cases. + + +## Samples + +Given a table called `samples`, with a column called `weights`, return +a `hyperloglog` over the `weights` column. + +```sql +SELECT hyperloglog(32768, weights) FROM samples; +``` + +Using the same data, build a view from the aggregate that you can pass +to other `hyperloglog` functions. + +```sql +CREATE VIEW hll AS SELECT hyperloglog(32768, data) FROM samples; +``` + +## Arguments + +The syntax is: + +```sql +hyperloglog( + buckets INTEGER, + value AnyElement +) RETURNS Hyperloglog +``` +| Name | Type | Default | Required | Description | +|--|--|--|--|--| +| `buckets` | INTEGER | - | ✔ | Number of buckets in the hyperloglog. Increasing the number of buckets improves accuracy but increases memory use. Value is rounded up to the next power of 2, and must be between 2^4 (16) and 2^18. Setting a value less than 2^10 (1,024) may result in poor accuracy if the true cardinality is high and is not recommended. If unsure, start experimenting with 8,192 (2^13) which has an approximate error rate of 1.15%. | +| `value` | AnyElement | - | ✔ | The column containing the elements to count. The type must have an extended, 64-bit, hash function. | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| hyperloglog | Hyperloglog | A `hyperloglog` object which can be passed to other hyperloglog APIs for rollups and final calculation | diff --git a/api-reference/timescaledb-toolkit/hyperloglog/index.mdx b/api-reference/timescaledb-toolkit/hyperloglog/index.mdx new file mode 100644 index 0000000..ed78a91 --- /dev/null +++ b/api-reference/timescaledb-toolkit/hyperloglog/index.mdx @@ -0,0 +1,101 @@ +--- +title: Approximate count distinct overview +description: Estimate the number of distinct values in a dataset, also known as cardinality estimation +sidebarTitle: Overview +--- + +import TwoStepAggregation from '/snippets/api-reference/timescaledb/hyperfunctions/_two-step-aggregation.mdx'; + +For large datasets and datasets with high cardinality (many distinct values), this can be much more efficient in +both CPU and memory than an exact count using `count(DISTINCT)`. + +The estimation uses the [`hyperloglog++`][hyperloglog-wiki] algorithm. If you aren't +sure what parameters to set for the `hyperloglog`, try using the +[`approx_count_distinct`][approx_count_distinct] aggregate, which sets some +reasonable default values. + +This function group uses the [two-step aggregation][two-step-aggregation] +pattern. In addition to the usual aggregate function, +[`hyperloglog`][hyperloglog], it also includes the alternate aggregate function +[`approx_count_distinct`][approx_count_distinct]. Both produce a hyperloglog aggregate, which can then be used with the +accessor and rollup functions in +this group. + +## Two-step aggregation + + + +## Samples + +### Roll up two hyperloglogs + +The first hyperloglog buckets the integers from 1 to +100,000, and the second hyperloglog buckets the integers from 50,000 to +150,000. Accounting for overlap, the exact number of distinct values in the +combined set is 150,000. + +Calling `distinct_count` on the rolled-up hyperloglog yields a final value of +150,552, so the approximation is off by only 0.368%: + +```sql +SELECT distinct_count(rollup(logs)) +FROM ( + (SELECT hyperloglog(4096, v::text) logs FROM generate_series(1, 100000) v) + UNION ALL + (SELECT hyperloglog(4096, v::text) FROM generate_series(50000, 150000) v) +) hll; +``` + +Output: + +```sql + distinct_count +---------------- + 150552 +``` + +## Approximate relative errors + +These are the approximate errors for each bucket size: + +| precision | registers (bucket size) | error | column size (in bytes) | +|-----------|-------------------------|--------|-------------------------| +| 4 | 16 | 0.2600 | 12 | +| 5 | 32 | 0.1838 | 24 | +| 6 | 64 | 0.1300 | 48 | +| 7 | 128 | 0.0919 | 96 | +| 8 | 256 | 0.0650 | 192 | +| 9 | 512 | 0.0460 | 384 | +| 10 | 1024 | 0.0325 | 768 | +| 11 | 2048 | 0.0230 | 1536 | +| 12 | 4096 | 0.0163 | 3072 | +| 13 | 8192 | 0.0115 | 6144 | +| 14 | 16384 | 0.0081 | 12288 | +| 15 | 32768 | 0.0057 | 24576 | +| 16 | 65536 | 0.0041 | 49152 | +| 17 | 131072 | 0.0029 | 98304 | +| 18 | 262144 | 0.0020 | 196608 | + +## Available functions + +### Aggregate +- [`hyperloglog()`][hyperloglog]: aggregate data into a hyperloglog for approximate counting + +### Alternate aggregate +- [`approx_count_distinct()`][approx_count_distinct]: aggregate data into a hyperloglog without specifying the number + of buckets + +### Accessors +- [`distinct_count()`][distinct_count]: estimate the number of distinct values from a hyperloglog +- [`stderror()`][stderror]: estimate the relative standard error of a hyperloglog + +### Rollup +- [`rollup()`][rollup]: combine multiple hyperloglogs + +[two-step-aggregation]: #two-step-aggregation +[hyperloglog-wiki]: https://en.wikipedia.org/wiki/HyperLogLog +[hyperloglog]: /api-reference/timescaledb/hyperfunctions/hyperloglog/hyperloglog +[approx_count_distinct]: /api-reference/timescaledb/hyperfunctions/hyperloglog/approx_count_distinct +[distinct_count]: /api-reference/timescaledb/hyperfunctions/hyperloglog/distinct_count +[stderror]: /api-reference/timescaledb/hyperfunctions/hyperloglog/stderror +[rollup]: /api-reference/timescaledb/hyperfunctions/hyperloglog/rollup diff --git a/api-reference/timescaledb-toolkit/hyperloglog/rollup.mdx b/api-reference/timescaledb-toolkit/hyperloglog/rollup.mdx new file mode 100644 index 0000000..966dbf0 --- /dev/null +++ b/api-reference/timescaledb-toolkit/hyperloglog/rollup.mdx @@ -0,0 +1,41 @@ +--- +title: rollup() +description: Roll up multiple hyperloglogs +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: approximate count distinct + type: rollup + aggregates: + - hyperloglog() +products: [cloud, mst, self_hosted] +--- + + Since 1.3.0 + +Combine multiple intermediate hyperloglog aggregates, produced by +hyperloglog, into a single intermediate hyperloglog aggregate. For example, +you can use `rollup` to combine hyperloglog from 15-minute buckets into +daily buckets. + + +## Arguments + +The syntax is: + +```sql +rollup( + hyperloglog Hyperloglog +) RETURNS Hyperloglog +``` +| Name | Type | Default | Required | Description | +|---|---|---|---|---| +| `hyperloglog` | Hyperloglog | - | ✔ | The hyperloglog aggregates to roll up. | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| rollup | Hyperloglog | A new hyperloglog aggregate created by combining the input hyperloglog aggregates. | diff --git a/api-reference/timescaledb-toolkit/hyperloglog/stderror.mdx b/api-reference/timescaledb-toolkit/hyperloglog/stderror.mdx new file mode 100644 index 0000000..f66b87f --- /dev/null +++ b/api-reference/timescaledb-toolkit/hyperloglog/stderror.mdx @@ -0,0 +1,58 @@ +--- +title: stderror() +description: Estimate the relative standard error of a hyperloglog +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: approximate count distinct + type: accessor + aggregates: + - hyperloglog() +products: [cloud, mst, self_hosted] +--- + + Since 1.3.0 + +Estimate the relative standard error of a `Hyperloglog`. For approximate +relative errors by number of buckets, see the +[relative errors section](#approximate-relative-errors). + + +## Samples + +Estimate the relative standard error of a hyperloglog named +`hyperloglog`. The expected output is 0.011490485194281396. + +```sql +SELECT stderror(hyperloglog(8192, data)) + FROM generate_series(1, 100000) data +``` + +Output: + +```sql +stderror +---------------------- +0.011490485194281396 +``` + +## Arguments + +The syntax is: + +```sql +stderror( + hyperloglog Hyperloglog +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|--|--|--|--|--| +| `hyperloglog` | Hyperloglog | - | ✔ | The hyperloglog to estimate the error of. | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| stderror | DOUBLE PRECISION | The approximate relative standard error of the hyperloglog. | diff --git a/api-reference/timescaledb-toolkit/index.mdx b/api-reference/timescaledb-toolkit/index.mdx new file mode 100644 index 0000000..15aedec --- /dev/null +++ b/api-reference/timescaledb-toolkit/index.mdx @@ -0,0 +1,103 @@ +--- +title: TimescaleDB Toolkit API Reference +sidebarTitle: Overview +description: Analyze anything you have stored as time-series data, including IoT devices, IT systems, marketing analytics, user behavior, financial metrics, and cryptocurrency. +products: [cloud, mst, self_hosted] +keywords: [API, reference, toolkit, utilities, functions] +mode: "wide" +--- + +import { TOOLKIT_LONG, TIMESCALE_DB, HYPERFUNC } from '/snippets/vars.mdx'; + + + + Functions for statistical analysis and linear regression on time-series data + + + + Functions for analyzing monotonic counters and gauge metrics + + + + Functions for tracking state transitions and system liveness over time + + + + Calculate time-weighted summary statistics for unevenly sampled data + + + + Estimate percentile values and percentile ranks using memory-efficient approximation algorithms + + + + Estimate the number of distinct values in a dataset, also known as cardinality estimation + + + + Functions for analyzing the frequency of values in time-series data + + + + Functions for downsampling time-series data to visualize trends while preserving visual similarity + + + + Perform analysis of financial asset data + + + + Find the smallest and largest values in a dataset + + + + Perform saturating math operations on integers + + + +{TOOLKIT_LONG} extends {TIMESCALE_DB} with additional {HYPERFUNC} for advanced time-series analysis. For +{HYPERFUNC} included by default in {TIMESCALE_DB}, see the [{TIMESCALE_DB} {HYPERFUNC} documentation](/api-reference/timescaledb/hyperfunctions/index). diff --git a/api-reference/timescaledb-toolkit/minimum-and-maximum/index.mdx b/api-reference/timescaledb-toolkit/minimum-and-maximum/index.mdx new file mode 100644 index 0000000..4aa7575 --- /dev/null +++ b/api-reference/timescaledb-toolkit/minimum-and-maximum/index.mdx @@ -0,0 +1,156 @@ +--- +title: Minimum and maximum overview +description: Find the smallest and largest values in a dataset +sidebarTitle: Overview +--- + +import TwoStepAggregation from '/snippets/api-reference/timescaledb/hyperfunctions/_two-step-aggregation.mdx'; + +Find the smallest and largest values in a dataset. These specialized hyperfunctions make +it easier to write queries that identify extreme values in your data. + +They help you answer questions such as: + +* What are the N smallest or largest values in my dataset? +* Which rows contain the minimum or maximum values? +* How can I efficiently track top/bottom values over time? + +This function family provides four related function groups: + +- [`min_n()`][min_n]: Get the N smallest values from a column +- [`max_n()`][max_n]: Get the N largest values from a column +- [`min_n_by()`][min_n_by]: Get the N smallest values with accompanying data (like full rows) +- [`max_n_by()`][max_n_by]: Get the N largest values with accompanying data (like full rows) + +These function groups use the [two-step aggregation][two-step-aggregation] +pattern. Each group includes an aggregate function to create intermediate aggregates, +accessor functions to extract results, and rollup functions to combine aggregates. + +The minimum and maximum functions give the same results as the regular SQL query +`SELECT ... ORDER BY ... LIMIT n`. But unlike the SQL query, they can be composed +and combined like other aggregate hyperfunctions. + +## Two-step aggregation + + + +## Samples + +### Find the smallest values + +Get the 5 smallest values from a calculation. This example uses `min_n()` to +find the bottom 5 values from `i * 13 % 10007` for i = 1 to 10000: + +```sql +SELECT into_array( + min_n(sub.val, 5)) +FROM ( + SELECT (i * 13) % 10007 AS val + FROM generate_series(1,10000) as i +) sub; +``` + +Output: + +```sql +into_array +--------------------------------- +{1,2,3,4,5} +``` + +### Find the largest values + +Get the 5 largest values from a calculation. This example uses `max_n()` to +find the top 5 values from `i * 13 % 10007` for i = 1 to 10000: + +```sql +SELECT into_array( + max_n(sub.val, 5)) +FROM ( + SELECT (i * 13) % 10007 AS val + FROM generate_series(1,10000) as i +) sub; +``` + +Output: + +```sql +into_array +--------------------------------- +{10006,10005,10004,10003,10002} +``` + +### Find the smallest transactions with details + +This example assumes you have a table of stock trades: + +```sql +CREATE TABLE stock_sales( + ts TIMESTAMPTZ, + symbol TEXT, + price FLOAT, + volume INT +); +``` + +Find the 10 smallest transactions each day with their timestamps and symbols. +This example uses `min_n_by()` to track both the transaction size and +associated row data: + +```sql +WITH daily_min AS ( + SELECT + time_bucket('1 day'::interval, ts) as day, + min_n_by(price * volume, stock_sales, 10) AS min_transactions + FROM stock_sales + GROUP BY day +) +SELECT + day, + (data).ts, + (data).symbol, + value AS transaction_size +FROM daily_min, + LATERAL into_values(min_transactions, NULL::stock_sales); +``` + +### Find the largest transactions with details + +Find the 10 largest transactions each day. This example uses `max_n_by()`: + +```sql +WITH daily_max AS ( + SELECT + time_bucket('1 day'::interval, ts) as day, + max_n_by(price * volume, stock_sales, 10) AS max_transactions + FROM stock_sales + GROUP BY day +) +SELECT + day, + (data).ts, + (data).symbol, + value AS transaction_size +FROM daily_max, + LATERAL into_values(max_transactions, NULL::stock_sales); +``` + +## Available functions + +### Minimum values +- [`min_n()`][min_n]: get the N smallest values from a column + +### Maximum values +- [`max_n()`][max_n]: get the N largest values from a column + +### Minimum values with data +- [`min_n_by()`][min_n_by]: get the N smallest values with accompanying data + +### Maximum values with data +- [`max_n_by()`][max_n_by]: get the N largest values with accompanying data + +[two-step-aggregation]: #two-step-aggregation +[min_n]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/min_n/index +[max_n]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/max_n/index +[min_n_by]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/min_n_by/index +[max_n_by]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/max_n_by/index diff --git a/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n/index.mdx b/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n/index.mdx new file mode 100644 index 0000000..900ad94 --- /dev/null +++ b/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n/index.mdx @@ -0,0 +1,74 @@ +--- +title: Maximum values overview +description: Get the N largest values from a column +sidebarTitle: Overview +--- + +import TwoStepAggregation from '/snippets/api-reference/timescaledb/hyperfunctions/_two-step-aggregation.mdx'; + +Get the N largest values from a column. + +The `max_n()` functions give the same results as the regular SQL query `SELECT +... ORDER BY ... LIMIT n`. But unlike the SQL query, they can be composed and +combined like other aggregate hyperfunctions. + +To get the N smallest values, use [`min_n()`][min_n]. To get the N largest +values with accompanying data, use [`max_n_by()`][max_n_by]. + +This function group uses the [two-step aggregation][two-step-aggregation] +pattern. In addition to the usual aggregate function [`max_n`][max_n], it also +includes accessors and rollup functions. + +## Two-step aggregation + + + +## Samples + +### Get the 10 largest transactions from a table of stock trades + +This example assumes that you have a table of stock trades in this format: + +```sql +CREATE TABLE stock_sales( + ts TIMESTAMPTZ, + symbol TEXT, + price FLOAT, + volume INT +); +``` + +You can query for the 10 largest transactions each day: + +```sql +WITH t as ( + SELECT + time_bucket('1 day'::interval, ts) as day, + max_n(price * volume, 10) AS daily_max + FROM stock_sales + GROUP BY time_bucket('1 day'::interval, ts) +) +SELECT + day, into_array(daily_max) +FROM t; +``` + +## Available functions + +### Aggregate +- [`max_n()`][max_n]: construct an aggregate that keeps track of the largest values passed through it + +### Accessors +- [`into_values()`][into_values]: return the N highest values seen by the aggregate +- [`into_array()`][into_array]: return the N highest values seen by the aggregate as an array + +### Rollup +- [`rollup()`][rollup]: combine multiple MaxN aggregates + +[two-step-aggregation]: #two-step-aggregation +[max_n]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/max_n/max_n +[min_n]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/min_n/min_n +[max_n_by]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/max_n_by/max_n_by +[into_values]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/max_n/into_values +[into_array]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/max_n/into_array +[rollup]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/max_n/rollup diff --git a/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n/into_array.mdx b/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n/into_array.mdx new file mode 100644 index 0000000..0351a3f --- /dev/null +++ b/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n/into_array.mdx @@ -0,0 +1,61 @@ +--- +title: into_array() +description: Returns an array of the highest values from a MaxN aggregate +topics: [hyperfunctions] +tags: [hyperfunctions, toolkit, maximum] +license: community +type: function +toolkit: true +hyperfunction: + family: minimum and maximum + type: accessor + aggregates: + - max_n() +products: [cloud, mst, self_hosted] +--- + + Since 1.16.0 + +Return the N largest values seen by the aggregate. The values are formatted +as an array in decreasing order. + + +## Samples + +Find the top 5 values from `i * 13 % 10007` for i = 1 to 10000. + +```sql +SELECT into_array( + max_n(sub.val, 5)) +FROM ( + SELECT (i * 13) % 10007 AS val + FROM generate_series(1,10000) as i +) sub; +``` + +Output: + +```sql +into_array +--------------------------------- +{10006,10005,10004,10003,10002} +``` + +## Arguments + +The syntax is: + +```sql +into_array ( + agg MinN +) BIGINT[] | DOUBLE PRECISION[] | TIMESTAMPTZ[] +``` +| Name | Type | Default | Required | Description | +|--|--|--|--|--| +| `agg` | MaxN | - | ✔ | The aggregate to return the results from. Note that the exact type here varies based on the type of data stored in the aggregate. | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| into_array | BIGINT[ ] \| DOUBLE PRECISION[ ] \| TIMESTAMPTZ[ ] | The lowest values seen while creating this aggregate. | diff --git a/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n/into_values.mdx b/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n/into_values.mdx new file mode 100644 index 0000000..c317e4d --- /dev/null +++ b/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n/into_values.mdx @@ -0,0 +1,63 @@ +--- +title: into_values() +description: Returns the highest values from a MaxN aggregate +topics: [hyperfunctions] +tags: [hyperfunctions, toolkit, maximum] +license: community +type: function +toolkit: true +hyperfunction: + family: minimum and maximum + type: accessor + aggregates: + - max_n() +products: [cloud, mst, self_hosted] +--- + + Since 1.16.0 + +Return the N highest values seen by the aggregate. + +## Samples + +Find the top 5 values from i * 13 % 10007 for i = 1 to 10000: + +```sql +SELECT into_values( + max_n(sub.val, 5)) +FROM ( + SELECT (i * 13) % 10007 AS val + FROM generate_series(1,10000) as i +) sub; +``` + +Output: +``` +into_values +------------- +10006 +10005 +10004 +10003 +10002 +``` + +## Arguments + +The syntax is: + +```sql +into_values ( + agg MaxN +) SETOF BIGINT | SETOF DOUBLE PRECISION | SETOF TIMESTAMPTZ +``` + +| Name | Type | Default | Required | Description | +|--|--|--|--|--| +| `agg` | MaxN | - | ✔ | The aggregate to return the results from. Note that the exact type here varies based on the type of data stored. | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| into_values | SETOF BIGINT \| SETOF DOUBLE PRECISION \| SETOF TIMESTAMPTZ | The lowest values seen while creating this aggregate. | diff --git a/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n/max_n.mdx b/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n/max_n.mdx new file mode 100644 index 0000000..9bd2400 --- /dev/null +++ b/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n/max_n.mdx @@ -0,0 +1,41 @@ +--- +title: max_n() +description: Find the largest values in a set of data +topics: [hyperfunctions] +tags: [hyperfunctions, toolkit, maximum] +license: community +type: function +toolkit: true +hyperfunction: + family: minimum and maximum + type: aggregate + aggregates: + - max_n() +products: [cloud, mst, self_hosted] +--- + + Since 1.16.0 + +Construct an aggregate which will keep track of the largest values passed through it. + + +## Arguments + +The syntax is: + +```sql +max_n( + value BIGINT | DOUBLE PRECISION | TIMESTAMPTZ, + capacity BIGINT +) MaxN +``` +| Name | Type | Default | Required | Description | +|--|--|--|--|--| +| `value` | BIGINT \| DOUBLE PRECISION \| TIMESTAMPTZ | - | ✔ | The values passed into the aggregate | +| `capacity` | BIGINT | - | ✔ | The number of values to retain. | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| max_n | MaxN | The compiled aggregate. Note that the exact type will be `MaxInts`, `MaxFloats`, or `MaxTimes` depending on the input type | diff --git a/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n/rollup.mdx b/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n/rollup.mdx new file mode 100644 index 0000000..66d87af --- /dev/null +++ b/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n/rollup.mdx @@ -0,0 +1,41 @@ +--- +title: rollup() +description: Combine multiple MaxN aggregates +topics: [hyperfunctions] +tags: [hyperfunctions, toolkit, maximum] +license: community +type: function +toolkit: true +hyperfunction: + family: minimum and maximum + type: rollup + aggregates: + - max_n() +products: [cloud, mst, self_hosted] +--- + + Since 1.16.0 + +This aggregate combines the aggregates generated by other `max_n` +aggregates and returns the maximum values found across all the +aggregated data. + + +## Arguments + +The syntax is: + +```sql +rollup( + agg MinN +) MinN +``` +| Name | Type | Default | Required | Description | +|--|--|--|--|--| +| `agg` | MaxN | - | ✔ | The aggregates being combined | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| rollup | MinN | An aggregate over all of the contributing values. | diff --git a/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n_by/index.mdx b/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n_by/index.mdx new file mode 100644 index 0000000..6cf59ff --- /dev/null +++ b/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n_by/index.mdx @@ -0,0 +1,73 @@ +--- +title: Maximum values by overview +description: Get the N largest values with accompanying data +sidebarTitle: Overview +--- + +import TwoStepAggregation from '/snippets/api-reference/timescaledb/hyperfunctions/_two-step-aggregation.mdx'; + +Get the N largest values from a column, with an associated piece of data per +value. For example, you can return an accompanying column, or the full row. + +The `max_n_by()` functions give the same results as the regular SQL query +`SELECT ... ORDER BY ... LIMIT n`. But unlike the SQL query, they can be +composed and combined like other aggregate hyperfunctions. + +To get the N smallest values with accompanying data, use +[`min_n_by()`][min_n_by]. To get the N largest values without accompanying data, +use [`max_n()`][max_n]. + +This function group uses the [two-step aggregation][two-step-aggregation] +pattern. In addition to the usual aggregate function [`max_n_by`][max_n_by], it also +includes accessors and rollup functions. + +## Two-step aggregation + + + +## Samples + +This example assumes that you have a table of stock trades in this format: + +```sql +CREATE TABLE stock_sales( + ts TIMESTAMPTZ, + symbol TEXT, + price FLOAT, + volume INT +); +``` + +Find the 10 largest transactions in the table, what time they occurred, and what +symbol was being traded: + +```sql +SELECT + (data).ts, + (data).symbol, + value AS transaction +FROM + into_values(( + SELECT max_n_by(price * volume, stock_sales, 10) + FROM stock_sales + ), + NULL::stock_sales); +``` + +## Available functions + +### Aggregate +- [`max_n_by()`][max_n_by]: construct an aggregate that keeps track of the largest values and associated data + +### Accessors +- [`into_values()`][into_values]: return the N highest values with their associated data + +### Rollup +- [`rollup()`][rollup]: combine multiple MaxNBy aggregates + +[two-step-aggregation]: #two-step-aggregation +[max_n_by]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/max_n_by/max_n_by +[min_n_by]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/min_n_by/min_n_by +[max_n]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/max_n/max_n +[into_values]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/max_n_by/into_values +[rollup]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/max_n_by/rollup diff --git a/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n_by/into_values.mdx b/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n_by/into_values.mdx new file mode 100644 index 0000000..fef2fae --- /dev/null +++ b/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n_by/into_values.mdx @@ -0,0 +1,71 @@ +--- +title: into_values() +description: Returns the highest values and associated data from a MaxNBy aggregate +topics: [hyperfunctions] +tags: [hyperfunctions, toolkit, maximum] +license: community +type: function +toolkit: true +hyperfunction: + family: minimum and maximum + type: accessor + aggregates: + - max_n_by() +products: [cloud, mst, self_hosted] +--- + + Since 1.16.0 + +This returns the largest values seen by the aggregate and the +corresponding values associated with them. Note that Postgres requires +an input argument with type matching the associated value in order to +determine the response type. + +## Samples + +Find the top 5 values from i * 13 % 10007 for i = 1 to 10000: + +```sql +SELECT into_values( + max_n_by(sub.mod, sub.div, 5), + NULL::INT) +FROM ( + SELECT (i * 13) % 10007 AS mod, (i * 13) / 10007 AS div + FROM generate_series(1,10000) as i +) sub; +``` + +Output: +``` +into_values +------------- +(10006,3) +(10005,7) +(10004,11) +(10003,2) +(10002,6) +``` + +## Arguments + +The syntax is: + +```sql +into_values( + agg MaxNBy, + dummy ANYELEMENT +) TABLE ( + value BIGINT | DOUBLE PRECISION | TIMESTAMPTZ, + data ANYELEMENT +) +``` +| Name | Type | Default | Required | Description | +|--|--|--|--|--| +| `agg` | MaxNBy | - | ✔ | The aggregate to return the results from. Note that the exact type here varies based on the type of data stored. | +| `dummy` | ANYELEMENT | - | ✔ | This is purely to inform Postgres of the response type. A NULL cast to the appropriate type is typical. | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| into_values | SETOF BIGINT \| SETOF DOUBLE PRECISION \| SETOF TIMESTAMPTZ | The lowest values seen while creating this aggregate. | diff --git a/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n_by/max_n_by.mdx b/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n_by/max_n_by.mdx new file mode 100644 index 0000000..5d951c0 --- /dev/null +++ b/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n_by/max_n_by.mdx @@ -0,0 +1,44 @@ +--- +title: max_n_by() +description: Track the largest values and associated data in a set of values +topics: [hyperfunctions] +tags: [hyperfunctions, toolkit, maximum] +license: community +type: function +toolkit: true +hyperfunction: + family: minimum and maximum + type: aggregate + aggregates: + - max_n_by() +products: [cloud, mst, self_hosted] +--- + + Since 1.16.0 + +Construct an aggregate that keeps track of the largest values passed through it, as well as some associated data which +is passed alongside the value. + + +## Arguments + +The syntax is: + +```sql +max_n_by( + value BIGINT | DOUBLE PRECISION | TIMESTAMPTZ, + data ANYELEMENT, + capacity BIGINT +) MaxNBy +``` +| Name | Type | Default | Required | Description | +|--|--|--|--|--| +| `value` | BIGINT \| DOUBLE PRECISION \| TIMESTAMPTZ | - | ✔ | The values passed into the aggregate | +| `data` | ANYELEMENT | - | ✔ | The data associated with a particular value | +| `capacity` | BIGINT | - | ✔ | The number of values to retain. | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| max_n_by | MaxNBy | The compiled aggregate. Note that the exact type will be MaxByInts, MaxByFloats, or MaxByTimes depending on the input type | diff --git a/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n_by/rollup.mdx b/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n_by/rollup.mdx new file mode 100644 index 0000000..2bacee9 --- /dev/null +++ b/api-reference/timescaledb-toolkit/minimum-and-maximum/max_n_by/rollup.mdx @@ -0,0 +1,41 @@ +--- +title: rollup() +description: Combine multiple MaxNBy aggregates +topics: [hyperfunctions] +tags: [hyperfunctions, toolkit, maximum] +license: community +type: function +toolkit: true +hyperfunction: + family: minimum and maximum + type: rollup + aggregates: + - max_n_by() +products: [cloud, mst, self_hosted] +--- + + Since 1.16.0 + +This aggregate combines the aggregates generated by other `max_n_by` +aggregates and returns the maximum values and associated data found +across all the aggregated data. + + +## Arguments + +The syntax is: + +```sql +rollup( + agg MinN +) MinN +``` +| Name | Type | Default | Required | Description | +|--|--|--|--|--| +| `agg` | MaxNBy | - | ✔ | The aggregates being combined | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| rollup | MinN | An aggregate over all of the contributing values. | diff --git a/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n/index.mdx b/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n/index.mdx new file mode 100644 index 0000000..70e5ec7 --- /dev/null +++ b/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n/index.mdx @@ -0,0 +1,72 @@ +--- +title: Minimum values overview +description: Get the N smallest values from a column +sidebarTitle: Overview +--- + +import TwoStepAggregation from '/snippets/api-reference/timescaledb/hyperfunctions/_two-step-aggregation.mdx'; + +Get the N smallest values from a column. + +The `min_n()` functions give the same results as the regular SQL query `SELECT +... ORDER BY ... LIMIT n`. But unlike the SQL query, they can be composed and +combined like other aggregate hyperfunctions. + +To get the N largest values, use [`max_n()`][max_n]. To get the N smallest +values with accompanying data, use [`min_n_by()`][min_n_by]. + +This function group uses the [two-step aggregation][two-step-aggregation] +pattern. In addition to the usual aggregate function [`min_n`][min_n], it also +includes accessors and rollup functions. + +## Two-step aggregation + + + +## Samples + +This example assumes that you have a table of stock trades in this format: + +```sql +CREATE TABLE stock_sales( + ts TIMESTAMPTZ, + symbol TEXT, + price FLOAT, + volume INT +); +``` + +You can query for the 10 smallest transactions each day: + +```sql +WITH t as ( + SELECT + time_bucket('1 day'::interval, ts) as day, + min_n(price * volume, 10) AS daily_min + FROM stock_sales + GROUP BY time_bucket('1 day'::interval, ts) +) +SELECT + day, into_array(daily_min) +FROM t; +``` + +## Available functions + +### Aggregate +- [`min_n()`][min_n]: construct an aggregate that keeps track of the smallest values passed through it + +### Accessors +- [`into_values()`][into_values]: return the N lowest values seen by the aggregate +- [`into_array()`][into_array]: return the N lowest values seen by the aggregate as an array + +### Rollup +- [`rollup()`][rollup]: combine multiple MinN aggregates + +[two-step-aggregation]: #two-step-aggregation +[min_n]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/min_n/min_n +[max_n]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/max_n/max_n +[min_n_by]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/min_n_by/min_n_by +[into_values]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/min_n/into_values +[into_array]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/min_n/into_array +[rollup]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/min_n/rollup diff --git a/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n/into_array.mdx b/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n/into_array.mdx new file mode 100644 index 0000000..aa9e11d --- /dev/null +++ b/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n/into_array.mdx @@ -0,0 +1,61 @@ +--- +title: into_array() +description: Returns an array of the lowest values from a MinN aggregate +topics: [hyperfunctions] +tags: [hyperfunctions, toolkit, minimum] +license: community +type: function +toolkit: true +hyperfunction: + family: minimum and maximum + type: accessor + aggregates: + - min_n() +products: [cloud, mst, self_hosted] +--- + + Since 1.16.0 + +Returns the N lowest values seen by the aggregate. The values are +formatted as an array in increasing order. + + +## Samples + +Find the bottom 5 values from `i * 13 % 10007` for i = 1 to 10000. + +```sql +SELECT into_array( + min_n(sub.val, 5)) +FROM ( + SELECT (i * 13) % 10007 AS val + FROM generate_series(1,10000) as i +) sub; +``` + +Output: + +```sql +into_array +--------------------------------- +{1,2,3,4,5} +``` + +## Arguments + +The syntax is: + +```sql +into_array ( + agg MinN +) BIGINT[] | DOUBLE PRECISION[] | TIMESTAMPTZ[] +``` +| Name | Type | Default | Required | Description | +|--|--|--|--|--| +| `agg` | MinN | - | ✔ | The aggregate to return the results from. Note that the exact type here varies based on the type of data stored. | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| into_array | BIGINT[] \| DOUBLE PRECISION[] \| TIMESTAMPTZ[] | The lowest values seen while creating this aggregate. | diff --git a/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n/into_values.mdx b/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n/into_values.mdx new file mode 100644 index 0000000..b963dfa --- /dev/null +++ b/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n/into_values.mdx @@ -0,0 +1,64 @@ +--- +title: into_values() +description: Returns the lowest values from a MinN aggregate +topics: [hyperfunctions] +tags: [hyperfunctions, toolkit, minimum] +license: community +type: function +toolkit: true +hyperfunction: + family: minimum and maximum + type: accessor + aggregates: + - min_n() +products: [cloud, mst, self_hosted] +--- + + Since 1.16.0 + +Return the N lowest values seen by the aggregate. + + +## Samples + +Find the bottom 5 values from `i * 13 % 10007` for i = 1 to 10000. + +```sql +SELECT toolkit_experimental.into_values( + toolkit_experimental.min_n(sub.val, 5)) +FROM ( + SELECT (i * 13) % 10007 AS val + FROM generate_series(1,10000) as i +) sub; +``` + +Output: + +```sql +into_values +--------------------------------- +1 +2 +3 +4 +5 +``` + +## Arguments + +The syntax is: + +```sql +into_values ( + agg MinN +) SETOF BIGINT | SETOF DOUBLE PRECISION | SETOF TIMESTAMPTZ +``` +| Name | Type | Default | Required | Description | +|--|--|--|--|--| +| `agg` | MinN | - | ✔ | The aggregate to return the results from. Note that the exact type here varies based on the type of data stored. | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| into_values | SETOF BIGINT \| SETOF DOUBLE PRECISION \| SETOF TIMESTAMPTZ | The lowest values seen while creating this aggregate. | diff --git a/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n/min_n.mdx b/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n/min_n.mdx new file mode 100644 index 0000000..a8482c0 --- /dev/null +++ b/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n/min_n.mdx @@ -0,0 +1,41 @@ +--- +title: min_n() +description: Find the smallest values in a set of data +topics: [hyperfunctions] +tags: [hyperfunctions, toolkit, minimum] +license: community +type: function +toolkit: true +hyperfunction: + family: minimum and maximum + type: aggregate + aggregates: + - min_n() +products: [cloud, mst, self_hosted] +--- + + Since 1.16.0 + +Construct an aggregate that keeps track of the smallest values passed through it. + + +## Arguments + +The syntax is: + +```sql +min_n( + value BIGINT | DOUBLE PRECISION | TIMESTAMPTZ, + capacity BIGINT +) MinN +``` +| Name | Type | Default | Required | Description | +|--|--|--|--|--| +| `value` | BIGINT \| DOUBLE PRECISION \| TIMESTAMPTZ | - | ✔ | The values passed into the aggregate | +| `capacity` | BIGINT | - | ✔ | The number of values to retain. | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| min_n | MinN | The compiled aggregate. Note that the exact type is `MinInts`, `MinFloats`, or `MinTimes` depending on the input type | diff --git a/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n/rollup.mdx b/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n/rollup.mdx new file mode 100644 index 0000000..a5b9ec8 --- /dev/null +++ b/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n/rollup.mdx @@ -0,0 +1,41 @@ +--- +title: rollup() +description: Combine multiple MinN aggregates +topics: [hyperfunctions] +tags: [hyperfunctions, toolkit, minimum] +license: community +type: function +toolkit: true +hyperfunction: + family: minimum and maximum + type: rollup + aggregates: + - min_n() +products: [cloud, mst, self_hosted] +--- + + Since 1.16.0 + +This aggregate combines the aggregates generated by other `min_n` +aggregates and returns the minimum values found across all the +aggregated data. + + +## Arguments + +The syntax is: + +```sql +rollup( + agg MinN +) MinN +``` +| Name | Type | Default | Required | Description | +|--|--|--|--|--| +| `agg` | MinN | - | ✔ | The aggregates being combined | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| rollup | MinN | An aggregate over all of the contributing values. | diff --git a/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n_by/index.mdx b/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n_by/index.mdx new file mode 100644 index 0000000..ffd802f --- /dev/null +++ b/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n_by/index.mdx @@ -0,0 +1,73 @@ +--- +title: Minimum values by overview +description: Get the N smallest values with accompanying data +sidebarTitle: Overview +--- + +import TwoStepAggregation from '/snippets/api-reference/timescaledb/hyperfunctions/_two-step-aggregation.mdx'; + +Get the N smallest values from a column, with an associated piece of data per +value. For example, you can return an accompanying column, or the full row. + +The `min_n_by()` functions give the same results as the regular SQL query +`SELECT ... ORDER BY ... LIMIT n`. But unlike the SQL query, they can be +composed and combined like other aggregate hyperfunctions. + +To get the N largest values with accompanying data, use +[`max_n_by()`][max_n_by]. To get the N smallest values without accompanying +data, use [`min_n()`][min_n]. + +This function group uses the [two-step aggregation][two-step-aggregation] +pattern. In addition to the usual aggregate function [`min_n_by`][min_n_by], it also +includes accessors and rollup functions. + +## Two-step aggregation + + + +## Samples + +This example assumes that you have a table of stock trades in this format: + +```sql +CREATE TABLE stock_sales( + ts TIMESTAMPTZ, + symbol TEXT, + price FLOAT, + volume INT +); +``` + +Find the 10 smallest transactions in the table, what time they occurred, and +what symbol was being traded. + +```sql +SELECT + (data).ts, + (data).symbol, + value AS transaction +FROM + into_values(( + SELECT min_n_by(price * volume, stock_sales, 10) + FROM stock_sales + ), + NULL::stock_sales); +``` + +## Available functions + +### Aggregate +- [`min_n_by()`][min_n_by]: construct an aggregate that keeps track of the smallest values and associated data + +### Accessors +- [`into_values()`][into_values]: return the N lowest values with their associated data + +### Rollup +- [`rollup()`][rollup]: combine multiple MinNBy aggregates + +[two-step-aggregation]: #two-step-aggregation +[min_n_by]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/min_n_by/min_n_by +[max_n_by]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/max_n_by/max_n_by +[min_n]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/min_n/min_n +[into_values]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/min_n_by/into_values +[rollup]: /api-reference/timescaledb/hyperfunctions/minimum-and-maximum/min_n_by/rollup \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n_by/into_values.mdx b/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n_by/into_values.mdx new file mode 100644 index 0000000..434a849 --- /dev/null +++ b/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n_by/into_values.mdx @@ -0,0 +1,70 @@ +--- +title: into_values() +description: Returns the lowest values and associated data from a MinNBy aggregate +topics: [hyperfunctions] +tags: [hyperfunctions, toolkit, minimum] +license: community +type: function +toolkit: true +hyperfunction: + family: minimum and maximum + type: accessor + aggregates: + - min_n_by() +products: [cloud, mst, self_hosted] +--- + + Since 1.16.0 + +This returns the smallest values seen by the aggregate and the +corresponding values associated with them. Note that Postgres requires +an input argument with type matching the associated value in order to +determine the response type. + + +## Samples + +Find the bottom 5 values from `i * 13 % 10007` for i = 1 to 10000, and +the integer result of the operation that generated that modulus. + +```sql +SELECT into_values( + min_n_by(sub.mod, sub.div, 5), + NULL::INT) +FROM ( + SELECT (i * 13) % 10007 AS mod, (i * 13) / 10007 AS div + FROM generate_series(1,10000) as i +) sub; +``` + +Output: + +```sql +into_values +------------- +(1,9) +(2,5) +(3,1) +(4,10) +(5,6) +``` + +## Arguments + +The syntax is: + +```sql +into_values ( + agg MinN +) SETOF BIGINT | SETOF DOUBLE PRECISION | SETOF TIMESTAMPTZ +``` +| Name | Type | Default | Required | Description | +|--|--|--|--|--| +| `agg` | MinNBy | - | ✔ | The aggregate to return the results from. Note that the exact type here varies based on the type of data stored. | +| `dummy` | ANYELEMENT | - | ✔ | This is purely to inform Postgres of the response type. A NULL cast to the appropriate type is typical. | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| into_values | SETOF BIGINT \| SETOF DOUBLE PRECISION \| SETOF TIMESTAMPTZ | The lowest values seen while creating this aggregate. | diff --git a/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n_by/min_n_by.mdx b/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n_by/min_n_by.mdx new file mode 100644 index 0000000..620c508 --- /dev/null +++ b/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n_by/min_n_by.mdx @@ -0,0 +1,44 @@ +--- +title: min_n_by() +description: Track the smallest values and associated data in a set of values +topics: [hyperfunctions] +tags: [hyperfunctions, toolkit, minimum] +license: community +type: function +toolkit: true +hyperfunction: + family: minimum and maximum + type: aggregate + aggregates: + - min_n_by() +products: [cloud, mst, self_hosted] +--- + + Since 1.16.0 + +Construct an aggregate that keeps track of the smallest values passed through it, as well as some associated data which +is passed alongside the value. + + +## Arguments + +The syntax is: + +```sql +min_n_by( + value BIGINT | DOUBLE PRECISION | TIMESTAMPTZ, + data ANYELEMENT, + capacity BIGINT +) MinNBy +``` +| Name | Type | Default | Required | Description | +|--|--|--|--|--| +| `value` | BIGINT \| DOUBLE PRECISION \| TIMESTAMPTZ | - | ✔ | The values passed into the aggregate | +| `data` | ANYELEMENT | - | ✔ | The data associated with a particular value | +| `capacity` | BIGINT | - | ✔ | The number of values to retain. | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| min_n_by | MinNBy | The compiled aggregate. Note that the exact type is `MinByInts`, `MinByFloats`, or `MinByTimes` depending on the input type | diff --git a/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n_by/rollup.mdx b/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n_by/rollup.mdx new file mode 100644 index 0000000..989d15c --- /dev/null +++ b/api-reference/timescaledb-toolkit/minimum-and-maximum/min_n_by/rollup.mdx @@ -0,0 +1,41 @@ +--- +title: rollup() +description: Combine multiple MinNBy aggregates +topics: [hyperfunctions] +tags: [hyperfunctions, toolkit, minimum] +license: community +type: function +toolkit: true +hyperfunction: + family: minimum and maximum + type: rollup + aggregates: + - min_n_by() +products: [cloud, mst, self_hosted] +--- + + Since 1.16.0 + +This aggregate combines the aggregates generated by other `min_n_by` +aggregates and returns the minimum values and associated data found +across all the aggregated data. + + +## Arguments + +The syntax is: + +```sql +rollup( + agg MinN +) MinN +``` +| Name | Type | Default | Required | Description | +|--|--|--|--|--| +| `agg` | MinNBy | - | ✔ | The aggregates being combined | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| rollup | MinN | An aggregate over all of the contributing values. | diff --git a/api-reference/timescaledb-toolkit/percentile-approximation/index.mdx b/api-reference/timescaledb-toolkit/percentile-approximation/index.mdx new file mode 100644 index 0000000..2eada94 --- /dev/null +++ b/api-reference/timescaledb-toolkit/percentile-approximation/index.mdx @@ -0,0 +1,93 @@ +--- +title: Percentile approximation overview +sidebarTitle: Overview +description: Estimate percentile values and percentile ranks using memory-efficient approximation algorithms +topics: [hyperfunctions] +license: community +toolkit: true +products: [cloud, mst, self_hosted] +--- + + Since 1.0.0 + +These functions are more CPU- and memory-efficient than exact calculations using PostgreSQL's `percentile_cont` and +`percentile_disc` functions, making them ideal for large datasets and continuous aggregates. + +TimescaleDB Toolkit provides two advanced percentile approximation algorithms: +- **UddSketch**: Produces stable estimates within a guaranteed relative error +- **t-digest**: More accurate at extreme quantiles, though somewhat dependent on input order + +import TwoStepAggregation from '/snippets/api-reference/timescaledb/hyperfunctions/_two-step-aggregation.mdx'; + +## Two-step aggregation + + + +## Samples + +### Using percentile_agg (recommended for most use cases) + +Create an hourly continuous aggregate and calculate daily percentiles: + +```sql +CREATE MATERIALIZED VIEW foo_hourly +WITH (timescaledb.continuous) +AS SELECT + time_bucket('1 h'::interval, ts) AS bucket, + percentile_agg(value) AS pct_agg +FROM foo +GROUP BY 1; + +-- Query daily percentiles +SELECT + time_bucket('1 day'::interval, bucket) AS bucket, + approx_percentile(0.95, rollup(pct_agg)) AS p95, + approx_percentile(0.99, rollup(pct_agg)) AS p99 +FROM foo_hourly +GROUP BY 1; +``` + +### Using uddsketch for custom error control + +Aggregate percentile data with specific error bounds: + +```sql +SELECT + time_bucket('1 day'::interval, ts) AS day, + uddsketch(200, 0.001, value) AS sketch +FROM measurements +GROUP BY day; +``` + +### Using tdigest for extreme quantiles + +Calculate percentiles at extreme ends of the distribution: + +```sql +CREATE MATERIALIZED VIEW response_times_hourly +WITH (timescaledb.continuous) +AS SELECT + time_bucket('1 h'::interval, ts) AS bucket, + tdigest(100, response_time) AS digest +FROM requests +GROUP BY 1; + +-- Query for extreme percentiles +SELECT + bucket, + approx_percentile(0.999, digest) AS p999, + approx_percentile(0.9999, digest) AS p9999 +FROM response_times_hourly; +``` + +## Available functions + +### UddSketch (recommended) +- [`uddsketch()`][uddsketch]: estimate percentiles using the UddSketch algorithm with guaranteed relative error + +### t-digest (for extreme quantiles) +- [`tdigest()`][tdigest]: estimate percentiles using the t-digest algorithm, optimized for extreme quantiles + +[two-step-aggregation]: #two-step-aggregation +[uddsketch]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/uddsketch/index +[tdigest]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/tdigest/index diff --git a/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/approx_percentile.mdx b/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/approx_percentile.mdx new file mode 100644 index 0000000..cee5878 --- /dev/null +++ b/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/approx_percentile.mdx @@ -0,0 +1,55 @@ +--- +title: approx_percentile() +description: Estimate the value at a given percentile from a `tdigest` +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: percentile approximation + type: accessor + aggregates: + - tdigest() +--- + + Since 1.0.0 + +Estimate the approximate value at a percentile from a `tdigest` aggregate. + +## Samples + +Estimate the value at the first percentile, given a sample containing the numbers from 0 to 100. + +```sql +SELECT + approx_percentile(0.01, tdigest(data)) +FROM generate_series(0, 100) data; +``` + +``` +approx_percentile +------------------- + 0.999 +``` + +## Arguments + +The syntax is: + +```sql +approx_percentile( + percentile DOUBLE PRECISION, + tdigest TDigest +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| percentile | DOUBLE PRECISION | - |  | The percentile to compute. Must be within the range `[0.0, 1.0]` | +| tdigest | TDigest | - |  | The `tdigest` aggregate | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| approx_percentile | DOUBLE PRECISION | The estimated value at the requested percentile. | + diff --git a/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/approx_percentile_rank.mdx b/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/approx_percentile_rank.mdx new file mode 100644 index 0000000..8119732 --- /dev/null +++ b/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/approx_percentile_rank.mdx @@ -0,0 +1,55 @@ +--- +title: approx_percentile_rank() +description: Estimate the percentile of a given value from a `tdigest` +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: percentile approximation + type: accessor + aggregates: + - tdigest() +--- + + Since 1.0.0 + +Estimate the percentile at which a given value would be located. + +## Samples + +Estimate the percentile rank of the value `99`, given a sample containing the numbers from 0 to 100. + +```sql +SELECT + approx_percentile_rank(99, tdigest(data)) +FROM generate_series(0, 100) data; +``` + +``` +approx_percentile_rank +---------------------------- + 0.9851485148514851 +``` + +## Arguments + +The syntax is: + +```sql +approx_percentile_rank( + value DOUBLE PRECISION, + digest TDigest +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| value | DOUBLE PRECISION | - |  | The value to estimate the percentile of | +| digest | TDigest | - |  | The `tdigest` aggregate | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| approx_percentile_rank | DOUBLE PRECISION | The estimated percentile associated with the provided value. | + diff --git a/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/index.mdx b/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/index.mdx new file mode 100644 index 0000000..485f496 --- /dev/null +++ b/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/index.mdx @@ -0,0 +1,87 @@ +--- +title: t-digest overview +sidebarTitle: Overview +description: Percentile approximation optimized for extreme quantiles using the t-digest algorithm +topics: [hyperfunctions] +license: community +toolkit: true +products: [cloud, mst, self_hosted] +--- + + Since 1.0.0 + +Estimate the value at a given percentile, or the percentile rank of a given value, using the t-digest algorithm. This +estimation is more memory- and CPU-efficient than an exact calculation using PostgreSQL's `percentile_cont` and +`percentile_disc` functions. + +`tdigest` is one of two advanced percentile approximation aggregates provided in TimescaleDB Toolkit. It is a +space-efficient aggregation, and it provides more accurate estimates at extreme quantiles than traditional methods. + +`tdigest` is somewhat dependent on input order. If `tdigest` is run on the same data arranged in different order, the +results should be nearly equal, but they are unlikely to be exact. + +The other advanced percentile approximation aggregate is [`uddsketch`][uddsketch], which produces stable estimates +within a guaranteed relative error. If you aren't sure which to use, try the default percentile estimation method, +[`percentile_agg`][percentile_agg]. It uses the `uddsketch` algorithm with some sensible defaults. + +import TwoStepAggregation from '/snippets/api-reference/timescaledb/hyperfunctions/_two-step-aggregation.mdx'; + +## Two-step aggregation + + + +## Samples + +### Aggregate and roll up percentile data to calculate daily percentiles + +Create an hourly continuous aggregate that contains a percentile aggregate: + +```sql +CREATE MATERIALIZED VIEW foo_hourly +WITH (timescaledb.continuous) +AS SELECT + time_bucket('1 h'::interval, ts) AS bucket, + tdigest(100, value) AS tdigest +FROM foo +GROUP BY 1; +``` + +Use accessors to query directly from the continuous aggregate for hourly data. You can also roll the hourly data up into +daily buckets, then calculate approximate percentiles: + +```sql +SELECT + time_bucket('1 day'::interval, bucket) AS bucket, + approx_percentile(0.95, rollup(tdigest)) AS p95, + approx_percentile(0.99, rollup(tdigest)) AS p99 +FROM foo_hourly +GROUP BY 1; +``` + +## Available functions + +### Aggregate +- [`tdigest()`][tdigest]: aggregate data in a t-digest for percentile calculation + +### Accessors +- [`approx_percentile()`][approx_percentile]: estimate the value at a given percentile from a t-digest +- [`approx_percentile_rank()`][approx_percentile_rank]: estimate the percentile rank of a given value from a t-digest +- [`max_val()`][max_val]: get the maximum value from a t-digest +- [`mean()`][mean]: calculate the exact mean from values in a t-digest +- [`min_val()`][min_val]: get the minimum value from a t-digest +- [`num_vals()`][num_vals]: get the number of values in a t-digest + +### Rollup +- [`rollup()`][rollup]: combine multiple t-digest aggregates + +[two-step-aggregation]: #two-step-aggregation +[uddsketch]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/uddsketch/index +[percentile_agg]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/uddsketch/percentile_agg +[tdigest]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/tdigest/tdigest +[approx_percentile]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/tdigest/approx_percentile +[approx_percentile_rank]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/tdigest/approx_percentile_rank +[max_val]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/tdigest/max_val +[mean]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/tdigest/mean +[min_val]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/tdigest/min_val +[num_vals]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/tdigest/num_vals +[rollup]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/tdigest/rollup \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/max_val.mdx b/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/max_val.mdx new file mode 100644 index 0000000..9609b95 --- /dev/null +++ b/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/max_val.mdx @@ -0,0 +1,52 @@ +--- +title: max_val() +description: Get the maximum value from a `tdigest` +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: percentile approximation + type: accessor + aggregates: + - tdigest() +--- + + Since 1.0.0 + +Get the maximum value from a `tdigest`. This accessor allows you to calculate the maximum alongside percentiles, without +needing to create two separate aggregates from the same raw data. + +## Samples + +Get the maximum of the integers from 1 to 100. + +```sql +SELECT max_val(tdigest(100, data)) + FROM generate_series(1, 100) data; +``` + +``` +max_val +--------- + 100 +``` +## Arguments + +The syntax is: + +```sql +max_val( + digest TDigest +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| digest | TDigest | - |  | The digest to extract the max value from | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| max_val | DOUBLE PRECISION | The maximum value entered into the `tdigest` | + diff --git a/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/mean.mdx b/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/mean.mdx new file mode 100644 index 0000000..f950e24 --- /dev/null +++ b/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/mean.mdx @@ -0,0 +1,54 @@ +--- +title: mean() +description: Calculate the exact mean from values in a `tdigest` +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: percentile approximation + type: accessor + aggregates: + - tdigest() +--- + + Since 1.0.0 + +Calculate the exact mean of the values in a `tdigest` aggregate. Unlike percentile calculations, the mean calculation is +exact. This accessor allows you to calculate the mean alongside percentiles, without needing to create two separate +aggregates from the same raw data. + +## Samples + +Calculate the mean of the integers from 0 to 100. + +```sql +SELECT mean(tdigest(data)) +FROM generate_series(0, 100) data; +``` + +``` +mean +------ +50 +``` + +## Arguments + +The syntax is: + +```sql +mean( + digest TDigest +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| digest | TDigest | - |  | The `tdigest` aggregate to extract the mean from | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| mean | DOUBLE PRECISION | The mean of the values in the `uddsketch`. | + diff --git a/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/min_val.mdx b/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/min_val.mdx new file mode 100644 index 0000000..4a9b457 --- /dev/null +++ b/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/min_val.mdx @@ -0,0 +1,53 @@ +--- +title: min_val() +description: Get the minimum value from a `tdigest` +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: percentile approximation + type: accessor + aggregates: + - tdigest() +--- + + Since 1.0.0 + +Get the minimum value from a `tdigest`. This accessor allows you to calculate the minimum alongside percentiles, without +needing to create two separate aggregates from the same raw data. + +## Samples + +Get the minimum of the integers from 1 to 100. + +```sql +SELECT min_val(tdigest(100, data)) + FROM generate_series(1, 100) data; +``` + +``` +min_val +--------- + 1 +``` + +## Arguments + +The syntax is: + +```sql +min_val( + digest TDigest +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| digest | TDigest | - |  | The digest to extract the minimum value from | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| min_val | DOUBLE PRECISION | The minimum value entered into the `tdigest` | + diff --git a/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/num_vals.mdx b/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/num_vals.mdx new file mode 100644 index 0000000..1cf7311 --- /dev/null +++ b/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/num_vals.mdx @@ -0,0 +1,53 @@ +--- +title: num_vals() +description: Get the number of values contained in a `tdigest` +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: percentile approximation + type: accessor + aggregates: + - tdigest() +--- + + Since 1.0.0 + +Get the number of values contained in a `tdigest` aggregate. This accessor allows you to calculate a count alongside +percentiles, without needing to create two separate aggregates from the same raw data. + +## Samples + +Count the number of integers from 0 to 100. + +```sql +SELECT num_vals(tdigest(data)) +FROM generate_series(0, 100) data; +``` + +``` +num_vals +----------- + 101 +``` + +## Arguments + +The syntax is: + +```sql +num_vals( + digest TDigest +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| digest | TDigest | - |  | The `tdigest` aggregate to extract the number of values from | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| num_vals | DOUBLE PRECISION | The number of values in the `uddsketch`. | + diff --git a/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/rollup.mdx b/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/rollup.mdx new file mode 100644 index 0000000..99502ee --- /dev/null +++ b/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/rollup.mdx @@ -0,0 +1,37 @@ +--- +title: rollup() +description: Roll up multiple `tdigest`s +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: percentile approximation + type: rollup + aggregates: + - tdigest() +--- + + Since 1.0.0 + +Combine multiple intermediate `tdigest` aggregates, produced by `tdigest`, into a single intermediate `tdigest` +aggregate. For example, you can use `rollup` to combine `tdigest`s from 15-minute buckets into daily buckets. + +## Arguments + +The syntax is: + +```sql +rollup( + digest TDigest +) RETURNS TDigest +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| digest | TDigest | - |  | The `tdigest`s to roll up | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| rollup | TDigest | A new `tdigest` created by combining the input `tdigests` | \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/tdigest.mdx b/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/tdigest.mdx new file mode 100644 index 0000000..a82ace3 --- /dev/null +++ b/api-reference/timescaledb-toolkit/percentile-approximation/tdigest/tdigest.mdx @@ -0,0 +1,52 @@ +--- +title: tdigest() +description: Aggregate data in a `tdigest` for further calculation of percentile estimates +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: percentile approximation + type: aggregate + aggregates: + - tdigest() +--- + + Since 1.0.0 + +This is the first step for calculating approximate percentiles with the `tdigest` algorithm. Use `tdigest` to create an +intermediate aggregate from your raw data. This intermediate form can then be used by one or more accessors in this +group to compute final results. + +Optionally, multiple such intermediate aggregate objects can be combined using [`rollup()`](#rollup) before an accessor is applied. + +## Samples + +Given a table called `samples`, with a column called `data`, build a `tdigest` using the `data` column. Use 100 buckets +for the approximation. + +```sql +SELECT tdigest(100, data) FROM samples; +``` + +## Arguments + +The syntax is: + +```sql +tdigest( + buckets INTEGER, + value DOUBLE PRECISION +) RETURNS TDigest +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| buckets | INTEGER | - |  | Number of buckets in the digest. Increasing this provides more accurate quantile estimates, but requires more memory | +| value | DOUBLE PRECISION | - |  | Column of values to aggregate for the `tdigest` object | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| tdigest | TDigest | A percentile estimator object created to calculate percentiles using the `tdigest` algorithm | + diff --git a/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/approx_percentile.mdx b/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/approx_percentile.mdx new file mode 100644 index 0000000..6a48c64 --- /dev/null +++ b/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/approx_percentile.mdx @@ -0,0 +1,55 @@ +--- +title: approx_percentile() +description: estimate the value at a given percentile from a `uddsketch` +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: percentile approximation + type: accessor + aggregates: + - uddsketch() +--- + + Since 1.0.0 + +Estimate the approximate value at a percentile from a `uddsketch` aggregate. + +## Samples + +Estimate the value at the first percentile, given a sample containing the numbers from 0 to 100. + +```sql +SELECT + approx_percentile(0.01, uddsketch(data)) +FROM generate_series(0, 100) data; +``` + +```sql +approx_percentile +------------------- + 0.999 +``` + +## Arguments + +The syntax is: + +```sql +approx_percentile( + percentile DOUBLE PRECISION, + uddsketch UddSketch +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| percentile | DOUBLE PRECISION | - | ✔ | the percentile to compute. Must be within the range `[0.0, 1.0]` | +| sketch | UddSketch | - | ✔ | the `uddsketch` aggregate | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| approx_percentile | DOUBLE PRECISION | The estimated value at the requested percentile. | + diff --git a/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/approx_percentile_array.mdx b/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/approx_percentile_array.mdx new file mode 100644 index 0000000..4953dce --- /dev/null +++ b/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/approx_percentile_array.mdx @@ -0,0 +1,55 @@ +--- +title: approx_percentile_array() +description: estimate the values for an array of given percentiles from a `uddsketch` +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: percentile approximation + type: accessor + aggregates: + - uddsketch() +--- + + Since 1.16.0 + +Estimate the approximate values of an array of percentiles from a `uddsketch` aggregate. + +## Samples + +Estimate the value at the 90th, 50th, and 20th percentiles, given a sample containing the numbers from 0 to 100. + +```sql +SELECT + approx_percentile_array(array[0.9,0.5,0.2], uddsketch(100,0.005,data)) +FROM generate_series(0, 100) data; +``` + +```sql +approx_percentile_array +------------------- + {90.0,50.0,20.0} +``` + +## Arguments + +The syntax is: + +```sql +approx_percentile_array( + percentiles DOUBLE PRECISION[], + uddsketch UddSketch +) RETURNS DOUBLE PRECISION[] +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| percentiles | DOUBLE PRECISION[] | - | ✔ | array of percentiles to compute. Must be within the range `[0.0, 1.0]` | +| sketch | UddSketch | - | ✔ | the `uddsketch` aggregate | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| approx_percentile_array | DOUBLE PRECISION[] | The estimated values at the requested percentiles. | + diff --git a/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/approx_percentile_rank.mdx b/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/approx_percentile_rank.mdx new file mode 100644 index 0000000..3b6815a --- /dev/null +++ b/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/approx_percentile_rank.mdx @@ -0,0 +1,55 @@ +--- +title: approx_percentile_rank() +description: estimate the percentile of a given value from a `uddsketch` +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: percentile approximation + type: accessor + aggregates: + - uddsketch() +--- + + Since 1.0.0 + +Estimate the percentile at which a given value would be located. + +## Samples + +Estimate the percentile rank of the value `99`, given a sample containing the numbers from 0 to 100. + +```sql +SELECT + approx_percentile_rank(99, uddsketch(data)) +FROM generate_series(0, 100) data; +``` + +```sql +approx_percentile_rank +---------------------------- + 0.9851485148514851 +``` + +## Arguments + +The syntax is: + +```sql +approx_percentile_rank( + value DOUBLE PRECISION, + sketch UddSketch +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| value | DOUBLE PRECISION | - | ✔ | the value to estimate the percentile of | +| sketch | UddSketch | - | ✔ | the `uddsketch` aggregate | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| approx_percentile_rank | DOUBLE PRECISION | The estimated percentile associated with the provided value. | + diff --git a/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/error.mdx b/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/error.mdx new file mode 100644 index 0000000..792b033 --- /dev/null +++ b/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/error.mdx @@ -0,0 +1,52 @@ +--- +title: error() +description: Get the maximum relative error for a `uddsketch` +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: percentile approximation + type: accessor + aggregates: + - uddsketch() +--- + + Since 1.0.0 + +Get the maximum relative error of a `uddsketch`. The correct (non-estimated) percentile falls within the range defined +by `approx_percentile(sketch) +/- (approx_percentile(sketch) * error(sketch))`. + +## Samples + +Calculate the maximum relative error when estimating percentiles using `uddsketch`. + +```sql +SELECT error(uddsketch(data)) +FROM generate_series(0, 100) data; +``` + +``` +error +------- +0.001 +``` +## Arguments + +The syntax is: + +```sql +error( + sketch UddSketch +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| sketch | UddSketch | - |  | The `uddsketch` to determine the error of | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| error | DOUBLE PRECISION | The maximum relative error of any percentile estimate. | + diff --git a/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/index.mdx b/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/index.mdx new file mode 100644 index 0000000..f309026 --- /dev/null +++ b/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/index.mdx @@ -0,0 +1,115 @@ +--- +title: UddSketch overview +sidebarTitle: Overview +description: Percentile approximation with guaranteed relative error using the UddSketch algorithm +topics: [hyperfunctions] +license: community +toolkit: true +products: [cloud, mst, self_hosted] +--- + + Since 1.0.0 + +Estimate the value at a given percentile, or the percentile rank of a given value, using the UddSketch algorithm. This +estimation is more memory- and CPU-efficient than an exact calculation using PostgreSQL's `percentile_cont` and +`percentile_disc` functions. + +`uddsketch` is one of two advanced percentile approximation aggregates provided in TimescaleDB Toolkit. It produces +stable estimates within a guaranteed relative error. + +The other advanced percentile approximation aggregate is [`tdigest`][tdigest], which is more accurate at extreme +quantiles, but is somewhat dependent on input order. + +If you aren't sure which aggregate to use, try the default percentile estimation method, +[`percentile_agg`][percentile_agg]. It uses the `uddsketch` algorithm with some sensible defaults. + +import TwoStepAggregation from '/snippets/api-reference/timescaledb/hyperfunctions/_two-step-aggregation.mdx'; + +## Two-step aggregation + + + +## Samples + +### Aggregate and roll up percentile data using percentile_agg + +Create an hourly continuous aggregate that contains a percentile aggregate: + +```sql +CREATE MATERIALIZED VIEW foo_hourly +WITH (timescaledb.continuous) +AS SELECT + time_bucket('1 h'::interval, ts) AS bucket, + percentile_agg(value) AS pct_agg +FROM foo +GROUP BY 1; +``` + +Use accessors to query directly from the continuous aggregate for hourly data. You can also roll the hourly data up into +daily buckets, then calculate approximate percentiles: + +```sql +SELECT + time_bucket('1 day'::interval, bucket) AS bucket, + approx_percentile(0.95, rollup(pct_agg)) AS p95, + approx_percentile(0.99, rollup(pct_agg)) AS p99 +FROM foo_hourly +GROUP BY 1; +``` + +### Aggregate and roll up percentile data using uddsketch + +Create an hourly continuous aggregate that contains a percentile aggregate: + +```sql +CREATE MATERIALIZED VIEW foo_hourly +WITH (timescaledb.continuous) +AS SELECT + time_bucket('1 h'::interval, ts) AS bucket, + uddsketch(100, 0.01, value) AS uddsketch +FROM foo +GROUP BY 1; +``` + +Use accessors to query directly from the continuous aggregate for hourly data. You can also roll the hourly data up into +daily buckets, then calculate approximate percentiles: + +```sql +SELECT + time_bucket('1 day'::interval, bucket) AS bucket, + approx_percentile(0.95, rollup(uddsketch)) AS p95, + approx_percentile(0.99, rollup(uddsketch)) AS p99 +FROM foo_hourly +GROUP BY 1; +``` + +## Available functions + +### Aggregate +- [`uddsketch()`][uddsketch]: aggregate data in a uddsketch for percentile calculation + +### Alternate aggregate +- [`percentile_agg()`][percentile_agg]: aggregate data using sensible defaults for percentile calculation + +### Accessors +- [`approx_percentile()`][approx_percentile]: estimate the value at a given percentile from a uddsketch +- [`approx_percentile_array()`][approx_percentile_array]: estimate values at multiple percentiles from a uddsketch +- [`approx_percentile_rank()`][approx_percentile_rank]: estimate the percentile rank of a given value from a uddsketch +- [`error()`][error]: get the maximum relative error of a uddsketch +- [`mean()`][mean]: calculate the exact mean from values in a uddsketch +- [`num_vals()`][num_vals]: get the number of values in a uddsketch + +### Rollup +- [`rollup()`][rollup]: combine multiple uddsketch aggregates + +[two-step-aggregation]: #two-step-aggregation +[tdigest]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/tdigest/index +[uddsketch]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/uddsketch/uddsketch +[percentile_agg]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/uddsketch/percentile_agg +[approx_percentile]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/uddsketch/approx_percentile +[approx_percentile_array]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/uddsketch/approx_percentile_array +[approx_percentile_rank]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/uddsketch/approx_percentile_rank +[error]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/uddsketch/error +[mean]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/uddsketch/mean +[num_vals]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/uddsketch/num_vals +[rollup]: /api-reference/timescaledb/hyperfunctions/percentile-approximation/uddsketch/rollup \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/mean.mdx b/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/mean.mdx new file mode 100644 index 0000000..120aa97 --- /dev/null +++ b/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/mean.mdx @@ -0,0 +1,53 @@ +--- +title: mean() +description: Calculate the exact mean from values in a `uddsketch` +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: percentile approximation + type: accessor + aggregates: + - uddsketch() +--- + + Since 1.0.0 + +Calculate the exact mean of the values in a `uddsketch`. Unlike percentile calculations, the mean calculation is exact. +This accessor allows you to calculate the mean alongside percentiles, without needing to create two separate aggregates +from the same raw data. + +## Samples + +Calculate the mean of the integers from 0 to 100. + +```sql +SELECT mean(uddsketch(data)) +FROM generate_series(0, 100) data; +``` + +``` +mean +------ +50 +``` +## Arguments + +The syntax is: + +```sql +mean( + sketch UddSketch +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| sketch | UddSketch | - |  | The `uddsketch` to extract the mean from | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| mean | DOUBLE PRECISION | The mean of the values in the `uddsketch`. | + diff --git a/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/num_vals.mdx b/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/num_vals.mdx new file mode 100644 index 0000000..4fd1a9d --- /dev/null +++ b/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/num_vals.mdx @@ -0,0 +1,53 @@ +--- +title: num_vals() +description: Get the number of values contained in a `uddsketch` +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: percentile approximation + type: accessor + aggregates: + - uddsketch() +--- + + Since 1.0.0 + +Get the number of values contained in a `uddsketch`. This accessor allows you to calculate a count alongside +percentiles, without needing to create two separate aggregates from the same raw data. + +## Samples + +Count the number of integers from 0 to 100. + +```sql +SELECT num_vals(uddsketch(data)) +FROM generate_series(0, 100) data; +``` + +``` +num_vals +----------- + 101 +``` + +## Arguments + +The syntax is: + +```sql +num_vals( + sketch UddSketch +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| sketch | UddSketch | - |  | The `uddsketch` to extract the number of values from | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| num_vals | DOUBLE PRECISION | The number of values in the `uddsketch`. | + diff --git a/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/percentile_agg.mdx b/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/percentile_agg.mdx new file mode 100644 index 0000000..8e3a31a --- /dev/null +++ b/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/percentile_agg.mdx @@ -0,0 +1,59 @@ +--- +title: percentile_agg() +description: aggregate data in a uddsketch, using some reasonable default values, for further calculation of percentile estimates +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: percentile approximation + type: alternate aggregate + aggregates: + - uddsketch() +--- + + Since 1.0.0 + +Aggregate data in a UddSketch, using reasonable default values, for further calculation of percentile estimates. This is +an alternate first step for calculating approximate percentiles. It provides added convenience by using sensible +defaults to create a `UddSketch`. Internally, it calls `uddsketch` with 200 buckets and a maximum error rate of 0.001. + +Use `percentile_agg` to create an intermediate aggregate from your raw data. This intermediate form can then be used by +one or more accessors in this group to compute final results. + +Optionally, multiple such intermediate aggregate objects can be combined using [`rollup()`](#rollup) before an accessor is applied. + +## Samples + +Create a continuous aggregate that stores percentile aggregate objects. These objects can later be used with multiple +accessors for retrospective analysis. + +```sql +CREATE MATERIALIZED VIEW foo_hourly +WITH (timescaledb.continuous) +AS SELECT + time_bucket('1 h'::interval, ts) as bucket, + percentile_agg(value) as pct_agg +FROM foo +GROUP BY 1; +``` + +## Arguments + +The syntax is: + +```sql +percentile_agg( + value DOUBLE PRECISION +) RETURNS UddSketch +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| value | DOUBLE PRECISION | - | ✔ | column of values to aggregate for percentile calculation | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| percentile_agg | UddSketch | A percentile estimator object created to calculate percentiles using the `UddSketch` algorithm | + diff --git a/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/rollup.mdx b/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/rollup.mdx new file mode 100644 index 0000000..aff547a --- /dev/null +++ b/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/rollup.mdx @@ -0,0 +1,37 @@ +--- +title: rollup() +description: Roll up multiple `uddsketch`es +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: percentile approximation + type: rollup + aggregates: + - uddsketch() +--- + + Since 1.0.0 + +Combine multiple intermediate `uddsketch` aggregates, produced by `uddsketch`, into a single intermediate `uddsketch` +aggregate. For example, you can use `rollup` to combine `uddsketch`es from 15-minute buckets into daily buckets. + +## Arguments + +The syntax is: + +```sql +rollup( + sketch UddSketch +) RETURNS UddSketch +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| sketch | UddSketch | - |  | The `uddsketch` aggregates to roll up | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| rollup | UddSketch | A new `uddsketch` aggregate created by combining the input `uddsketch` aggregates. | diff --git a/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/uddsketch.mdx b/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/uddsketch.mdx new file mode 100644 index 0000000..2873e75 --- /dev/null +++ b/api-reference/timescaledb-toolkit/percentile-approximation/uddsketch/uddsketch.mdx @@ -0,0 +1,56 @@ +--- +title: uddsketch() +description: aggregate data in a `uddsketch` for further calculation of percentile estimates +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: percentile approximation + type: aggregate + aggregates: + - uddsketch() +--- + + Since 1.0.0 + +Aggregate data in a `uddsketch` for further calculation of percentile estimates. This is the first step for calculating +approximate percentiles with the `uddsketch` algorithm. Use `uddsketch` to create an intermediate aggregate from your +raw data. This intermediate form can then be used by one or more accessors in this group to compute final results. + +Optionally, multiple such intermediate aggregate objects can be combined using [`rollup()`](#rollup) before an accessor is applied. + +If you aren't sure what values to set for `size` and `max_error`, try using the alternate aggregate function, [`percentile_agg()`](#percentile_agg). `percentile_agg` also creates a `UddSketch`, but it sets sensible default values for `size` and `max_error` that should work for many use cases. + +## Samples + +Build a `uddsketch` using a column called `data` from a table called `samples`. Use a maximum of 100 buckets and a +relative error of 0.01. + +```sql +SELECT uddsketch(100, 0.01, data) FROM samples; +``` + +## Arguments + +The syntax is: + +```sql +uddsketch( + size INTEGER, + max_error DOUBLE PRECISION, + value DOUBLE PRECISION +) RETURNS UddSketch +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| size | INTEGER | - | ✔ | maximum number of buckets in the `uddsketch`. Providing a larger value here makes it more likely that the aggregate is able to maintain the desired error, but potentially increases the memory usage | +| max_error | DOUBLE PRECISION | - | ✔ | the desired maximum relative error of the sketch. The true error may exceed this if too few buckets are provided for the data distribution. You can get the true error using the [`error`](#error) function | +| value | DOUBLE PRECISION | - | ✔ | the column to aggregate for further calculation | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| uddsketch | UddSketch | A percentile estimator object created to calculate percentiles using the `uddsketch` algorithm | + diff --git a/api-reference/timescaledb-toolkit/saturating-math/index.mdx b/api-reference/timescaledb-toolkit/saturating-math/index.mdx new file mode 100644 index 0000000..9edc742 --- /dev/null +++ b/api-reference/timescaledb-toolkit/saturating-math/index.mdx @@ -0,0 +1,40 @@ +--- +title: Saturating math overview +sidebarTitle: Overview +description: Perform saturating math operations on integers +topics: [hyperfunctions] +license: community +toolkit: true +products: [cloud, mst, self_hosted] +--- + + Early access + +The saturating math hyperfunctions help you perform saturating math on integers. In saturating math, the final result is +bounded. If the result of a normal mathematical operation exceeds either the minimum or maximum bound, the result of the +corresponding saturating math operation is capped at the bound. For example, `2 + (-3) = -1`. But in a saturating math +function with a lower bound of `0`, such as [`saturating_add_pos`][saturating_add_pos], the result is `0`. + +You can use saturating math to make sure your results don't overflow the allowed range of integers, or to force a result +to be greater than or equal to zero. + +## Available functions + +### Addition +- [`saturating_add()`][saturating_add]: add two numbers, saturating at the 32-bit integer bounds instead of overflowing +- [`saturating_add_pos()`][saturating_add_pos]: add two numbers, saturating at 0 for the minimum bound + +### Subtraction +- [`saturating_sub()`][saturating_sub]: subtract one number from another, saturating at the 32-bit integer bounds + instead of overflowing +- [`saturating_sub_pos()`][saturating_sub_pos]: subtract one number from another, saturating at 0 for the minimum bound + +### Multiplication +- [`saturating_mul()`][saturating_mul]: multiply two numbers, saturating at the 32-bit integer bounds instead of + overflowing + +[saturating_add]: /api-reference/timescaledb/hyperfunctions/saturating-math/saturating_add +[saturating_add_pos]: /api-reference/timescaledb/hyperfunctions/saturating-math/saturating_add_pos +[saturating_sub]: /api-reference/timescaledb/hyperfunctions/saturating-math/saturating_sub +[saturating_sub_pos]: /api-reference/timescaledb/hyperfunctions/saturating-math/saturating_sub_pos +[saturating_mul]: /api-reference/timescaledb/hyperfunctions/saturating-math/saturating_mul \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/saturating-math/saturating_add.mdx b/api-reference/timescaledb-toolkit/saturating-math/saturating_add.mdx new file mode 100644 index 0000000..df7ab41 --- /dev/null +++ b/api-reference/timescaledb-toolkit/saturating-math/saturating_add.mdx @@ -0,0 +1,38 @@ +--- +title: saturating_add() +description: Add two numbers, saturating at the 32-bit integer bounds instead of overflowing +topics: [hyperfunctions] +license: community +type: function +toolkit: true +products: [cloud, mst, self_hosted] +hyperfunction: + family: saturating math + type: function +--- + + Early access + +The `saturating_add` function adds two numbers, saturating at -2,147,483,648 and 2,147,483,647 instead of overflowing. + + +## Arguments + +The syntax is: + +```sql +saturating_add( + x INT, + y INT +) RETURNS INT +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| x | INT | - | ✔ | An integer to add to `y` | +| y | INT | - | ✔ | An integer to add to `x` | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| saturating_add | INT | The result of `x + y`, saturating at the numeric bounds instead of overflowing. The numeric bounds are the upper and lower bounds of the 32-bit signed integers | \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/saturating-math/saturating_add_pos.mdx b/api-reference/timescaledb-toolkit/saturating-math/saturating_add_pos.mdx new file mode 100644 index 0000000..b5c3b1d --- /dev/null +++ b/api-reference/timescaledb-toolkit/saturating-math/saturating_add_pos.mdx @@ -0,0 +1,38 @@ +--- +title: saturating_add_pos() +description: Add two numbers, saturating at 0 for the minimum bound +topics: [hyperfunctions] +license: community +type: function +toolkit: true +products: [cloud, mst, self_hosted] +hyperfunction: + family: saturating math + type: function +--- + + Early access + +The `saturating_add_pos` function adds two numbers, saturating at 0 and 2,147,483,647 instead of overflowing. + + +## Arguments + +The syntax is: + +```sql +saturating_add_pos( + x INT, + y INT +) RETURNS INT +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| x | INT | - | ✔ | An integer to add to `y` | +| y | INT | - | ✔ | An integer to add to `x` | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| saturating_add_pos | INT | The result of `x + y`, saturating at 0 for the minimum bound | \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/saturating-math/saturating_mul.mdx b/api-reference/timescaledb-toolkit/saturating-math/saturating_mul.mdx new file mode 100644 index 0000000..9f5e8c6 --- /dev/null +++ b/api-reference/timescaledb-toolkit/saturating-math/saturating_mul.mdx @@ -0,0 +1,39 @@ +--- +title: saturating_mul() +description: Multiply two numbers, saturating at the 32-bit integer bounds instead of overflowing +topics: [hyperfunctions] +license: community +type: function +toolkit: true +products: [cloud, mst, self_hosted] +hyperfunction: + family: saturating math + type: function +--- + + Early access + +The `saturating_mul` function multiplies two numbers, saturating at -2,147,483,648 and 2,147,483,647 instead of +overflowing. + + +## Arguments + +The syntax is: + +```sql +saturating_mul( + x INT, + y INT +) RETURNS INT +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| x | INT | - | ✔ | An integer to multiply with `y` | +| y | INT | - | ✔ | An integer to multiply with `x` | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| saturating_mul | INT | The result of `x * y`, saturating at the numeric bounds instead of overflowing. The numeric bounds are the upper and lower bounds of the 32-bit signed integers | \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/saturating-math/saturating_sub.mdx b/api-reference/timescaledb-toolkit/saturating-math/saturating_sub.mdx new file mode 100644 index 0000000..4d3f7fe --- /dev/null +++ b/api-reference/timescaledb-toolkit/saturating-math/saturating_sub.mdx @@ -0,0 +1,39 @@ +--- +title: saturating_sub() +description: Subtract one number from another, saturating at the 32-bit integer bounds instead of overflowing +topics: [hyperfunctions] +license: community +type: function +toolkit: true +products: [cloud, mst, self_hosted] +hyperfunction: + family: saturating math + type: function +--- + + Early access + +The `saturating_sub` function subtracts the second number from the first, saturating at -2,147,483,648 and 2,147,483,647 +instead of overflowing. + + +## Arguments + +The syntax is: + +```sql +saturating_sub( + x INT, + y INT +) RETURNS INT +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| x | INT | - | ✔ | An integer for `y` to subtract from | +| y | INT | - | ✔ | An integer to subtract from `x` | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| saturating_sub | INT | The result of `x - y`, saturating at the numeric bounds instead of overflowing. The numeric bounds are the upper and lower bounds of the 32-bit signed integers | \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/saturating-math/saturating_sub_pos.mdx b/api-reference/timescaledb-toolkit/saturating-math/saturating_sub_pos.mdx new file mode 100644 index 0000000..3c1e254 --- /dev/null +++ b/api-reference/timescaledb-toolkit/saturating-math/saturating_sub_pos.mdx @@ -0,0 +1,39 @@ +--- +title: saturating_sub_pos() +description: Subtract one number from another, saturating at 0 for the minimum bound +topics: [hyperfunctions] +license: community +type: function +toolkit: true +products: [cloud, mst, self_hosted] +hyperfunction: + family: saturating math + type: function +--- + + Early access + +The `saturating_sub_pos` function subtracts the second number from the first, saturating at 0 and 2,147,483,647 instead +of overflowing. + + +## Arguments + +The syntax is: + +```sql +saturating_sub_pos( + x INT, + y INT +) RETURNS INT +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| x | INT | - | ✔ | An integer for `y` to subtract from | +| y | INT | - | ✔ | An integer to subtract from `x` | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| saturating_sub_pos | INT | The result of `x - y`, saturating at 0 for the minimum bound | \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/state-tracking/compact_state_agg/compact_state_agg.mdx b/api-reference/timescaledb-toolkit/state-tracking/compact_state_agg/compact_state_agg.mdx new file mode 100644 index 0000000..d9045ee --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/compact_state_agg/compact_state_agg.mdx @@ -0,0 +1,47 @@ +--- +title: compact_state_agg() +description: Aggregate state data into a state aggregate for further analysis +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: aggregate + aggregates: + - compact_state_agg() +--- + + Early access 1.5.0 + +Aggregate a dataset containing state data into a state aggregate to track the time spent in each state. + +## Samples + +Create a state aggregate to track the status of some devices. + +```sql +SELECT toolkit_experimental.compact_state_agg(time, status) FROM devices; +``` + +## Arguments + +The syntax is: + +```sql +compact_state_agg( + ts TIMESTAMPTZ, + value {TEXT | BIGINT} +) RETURNS StateAgg +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| ts | TIMESTAMPTZ | - | ✔ | Timestamps associated with each state reading | +| value | TEXT \| BIGINT | - | ✔ | The state at that time | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| agg | StateAgg | An object storing the total time spent in each state | + diff --git a/api-reference/timescaledb-toolkit/state-tracking/compact_state_agg/duration_in.mdx b/api-reference/timescaledb-toolkit/state-tracking/compact_state_agg/duration_in.mdx new file mode 100644 index 0000000..ba1fa21 --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/compact_state_agg/duration_in.mdx @@ -0,0 +1,77 @@ +--- +title: duration_in() +description: Calculate the total time spent in a given state from a state aggregate +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: accessor + aggregates: + - compact_state_agg() +--- + + Early access 1.5.0 + +Calculate the total time spent in the given state from a state aggregate. If you need to interpolate missing values +across time bucket boundaries, use [`interpolated_duration_in`][interpolated_duration_in]. + +## Samples + +Create a test table that tracks when a system switches between `starting`, `running`, and `error` states. Query the +table for the time spent in the `running` state. + +If you prefer to see the result in seconds, [`EXTRACT`](https://www.postgresql.org/docs/current/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT) the epoch from the returned result. + +```sql +SET timezone TO 'UTC'; +CREATE TABLE states(time TIMESTAMPTZ, state TEXT); +INSERT INTO states VALUES + ('1-1-2020 10:00', 'starting'), + ('1-1-2020 10:30', 'running'), + ('1-3-2020 16:00', 'error'), + ('1-3-2020 18:30', 'starting'), + ('1-3-2020 19:30', 'running'), + ('1-5-2020 12:00', 'stopping'); + +SELECT toolkit_experimental.duration_in( + toolkit_experimental.compact_state_agg(time, state), + 'running' +) FROM states; +``` + +Returns: + +``` +duration_in +--------------- +3 days 22:00:00 +``` + + +The syntax is: + +```sql +duration_in( + agg StateAgg, + state {TEXT | BIGINT} + [, start TIMESTAMPTZ] + [, interval INTERVAL] +) RETURNS INTERVAL +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | StateAgg | - | ✔ | A state aggregate created with [`compact_state_agg`][compact_state_agg] | +| state | TEXT \| BIGINT | - | ✔ | The state to query | +| start | TIMESTAMPTZ | - | | If specified, only the time in the state after this time is returned | +| interval | INTERVAL | - | | If specified, only the time in the state from the start time to the end of the interval is returned | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| duration_in | INTERVAL | The time spent in the given state. Displayed in `days`, `hh:mm:ss`, or a combination of the two. | + +[compact_state_agg]: /api-reference/timescaledb/hyperfunctions/state-tracking/compact_state_agg/compact_state_agg +[interpolated_duration_in]: /api-reference/timescaledb/hyperfunctions/state-tracking/compact_state_agg/interpolated_duration_in## Arguments diff --git a/api-reference/timescaledb-toolkit/state-tracking/compact_state_agg/index.mdx b/api-reference/timescaledb-toolkit/state-tracking/compact_state_agg/index.mdx new file mode 100644 index 0000000..3b7b814 --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/compact_state_agg/index.mdx @@ -0,0 +1,63 @@ +--- +title: Compact state aggregation overview +sidebarTitle: Overview +description: Track the amount of time spent in each discrete state with compact_state_agg functions +--- + + Early access 1.5.0 + +Track the amount of time a system or value spends in each discrete state. For example, use the `compact_state_agg` +functions to track how much time a system spends in `error`, `running`, or `starting` states. + +`compact_state_agg` is designed to work with a relatively small number of states. It might not perform well on datasets +where states are mostly distinct between rows. + +If you need to track when each state is entered and exited, use the [`state_agg`][state_agg] functions. If you need to +track the liveness of a system based on a heartbeat signal, consider using the [`heartbeat_agg`][heartbeat_agg] +functions. + +## Two-step aggregation + +This group of functions uses the two-step aggregation pattern. + +Rather than calculating the final result in one step, you first create an intermediate aggregate by using the aggregate +function. + +Then, use any of the accessors on the intermediate aggregate to calculate a final result. You can also roll up multiple +intermediate aggregates with the rollup functions. + +The two-step aggregation pattern has several advantages: + +1. More efficient because multiple accessors can reuse the same aggregate +1. Easier to reason about performance, because aggregation is separate from final computation +1. Easier to understand when calculations can be rolled up into larger intervals, especially in window functions and +[continuous aggregates][caggs] +1. Can perform retrospective analysis even when underlying data is dropped, because the intermediate aggregate stores +extra information not available in the final result + +To learn more, see the [blog post on two-step aggregates][blog-two-step-aggregates]. + +## Functions in this group + +### Aggregate +- [`compact_state_agg()`][compact_state_agg]: aggregate state data into an intermediate form for further computation + +### Accessors +- [`duration_in()`][duration_in]: get the total duration in the specified states +- [`interpolated_duration_in()`][interpolated_duration_in]: get the total duration in the specified states, + interpolating values at the boundary +- [`into_values()`][into_values]: return an array of `(state, duration)` pairs from the aggregate + +### Rollup +- [`rollup()`][rollup]: combine multiple intermediate aggregates + +[two-step-aggregation]: #two-step-aggregation +[blog-two-step-aggregates]: https://www.timescale.com/blog/how-postgresql-aggregation-works-and-how-it-inspired-our-hyperfunctions-design +[caggs]: /use-timescale/continuous-aggregates/about-continuous-aggregates/ +[heartbeat_agg]: /api-reference/timescaledb/hyperfunctions/state-tracking/heartbeat_agg/ +[state_agg]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/ +[compact_state_agg]: /api-reference/timescaledb/hyperfunctions/state-tracking/compact_state_agg/compact_state_agg +[duration_in]: /api-reference/timescaledb/hyperfunctions/state-tracking/compact_state_agg/duration_in +[interpolated_duration_in]: /api-reference/timescaledb/hyperfunctions/state-tracking/compact_state_agg/interpolated_duration_in +[into_values]: /api-reference/timescaledb/hyperfunctions/state-tracking/compact_state_agg/into_values +[rollup]: /api-reference/timescaledb/hyperfunctions/state-tracking/compact_state_agg/rollup \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/state-tracking/compact_state_agg/interpolated_duration_in.mdx b/api-reference/timescaledb-toolkit/state-tracking/compact_state_agg/interpolated_duration_in.mdx new file mode 100644 index 0000000..20c00bf --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/compact_state_agg/interpolated_duration_in.mdx @@ -0,0 +1,90 @@ +--- +title: interpolated_duration_in() +description: Calculate the total time spent in a given state from a state aggregate, interpolating values at time bucket boundaries +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: accessor + aggregates: + - compact_state_agg() +--- + + Early access 1.8.0 + +Calculate the total duration in the given state. Unlike [`duration_in`][duration_in], you can use this function across +multiple state aggregates that cover multiple time buckets. Any missing values at the time bucket boundaries are +interpolated from adjacent state aggregates. + +## Samples + +Create a test table that tracks when a system switches between `starting`, `running`, and `error` states. Query the +table for the time spent in the `running` state. Use `LAG` and `LEAD` to get the neighboring aggregates for +interpolation. + +If you prefer to see the result in seconds, [`EXTRACT`][extract] the epoch from the returned result. + +```sql +SELECT + time, + toolkit_experimental.interpolated_duration_in( + agg, + 'running', + time, + '1 day', + LAG(agg) OVER (ORDER BY time) +) FROM ( + SELECT + time_bucket('1 day', time) as time, + toolkit_experimental.compact_state_agg(time, state) as agg + FROM + states + GROUP BY time_bucket('1 day', time) +) s; +``` + +Returns: + +``` +time | interpolated_duration_in +------------------------+-------------------------- +2020-01-01 00:00:00+00 | 13:30:00 +2020-01-02 00:00:00+00 | 16:00:00 +2020-01-03 00:00:00+00 | 04:30:00 +2020-01-04 00:00:00+00 | 12:00:00 +``` + +## Arguments + +The syntax is: + +```sql +interpolated_duration_in( + agg StateAgg, + state {TEXT | BIGINT}, + start TIMESTAMPTZ, + interval INTERVAL + [, prev StateAgg] +) RETURNS DOUBLE PRECISION +``` + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | StateAgg | - | ✔ | A state aggregate created with [`compact_state_agg`][compact_state_agg] | +| state | TEXT \| BIGINT | - | ✔ | The state to query | +| start | TIMESTAMPTZ | - | ✔ | The start of the interval to be calculated | +| interval | INTERVAL | - | ✔ | The length of the interval to be calculated | +| prev | StateAgg | - | | The state aggregate from the prior interval, used to interpolate the value at `start`. If `NULL`, the first timestamp in `aggregate` is used as the start of the interval | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| interpolated_duration_in | INTERVAL | The total time spent in the queried state. Displayed as `days`, `hh:mm:ss`, or a combination of the two. | + + +[extract]:https://www.postgresql.org/docs/current/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT +[compact_state_agg]: /api-reference/timescaledb/hyperfunctions/state-tracking/compact_state_agg/compact_state_agg +[duration_in]: /api-reference/timescaledb/hyperfunctions/state-tracking/compact_state_agg/duration_in \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/state-tracking/compact_state_agg/into_values.mdx b/api-reference/timescaledb-toolkit/state-tracking/compact_state_agg/into_values.mdx new file mode 100644 index 0000000..873431b --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/compact_state_agg/into_values.mdx @@ -0,0 +1,65 @@ +--- +title: into_values() +description: Expand a state aggregate into a set of rows displaying the duration of each state +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: accessor + aggregates: + - compact_state_agg() +--- + + Early access 1.6.0 + +Unpack the state aggregate into a set of rows with two columns, displaying the duration of each state. By default, the +columns are named `state` and `duration`. You can rename them using the same method as renaming a table. + +## Samples + +Create a state aggregate from the table `states_test`. The time column is named `time`, and the `state` column contains +text values corresponding to different states of a system. Use `into_values` to display the data from the state +aggregate. + +```sql +SELECT state, duration FROM toolkit_experimental.into_values( + (SELECT toolkit_experimental.compact_state_agg(time, state) FROM states_test) +); +``` + +Returns: + +``` +state | duration +------+---------- +ERROR | 00:00:03 +OK | 00:01:46 +START | 00:00:11 +``` + + +The syntax is: + +```sql +into_values( + agg StateAgg +) RETURNS (TEXT, INTERVAL) + +into_int_values( + agg StateAgg +) RETURNS (INT, INTERVAL) +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | StateAgg | - | ✔ | A state aggregate created with [`compact_state_agg`][compact_state_agg] | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| state | TEXT \| BIGINT | A state found in the state aggregate | +| duration | INTERVAL | The total time spent in that state | + +[compact_state_agg]: /api-reference/timescaledb/hyperfunctions/state-tracking/compact_state_agg/compact_state_agg## Arguments diff --git a/api-reference/timescaledb-toolkit/state-tracking/compact_state_agg/rollup.mdx b/api-reference/timescaledb-toolkit/state-tracking/compact_state_agg/rollup.mdx new file mode 100644 index 0000000..bde8ccb --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/compact_state_agg/rollup.mdx @@ -0,0 +1,55 @@ +--- +title: rollup() +description: Combine multiple state aggregates +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: rollup + aggregates: + - compact_state_agg() +--- + + Early access 1.13.0 + +Combine multiple state aggregates into a single state aggregate. For example, you can use `rollup` to combine state +aggregates from 15-minute buckets into daily buckets. + +## Samples + +Combine multiple state aggregates and calculate the duration spent in the `START` state. + +```sql +WITH buckets AS (SELECT + time_bucket('1 minute', ts) as dt, + toolkit_experimental.compact_state_agg(ts, state) AS sa +FROM states_test +GROUP BY time_bucket('1 minute', ts)) +SELECT toolkit_experimental.duration_in( + 'START', + toolkit_experimental.rollup(buckets.sa) +) +FROM buckets; +``` + +## Arguments + +The syntax is: + +```sql +rollup( + agg StateAgg +) RETURNS StateAgg +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | StateAgg | - | ✔ | State aggregates created using `compact_state_agg` | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| agg | StateAgg | A new state aggregate that combines the input state aggregates | + diff --git a/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/dead_ranges.mdx b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/dead_ranges.mdx new file mode 100644 index 0000000..3b0f7ac --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/dead_ranges.mdx @@ -0,0 +1,61 @@ +--- +title: dead_ranges() +description: Get the down intervals from a heartbeat_agg +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: accessor + aggregates: + - heartbeat_agg() +--- + + Since 1.15.0 + +Return a set of (start_time, end_time) pairs representing when the underlying system did not have a valid heartbeat +during the interval of the aggregate. + +## Samples + +Given a table called `liveness` containing weekly heartbeat aggregates in column `health` with timestamp column `date`, +use the following to get the intervals where the system was down during the week of Jan 9, 2022. + +```sql +SELECT dead_ranges(health) +FROM liveness +WHERE date = '01-9-2022 UTC' +``` + +Returns: + +``` + dead_ranges +----------------------------------------------------- +("2022-01-09 00:00:00+00","2022-01-09 00:00:30+00") +("2022-01-12 15:27:22+00","2022-01-12 15:31:17+00") +``` + +## Arguments + +The syntax is: + +```sql +dead_ranges( + agg HEARTBEATAGG +) RETURNS TABLE ( + start TIMESTAMPTZ, + end TIMESTAMPTZ +) +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | HeartbeatAgg | - | ✔ | A heartbeat aggregate to get the liveness data from | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| dead_ranges | TABLE (start TIMESTAMPTZ, end TIMESTAMPTZ) | The (start, end) pairs of when the system was down. | + diff --git a/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/downtime.mdx b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/downtime.mdx new file mode 100644 index 0000000..fa414b4 --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/downtime.mdx @@ -0,0 +1,63 @@ +--- +title: downtime() +description: Get the total time dead during a heartbeat aggregate +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: accessor + aggregates: + - heartbeat_agg() +--- + + Since 1.15.0 + +Sum all the ranges where the system did not have a recent enough heartbeat from a heartbeat aggregate. + +There may appear to be some downtime between the start of the aggregate and the first heartbeat. If there is a heartbeat +aggregate covering the previous period, you can use its last heartbeat to correct for this using +[`interpolated_downtime()`][interpolated_downtime]. + +## Samples + +Given a table called `liveness` containing weekly heartbeat aggregates in column `health` with timestamp column `date`, +use the following to get the total downtime of the system during the week of Jan 9, 2022. + +```sql +SELECT downtime(health) +FROM liveness +WHERE date = '01-9-2022 UTC' +``` + +Returns: + +``` +downtime +-------- +00:04:25 +``` + + + +## Arguments + +The syntax is: + +```sql +downtime( + agg HEARTBEATAGG +) RETURNS INTERVAL +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | HeartbeatAgg | - | ✔ | A heartbeat aggregate to get the liveness data from | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| downtime | INTERVAL | The sum of all the dead ranges in the aggregate. | + +[interpolated_downtime]: /api-reference/timescaledb/hyperfunctions/state-tracking/heartbeat_agg/interpolated_downtime diff --git a/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/heartbeat_agg.mdx b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/heartbeat_agg.mdx new file mode 100644 index 0000000..23bd648 --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/heartbeat_agg.mdx @@ -0,0 +1,57 @@ +--- +title: heartbeat_agg() +description: Create a liveness aggregate from a set of heartbeats +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: aggregate + aggregates: + - heartbeat_agg() +--- + + Since 1.15.0 + +Aggregate a set of heartbeat timestamps to track the liveness state of the underlying system for the specified time +range. + +## Samples + +Given a table called `system_health` with a `ping_time` column, construct an aggregate of system liveness for 10 days +starting from Jan 1, 2022. This assumes a system is unhealthy if it hasn't been heard from in a 5 minute window. + +```sql +SELECT heartbeat_agg( + ping_time, + '01-01-2022 UTC', + '10 days', + '5 min') +FROM system_health; +``` +## Arguments + +The syntax is: + +```sql +heartbeat_agg( + heartbeat TIMESTAMPTZ, + agg_start TIMESTAMPTZ, + agg_duration INTERVAL, + heartbeat_liveness INTERVAL +) RETURNS HeartbeatAgg +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| heartbeat | TIMESTAMPTZ | - | ✔ | The column containing the timestamps of the heartbeats | +| agg_start | TIMESTAMPTZ | - | ✔ | The start of the time range over which this aggregate is tracking liveness | +| agg_duration | INTERVAL | - | ✔ | The length of the time range over which this aggregate is tracking liveness. Any point in this range that doesn't closely follow a heartbeat is considered to be dead | +| heartbeat_liveness | INTERVAL | - | ✔ | How long the system is considered to be live after each heartbeat | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| heartbeat_agg | HeartbeatAgg | The liveness data for the heartbeated system over the provided interval. | + diff --git a/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/index.mdx b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/index.mdx new file mode 100644 index 0000000..7162cae --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/index.mdx @@ -0,0 +1,76 @@ +--- +title: Heartbeat aggregation overview +sidebarTitle: Overview +description: Determine system liveness from timestamped heartbeats with heartbeat_agg functions +--- + + Since 1.15.0 + +Determine the overall liveness of a system from a series of timestamped heartbeats and a liveness interval. This +aggregate can be used to report total uptime or downtime as well as report the time ranges where the system was live or +dead. + +You can combine multiple heartbeat aggregates to determine the overall health of a service. For example, the heartbeat +aggregates from a primary and standby server could be combined to see if there was ever a window where both machines +were down at the same time. + +## Two-step aggregation + +This group of functions uses the two-step aggregation pattern. + +Rather than calculating the final result in one step, you first create an intermediate aggregate by using the aggregate +function. + +Then, use any of the accessors on the intermediate aggregate to calculate a final result. You can also roll up multiple +intermediate aggregates with the rollup functions. + +The two-step aggregation pattern has several advantages: + +1. More efficient because multiple accessors can reuse the same aggregate +1. Easier to reason about performance, because aggregation is separate from final computation +1. Easier to understand when calculations can be rolled up into larger intervals, especially in window functions and +[continuous aggregates][caggs] +1. Can perform retrospective analysis even when underlying data is dropped, because the intermediate aggregate stores +extra information not available in the final result + +To learn more, see the [blog post on two-step aggregates][blog-two-step-aggregates]. + +## Functions in this group + +### Aggregate +- [`heartbeat_agg()`][heartbeat_agg]: aggregate heartbeat data into an intermediate form for further computation + +### Accessors +- [`uptime()`][uptime]: get the total uptime from the aggregate +- [`downtime()`][downtime]: get the total downtime from the aggregate +- [`interpolated_uptime()`][interpolated_uptime]: get the total uptime, interpolating values at the boundary +- [`interpolated_downtime()`][interpolated_downtime]: get the total downtime, interpolating values at the boundary +- [`live_at()`][live_at]: determine if the system was live at a given time +- [`live_ranges()`][live_ranges]: get all time ranges when the system was live +- [`dead_ranges()`][dead_ranges]: get all time ranges when the system was dead +- [`num_live_ranges()`][num_live_ranges]: count the number of live ranges +- [`num_gaps()`][num_gaps]: count the number of gaps (downtime periods) +- [`trim_to()`][trim_to]: trim the aggregate to a specific time range + +### Mutator +- [`interpolate()`][interpolate]: interpolate the state at interval boundaries + +### Rollup +- [`rollup()`][rollup]: combine multiple intermediate aggregates + +[two-step-aggregation]: #two-step-aggregation +[blog-two-step-aggregates]: https://www.timescale.com/blog/how-postgresql-aggregation-works-and-how-it-inspired-our-hyperfunctions-design +[caggs]: /use-timescale/continuous-aggregates/about-continuous-aggregates/ +[heartbeat_agg]: /api-reference/timescaledb/hyperfunctions/state-tracking/heartbeat_agg/heartbeat_agg +[uptime]: /api-reference/timescaledb/hyperfunctions/state-tracking/heartbeat_agg/uptime +[downtime]: /api-reference/timescaledb/hyperfunctions/state-tracking/heartbeat_agg/downtime +[interpolated_uptime]: /api-reference/timescaledb/hyperfunctions/state-tracking/heartbeat_agg/interpolated_uptime +[interpolated_downtime]: /api-reference/timescaledb/hyperfunctions/state-tracking/heartbeat_agg/interpolated_downtime +[live_at]: /api-reference/timescaledb/hyperfunctions/state-tracking/heartbeat_agg/live_at +[live_ranges]: /api-reference/timescaledb/hyperfunctions/state-tracking/heartbeat_agg/live_ranges +[dead_ranges]: /api-reference/timescaledb/hyperfunctions/state-tracking/heartbeat_agg/dead_ranges +[num_live_ranges]: /api-reference/timescaledb/hyperfunctions/state-tracking/heartbeat_agg/num_live_ranges +[num_gaps]: /api-reference/timescaledb/hyperfunctions/state-tracking/heartbeat_agg/num_gaps +[trim_to]: /api-reference/timescaledb/hyperfunctions/state-tracking/heartbeat_agg/trim_to +[interpolate]: /api-reference/timescaledb/hyperfunctions/state-tracking/heartbeat_agg/interpolate +[rollup]: /api-reference/timescaledb/hyperfunctions/state-tracking/heartbeat_agg/rollup diff --git a/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/interpolate.mdx b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/interpolate.mdx new file mode 100644 index 0000000..d90c2ec --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/interpolate.mdx @@ -0,0 +1,66 @@ +--- +title: interpolate() +description: Adjust a heartbeat aggregate with predecessor information +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: mutator + aggregates: + - heartbeat_agg() +--- + + Since 1.15.0 + +Update a heartbeat aggregate to include any live ranges that should have been carried over from the last heartbeat in +the predecessor, even if there aren't heartbeats for that range in the interval covered by this aggregate. Return the +updated aggregate, which can then be used with any of the heartbeat aggregate accessors. + +## Samples + +Given a table called `liveness` containing weekly heartbeat aggregates in column `health` with timestamp column `date`, +use the following to get the intervals where the system was unhealthy during the week of Jan 9, 2022. This correctly +excludes any ranges covered by a heartbeat at the end of the Jan 2 week. + +```sql +SELECT dead_ranges( + interpolate( + health, + LAG(health) OVER (ORDER BY date) + ) +) +FROM liveness +WHERE date = '01-9-2022 UTC' +``` + +Returns: + +``` + dead_ranges +----------------------------------------------------- +("2022-01-12 15:27:22+00","2022-01-12 15:31:17+00") +``` + +## Arguments + +The syntax is: + +```sql +interpolate( + agg HEARTBEATAGG, + pred HEARTBEATAGG +) RETURNS HEARTBEATAGG +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | HeartbeatAgg | - | ✔ | A heartbeat aggregate containing liveness data for a particular interval | +| pred | HeartbeatAgg | - | | The heartbeat aggregate for the preceding interval, if one exists | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| interpolate | HeartbeatAgg | A copy of `agg` which has been update to include any heartbeat intervals extending past the end of `pred`. | + diff --git a/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/interpolated_downtime.mdx b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/interpolated_downtime.mdx new file mode 100644 index 0000000..c3ce947 --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/interpolated_downtime.mdx @@ -0,0 +1,64 @@ +--- +title: interpolated_downtime() +description: Get the total time dead from a heartbeat aggregate and predecessor +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: accessor + aggregates: + - heartbeat_agg() +--- + + Since 1.15.0 + +Calculate downtime from a heartbeat aggregate, taking the heartbeat aggregate from the preceding interval to interpolate +values at the boundary. This checks when the last heartbeat in the predecessor was received and makes sure not to +consider the heartbeat interval after that time as unhealthy, even if it extends into the current aggregate prior to the +first heartbeat. + +## Samples + +Given a table called `liveness` containing weekly heartbeat aggregates in column `health` with timestamp column `date`, +use this command to get the total interpolated downtime of the system during the week of Jan 9, 2022. + +```sql +SELECT interpolated_downtime( + health, + LAG(health) OVER (ORDER BY date) +) +FROM liveness +WHERE date = '01-9-2022 UTC' +``` + +Returns: + +``` +interpolated_downtime +--------------------- + 00:03:55 +``` + +## Arguments + +The syntax is: + +```sql +interpolated_downtime( + agg HEARTBEATAGG, + pred HEARTBEATAGG +) RETURNS INTERVAL +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | HeartbeatAgg | - | ✔ | A heartbeat aggregate to get the liveness data from | +| pred | HeartbeatAgg | - | | The heartbeat aggregate for the interval before the one being measured, if one exists | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| interpolated_downtime | INTERVAL | The sum of all the unhealthy ranges in the aggregate, excluding those covered by the last heartbeat of the previous interval. | + diff --git a/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/interpolated_uptime.mdx b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/interpolated_uptime.mdx new file mode 100644 index 0000000..702e766 --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/interpolated_uptime.mdx @@ -0,0 +1,64 @@ +--- +title: interpolated_uptime() +description: Get the total time live from a heartbeat aggregate and predecessor +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: accessor + aggregates: + - heartbeat_agg() +--- + + Since 1.15.0 + +Calculate uptime from a heartbeat aggregate, taking the heartbeat aggregate from the preceding interval to interpolate +values at the boundary. This checks when the last heartbeat in the predecessor was received and makes sure that the +entire heartbeat interval after that is considered live. This addresses the issue where `uptime` would consider the +interval between the start of the interval and the first heartbeat as dead. + +## Samples + +Given a table called `liveness` containing weekly heartbeat aggregates in column `health` with timestamp column `date`, +use this command to get the total interpolated uptime of the system during the week of Jan 9, 2022. + +```sql +SELECT interpolated_uptime( + health, + LAG(health) OVER (ORDER BY date) +) +FROM liveness +WHERE date = '01-9-2022 UTC' +``` + +Returns: + +``` +interpolated_uptime +------------------- + 6 days 23:56:05 +``` + +## Arguments + +The syntax is: + +```sql +interpolated_uptime( + agg HEARTBEATAGG, + pred HEARTBEATAGG +) RETURNS INTERVAL +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | HeartbeatAgg | - | ✔ | A heartbeat aggregate to get the liveness data from | +| pred | HeartbeatAgg | - | | The heartbeat aggregate for the interval before the one being measured, if one exists | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| interpolated_uptime | INTERVAL | The sum of all the live ranges in the aggregate, including those covered by the last heartbeat of the previous interval. | + diff --git a/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/live_at.mdx b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/live_at.mdx new file mode 100644 index 0000000..3581290 --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/live_at.mdx @@ -0,0 +1,60 @@ +--- +title: live_at() +description: Test if the aggregate has a heartbeat covering a given time +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: accessor + aggregates: + - heartbeat_agg() +--- + + Since 1.15.0 + +Determine whether the aggregate has a heartbeat indicating the system was live at a given time. + +Note that this returns false for any time not covered by the aggregate. + +## Samples + +Given a table called `liveness` containing weekly heartbeat aggregates in column `health` with timestamp column `date`, +use the following to see if the system was live at a particular time. + +```sql +SELECT live_at(health, '2022-01-12 15:30:00+00') +FROM liveness +WHERE date = '01-9-2022 UTC' +``` + +Returns: + +``` + live_at +--------- + f +``` + +## Arguments + +The syntax is: + +```sql +live_at( + agg HEARTBEATAGG, + test TIMESTAMPTZ +) RETURNS BOOL +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | HeartbeatAgg | - | ✔ | A heartbeat aggregate to get the liveness data from | +| test | TimestampTz | - | ✔ | The time to test the liveness of | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| live_at | bool | True if the heartbeat aggregate had a heartbeat close before the test time. | + diff --git a/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/live_ranges.mdx b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/live_ranges.mdx new file mode 100644 index 0000000..bb6ce82 --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/live_ranges.mdx @@ -0,0 +1,61 @@ +--- +title: live_ranges() +description: Get the live intervals from a heartbeat_agg +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: accessor + aggregates: + - heartbeat_agg() +--- + + Since 1.15.0 + +Return a set of (start_time, end_time) pairs representing when the underlying system was live during the interval of the +aggregate. + +## Samples + +Given a table called `liveness` containing weekly heartbeat aggregates in column `health` with timestamp column `date`, +use the following to get the intervals where the system was live during the week of Jan 9, 2022. + +```sql +SELECT live_ranges(health) +FROM liveness +WHERE date = '01-9-2022 UTC' +``` + +Returns: + +``` + live_ranges +----------------------------------------------------- +("2022-01-09 00:00:30+00","2022-01-12 15:27:22+00") +("2022-01-12 15:31:17+00","2022-01-16 00:00:00+00") +``` + +## Arguments + +The syntax is: + +```sql +live_ranges( + agg HEARTBEATAGG +) RETURNS TABLE ( + start TIMESTAMPTZ, + end TIMESTAMPTZ +) +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | HeartbeatAgg | - | ✔ | A heartbeat aggregate to get the liveness data from | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| live_ranges | TABLE (start TIMESTAMPTZ, end TIMESTAMPTZ) | The (start, end) pairs of when the system was live. | + diff --git a/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/num_gaps.mdx b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/num_gaps.mdx new file mode 100644 index 0000000..4486af8 --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/num_gaps.mdx @@ -0,0 +1,57 @@ +--- +title: num_gaps() +description: Count the number of gaps between live ranges +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: accessor + aggregates: + - heartbeat_agg() +--- + + Since 1.16.0 + +Return the number of gaps between the periods of liveness. Additionally, if the aggregate is not live at the start or +end of its covered interval, these are also considered gaps. + +## Samples + +Given a table called `liveness` containing weekly heartbeat aggregates in column `health` with timestamp column `date`, +use this query to see how many times the system was down in a particular week: + +```sql +SELECT num_gaps(health) +FROM liveness +WHERE date = '01-9-2022 UTC' +``` + +Returns: + +``` + num_gaps +--------- + 4 +``` + +## Arguments + +The syntax is: + +```sql +num_gaps( + agg HEARTBEATAGG +) RETURNS BIGINT +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | HeartbeatAgg | - | ✔ | A heartbeat aggregate to get the number of gaps from | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| num_gaps | bigint | The number of gaps in the aggregate. | + diff --git a/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/num_live_ranges.mdx b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/num_live_ranges.mdx new file mode 100644 index 0000000..14908e2 --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/num_live_ranges.mdx @@ -0,0 +1,56 @@ +--- +title: num_live_ranges() +description: Count the number of live ranges +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: accessor + aggregates: + - heartbeat_agg() +--- + + Since 1.16.0 + +Return the number of live periods from a heartbeat aggregate. + +## Samples + +Given a table called `liveness` containing weekly heartbeat aggregates in column `health` with timestamp column `date`, +use this query to see how many intervals the system was up in a given week: + +```sql +SELECT num_live_ranges(health) +FROM liveness +WHERE date = '01-9-2022 UTC' +``` + +Returns: + +``` + num_live_ranges +--------- + 5 +``` + +## Arguments + +The syntax is: + +```sql +num_live_ranges( + agg HEARTBEATAGG +) RETURNS BIGINT +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | HeartbeatAgg | - | ✔ | A heartbeat aggregate to get the number of ranges from | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| num_live_ranges | bigint | The number of live ranges in the aggregate. | + diff --git a/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/rollup.mdx b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/rollup.mdx new file mode 100644 index 0000000..fcf252d --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/rollup.mdx @@ -0,0 +1,42 @@ +--- +title: rollup() +description: Combine multiple heartbeat aggregates +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: rollup + aggregates: + - heartbeat_agg() +--- + + Since 1.15.0 + +Combine multiple heartbeat aggregates into one. This can be used to combine aggregates into adjacent intervals into one +larger interval, such as rolling daily aggregates into a weekly or monthly aggregate. + +Another use for this is to combine heartbeat aggregates for redundant systems to determine if there were any overlapping +failures. For instance, a master and standby system can have their heartbeats combined to see if there were any +intervals where both systems were down at the same time. The result of rolling overlapping heartbeats together like this +is a heartbeat aggregate which considers a time live if any of its component aggregates were live. + +## Arguments + +The syntax is: + +```sql +rollup( + heartbeatagg HEARTBEATAGG +) RETURNS HEARTBEATAGG +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| heartbeatagg | HeartbeatAgg | - | ✔ | The heartbeat aggregates to roll up | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| rollup | HeartbeatAgg | A heartbeat aggregate covering the interval from the earliest start time of its component aggregates to the latest end time. It combines the live ranges of all the components | \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/trim_to.mdx b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/trim_to.mdx new file mode 100644 index 0000000..d7e4cb9 --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/trim_to.mdx @@ -0,0 +1,53 @@ +--- +title: trim_to() +description: Reduce the covered interval of a heartbeat aggregate +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: accessor + aggregates: + - heartbeat_agg() +--- + + Since 1.16.0 + +Reduce the time range covered by a heartbeat aggregate. This can only be used to narrow the covered interval, passing +arguments that would extend beyond the range covered by the initial aggregate gives an error. + +## Samples + +Given a table called `liveness` containing weekly heartbeat aggregates in column `health` with timestamp column `date`, +use this query to roll up several weeks and trim the result to an exact month: + +```sql +SELECT trim_to(rollup(health), '03-1-2022 UTC', '1 month') +FROM liveness +WHERE date > '02-21-2022 UTC' AND date < '3-7-2022 UTC' +``` + +## Arguments + +The syntax is: + +```sql +trim_to( + agg HEARTBEATAGG, + start TIMESTAMPTZ, + duration INTERVAL +) RETURNS HEARTBEATAGG +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | HeartbeatAgg | - | ✔ | A heartbeat aggregate to trim down | +| start | TimestampTz | - | | The start of the trimmed range. If not provided, the returned heartbeat aggregate starts from the same time as the starting one | +| duration | Interval | - | | How long the resulting aggregate should cover. If not provided, the returned heartbeat aggregate ends at the same time as the starting one | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| trim_to | heartbeat_agg | The trimmed aggregate. | + diff --git a/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/uptime.mdx b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/uptime.mdx new file mode 100644 index 0000000..7feb6e4 --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/heartbeat_agg/uptime.mdx @@ -0,0 +1,61 @@ +--- +title: uptime() +description: Get the total time live during a heartbeat aggregate +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: accessor + aggregates: + - heartbeat_agg() +--- + + Since 1.15.0 + +Sum all the ranges where the system was live and return the total from a heartbeat aggregate. + +There may appear to be some downtime between the start of the aggregate and the first heartbeat. If there is a heartbeat +aggregate covering the previous period, you can use its last heartbeat to correct for this using +[`interpolated_uptime()`][interpolated_uptime]. + +## Samples + +Given a table called `liveness` containing weekly heartbeat aggregates in column `health` with timestamp column `date`, +use this command to get the total uptime of the system during the week of Jan 9, 2022. + +```sql +SELECT uptime(health) +FROM liveness +WHERE date = '01-9-2022 UTC' +``` + +Returns: + +``` + uptime +----------------- +6 days 23:55:35 +``` + +## Arguments + +The syntax is: + +```sql +uptime( + agg HEARTBEATAGG +) RETURNS INTERVAL +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | HeartbeatAgg | - | ✔ | A heartbeat aggregate to get the liveness data from | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| uptime | INTERVAL | The sum of all the live ranges in the aggregate. | + +[interpolated_uptime]: /api-reference/timescaledb/hyperfunctions/state-tracking/heartbeat_agg/interpolated_uptime diff --git a/api-reference/timescaledb-toolkit/state-tracking/index.mdx b/api-reference/timescaledb-toolkit/state-tracking/index.mdx new file mode 100644 index 0000000..61baafb --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/index.mdx @@ -0,0 +1,106 @@ +--- +title: State tracking overview +sidebarTitle: Overview +description: Functions for tracking state transitions and system liveness over time +topics: [hyperfunctions] +license: community +toolkit: true +products: [cloud, mst, self_hosted] +--- + + Early access 1.5.0 + +Track state transitions and system liveness over time. These functions help you analyze systems that switch between +discrete states, monitor heartbeat signals for liveness detection, and calculate durations spent in different states. + +TimescaleDB Toolkit provides three approaches to state tracking: +- **compact_state_agg**: Track time spent in each state with minimal memory usage +- **state_agg**: Track state transitions with full timestamp information +- **heartbeat_agg**: Monitor system liveness based on heartbeat signals + +import TwoStepAggregation from '/snippets/api-reference/timescaledb/hyperfunctions/_two-step-aggregation.mdx'; + +## Two-step aggregation + + + +## Samples + +### Track time in different states + +Use compact_state_agg to efficiently track how much time a system spends in each state: + +```sql +SELECT + device_id, + toolkit_experimental.compact_state_agg(time, status) AS state_data +FROM devices +GROUP BY device_id; +``` + +Query the duration spent in each state: + +```sql +WITH state_data AS ( + SELECT + device_id, + toolkit_experimental.compact_state_agg(time, status) AS agg + FROM devices + GROUP BY device_id +) +SELECT + device_id, + toolkit_experimental.duration_in('running', agg) AS running_time, + toolkit_experimental.duration_in('error', agg) AS error_time +FROM state_data; +``` + +### Analyze state transitions + +Use state_agg to track when state transitions occur: + +```sql +WITH state_data AS ( + SELECT state_agg(time, status) AS agg + FROM devices + WHERE device_id = 'device_1' +) +SELECT * +FROM unnest((SELECT state_timeline(agg) FROM state_data)); +``` + +### Monitor system liveness + +Use heartbeat_agg to track uptime and downtime: + +```sql +WITH heartbeats AS ( + SELECT heartbeat_agg( + ping_time, + '2024-01-01'::timestamptz, + '7 days'::interval, + '5 minutes'::interval + ) AS agg + FROM system_health +) +SELECT + uptime(agg) AS total_uptime, + downtime(agg) AS total_downtime +FROM heartbeats; +``` + +## Available functions + +### Compact state aggregation +- [`compact_state_agg()`][compact_state_agg]: track time spent in states with minimal memory usage + +### State aggregation with transitions +- [`state_agg()`][state_agg]: track state transitions with full timestamp information + +### Heartbeat monitoring +- [`heartbeat_agg()`][heartbeat_agg]: monitor system liveness based on heartbeat signals + +[two-step-aggregation]: #two-step-aggregation +[compact_state_agg]: /api-reference/timescaledb/hyperfunctions/state-tracking/compact_state_agg/index +[state_agg]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/index +[heartbeat_agg]: /api-reference/timescaledb/hyperfunctions/state-tracking/heartbeat_agg/index diff --git a/api-reference/timescaledb-toolkit/state-tracking/state_agg/duration_in.mdx b/api-reference/timescaledb-toolkit/state-tracking/state_agg/duration_in.mdx new file mode 100644 index 0000000..a2174fe --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/state_agg/duration_in.mdx @@ -0,0 +1,78 @@ +--- +title: duration_in() +description: Calculate the total time spent in a given state from a state aggregate +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: accessor + aggregates: + - state_agg() +--- + + Since 1.15.0 + +Calculate the total time spent in a state from a state aggregate. If you need to interpolate missing values across time +bucket boundaries, use [`interpolated_duration_in`][interpolated_duration_in]. + +## Samples + +Create a test table that tracks when a system switches between `starting`, `running`, and `error` states. Query the +table for the time spent in the `running` state. + +If you prefer to see the result in seconds, [`EXTRACT`](https://www.postgresql.org/docs/current/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT) the epoch from the returned result. + +```sql +SET timezone TO 'UTC'; +CREATE TABLE states(time TIMESTAMPTZ, state TEXT); +INSERT INTO states VALUES + ('1-1-2020 10:00', 'starting'), + ('1-1-2020 10:30', 'running'), + ('1-3-2020 16:00', 'error'), + ('1-3-2020 18:30', 'starting'), + ('1-3-2020 19:30', 'running'), + ('1-5-2020 12:00', 'stopping'); + +SELECT duration_in( + state_agg(time, state), + 'running' +) FROM states; +``` + +Returns: + +``` +duration_in +--------------- +3 days 22:00:00 +``` + +## Arguments + +The syntax is: + +```sql +duration_in( + agg StateAgg, + state {TEXT | BIGINT} + [, start TIMESTAMPTZ] + [, interval INTERVAL] +) RETURNS INTERVAL +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | StateAgg | - | ✔ | A state aggregate created with [`state_agg`][state_agg] | +| state | TEXT \| BIGINT | - | ✔ | The state to query | +| start | TIMESTAMPTZ | - | | If specified, only the time in the state after this time is returned | +| interval | INTERVAL | - | | If specified, only the time in the state from the start time to the end of the interval is returned | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| duration_in | INTERVAL | The time spent in the given state. Displayed in `days`, `hh:mm:ss`, or a combination of the two. | + +[state_agg]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/state_agg +[interpolated_duration_in]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/interpolated_duration_in diff --git a/api-reference/timescaledb-toolkit/state-tracking/state_agg/index.mdx b/api-reference/timescaledb-toolkit/state-tracking/state_agg/index.mdx new file mode 100644 index 0000000..5a16704 --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/state_agg/index.mdx @@ -0,0 +1,75 @@ +--- +title: State aggregation overview +sidebarTitle: Overview +description: Track transitions between discrete states with state_agg functions +--- + + Since 1.15.0 + +Track transitions between discrete states for a system or value that switches between them. For example, use `state_agg` +to create a timeline of state transitions, or to calculate the durations of states. `state_agg` extends the capabilities +of [`compact_state_agg`][compact_state_agg]. + +`state_agg` is designed to work with a relatively small number of states. It might not perform well on datasets where +states are mostly distinct between rows. + +Because `state_agg` tracks more information, it uses more memory than `compact_state_agg`. If you want to minimize +memory use and don't need to query the timestamps of state transitions, consider using +[`compact_state_agg`][compact_state_agg] instead. + +## Two-step aggregation + +This group of functions uses the two-step aggregation pattern. + +Rather than calculating the final result in one step, you first create an intermediate aggregate by using the aggregate +function. + +Then, use any of the accessors on the intermediate aggregate to calculate a final result. You can also roll up multiple +intermediate aggregates with the rollup functions. + +The two-step aggregation pattern has several advantages: + +1. More efficient because multiple accessors can reuse the same aggregate +1. Easier to reason about performance, because aggregation is separate from final computation +1. Easier to understand when calculations can be rolled up into larger intervals, especially in window functions and +[continuous aggregates][caggs] +1. Can perform retrospective analysis even when underlying data is dropped, because the intermediate aggregate stores +extra information not available in the final result + +To learn more, see the [blog post on two-step aggregates][blog-two-step-aggregates]. + +## Functions in this group + +### Aggregate +- [`state_agg()`][state_agg]: aggregate state data into an intermediate form for further computation + +### Accessors +- [`state_at()`][state_at]: get the state at a given time +- [`duration_in()`][duration_in]: get the total duration in the specified states +- [`interpolated_duration_in()`][interpolated_duration_in]: get the total duration in the specified states, + interpolating values at the boundary +- [`state_periods()`][state_periods]: get an array of periods for each state +- [`state_timeline()`][state_timeline]: get a timeline of state changes +- [`interpolated_state_periods()`][interpolated_state_periods]: get an array of periods for each state, interpolating + values at the boundary +- [`interpolated_state_timeline()`][interpolated_state_timeline]: get a timeline of state changes, interpolating values + at the boundary +- [`into_values()`][into_values]: return an array of `(state, duration)` pairs from the aggregate + +### Rollup +- [`rollup()`][rollup]: combine multiple intermediate aggregates + +[two-step-aggregation]: #two-step-aggregation +[blog-two-step-aggregates]: https://www.timescale.com/blog/how-postgresql-aggregation-works-and-how-it-inspired-our-hyperfunctions-design +[caggs]: /use-timescale/continuous-aggregates/about-continuous-aggregates/ +[compact_state_agg]: /api-reference/timescaledb/hyperfunctions/state-tracking/compact_state_agg/ +[state_agg]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/state_agg +[state_at]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/state_at +[duration_in]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/duration_in +[interpolated_duration_in]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/interpolated_duration_in +[state_periods]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/state_periods +[state_timeline]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/state_timeline +[interpolated_state_periods]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/interpolated_state_periods +[interpolated_state_timeline]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/interpolated_state_timeline +[into_values]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/into_values +[rollup]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/rollup diff --git a/api-reference/timescaledb-toolkit/state-tracking/state_agg/interpolated_duration_in.mdx b/api-reference/timescaledb-toolkit/state-tracking/state_agg/interpolated_duration_in.mdx new file mode 100644 index 0000000..9832ffa --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/state_agg/interpolated_duration_in.mdx @@ -0,0 +1,87 @@ +--- +title: interpolated_duration_in() +description: Calculate the total time spent in a given state from a state aggregate, interpolating values at time bucket boundaries +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: accessor + aggregates: + - state_agg() +--- + + Since 1.15.0 + +Calculate the total duration in a given state. Unlike [`duration_in`][duration_in], you can use this function across +multiple state aggregates that cover multiple time buckets. Any missing values at the time bucket boundaries are +interpolated from adjacent state aggregates. + +## Samples + +Create a test table that tracks when a system switches between `starting`, `running`, and `error` states. Query the +table for the time spent in the `running` state. Use `LAG` and `LEAD` to get the neighboring aggregates for +interpolation. + +If you prefer to see the result in seconds, [`EXTRACT`](https://www.postgresql.org/docs/current/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT) the epoch from the returned result. + +```sql +SELECT + time, + interpolated_duration_in( + agg, + 'running', + time, + '1 day', + LAG(agg) OVER (ORDER BY time) +) FROM ( + SELECT + time_bucket('1 day', time) as time, + state_agg(time, state) as agg + FROM + states + GROUP BY time_bucket('1 day', time) +) s; +``` + +Returns: + +``` +time | interpolated_duration_in +------------------------+-------------------------- +2020-01-01 00:00:00+00 | 13:30:00 +2020-01-02 00:00:00+00 | 16:00:00 +2020-01-03 00:00:00+00 | 04:30:00 +2020-01-04 00:00:00+00 | 12:00:00 +``` + +## Arguments + +The syntax is: + +```sql +interpolated_duration_in( + agg StateAgg, + state {TEXT | BIGINT}, + start TIMESTAMPTZ, + interval INTERVAL + [, prev StateAgg] +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | StateAgg | - | ✔ | A state aggregate created with [`state_agg`][state_agg] | +| state | TEXT \| BIGINT | - | ✔ | The state to query | +| start | TIMESTAMPTZ | - | ✔ | The start of the interval to be calculated | +| interval | INTERVAL | - | ✔ | The length of the interval to be calculated | +| prev | StateAgg | - | | The state aggregate from the prior interval, used to interpolate the value at `start`. If `NULL`, the first timestamp in `aggregate` is used as the start of the interval | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| interpolated_duration_in | INTERVAL | The total time spent in the queried state. Displayed as `days`, `hh:mm:ss`, or a combination of the two. | + +[state_agg]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/state_agg +[duration_in]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/duration_in diff --git a/api-reference/timescaledb-toolkit/state-tracking/state_agg/interpolated_state_periods.mdx b/api-reference/timescaledb-toolkit/state-tracking/state_agg/interpolated_state_periods.mdx new file mode 100644 index 0000000..5a28dfd --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/state_agg/interpolated_state_periods.mdx @@ -0,0 +1,87 @@ +--- +title: interpolated_state_periods() +description: Get the time periods corresponding to a given state from a state aggregate, interpolating values at time bucket boundaries +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: accessor + aggregates: + - state_agg() +--- + + Since 1.15.0 + +List the periods when the system is in a specific state from a state aggregate. Periods are defined by the start time +and end time. + +Unlike [`state_periods`][state_periods], you can use this function across multiple state aggregates that cover different +time buckets. Any missing values at the time bucket boundaries are interpolated from adjacent state aggregates. + +## Samples + +Given state aggregates bucketed by 1-minute intervals, interpolate the states at the bucket boundaries and list all time +periods corresponding to the state `OK`. + +To perform the interpolation, the `LAG` and `LEAD` functions are used to get the previous and next state aggregates. + +```sql +SELECT + bucket, + (interpolated_state_periods( + summary, + 'OK', + bucket, + '15 min', + LAG(summary) OVER (ORDER by bucket) + )).* +FROM ( + SELECT + time_bucket('1 min'::interval, ts) AS bucket, + state_agg(ts, state) AS summary + FROM states_test + GROUP BY time_bucket('1 min'::interval, ts) +) t; +``` + +Returns: + +``` + bucket | start_time | end_time +------------------------+------------------------+------------------------ + 2020-01-01 00:00:00+00 | 2020-01-01 00:00:11+00 | 2020-01-01 00:15:00+00 + 2020-01-01 00:01:00+00 | 2020-01-01 00:01:03+00 | 2020-01-01 00:16:00+00 +``` + +## Arguments + +The syntax is: + +```sql +interpolated_state_periods( + agg StateAgg, + state [TEXT | BIGINT], + start TIMESTAMPTZ, + interval INTERVAL, + [, prev StateAgg] +) RETURNS (TIMESTAMPTZ, TIMESTAMPTZ) +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | StateAgg | - | ✔ | A state aggregate created with [`state_agg`][state_agg] | +| state | TEXT \| BIGINT | - | ✔ | The state to query | +| start | TIMESTAMPTZ | - | ✔ | The start of the interval to be calculated | +| interval | INTERVAL | - | ✔ | The length of the interval to be calculated | +| prev | StateAgg | - | | The state aggregate from the prior interval, used to interpolate the value at `start`. If `NULL`, the first timestamp in `aggregate` is used as the start of the interval | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| start_time | TIMESTAMPTZ | The time when the state started (inclusive) | +| end_time | TIMESTAMPTZ | The time when the state ended (exclusive) | + +[state_agg]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/state_agg +[state_periods]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/state_periods diff --git a/api-reference/timescaledb-toolkit/state-tracking/state_agg/interpolated_state_timeline.mdx b/api-reference/timescaledb-toolkit/state-tracking/state_agg/interpolated_state_timeline.mdx new file mode 100644 index 0000000..9dd2c1f --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/state_agg/interpolated_state_timeline.mdx @@ -0,0 +1,95 @@ +--- +title: interpolated_state_timeline() +description: Get a timeline of all states from a state aggregate, interpolating values at time bucket boundaries +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: accessor + aggregates: + - state_agg() +--- + + Since 1.15.0 + +Get a timeline of all states, showing each time a state is entered and exited. + +Unlike [`state_timeline`][state_timeline], you can use this function across multiple state aggregates that cover +different time buckets. Any missing values at the time bucket boundaries are interpolated from adjacent state +aggregates. + +## Samples + +Given state aggregates bucketed by 1-minute intervals, interpolate the states at the bucket boundaries and get the +history of all states. + +To perform the interpolation, the `LAG` and `LEAD` functions are used to get the previous and next state aggregates. + +```sql +SELECT + bucket, + (interpolated_state_timeline( + summary, + bucket, + '15 min', + LAG(summary) OVER (ORDER by bucket) + )).* +FROM ( + SELECT + time_bucket('1 min'::interval, ts) AS bucket, + state_agg(ts, state) AS summary + FROM states_test + GROUP BY time_bucket('1 min'::interval, ts) +) t; +``` + +Returns: + +``` + bucket | state | start_time | end_time +------------------------+-------+------------------------+------------------------ + 2020-01-01 00:00:00+00 | START | 2020-01-01 00:00:00+00 | 2020-01-01 00:00:11+00 + 2020-01-01 00:00:00+00 | OK | 2020-01-01 00:00:11+00 | 2020-01-01 00:15:00+00 + 2020-01-01 00:01:00+00 | ERROR | 2020-01-01 00:01:00+00 | 2020-01-01 00:01:03+00 + 2020-01-01 00:01:00+00 | OK | 2020-01-01 00:01:03+00 | 2020-01-01 00:16:00+00 + 2020-01-01 00:02:00+00 | STOP | 2020-01-01 00:02:00+00 | 2020-01-01 00:17:00+00 +``` + +## Arguments + +The syntax is: + +```sql +interpolated_state_timeline( + agg StateAgg, + start TIMESTAMPTZ, + interval INTERVAL, + [, prev StateAgg] +) RETURNS (TIMESTAMPTZ, TIMESTAMPTZ) + +interpolated_state_int_timeline( + agg StateAgg, + start TIMESTAMPTZ, + interval INTERVAL, + [, prev StateAgg] +) RETURNS (TIMESTAMPTZ, TIMESTAMPTZ) +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | StateAgg | - | ✔ | A state aggregate created with [`state_agg`][state_agg] | +| start | TIMESTAMPTZ | - | ✔ | The start of the interval to be calculated | +| interval | INTERVAL | - | ✔ | The length of the interval to be calculated | +| prev | StateAgg | - | | The state aggregate from the prior interval, used to interpolate the value at `start`. If `NULL`, the first timestamp in `aggregate` is used as the start of the interval | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| state | TEXT | BIGINT | A state found in the state aggregate | +| start_time | TIMESTAMPTZ | The time when the state started (inclusive) | +| end_time | TIMESTAMPTZ | The time when the state ended (exclusive) | + +[state_agg]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/state_agg +[state_timeline]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/state_timeline diff --git a/api-reference/timescaledb-toolkit/state-tracking/state_agg/into_values.mdx b/api-reference/timescaledb-toolkit/state-tracking/state_agg/into_values.mdx new file mode 100644 index 0000000..442e5ba --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/state_agg/into_values.mdx @@ -0,0 +1,65 @@ +--- +title: into_values() +description: Expand the state aggregate into a set of rows, displaying the duration of each state +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: accessor + aggregates: + - state_agg() +--- + + Since 1.15.0 + +Unpack the state aggregate into a set of rows with two columns, displaying the duration of each state. By default, the +columns are named `state` and `duration`. You can rename them using the same method as renaming a table. + +## Samples + +Create a state aggregate from the table `states_test`. The time column is named `time`, and the `state` column contains +text values corresponding to different states of a system. Use `into_values` to display the data from the state +aggregate. + +```sql +SELECT state, duration FROM into_values( + (SELECT state_agg(time, state) FROM states_test) +); +``` + +Returns: + +``` +state | duration +------+---------- +ERROR | 00:00:03 +OK | 00:01:46 +START | 00:00:11 +``` + + +The syntax is: + +```sql +into_values( + agg StateAgg +) RETURNS (TEXT, INTERVAL) + +into_int_values( + agg StateAgg +) RETURNS (INT, INTERVAL) +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | StateAgg | - | ✔ | A state aggregate created with [`state_agg`][state_agg] | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| state | TEXT | BIGINT | A state found in the state aggregate | +| duration | INTERVAL | The total time spent in that state | + +[state_agg]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/state_agg## Arguments diff --git a/api-reference/timescaledb-toolkit/state-tracking/state_agg/rollup.mdx b/api-reference/timescaledb-toolkit/state-tracking/state_agg/rollup.mdx new file mode 100644 index 0000000..f91e467 --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/state_agg/rollup.mdx @@ -0,0 +1,55 @@ +--- +title: rollup() +description: Combine multiple state aggregates +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: rollup + aggregates: + - state_agg() +--- + + Since 1.15.0 + +Combine multiple state aggregates into a single state aggregate. For example, you can use `rollup` to combine state +aggregates from 15-minute buckets into daily buckets. + +## Samples + +Combine multiple state aggregates and calculate the duration spent in the `START` state. + +```sql +WITH buckets AS (SELECT + time_bucket('1 minute', ts) as dt, + state_agg(ts, state) AS sa +FROM states_test +GROUP BY time_bucket('1 minute', ts)) +SELECT duration_in( + 'START', + rollup(buckets.sa) +) +FROM buckets; +``` + +## Arguments + +The syntax is: + +```sql +rollup( + agg StateAgg +) RETURNS StateAgg +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | StateAgg | - | ✔ | State aggregates created using `state_agg` | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| agg | StateAgg | A new state aggregate that combines the input state aggregates | + diff --git a/api-reference/timescaledb-toolkit/state-tracking/state_agg/state_agg.mdx b/api-reference/timescaledb-toolkit/state-tracking/state_agg/state_agg.mdx new file mode 100644 index 0000000..5612be4 --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/state_agg/state_agg.mdx @@ -0,0 +1,49 @@ +--- +title: state_agg() +description: Aggregate state data into a state aggregate for further analysis +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: aggregate + aggregates: + - state_agg() +--- + + Since 1.15.0 + +Aggregate state data into a state aggregate to track state transitions. Unlike [`compact_state_agg`][compact_state_agg], +which only stores durations, `state_agg` also stores the timestamps of state transitions. + +## Samples + +Create a state aggregate to track the status of some devices. + +```sql +SELECT state_agg(time, status) FROM devices; +``` + +## Arguments + +The syntax is: + +```sql +state_agg( + ts TIMESTAMPTZ, + value {TEXT | BIGINT} +) RETURNS StateAgg +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| ts | TIMESTAMPTZ | - | ✔ | Timestamps associated with each state reading | +| value | TEXT \| BIGINT | - | ✔ | The state at that time | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| agg | StateAgg | An object storing the periods spent in each state, including timestamps of state transitions | + +[compact_state_agg]: /api-reference/timescaledb/hyperfunctions/state-tracking/compact_state_agg/ diff --git a/api-reference/timescaledb-toolkit/state-tracking/state_agg/state_at.mdx b/api-reference/timescaledb-toolkit/state-tracking/state_agg/state_at.mdx new file mode 100644 index 0000000..19bfb4a --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/state_agg/state_at.mdx @@ -0,0 +1,64 @@ +--- +title: state_at() +description: Determine the state at a given time +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: accessor + aggregates: + - state_agg() +--- + + Since 1.15.0 + +Determine the state at a given time from a state aggregate. + +## Samples + +Create a state aggregate and determine the state at a particular time. + +```sql +SELECT state_at( + (SELECT state_agg(ts, state) FROM states_test), + '2020-01-01 00:00:05+00' +); +``` + +Returns: + +``` +state_at +---------- +START +``` + +## Arguments + +The syntax is: + +```sql +state_at( + agg StateAgg, + ts TIMESTAMPTZ +) RETURNS TEXT + +state_at_int( + agg StateAgg, + ts TIMESTAMPTZ +) RETURNS BIGINT +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | StateAgg | - | ✔ | A state aggregate created with [`state_agg`][state_agg] | +| ts | TIMESTAMPTZ | - | ✔ | The time to get the state at | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| state | TEXT \| BIGINT | The state at the given time. | + +[state_agg]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/state_agg diff --git a/api-reference/timescaledb-toolkit/state-tracking/state_agg/state_periods.mdx b/api-reference/timescaledb-toolkit/state-tracking/state_agg/state_periods.mdx new file mode 100644 index 0000000..4c3ff08 --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/state_agg/state_periods.mdx @@ -0,0 +1,66 @@ +--- +title: state_periods() +description: Get the time periods corresponding to a given state from a state aggregate +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: accessor + aggregates: + - state_agg() +--- + + Since 1.15.0 + +List the periods when the system is in a specific state from a state aggregate. Periods are defined by the start time +and end time. + +If you have multiple state aggregates and need to interpolate the state across interval boundaries, use +[`interpolated_state_periods`][interpolated_state_periods]. + +## Samples + +Create a state aggregate and list all periods corresponding to the state `OK`. + +```sql +SELECT start_time, end_time FROM state_periods( + (SELECT state_agg(ts, state) FROM states_test), + 'OK', +); +``` + +Returns: + +``` + start_time | end_time +------------------------+------------------------ + 2020-01-01 00:00:11+00 | 2020-01-01 00:01:00+00 + 2020-01-01 00:01:03+00 | 2020-01-01 00:02:00+00 +``` + +## Arguments + +The syntax is: + +```sql +state_periods( + agg StateAgg, + state [TEXT | BIGINT] +) RETURNS (TIMESTAMPTZ, TIMESTAMPTZ) +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | StateAgg | - | ✔ | A state aggregate created using [`state_agg`][state_agg] | +| state | TEXT \| BIGINT | - | ✔ | The target state to get data for | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| start_time | TIMESTAMPTZ | The time when the state started (inclusive) | +| end_time | TIMESTAMPTZ | The time when the state ended (exclusive) | + +[state_agg]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/state_agg +[interpolated_state_periods]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/interpolated_state_periods diff --git a/api-reference/timescaledb-toolkit/state-tracking/state_agg/state_timeline.mdx b/api-reference/timescaledb-toolkit/state-tracking/state_agg/state_timeline.mdx new file mode 100644 index 0000000..111dcf0 --- /dev/null +++ b/api-reference/timescaledb-toolkit/state-tracking/state_agg/state_timeline.mdx @@ -0,0 +1,71 @@ +--- +title: state_timeline() +description: Get a timeline of all states from a state aggregate +license: community +type: function +toolkit: true +topics: [hyperfunctions] +hyperfunction: + family: state tracking + type: accessor + aggregates: + - state_agg() +--- + + Since 1.15.0 + +Get a timeline of all states, showing each time a state is entered and exited. + +If you have multiple state aggregates and need to interpolate the state across interval boundaries, use +[`interpolated_state_timeline`][interpolated_state_timeline]. + +## Samples + +Get the history of states from a state aggregate. + +```sql +SELECT state, start_time, end_time + FROM state_timeline( + (SELECT state_agg(ts, state) FROM states_test) + ); +``` + +Returns: + +``` + state | start_time | end_time +-------+------------------------+------------------------ + START | 2020-01-01 00:00:00+00 | 2020-01-01 00:00:11+00 + OK | 2020-01-01 00:00:11+00 | 2020-01-01 00:01:00+00 + ERROR | 2020-01-01 00:01:00+00 | 2020-01-01 00:01:03+00 + OK | 2020-01-01 00:01:03+00 | 2020-01-01 00:02:00+00 + STOP | 2020-01-01 00:02:00+00 | 2020-01-01 00:02:00+00 +``` + + +## Arguments + +The syntax is: + +```sql +state_timeline( + agg StateAgg +) RETURNS (TEXT, TIMESTAMPTZ, TIMESTAMPTZ) + +state_int_timeline( + agg StateAgg +) RETURNS (BIGINT, TIMESTAMPTZ, TIMESTAMPTZ) +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| agg | StateAgg | - | ✔ | The aggregate from which to get a timeline | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| state | TEXT \| BIGINT | A state found in the state aggregate | +| start_time | TIMESTAMPTZ | The time when the state started (inclusive) | +| end_time | TIMESTAMPTZ | The time when the state ended (exclusive) | + +[interpolated_state_timeline]: /api-reference/timescaledb/hyperfunctions/state-tracking/state_agg/interpolated_state_timeline diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/index.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/index.mdx new file mode 100644 index 0000000..1e11074 --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/index.mdx @@ -0,0 +1,93 @@ +--- +title: Statistical and regression analysis overview +sidebarTitle: Overview +description: Functions for statistical analysis and linear regression on time-series data +topics: [hyperfunctions] +license: community +toolkit: true +products: [cloud, mst, self_hosted] +--- + + Since 1.3.0 + +Perform statistical analysis and linear regression on time-series data. These functions are similar to PostgreSQL +statistical aggregates, but they include more features and are easier to use in continuous aggregates and window +functions. + +import TwoStepAggregation from '/snippets/api-reference/timescaledb/hyperfunctions/_two-step-aggregation.mdx'; + +## Two-step aggregation + + + +## Samples + +### One-dimensional statistical analysis + +Calculate the average, standard deviation, and skewness of daily temperature readings: + +```sql +WITH daily_stats AS ( + SELECT + time_bucket('1 day'::interval, time) AS day, + stats_agg(temperature) AS stats + FROM weather_data + GROUP BY day +) +SELECT + day, + average(stats) AS avg_temp, + stddev(stats) AS std_dev, + skewness(stats) AS skew +FROM daily_stats +ORDER BY day; +``` + +### Two-dimensional regression analysis + +Calculate the correlation coefficient and linear regression slope between two variables: + +```sql +WITH daily_stats AS ( + SELECT + time_bucket('1 day'::interval, time) AS day, + stats_agg(sales, temperature) AS stats + FROM store_data + GROUP BY day +) +SELECT + day, + corr(stats) AS correlation, + slope(stats) AS regression_slope, + intercept(stats) AS y_intercept +FROM daily_stats +ORDER BY day; +``` + +### Rolling window calculations + +Calculate a 7-day rolling average using the rolling window function: + +```sql +SELECT + time_bucket('1 day'::interval, time) AS day, + average(rolling(stats_agg(temperature)) OVER (ORDER BY time_bucket('1 day'::interval, time) ROWS 6 PRECEDING)) AS rolling_avg_7day +FROM weather_data +GROUP BY day +ORDER BY day; +``` + +## Available functions + +### One-dimensional statistics + +- [`stats_agg() (one variable)`][stats_agg-one-variable]: analyze statistical properties of a single variable + +### Two-dimensional statistics and regression + +- [`stats_agg() (two variables)`][stats_agg-two-variables]: analyze statistical properties and linear regression of two + variables + +[two-step-aggregation]: #two-step-aggregation +[stats_agg-one-variable]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-one-variable/index +[stats_agg-two-variables]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-two-variables/index \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/average.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/average.mdx new file mode 100644 index 0000000..2afdd4d --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/average.mdx @@ -0,0 +1,52 @@ +--- +title: average() +description: Calculate the average from a one-dimensional statistical aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: accessor + aggregates: + - stats_agg() (one variable) +topics: [hyperfunctions] +keywords: [average, statistics, hyperfunctions, Toolkit] +--- + + Since 1.3.0 + +Calculate a simple average (or mean) from the values in a statistical aggregate. + +## Samples + +Calculate the average of the integers from 0 to 100: + +```sql +SELECT average(stats_agg(data)) + FROM generate_series(0, 100) data; +``` + +``` +average +----------- +50 +``` + +## Arguments + +The syntax is: + +```sql +average( + summary StatsSummary1D +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary1D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| average | DOUBLE PRECISION | The average of the values in the statistical aggregate | \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/index.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/index.mdx new file mode 100644 index 0000000..8aefb05 --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/index.mdx @@ -0,0 +1,80 @@ +--- +title: stats_agg (one variable) overview +sidebarTitle: Overview +description: Statistical analysis functions for one-dimensional data +topics: [hyperfunctions] +license: community +toolkit: true +products: [cloud, mst, self_hosted] +--- + + Since 1.3.0 + +Perform common statistical analyses, such as calculating averages and standard deviations, using this group of +functions. These functions are similar to the PostgreSQL statistical aggregates, but they include more features and are +easier to use in continuous aggregates and window functions. + +These functions work on one-dimensional data. To work with two-dimensional data, for example to perform linear +regression, see [the two-dimensional `stats_agg` functions][stats_agg-2d]. + +import TwoStepAggregation from '/snippets/api-reference/timescaledb/hyperfunctions/_two-step-aggregation.mdx'; + +## Two-step aggregation + + + +## Samples + +### Calculate statistical properties + +Create a statistical aggregate to summarize daily statistical data about the variable `val1`. Use the statistical +aggregate to calculate average, standard deviation, and skewness of the variable: + +```sql +WITH t AS ( + SELECT + time_bucket('1 day'::interval, ts) AS dt, + stats_agg(val1) AS stats1D + FROM foo + WHERE id = 'bar' + GROUP BY time_bucket('1 day'::interval, ts) +) +SELECT + average(stats1D), + stddev(stats1D), + skewness(stats1D) +FROM t; +``` + +## Available functions + +### Aggregate +- [`stats_agg()`][stats_agg]: aggregate data into an intermediate statistical aggregate form for further calculation + +### Accessors +- [`average()`][average]: calculate the average from a statistical aggregate +- [`stddev()`][stddev]: calculate the standard deviation from a statistical aggregate +- [`variance()`][variance]: calculate the variance from a statistical aggregate +- [`skewness()`][skewness]: calculate the skewness from a statistical aggregate +- [`kurtosis()`][kurtosis]: calculate the kurtosis from a statistical aggregate +- [`sum()`][sum]: calculate the sum from a statistical aggregate +- [`num_vals()`][num_vals]: get the number of values contained in a statistical aggregate + +### Rollup +- [`rollup()`][rollup]: combine multiple one-dimensional statistical aggregates + +### Mutator +- [`rolling()`][rolling]: create a rolling window aggregate for use in window functions + +[two-step-aggregation]: #two-step-aggregation +[stats_agg-2d]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-two-variables/index +[stats_agg]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-one-variable/stats_agg +[average]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-one-variable/average +[stddev]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-one-variable/stddev +[variance]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-one-variable/variance +[skewness]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-one-variable/skewness +[kurtosis]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-one-variable/kurtosis +[sum]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-one-variable/sum +[num_vals]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-one-variable/num_vals +[rollup]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-one-variable/rollup +[rolling]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-one-variable/rolling diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/kurtosis.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/kurtosis.mdx new file mode 100644 index 0000000..48081de --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/kurtosis.mdx @@ -0,0 +1,57 @@ +--- +title: kurtosis() +description: Calculate the kurtosis from a one-dimensional statistical aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: accessor + aggregates: + - stats_agg() (one variable) +topics: [hyperfunctions] +keywords: [statistics, hyperfunctions, Toolkit] +tags: [skew] +--- + + Since 1.3.0 + +Calculate the kurtosis from the values in a statistical aggregate. The kurtosis is the fourth statistical moment. It is +a measure of "tailedness" of a data distribution compared to a normal distribution. + +## Samples + +Calculate the kurtosis of a sample containing the integers from 0 to 100: + +```sql +SELECT kurtosis(stats_agg(data)) + FROM generate_series(0, 100) data; +``` +This returns something like: +```sql +kurtosis +---------- +1.78195 +``` + +## Arguments + +The syntax is: + +```sql +kurtosis( + summary StatsSummary1D, + [ method TEXT ] +) DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary1D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | +| method | TEXT | `sample` | - | The method used for calculating the kurtosis. The two options are `population` and `sample`, which can be abbreviated to `pop` or `samp` | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| kurtosis | DOUBLE PRECISION | The kurtosis of the values in the statistical aggregate | + diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/num_vals.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/num_vals.mdx new file mode 100644 index 0000000..6890c5f --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/num_vals.mdx @@ -0,0 +1,54 @@ +--- +title: num_vals() +description: Calculate the number of values in a one-dimensional statistical aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: accessor + aggregates: + - stats_agg() (one variable) +topics: [hyperfunctions] +keywords: [statistics, hyperfunctions, Toolkit] +tags: [count, number] +--- + + Since 1.3.0 + +Calculate the number of values contained in a statistical aggregate. + +## Samples + +Calculate the number of values from 0 to 100, inclusive: + +```sql +SELECT num_vals(stats_agg(data)) + FROM generate_series(0, 100) data; +``` + +``` +num_vals +-------- +101 +``` + +## Arguments + +The syntax is: + +```sql +num_vals( + summary StatsSummary1D +) RETURNS BIGINT +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary1D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| num_vals | DOUBLE PRECISION | The number of values in the statistical aggregate | + diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/rolling.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/rolling.mdx new file mode 100644 index 0000000..002ba29 --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/rolling.mdx @@ -0,0 +1,69 @@ +--- +title: rolling() +description: Combine multiple one-dimensional statistical aggregates to calculate rolling window aggregates +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: rollup + aggregates: + - stats_agg() (one variable) +topics: [hyperfunctions] +keywords: [statistics, hyperfunctions, Toolkit] +--- + + Since 1.3.0 + +Combine multiple intermediate statistical aggregate (`StatsSummary1D`) objects into a single `StatsSummary1D` object. It +is optimized for use in a window function context for computing tumbling window statistical aggregates. + + +This is especially useful for computing tumbling window aggregates from a continuous aggregate. It can be orders of +magnitude faster because it uses inverse transition and combine functions, with the possibility that bigger floating +point errors can occur in unusual scenarios. + +For re-aggregation in a non-window function context, such as combining hourly buckets into daily buckets, see +`rollup()`. + + +## Samples + +Combine hourly continuous aggregates to create a tumbling window daily aggregate. Calculate the average and standard +deviation using the appropriate accessors: + +```sql +CREATE MATERIALIZED VIEW foo_hourly +WITH (timescaledb.continuous) +AS SELECT + time_bucket('1h'::interval, ts) AS bucket, + stats_agg(value) as stats +FROM foo +GROUP BY 1; + +SELECT + bucket, + average(rolling(stats) OVER (ORDER BY bucket RANGE '1 day' PRECEDING)), + stddev(rolling(stats) OVER (ORDER BY bucket RANGE '1 day' PRECEDING)), +FROM foo_hourly; +``` + +## Arguments + +The syntax is: + +```sql +rolling( + ss StatsSummary1D +) RETURNS StatsSummary1D +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary1D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| rolling | StatsSummary1D | A new statistical aggregate produced by combining the input statistical aggregates | + diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/rollup.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/rollup.mdx new file mode 100644 index 0000000..924881f --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/rollup.mdx @@ -0,0 +1,41 @@ +--- +title: rollup() +description: Combine multiple one-dimensional statistical aggregates +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: rollup + aggregates: + - stats_agg() (one variable) +topics: [hyperfunctions] +keywords: [statistics, hyperfunctions, Toolkit] +--- + + Since 1.3.0 + +Combine multiple intermediate statistical aggregate (`StatsSummary1D`) objects produced by `stats_agg` (one variable) +into a single intermediate `StatsSummary1D` object. For example, you can use `rollup` to combine statistical aggregates +from 15-minute buckets into daily buckets. + +For use in window functions, see `rolling()`. + +## Arguments + +The syntax is: + +```sql +rollup( + ss StatsSummary1D +) RETURNS StatsSummary1D +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary1D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| rollup | StatsSummary1D | A new statistical aggregate produced by combining the input statistical aggregates | \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/skewness.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/skewness.mdx new file mode 100644 index 0000000..037c299 --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/skewness.mdx @@ -0,0 +1,56 @@ +--- +title: skewness() +description: Calculate the skewness from a one-dimensional statistical aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: accessor + aggregates: + - stats_agg() (one variable) +topics: [hyperfunctions] +keywords: [statistics, hyperfunctions, Toolkit] +--- + + Since 1.3.0 + +Calculate the skewness from the values in a statistical aggregate. The skewness is the third statistical moment. It is a +measure of asymmetry in a data distribution. + +## Samples + +Calculate the skewness of a sample containing the integers from 0 to 100: + +```sql +SELECT skewness(stats_agg(data)) + FROM generate_series(0, 100) data; +``` + +```sql +skewness_x +---------- +0 +``` + +## Arguments + +The syntax is: + +```sql +skewness( + summary StatsSummary1D, + [ method TEXT ] +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary1D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | +| method | TEXT | `sample` | - | The method used for calculating the skewness. The two options are `population` and `sample`, which can be abbreviated to `pop` or `samp` | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| skewness | DOUBLE PRECISION | The skewness of the values in the statistical aggregate | + diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/stats_agg.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/stats_agg.mdx new file mode 100644 index 0000000..adad4cf --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/stats_agg.mdx @@ -0,0 +1,45 @@ +--- +title: stats_agg() (one variable) +description: Aggregate data into an intermediate statistical aggregate form for further calculation +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: aggregate + aggregates: + - stats_agg() (one variable) +topics: [hyperfunctions] +keywords: [statistics, hyperfunctions, Toolkit] +--- + + Since 1.3.0 + +Aggregate data into an intermediate statistical aggregate form for further calculation. This is the first step for +performing any statistical aggregate calculations on one-dimensional data. Use `stats_agg` to create an intermediate +aggregate (`StatsSummary1D`) from your data. This intermediate form can then be used by one or more accessors in this +group to compute final results. Optionally, multiple such intermediate aggregate objects can be combined using +`rollup()` or `rolling()` before an accessor is applied. + +`stats_agg` is well suited for creating a continuous aggregate that can serve multiple purposes later. For example, you +can create a continuous aggregate using `stats_agg` to calculate average and sum. Later, you can reuse the same +`StatsSummary1D` objects to calculate standard deviation from the same continuous aggregate. + +## Arguments + +The syntax is: + +```sql +stats_agg( + value DOUBLE PRECISION +) RETURNS StatsSummary1D +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| value | DOUBLE PRECISION | - | ✔ | The variable to use for the statistical aggregate | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| stats_agg | StatsSummary1D | The statistical aggregate, containing data about the variables in an intermediate form. Pass the aggregate to accessor functions in the statistical aggregates API to perform final calculations. Or, pass the aggregate to rollup functions to combine multiple statistical aggregates into larger aggregates | diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/stddev.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/stddev.mdx new file mode 100644 index 0000000..2f4c6b3 --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/stddev.mdx @@ -0,0 +1,56 @@ +--- +title: stddev() +description: Calculate the standard deviation from a one-dimensional statistical aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: accessor + aggregates: + - stats_agg() (one variable) +topics: [hyperfunctions] +keywords: [statistics, hyperfunctions, Toolkit] +tags: [standard deviation] +--- + + Since 1.3.0 + +Calculate the standard deviation from the values in a statistical aggregate. + +## Samples + +Calculate the standard deviation of a sample containing the integers from 0 to 100: + +```sql +SELECT stddev(stats_agg(data)) + FROM generate_series(0, 100) data; +``` + +``` +stddev_y +-------- +29.3002 +``` + +## Arguments + +The syntax is: + +```sql +stddev( + summary StatsSummary1D, + [ method TEXT ] +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary1D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | +| method | TEXT | `sample` | - | The method used for calculating the standard deviation. The two options are `population` and `sample`, which can be abbreviated to `pop` or `samp` | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| stddev | DOUBLE PRECISION | The standard deviation of the values in the statistical aggregate | + diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/sum.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/sum.mdx new file mode 100644 index 0000000..b002a8a --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/sum.mdx @@ -0,0 +1,53 @@ +--- +title: sum() +description: Calculate the sum from a one-dimensional statistical aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: accessor + aggregates: + - stats_agg() (one variable) +topics: [hyperfunctions] +keywords: [statistics, hyperfunctions, Toolkit] +tags: [sum] +--- + + Since 1.3.0 + +Calculate the sum of the values contained in a statistical aggregate. + +## Samples + +Calculate the sum of the integers from 0 to 100: + +```sql +SELECT sum(stats_agg(data)) + FROM generate_series(0, 100) data; +``` + +``` +sum +----- +5050 +``` +## Arguments + +The syntax is: + +```sql +sum( + summary StatsSummary1D +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary1D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| sum | DOUBLE PRECISION | The sum of the values in the statistical aggregate | + diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/variance.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/variance.mdx new file mode 100644 index 0000000..f595767 --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-one-variable/variance.mdx @@ -0,0 +1,55 @@ +--- +title: variance() +description: Calculate the variance from a one-dimensional statistical aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: accessor + aggregates: + - stats_agg() (one variable) +topics: [hyperfunctions] +keywords: [statistics, hyperfunctions, Toolkit] +--- + + Since 1.3.0 + +Calculate the variance from the values in a statistical aggregate. + +## Samples + +Calculate the variance of a sample containing the integers from 0 to 100: + +```sql +SELECT variance(stats_agg(data)) + FROM generate_series(0, 100) data; +``` + +``` +variance +---------- +858.5 +``` + +## Arguments + +The syntax is: + +```sql +variance( + summary StatsSummary1D, + [ method TEXT ] +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary1D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | +| method | TEXT | `sample` | - | The method used for calculating the standard deviation. The two options are `population` and `sample`, which can be abbreviated to `pop` or `samp` | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| variance | DOUBLE PRECISION | The variance of the values in the statistical aggregate | + diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/average_y_x.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/average_y_x.mdx new file mode 100644 index 0000000..98425b8 --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/average_y_x.mdx @@ -0,0 +1,61 @@ +--- +title: average_y() | average_x() +description: Calculate the average from a two-dimensional statistical aggregate for the dimension specified +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: accessor + aggregates: + - stats_agg() (two variables) +topics: [hyperfunctions] +keywords: [average, statistics, hyperfunctions, Toolkit] +--- + + Since 1.3.0 + +Calculate the average from a two-dimensional aggregate for the given dimension. For example, `average_y()` calculates +the average for all the values of the `y` variable, independent of the values of the `x` variable. + +## Samples + +Calculate the average of the integers from 0 to 100: + +```sql +SELECT average_x(stats_agg(y, x)) + FROM generate_series(1, 5) y, + generate_series(0, 100) x; +``` + +``` +average +----------- +50 +``` + +## Arguments + +The syntax is: + +```sql +average_y( + summary StatsSummary 2D +) RETURNS DOUBLE PRECISION +``` + +```sql +average_x( + summary StatsSummary 2D +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary2D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| average_y \| average_x | DOUBLE PRECISION | The average of the values in the statistical aggregate | + diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/corr.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/corr.mdx new file mode 100644 index 0000000..2929b41 --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/corr.mdx @@ -0,0 +1,61 @@ +--- +title: corr() +description: Calculate the correlation coefficient from a two-dimensional statistical aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: accessor + aggregates: + - stats_agg() (two variables) +topics: [hyperfunctions] +keywords: + [ + correlation coefficient, + statistics, + statistical aggregate, + hyperfunctions, + toolkit, + ] +tags: [least squares, linear regression] +--- + + Since 1.3.0 + +Calculate the correlation coefficient from a two-dimensional statistical aggregate. The calculation uses the standard +least-squares fitting for linear regression. + +## Samples + +Calculate the correlation coefficient of independent variable `y` and dependent variable `x` for each 15-minute time +bucket: + +```sql +SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + corr(stats_agg(y, x)) AS summary +FROM foo +GROUP BY id, time_bucket('15 min'::interval, ts) +``` + +## Arguments + +The syntax is: + +```sql +corr( + summary StatsSummary2D +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary2D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| corr | DOUBLE PRECISION | The correlation coefficient of the least-squares fit line | + diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/covariance.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/covariance.mdx new file mode 100644 index 0000000..65b2d38 --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/covariance.mdx @@ -0,0 +1,55 @@ +--- +title: covariance() +description: Calculate the covariance from a two-dimensional statistical aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: accessor + aggregates: + - stats_agg() (two variables) +topics: [hyperfunctions] +keywords: + [covariance, statistics, statistical aggregate, hyperfunctions, toolkit] +--- + + Since 1.3.0 + +Calculate the covariance from a two-dimensional statistical aggregate. The calculation uses the standard least-squares +fitting for linear regression. + +## Samples + +Calculate the covariance of independent variable `y` and dependent variable `x` for each 15-minute time bucket: + +```sql +SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + covariance(stats_agg(y, x)) AS summary +FROM foo +GROUP BY id, time_bucket('15 min'::interval, ts) +``` + +## Arguments + +The syntax is: + +```sql +covariance( + summary StatsSummary2D, + [ method TEXT ] +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary2D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | +| method | TEXT | `sample` | - | The method used for calculating the covariance. The two options are `population` and `sample`, which can be abbreviated to `pop` or `samp` | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| covariance | DOUBLE PRECISION | The covariance of the least-squares fit line | + diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/determination_coeff.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/determination_coeff.mdx new file mode 100644 index 0000000..ea20152 --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/determination_coeff.mdx @@ -0,0 +1,60 @@ +--- +title: determination_coeff() +description: Calculate the determination coefficient from a two-dimensional statistical aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: accessor + aggregates: + - stats_agg() (two variables) +topics: [hyperfunctions] +keywords: + [ + determination coefficient, + statistics, + statistical aggregate, + hyperfunctions, + toolkit, + ] +--- + + Since 1.3.0 + +Calculate the determination coefficient from a two-dimensional statistical aggregate. The calculation uses the standard +least-squares fitting for linear regression. + +## Samples + +Calculate the determination coefficient of independent variable `y` and dependent variable `x` for each 15-minute time +bucket: + +```sql +SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + determination_coeff(stats_agg(y, x)) AS summary +FROM foo +GROUP BY id, time_bucket('15 min'::interval, ts) +``` + +## Arguments + +The syntax is: + +```sql +determination_coeff( + summary StatsSummary2D +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary2D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| determination_coeff | DOUBLE PRECISION | The determination coefficient of the least-squares fit line | + diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/index.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/index.mdx new file mode 100644 index 0000000..431d19f --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/index.mdx @@ -0,0 +1,99 @@ +--- +title: stats_agg (two variables) overview +sidebarTitle: Overview +description: Statistical analysis and linear regression functions for two-dimensional data +topics: [hyperfunctions] +license: community +toolkit: true +products: [cloud, mst, self_hosted] +--- + + Since 1.3.0 + +Perform linear regression analysis, for example to calculate correlation coefficient and covariance, on two-dimensional +data. You can also calculate common statistics, such as average and standard deviation, on each dimension separately. +These functions are similar to the PostgreSQL statistical aggregates, but they include more features and are easier to +use in continuous aggregates and window functions. The linear regressions are based on the standard least-squares +fitting method. + +These functions work on two-dimensional data. To work with one-dimensional data, for example to calculate the average +and standard deviation of a single variable, see [the one-dimensional `stats_agg` functions][stats_agg-1d]. + +import TwoStepAggregation from '/snippets/api-reference/timescaledb/hyperfunctions/_two-step-aggregation.mdx'; + +## Two-step aggregation + + + +## Samples + +### Calculate regression and statistical properties + +Create a statistical aggregate that summarizes daily statistical data about two variables, `val2` and `val1`, where +`val2` is the dependent variable and `val1` is the independent variable. Use the statistical aggregate to calculate the +average of the dependent variable and the slope of the linear-regression fit: + +```sql +WITH t AS ( + SELECT + time_bucket('1 day'::interval, ts) AS dt, + stats_agg(val2, val1) AS stats2D + FROM foo + WHERE id = 'bar' + GROUP BY time_bucket('1 day'::interval, ts) +) +SELECT + average_x(stats2D), + slope(stats2D) +FROM t; +``` + +## Available functions + +### Aggregate +- [`stats_agg()`][stats_agg]: aggregate data into an intermediate statistical aggregate form for further calculation + +### Accessors for y variable statistics +- [`average_y()`][average_y_x]: calculate the average of the dependent variable from a statistical aggregate +- [`stddev_y()`][stddev_y_x]: calculate the standard deviation of the dependent variable from a statistical aggregate +- [`variance_y()`][variance_y_x]: calculate the variance of the dependent variable from a statistical aggregate +- [`skewness_y()`][skewness_y_x]: calculate the skewness of the dependent variable from a statistical aggregate +- [`kurtosis_y()`][kurtosis_y_x]: calculate the kurtosis of the dependent variable from a statistical aggregate +- [`sum_y()`][sum_y_x]: calculate the sum of the dependent variable from a statistical aggregate + +### Accessors for regression analysis +- [`corr()`][corr]: calculate the correlation coefficient from a statistical aggregate +- [`covariance()`][covariance]: calculate the covariance from a statistical aggregate +- [`determination_coeff()`][determination_coeff]: calculate the coefficient of determination (R²) from a statistical + aggregate +- [`slope()`][slope]: calculate the slope of the linear regression line from a statistical aggregate +- [`intercept()`][intercept]: calculate the y-intercept of the linear regression line from a statistical aggregate +- [`x_intercept()`][x_intercept]: calculate the x-intercept of the linear regression line from a statistical aggregate + +### Accessors for aggregate information +- [`num_vals()`][num_vals]: get the number of values contained in a statistical aggregate + +### Rollup +- [`rollup()`][rollup]: combine multiple two-dimensional statistical aggregates + +### Mutator +- [`rolling()`][rolling]: create a rolling window aggregate for use in window functions + +[two-step-aggregation]: #two-step-aggregation +[stats_agg-1d]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-one-variable/index +[stats_agg]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-two-variables/stats_agg +[average_y_x]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-two-variables/average_y_x +[stddev_y_x]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-two-variables/stddev_y_x +[variance_y_x]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-two-variables/variance_y_x +[skewness_y_x]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-two-variables/skewness_y_x +[kurtosis_y_x]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-two-variables/kurtosis_y_x +[sum_y_x]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-two-variables/sum_y_x +[corr]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-two-variables/corr +[covariance]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-two-variables/covariance +[determination_coeff]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-two-variables/determination_coeff +[slope]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-two-variables/slope +[intercept]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-two-variables/intercept +[x_intercept]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-two-variables/x_intercept +[num_vals]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-two-variables/num_vals +[rollup]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-two-variables/rollup +[rolling]: /api-reference/timescaledb/hyperfunctions/statistical-and-regression-analysis/stats_agg-two-variables/rolling \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/intercept.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/intercept.mdx new file mode 100644 index 0000000..2953a48 --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/intercept.mdx @@ -0,0 +1,52 @@ +--- +title: intercept() +description: Calculate the intercept from a two-dimensional statistical aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: accessor + aggregates: + - stats_agg() (two variables) +topics: [hyperfunctions] +keywords: [statistics, hyperfunctions, Toolkit] +tags: [intercept, least squares, linear regression] +--- + + Since 1.3.0 + +Calculate the y intercept from a two-dimensional statistical aggregate. The calculation uses the standard least-squares +fitting for linear regression. + +## Samples + +Calculate the y intercept from independent variable `y` and dependent variable `x` for each 15-minute time bucket: + +```sql +SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + intercept(stats_agg(y, x)) AS summary +FROM foo +GROUP BY id, time_bucket('15 min'::interval, ts) +``` +## Arguments + +The syntax is: + +```sql +intercept( + summary StatsSummary2D +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary2D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| intercept | DOUBLE PRECISION | The y intercept of the least-squares fit line | + diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/kurtosis_y_x.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/kurtosis_y_x.mdx new file mode 100644 index 0000000..f98ed0f --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/kurtosis_y_x.mdx @@ -0,0 +1,66 @@ +--- +title: kurtosis_y() | kurtosis_x() +description: Calculate the kurtosis from a two-dimensional statistical aggregate for the dimension specified +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: accessor + aggregates: + - stats_agg() (two variables) +topics: [hyperfunctions] +keywords: [statistics, hyperfunctions, Toolkit] +tags: [skew] +--- + + Since 1.3.0 + +Calculate the kurtosis from a two-dimensional statistical aggregate for the given dimension. For example, `kurtosis_y()` +calculates the kurtosis for all the values of the `y` variable, independent of values of the `x` variable. The kurtosis +is the fourth statistical moment. It is a measure of "tailedness" of a data distribution compared to a normal +distribution. + +## Samples + +Calculate the kurtosis of a sample containing the integers from 0 to 100: + +```sql +SELECT kurtosis_y(stats_agg(data, data)) + FROM generate_series(0, 100) data; +``` +This returns something like: +```sql +kurtosis_y +---------- +1.78195 +``` + +## Arguments + +The syntax is: + +```sql +kurtosis_y( + summary StatsSummary2D, + [ method TEXT ] +) RETURNS DOUBLE PRECISION +``` + +```sql +kurtosis_x( + summary StatsSummary2D, + [ method TEXT ] +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary2D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | +| method | TEXT | `sample` | - | The method used for calculating the kurtosis. The two options are `population` and `sample`, which can be abbreviated to `pop` or `samp` | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| kurtosis_y \| kurtosis_x | DOUBLE PRECISION | The kurtosis of the values in the statistical aggregate | + diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/num_vals.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/num_vals.mdx new file mode 100644 index 0000000..eb36756 --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/num_vals.mdx @@ -0,0 +1,53 @@ +--- +title: num_vals() +description: Calculate the number of values in a two-dimensional statistical aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: accessor + aggregates: + - stats_agg() (two variables) +topics: [hyperfunctions] +keywords: [statistics, hyperfunctions, Toolkit] +tags: [count, number] +--- + +Calculate the number of values contained in a two-dimensional statistical aggregate. + +## Samples + +Calculate the number of values from 1 to 5, and from 0 to 100, inclusive: + +```sql +SELECT num_vals(stats_agg(y, x)) + FROM generate_series(1, 5) y, + generate_series(0, 100) x; +``` + +``` +num_vals +-------- +505 +``` + +## Arguments + +The syntax is: + +```sql +num_vals( + summary StatsSummary2D +) RETURNS BIGINT +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary2D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| num_vals | DOUBLE PRECISION | The number of values in the statistical aggregate | + diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/rolling.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/rolling.mdx new file mode 100644 index 0000000..6482806 --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/rolling.mdx @@ -0,0 +1,48 @@ +--- +title: rolling() +description: Combine multiple two-dimensional statistical aggregates to calculate rolling window aggregates +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: rollup + aggregates: + - stats_agg() (two variables) +topics: [hyperfunctions] +keywords: [statistics, hyperfunctions, Toolkit] +--- + + Since 1.3.0 + +Combine multiple intermediate two-dimensional statistical aggregate (`StatsSummary2D`) objects into a single +`StatsSummary2D` object. It is optimized for use in a window function context for computing tumbling window statistical +aggregates. + + +This is especially useful for computing tumbling window aggregates from a continuous aggregate. It can be orders of +magnitude faster because it uses inverse transition and combine functions, with the possibility that bigger floating +point errors can occur in unusual scenarios. + +For re-aggregation in a non-window function context, such as combining hourly buckets into daily buckets, see +`rollup()`. + + +## Arguments + +The syntax is: + +```sql +rolling( + ss StatsSummary2D +) RETURNS StatsSummary2D +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary2D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| rolling | StatsSummary2D | A new statistical aggregate produced by combining the input statistical aggregates | \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/rollup.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/rollup.mdx new file mode 100644 index 0000000..cc14c17 --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/rollup.mdx @@ -0,0 +1,41 @@ +--- +title: rollup() +description: Combine multiple two-dimensional statistical aggregates +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: rollup + aggregates: + - stats_agg() (two variables) +topics: [hyperfunctions] +keywords: [statistics, hyperfunctions, Toolkit] +--- + + Since 1.3.0 + +Combine multiple intermediate two-dimensional statistical aggregate (`StatsSummary2D`) objects into a single +`StatsSummary2D` object. For example, you can use `rollup` to combine statistical aggregates from 15-minute buckets into +daily buckets. + +For use in window function, see `rolling()`. + +## Arguments + +The syntax is: + +```sql +rolling( + ss StatsSummary2D +) RETURNS StatsSummary2D +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary2D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| rollup | StatsSummary2D | A new statistical aggregate produced by combining the input statistical aggregates | \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/skewness_y_x.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/skewness_y_x.mdx new file mode 100644 index 0000000..2dcc663 --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/skewness_y_x.mdx @@ -0,0 +1,64 @@ +--- +title: skewness_y() | skewness_x() +description: Calculate the skewness from a two-dimensional statistical aggregate for the dimension specified +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: accessor + aggregates: + - stats_agg() (two variables) +topics: [hyperfunctions] +keywords: [statistics, hyperfunctions, Toolkit] +--- + + Since 1.3.0 + +Calculate the skewness from a two-dimensional statistical aggregate for the given dimension. For example, `skewness_y()` +calculates the skewness for all the values of the `y` variable, independent of values of the `x` variable. The skewness +is the third statistical moment. It is a measure of asymmetry in a data distribution. + +## Samples + +Calculate the skewness of a sample containing the integers from 0 to 100: + +```sql +SELECT skewness_x(stats_agg(data, data)) + FROM generate_series(0, 100) data; +``` + +```sql +skewness_x +---------- +0 +``` + +## Arguments + +The syntax is: + +```sql +skewness_y( + summary StatsSummary2D, + [ method TEXT ] +) RETURNS DOUBLE PRECISION +``` + +```sql +skewness_x( + summary StatsSummary2D, + [ method TEXT ] +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary2D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | +| method | TEXT | `sample` | - | The method used for calculating the skewness. The two options are `population` and `sample`, which can be abbreviated to `pop` or `samp` | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| skewness_y \| skewness_x | DOUBLE PRECISION | The skewness of the values in the statistical aggregate | + diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/slope.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/slope.mdx new file mode 100644 index 0000000..b70c811 --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/slope.mdx @@ -0,0 +1,53 @@ +--- +title: slope() +description: Calculate the slope from a two-dimensional statistical aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: accessor + aggregates: + - stats_agg() (two variables) +topics: [hyperfunctions] +keywords: [statistics, hyperfunctions, Toolkit] +tags: [least squares, linear regression] +--- + + Since 1.3.0 + +Calculate the slope of the linear fitting line from a two-dimensional statistical aggregate. The calculation uses the +standard least-squares fitting for linear regression. + +## Samples + +Calculate the slope from independent variable `y` and dependent variable `x` for each 15-minute time bucket: + +```sql +SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + slope(stats_agg(y, x)) AS summary +FROM foo +GROUP BY id, time_bucket('15 min'::interval, ts) +``` + +## Arguments + +The syntax is: + +```sql +slope( + summary StatsSummary2D +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary2D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| slope | DOUBLE PRECISION | The slope of the least-squares fit line | + diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/stats_agg.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/stats_agg.mdx new file mode 100644 index 0000000..92ba4cd --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/stats_agg.mdx @@ -0,0 +1,42 @@ +--- +title: stats_agg() (two variables) +description: Aggregate data into an intermediate statistical aggregate form for further calculation +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: aggregate + aggregates: + - stats_agg() (two variables) +topics: [hyperfunctions] +keywords: [statistics, hyperfunctions, Toolkit] +--- + + Since 1.3.0 + +Aggregate data into an intermediate statistical aggregate form for further calculation. This is the first step for +performing any statistical aggregate calculations on two-dimensional data. Use `stats_agg` to create an intermediate +aggregate (`StatsSummary2D`) from your data. This intermediate form can then be used by one or more accessors in this +group to compute the final results. Optionally, multiple such intermediate aggregate objects can be combined using +`rollup()` or `rolling()` before an accessor is applied. + +## Arguments + +The syntax is: + +```sql +stats_agg( + y DOUBLE PRECISION, + x DOUBLE PRECISION +) RETURNS StatsSummary2D +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| y, x | DOUBLE PRECISION | - | ✔ | The variables to use for the statistical aggregate | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| stats_agg | StatsSummary2D | The statistical aggregate, containing data about the variables in an intermediate form. Pass the aggregate to accessor functions in the statistical aggregates API to perform final calculations. Or, pass the aggregate to rollup functions to combine multiple statistical aggregates into larger aggregates | diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/stddev_y_x.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/stddev_y_x.mdx new file mode 100644 index 0000000..89bd12c --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/stddev_y_x.mdx @@ -0,0 +1,64 @@ +--- +title: stddev_y() | stddev_x() +description: Calculate the standard deviation from a two-dimensional statistical aggregate for the dimension specified +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: accessor + aggregates: + - stats_agg() (two variables) +topics: [hyperfunctions] +keywords: [statistics, hyperfunctions, Toolkit] +tags: [standard deviation] +--- + + Since 1.3.0 + +Calculate the standard deviation from a two-dimensional statistical aggregate for the given dimension. For example, +`stddev_y()` calculates the skewness for all the values of the `y` variable, independent of values of the `x` variable. + +## Samples + +Calculate the standard deviation of a sample containing the integers from 0 to 100: + +```sql +SELECT stddev_y(stats_agg(data, data)) + FROM generate_series(0, 100) data; +``` + +``` +stddev_y +-------- +29.3002 +``` + +## Arguments + +The syntax is: + +```sql +stddev_y( + summary StatsSummary2D, + [ method TEXT ] +) RETURNS DOUBLE PRECISION +``` + +```sql +stddev_x(summary + StatsSummary2D, + [ method TEXT ] +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary2D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | +| method | TEXT | `sample` | - | The method used for calculating the standard deviation. The two options are `population` and `sample`, which can be abbreviated to `pop` or `samp` | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| stddev_y \| stddev_x | DOUBLE PRECISION | The standard deviation of the values in the statistical aggregate | + diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/sum_y_x.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/sum_y_x.mdx new file mode 100644 index 0000000..e84a029 --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/sum_y_x.mdx @@ -0,0 +1,61 @@ +--- +title: sum_y() | sum_x() +description: Calculate the sum from a two-dimensional statistical aggregate for the dimension specified +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: accessor + aggregates: + - stats_agg() (two variables) +topics: [hyperfunctions] +keywords: [statistics, hyperfunctions, Toolkit] +tags: [sum] +--- + + Since 1.3.0 + +Calculate the sum from a two-dimensional statistical aggregate for the given dimension. For example, `sum_y()` +calculates the skewness for all the values of the `y` variable, independent of values of the `x` variable. + +## Samples + +Calculate the sum of the numbers from 0 to 100: + +```sql +SELECT sum_y(stats_agg(data, data)) + FROM generate_series(0, 100) data; +``` + +``` +sum_y +----- +5050 +``` + +## Arguments + +The syntax is: + +```sql +sum_y( + summary StatsSummary2D +) RETURNS DOUBLE PRECISION +``` + +```sql +sum_x( + summary StatsSummary2D +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary2D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| sum | DOUBLE PRECISION | The sum of the values in the statistical aggregate | + diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/variance_y_x.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/variance_y_x.mdx new file mode 100644 index 0000000..7d8b14d --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/variance_y_x.mdx @@ -0,0 +1,63 @@ +--- +title: variance_y() | variance_x() +description: Calculate the variance from a two-dimensional statistical aggregate for the dimension specified +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: accessor + aggregates: + - stats_agg() (two variables) +topics: [hyperfunctions] +keywords: [statistics, hyperfunctions, Toolkit] +--- + + Since 1.3.0 + +Calculate the variance from a two-dimensional statistical aggregate for the given dimension. For example, `variance_y()` +calculates the skewness for all the values of the `y` variable, independent of values of the `x` variable. + +## Samples + +Calculate the variance of a sample containing the integers from 0 to 100: + +```sql +SELECT variance_y(stats_agg(data, data)) +FROM generate_series(0, 100) data; +``` + +``` +variance_y +---------- +858.5 +``` + +## Arguments + +The syntax is: + +```sql +variance_y( + summary StatsSummary2D, + [ method TEXT ] +) RETURNS DOUBLE PRECISION +``` + +```sql +variance_x(summary + StatsSummary2D, + [ method TEXT ] +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary2D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | +| method | TEXT | `sample` | - | The method used for calculating the standard deviation. The two options are `population` and `sample`, which can be abbreviated to `pop` or `samp` | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| variance | DOUBLE PRECISION | The variance of the values in the statistical aggregate | + diff --git a/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/x_intercept.mdx b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/x_intercept.mdx new file mode 100644 index 0000000..156780e --- /dev/null +++ b/api-reference/timescaledb-toolkit/statistical-and-regression-analysis/stats_agg-two-variables/x_intercept.mdx @@ -0,0 +1,53 @@ +--- +title: x_intercept() +description: Calculate the x-intercept from a two-dimensional statistical aggregate +license: community +type: function +toolkit: true +hyperfunction: + family: statistical and regression analysis + type: accessor + aggregates: + - stats_agg() (two variables) +topics: [hyperfunctions] +keywords: [statistics, hyperfunctions, Toolkit] +tags: [least squares, linear regression] +--- + + Since 1.3.0 + +Calculate the x intercept from a two-dimensional statistical aggregate. The calculation uses the standard least-squares +fitting for linear regression. + +## Samples + +Calculate the x intercept from independent variable `y` and dependent variable `x` for each 15-minute time bucket: + +```sql +SELECT + id, + time_bucket('15 min'::interval, ts) AS bucket, + x_intercept(stats_agg(y, x)) AS summary +FROM foo +GROUP BY id, time_bucket('15 min'::interval, ts) +``` + +## Arguments + +The syntax is: + +```sql +x_intercept( + summary StatsSummary2D +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| summary | StatsSummary2D | - | ✔ | The statistical aggregate produced by a `stats_agg` call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| intercept | DOUBLE PRECISION | The x intercept of the least-squares fit line | + diff --git a/api-reference/timescaledb-toolkit/time_weight/average.mdx b/api-reference/timescaledb-toolkit/time_weight/average.mdx new file mode 100644 index 0000000..18a14a8 --- /dev/null +++ b/api-reference/timescaledb-toolkit/time_weight/average.mdx @@ -0,0 +1,60 @@ +--- +title: average() +description: Calculate the time-weighted average of values in a TimeWeightSummary +topics: [hyperfunctions] +license: community +type: function +toolkit: true +products: [cloud, mst, self_hosted] +hyperfunction: + family: time-weighted calculations + type: accessor + aggregates: + - time_weight() +--- + + Since 1.0.0 + +Calculate the time-weighted average. Equal to [`integral`][integral] divided by the elapsed time. Note that there is a +key difference to `avg()`: If there is exactly one value, `avg()` would return that value, but `average()` returns +`NULL`. + + +## Samples + +Calculate the time-weighted average of the column `val`, using the 'last observation carried forward' interpolation +method: + +```sql +SELECT + id, + average(tws) +FROM ( + SELECT + id, + time_weight('LOCF', ts, val) AS tws + FROM foo + GROUP BY id +) t +``` + +## Arguments + +The syntax is: + +```sql +average( + tws TimeWeightSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| tws | TimeWeightSummary | - | ✔ | The input TimeWeightSummary from a time_weight() call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| average | DOUBLE PRECISION | The time-weighted average. | + +[integral]: /api-reference/timescaledb/hyperfunctions/time_weight/integral \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/time_weight/first_time.mdx b/api-reference/timescaledb-toolkit/time_weight/first_time.mdx new file mode 100644 index 0000000..70b292b --- /dev/null +++ b/api-reference/timescaledb-toolkit/time_weight/first_time.mdx @@ -0,0 +1,57 @@ +--- +title: first_time() +description: Get the first timestamp from a TimeWeightSummary aggregate +topics: [hyperfunctions] +license: community +type: function +toolkit: true +products: [cloud, mst, self_hosted] +hyperfunction: + family: time-weighted calculations + type: accessor + aggregates: + - time_weight() +--- + + Since 1.11.0 + +Get the timestamp of the first point in a TimeWeightSummary aggregate. + + +## Samples + +Produce a linear TimeWeightSummary over the column `val` and get the first timestamp: + +```sql +WITH t as ( + SELECT + time_bucket('1 day'::interval, ts) as dt, + time_weight('Linear', ts, val) AS tw + FROM table + GROUP BY time_bucket('1 day'::interval, ts) +) +SELECT + dt, + first_time(tw) +FROM t; +``` + +## Arguments + +The syntax is: + +```sql +first_time( + tw TimeWeightSummary +) RETURNS TIMESTAMPTZ +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| tws | TimeWeightSummary | - | ✔ | The input TimeWeightSummary from a time_weight() call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| first_time | TIMESTAMPTZ | The time of the first point in the `TimeWeightSummary` | + diff --git a/api-reference/timescaledb-toolkit/time_weight/first_val.mdx b/api-reference/timescaledb-toolkit/time_weight/first_val.mdx new file mode 100644 index 0000000..ef5b39e --- /dev/null +++ b/api-reference/timescaledb-toolkit/time_weight/first_val.mdx @@ -0,0 +1,57 @@ +--- +title: first_val() +description: Get the first value from a TimeWeightSummary aggregate +topics: [hyperfunctions] +license: community +type: function +toolkit: true +products: [cloud, mst, self_hosted] +hyperfunction: + family: time-weighted calculations + type: accessor + aggregates: + - time_weight() +--- + + Since 1.11.0 + +Get the value of the first point in a TimeWeightSummary aggregate. + + +## Samples + +Produce a linear TimeWeightSummary over the column `val` and get the first value: + +```sql +WITH t as ( + SELECT + time_bucket('1 day'::interval, ts) as dt, + time_weight('Linear', ts, val) AS tw + FROM table + GROUP BY time_bucket('1 day'::interval, ts) +) +SELECT + dt, + first_val(tw) +FROM t; +``` + +## Arguments + +The syntax is: + +```sql +first_val( + tw TimeWeightSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| tws | TimeWeightSummary | - | ✔ | The input TimeWeightSummary from a time_weight() call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| first_val | DOUBLE PRECISION | The value of the first point in the `TimeWeightSummary` | + diff --git a/api-reference/timescaledb-toolkit/time_weight/index.mdx b/api-reference/timescaledb-toolkit/time_weight/index.mdx new file mode 100644 index 0000000..6c61941 --- /dev/null +++ b/api-reference/timescaledb-toolkit/time_weight/index.mdx @@ -0,0 +1,139 @@ +--- +title: Time-weighted calculations overview +sidebarTitle: Overview +description: Calculate time-weighted summary statistics for unevenly sampled data +topics: [hyperfunctions] +license: community +toolkit: true +products: [cloud, mst, self_hosted] +--- + + Since 1.0.0 + +Calculate time-weighted summary statistics, such as averages (means) and integrals. Time weighting is used when data is +unevenly sampled over time. In that case, a straight average gives misleading results, as it biases towards more +frequently sampled values. + +For example, a sensor might silently spend long periods of time in a steady state, and send data only when a significant +change occurs. The regular mean counts the steady-state reading as only a single point, whereas a time-weighted mean +accounts for the long period of time spent in the steady state. In essence, the time-weighted mean takes an integral +over time, then divides by the elapsed time. + +import TwoStepAggregation from '/snippets/api-reference/timescaledb/hyperfunctions/_two-step-aggregation.mdx'; + +## Two-step aggregation + + + +## Samples + +### Aggregate data into a TimeWeightSummary and calculate the average + +Given a table `foo` with data in a column `val`, aggregate data into a daily `TimeWeightSummary`. Use that to calculate +the average for column `val`: + +```sql +WITH t as ( + SELECT + time_bucket('1 day'::interval, ts) as dt, + time_weight('Linear', ts, val) AS tw + FROM foo + WHERE measure_id = 10 + GROUP BY time_bucket('1 day'::interval, ts) +) +SELECT + dt, + average(tw) +FROM t; +``` + +### Advanced usage + +#### Parallelism and ordering + +Time-weighted average calculations are not strictly parallelizable, as defined by PostgreSQL. These calculations require +inputs to be strictly ordered, but in general, PostgreSQL parallelizes by assigning rows randomly to workers. + +However, the algorithm can be parallelized if it is guaranteed that all rows within some time range go to the same +worker. This is the case for both continuous aggregates and distributed hypertables. (Note that the partitioning keys of +the distributed hypertable must be within the `GROUP BY` clause, but this is usually the case.) + +#### Combining aggregates across measurement series + +If you try to combine overlapping `TimeWeightSummaries`, an error is thrown. For example, you might create a +`TimeWeightSummary` for `device_1` and a separate `TimeWeightSummary` for `device_2`, both covering the same period of +time. You can't combine these because the interpolation techniques only make sense when restricted to a single +measurement series. + +If you want to calculate a single summary statistic across all devices, use a simple average, like this: + +```sql +WITH t as (SELECT measure_id, + average( + time_weight('LOCF', ts, val) + ) as time_weighted_average + FROM foo + GROUP BY measure_id) +SELECT avg(time_weighted_average) -- use the normal avg function to average the time-weighted averages +FROM t; +``` + +#### Parallelism in multi-node + +The time-weighted average functions are not strictly parallelizable in the PostgreSQL sense. PostgreSQL requires that +parallelizable functions accept potentially overlapping input. As explained above, the time-weighted functions do not. +However, they do support partial aggregation and partition-wise aggregation in multi-node setups. + +#### Reducing memory usage + +Because the time-weighted aggregates require ordered sets, they build up a buffer of input data, sort it, and then +perform the aggregation steps. When memory is too small to build up a buffer of points, you might see Out of Memory +failures or other issues. In these cases, try using a multi-level aggregate. For example: + +```sql +WITH t as (SELECT measure_id, + time_bucket('1 day'::interval, ts), + time_weight('LOCF', ts, val) + FROM foo + GROUP BY measure_id, time_bucket('1 day'::interval, ts) + ) +SELECT measure_id, + average( + rollup(time_weight) + ) +FROM t +GROUP BY measure_id; +``` + +## Functions in this group + +### Aggregate +- [`time_weight()`][time_weight]: aggregate data into an intermediate time-weighted aggregate form for further + calculation + +### Accessors +- [`average()`][average]: calculate the time-weighted average of values in a TimeWeightSummary +- [`first_time()`][first_time]: get the timestamp of the first point in the TimeWeightSummary +- [`first_val()`][first_val]: get the value of the first point in the TimeWeightSummary +- [`integral()`][integral]: calculate the integral from a TimeWeightSummary +- [`interpolated_average()`][interpolated_average]: calculate the time-weighted average, interpolating at boundaries +- [`interpolated_integral()`][interpolated_integral]: calculate the integral, interpolating at boundaries +- [`last_time()`][last_time]: get the timestamp of the last point in the TimeWeightSummary +- [`last_val()`][last_val]: get the value of the last point in the TimeWeightSummary + +### Rollup +- [`rollup()`][rollup]: combine multiple TimeWeightSummaries + +[two-step-aggregation]: #two-step-aggregation +[blog-two-step-aggregates]: https://www.timescale.com/blog/how-postgresql-aggregation-works-and-how-it-inspired-our-hyperfunctions-design +[caggs]: /use-timescale/continuous-aggregates/about-continuous-aggregates/ +[time_weight]: /api-reference/timescaledb/hyperfunctions/time_weight/time_weight +[average]: /api-reference/timescaledb/hyperfunctions/time_weight/average +[first_time]: /api-reference/timescaledb/hyperfunctions/time_weight/first_time +[first_val]: /api-reference/timescaledb/hyperfunctions/time_weight/first_val +[integral]: /api-reference/timescaledb/hyperfunctions/time_weight/integral +[interpolated_average]: /api-reference/timescaledb/hyperfunctions/time_weight/interpolated_average +[interpolated_integral]: /api-reference/timescaledb/hyperfunctions/time_weight/interpolated_integral +[last_time]: /api-reference/timescaledb/hyperfunctions/time_weight/last_time +[last_val]: /api-reference/timescaledb/hyperfunctions/time_weight/last_val +[rollup]: /api-reference/timescaledb/hyperfunctions/time_weight/rollup diff --git a/api-reference/timescaledb-toolkit/time_weight/integral.mdx b/api-reference/timescaledb-toolkit/time_weight/integral.mdx new file mode 100644 index 0000000..e828e7e --- /dev/null +++ b/api-reference/timescaledb-toolkit/time_weight/integral.mdx @@ -0,0 +1,66 @@ +--- +title: integral() +description: Calculate the integral from a TimeWeightSummary +topics: [hyperfunctions] +license: community +type: function +toolkit: true +products: [cloud, mst, self_hosted] +hyperfunction: + family: time-weighted calculations + type: accessor + aggregates: + - time_weight() +--- + + Since 1.15.0 + +Calculate the integral, or the area under the curve formed by the data points. Equal to [`average`][average] multiplied +by the elapsed time. + + +## Samples + +Create a table to track irregularly sampled storage usage in bytes, and get the total storage used in byte-hours. Use +the 'last observation carried forward' interpolation method: + +```sql +-- Create a table to track irregularly sampled storage usage +CREATE TABLE user_storage_usage(ts TIMESTAMP, storage_bytes BIGINT); +INSERT INTO user_storage_usage(ts, storage_bytes) VALUES + ('01-01-2022 00:00', 0), + ('01-01-2022 00:30', 100), + ('01-01-2022 03:00', 300), + ('01-01-2022 03:10', 1000), + ('01-01-2022 03:25', 817); + +-- Get the total byte-hours used +SELECT + integral(time_weight('LOCF', ts, storage_bytes), 'hours') +FROM + user_storage_usage; +``` + + +## Arguments + +The syntax is: + +```sql +integral( + tws TimeWeightSummary + [, unit TEXT] +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| tws | TimeWeightSummary | - | ✔ | The input TimeWeightSummary from a time_weight() call | +| unit | TEXT | second | | The unit of time to express the integral in. Can be `microsecond`, `millisecond`, `second`, `minute`, `hour`, or any alias for those units supported by Postgres | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| integral | DOUBLE PRECISION | The time-weighted integral. | + +[average]: /api-reference/timescaledb/hyperfunctions/time_weight/average diff --git a/api-reference/timescaledb-toolkit/time_weight/interpolated_average.mdx b/api-reference/timescaledb-toolkit/time_weight/interpolated_average.mdx new file mode 100644 index 0000000..227f8e0 --- /dev/null +++ b/api-reference/timescaledb-toolkit/time_weight/interpolated_average.mdx @@ -0,0 +1,85 @@ +--- +title: interpolated_average() +description: Calculate the time-weighted average over an interval, while interpolating the interval bounds +topics: [hyperfunctions] +license: community +type: function +toolkit: true +products: [cloud, mst, self_hosted] +hyperfunction: + family: time-weighted calculations + type: accessor + aggregates: + - time_weight() +--- + + Since 1.14.0 + +Calculate the time-weighted average over an interval, while interpolating the interval bounds. + +Similar to [`average`][average], but allows an accurate calculation across interval bounds when data has been bucketed +into separate time intervals, and there is no data point precisely at the interval bound. For example, this is useful in +a window function. + +Values from the previous and next buckets are used to interpolate the values at the bounds, using the same interpolation +method used within the TimeWeightSummary itself. + +Equal to [`interpolated_integral`][interpolated_integral] divided by the elapsed time. + + +## Samples + +Calculate the time-weighted daily average of the column `val`, interpolating over bucket bounds using the 'last +observation carried forward' method: + +```sql +SELECT + id, + time, + interpolated_average( + tws, + time, + '1 day', + LAG(tws) OVER (PARTITION BY id ORDER by time), + LEAD(tws) OVER (PARTITION BY id ORDER by time) + ) +FROM ( + SELECT + id, + time_bucket('1 day', ts) AS time, + time_weight('LOCF', ts, val) AS tws + FROM foo + GROUP BY id, time +) t +``` + +## Arguments + +The syntax is: + +```sql +interpolated_average( + tws TimeWeightSummary, + start TIMESTAMPTZ, + interval INTERVAL + [, prev TimeWeightSummary] + [, next TimeWeightSummary] +) RETURNS DOUBLE PRECISION +``` + +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| tws | TimeWeightSummary | - | ✔ | The input TimeWeightSummary from a time_weight() call | +| start | TIMESTAMPTZ | - | ✔ | The start of the interval which the time-weighted average should cover (if there is a preceding point) | +| interval | INTERVAL | - | ✔ | The length of the interval which the time-weighted average should cover | +| prev | TimeWeightSummary | NULL | | The TimeWeightSummary from the prior interval, used to interpolate the value at `start`. If NULL, the first timestamp in `tws` is used for the starting value. The prior interval can be determined from the Postgres `lag()` function | +| next | TimeWeightSummary | NULL | | The TimeWeightSummary from the next interval, used to interpolate the value at `start` + `interval`. If NULL, the last timestamp in `tws` is used for the starting value. The next interval can be determined from the Postgres `lead()` function | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| average | DOUBLE PRECISION | The time-weighted average for the interval (`start`, `start` + `interval`), computed from the `TimeWeightSummary` plus end points interpolated from `prev` and `next` | + +[average]: /api-reference/timescaledb/hyperfunctions/time_weight/average +[interpolated_integral]: /api-reference/timescaledb/hyperfunctions/time_weight/interpolated_integral \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/time_weight/interpolated_integral.mdx b/api-reference/timescaledb-toolkit/time_weight/interpolated_integral.mdx new file mode 100644 index 0000000..6f506e9 --- /dev/null +++ b/api-reference/timescaledb-toolkit/time_weight/interpolated_integral.mdx @@ -0,0 +1,91 @@ +--- +title: interpolated_integral() +description: Calculate the integral over an interval, while interpolating the interval bounds +topics: [hyperfunctions] +license: community +type: function +toolkit: true +products: [cloud, mst, self_hosted] +hyperfunction: + family: time-weighted calculations + type: accessor + aggregates: + - time_weight() +--- + + Since 1.15.0 + +Calculate the integral over an interval, while interpolating the interval bounds. + +Similar to [`integral`][integral], but allows an accurate calculation across interval bounds when data has been bucketed +into separate time intervals, and there is no data point precisely at the interval bound. For example, this is useful in +a window function. + +Values from the previous and next buckets are used to interpolate the values at the bounds, using the same interpolation +method used within the TimeWeightSummary itself. + +Equal to [`interpolated_average`][interpolated_average] multiplied by the elapsed time. + + +## Samples + +Create a table to track irregularly sampled storage usage in bytes, and get the total storage used in byte-hours between +January 1 and January 6. Use the 'last observation carried forward' interpolation method: + +```sql +-- Create a table to track irregularly sampled storage usage +CREATE TABLE user_storage_usage(ts TIMESTAMP, storage_bytes BIGINT); +INSERT INTO user_storage_usage(ts, storage_bytes) VALUES + ('01-01-2022 20:55', 27), + ('01-02-2022 18:33', 100), + ('01-03-2022 03:05', 300), + ('01-04-2022 12:13', 1000), + ('01-05-2022 07:26', 817); + + +-- Get the total byte-hours used between Jan. 1 and Jan. 6 +SELECT + interpolated_integral( + time_weight('LOCF', ts, storage_bytes), + '01-01-2022', + '5 days', + NULL, + NULL, + 'hours' + ) +FROM + user_storage_usage; +``` + +## Arguments + +The syntax is: + +```sql +interpolated_integral( + tws TimeWeightSummary, + start TIMESTAMPTZ, + interval INTERVAL + [, prev TimeWeightSummary] + [, next TimeWeightSummary] + [, unit TEXT] +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| tws | TimeWeightSummary | - | ✔ | The input TimeWeightSummary from a time_weight() call | +| start | TIMESTAMPTZ | - | ✔ | The start of the interval which the time-weighted integral should cover (if there is a preceding point) | +| interval | INTERVAL | - | ✔ | The length of the interval which the time-weighted integral should cover | +| prev | TimeWeightSummary | NULL | | The TimeWeightSummary from the prior interval, used to interpolate the value at `start`. If NULL, the first timestamp in `tws` is used for the starting value. The prior interval can be determined from the Postgres `lag()` function | +| next | TimeWeightSummary | NULL | | The TimeWeightSummary from the next interval, used to interpolate the value at `start` + `interval`. If NULL, the last timestamp in `tws` is used for the starting value. The next interval can be determined from the Postgres `lead()` function | +| unit | TEXT | second | | The unit of time to express the integral in. Can be `microsecond`, `millisecond`, `second`, `minute`, `hour`, or any alias for those units supported by Postgres | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| integral | DOUBLE PRECISION | The time-weighted integral for the interval (`start`, `start` + `interval`), computed from the `TimeWeightSummary` plus end points interpolated from `prev` and `next` | + + +[integral]: /api-reference/timescaledb/hyperfunctions/time_weight/integral +[interpolated_average]: /api-reference/timescaledb/hyperfunctions/time_weight/interpolated_average diff --git a/api-reference/timescaledb-toolkit/time_weight/last_time.mdx b/api-reference/timescaledb-toolkit/time_weight/last_time.mdx new file mode 100644 index 0000000..b46b48e --- /dev/null +++ b/api-reference/timescaledb-toolkit/time_weight/last_time.mdx @@ -0,0 +1,57 @@ +--- +title: last_time() +description: Get the last timestamp from a TimeWeightSummary aggregate +topics: [hyperfunctions] +license: community +type: function +toolkit: true +products: [cloud, mst, self_hosted] +hyperfunction: + family: time-weighted calculations + type: accessor + aggregates: + - time_weight() +--- + + Since 1.11.0 + +Get the timestamp of the last point in a TimeWeightSummary aggregate. + + +## Samples + +Produce a linear TimeWeightSummary over the column `val` and get the last timestamp: + +```sql +WITH t as ( + SELECT + time_bucket('1 day'::interval, ts) as dt, + time_weight('Linear', ts, val) AS tw + FROM table + GROUP BY time_bucket('1 day'::interval, ts) +) +SELECT + dt, + last_time(tw) +FROM t; +``` + +## Arguments + +The syntax is: + +```sql +last_time( + tw TimeWeightSummary +) RETURNS TIMESTAMPTZ +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| tws | TimeWeightSummary | - | ✔ | The input TimeWeightSummary from a time_weight() call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| last_time | TIMESTAMPTZ | The time of the last point in the `TimeWeightSummary` | + diff --git a/api-reference/timescaledb-toolkit/time_weight/last_val.mdx b/api-reference/timescaledb-toolkit/time_weight/last_val.mdx new file mode 100644 index 0000000..f13b2a0 --- /dev/null +++ b/api-reference/timescaledb-toolkit/time_weight/last_val.mdx @@ -0,0 +1,57 @@ +--- +title: last_val() +description: Get the last value from a TimeWeightSummary aggregate +topics: [hyperfunctions] +license: community +type: function +toolkit: true +products: [cloud, mst, self_hosted] +hyperfunction: + family: time-weighted calculations + type: accessor + aggregates: + - time_weight() +--- + + Since 1.11.0 + +Get the value of the last point in a TimeWeightSummary aggregate. + + +## Samples + +Produce a linear TimeWeightSummary over the column `val` and get the last value: + +```sql +WITH t as ( + SELECT + time_bucket('1 day'::interval, ts) as dt, + time_weight('Linear', ts, val) AS tw + FROM table + GROUP BY time_bucket('1 day'::interval, ts) +) +SELECT + dt, + last_val(tw) +FROM t; +``` + +## Arguments + +The syntax is: + +```sql +last_val( + tw TimeWeightSummary +) RETURNS DOUBLE PRECISION +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| tws | TimeWeightSummary | - | ✔ | The input TimeWeightSummary from a time_weight() call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| last_val | DOUBLE PRECISION | The value of the last point in the `TimeWeightSummary` | + diff --git a/api-reference/timescaledb-toolkit/time_weight/rollup.mdx b/api-reference/timescaledb-toolkit/time_weight/rollup.mdx new file mode 100644 index 0000000..1c89a3b --- /dev/null +++ b/api-reference/timescaledb-toolkit/time_weight/rollup.mdx @@ -0,0 +1,40 @@ +--- +title: rollup() +description: Combine multiple TimeWeightSummaries +topics: [hyperfunctions] +license: community +type: function +toolkit: true +products: [cloud, mst, self_hosted] +hyperfunction: + family: time-weighted calculations + type: rollup + aggregates: + - time_weight() +--- + + Since 1.0.0 + +Combine multiple intermediate time-weighted aggregate (TimeWeightSummary) objects produced by time_weight() into a +single intermediate TimeWeightSummary object. For example, you can use `rollup` to combine time-weighted aggregates from +15-minute buckets into daily buckets. + + +## Arguments + +The syntax is: + +```sql +rollup( + tws TimeWeightSummary +) RETURNS TimeWeightSummary +``` +| Name | Type | Default | Required | Description | +|------|------|---------|----------|-------------| +| time_weight | TimeWeightSummary | - | ✔ | The TimeWeightSummary aggregate produced by a time_weight call | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| rollup | TimeWeightSummary | A new TimeWeightSummary aggregate produced by combining the input TimeWeightSummary aggregates | \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/time_weight/time_weight.mdx b/api-reference/timescaledb-toolkit/time_weight/time_weight.mdx new file mode 100644 index 0000000..427a4e5 --- /dev/null +++ b/api-reference/timescaledb-toolkit/time_weight/time_weight.mdx @@ -0,0 +1,61 @@ +--- +title: time_weight() +description: Aggregate data into an intermediate time-weighted aggregate form for further calculation +topics: [hyperfunctions] +license: community +type: function +toolkit: true +hyperfunction: + family: time-weighted calculations + type: aggregate + aggregates: + - time_weight() +products: [cloud, mst, self_hosted] +--- + + Since 1.0.0 + +This is the first step for performing any time-weighted calculations. Use +`time_weight` to create an intermediate aggregate (`TimeWeightSummary`) from +your data. This intermediate form can then be used by one or more accessors +in this group to compute final results. + +Optionally, multiple such intermediate aggregate objects can be combined +using [`rollup()`](#rollup) before an accessor is applied. + + +## Samples + +Aggregate data from column `val` into daily time-weighted aggregates, +using the linear interpolation method. + +```sql +SELECT + time_bucket('1 day'::interval, ts) as dt, + time_weight('Linear', ts, val) AS tw +FROM foo +GROUP BY time_bucket('1 day'::interval, ts) +``` + +## Arguments + +The syntax is: + +```sql +time_weight( + method TEXT, + ts TIMESTAMPTZ, + value DOUBLE PRECISION +) RETURNS TimeWeightSummary +``` +| Name | Type | Default | Required | Description | +|--|--|--|--|--| +| `method` | TEXT | - | ✔ | The weighting method to use. The available methods are `linear` (or its alias `trapezoidal`, for those familiar with numeric integration methods) and `LOCF`, which stands for 'last observation carried forward'. `linear` fills in missing data by interpolating linearly between the start and end points of the gap. `LOCF` fills in the gap by assuming that the value remains constant until the next value is seen. `LOCF` is most useful when a measurement is taken only when a value changes. `linear` is most useful if there are no such guarantees on the measurement. The method names are case-insensitive. | +| `ts` | TIMESTAMPTZ | - | ✔ | The time at each point. Null values are ignored. An aggregate evaluated on only `null` values returns `null`. | +| `value` | DOUBLE PRECISION | - | ✔ | The value at each point to use for the time-weighted aggregate. Null values are ignored. An aggregate evaluated on only `null` values returns `null`. | + +## Returns + +| Column | Type | Description | +|--------|------|-------------| +| time_weight | TimeWeightSummary | A `TimeWeightSummary` object that can be passed to other functions within the time-weighting API | diff --git a/api-reference/timescaledb-toolkit/timescaledb-toolkit-api-reference-landing.mdx b/api-reference/timescaledb-toolkit/timescaledb-toolkit-api-reference-landing.mdx deleted file mode 100644 index 2cfa4b4..0000000 --- a/api-reference/timescaledb-toolkit/timescaledb-toolkit-api-reference-landing.mdx +++ /dev/null @@ -1,17 +0,0 @@ ---- -title: TimescaleDB Toolkit API Reference -description: Complete API reference for TimescaleDB toolkit functions and utilities -products: [cloud, mst, self_hosted] -keywords: [API, reference, toolkit, utilities, functions] -mode: "wide" ---- - - - - Complete API reference for TimescaleDB toolkit functions, utilities, and extended functionality. - - \ No newline at end of file diff --git a/api-reference/timescaledb-toolkit/timescaledb-toolkit-api-reference.mdx b/api-reference/timescaledb-toolkit/timescaledb-toolkit-api-reference.mdx deleted file mode 100644 index 1d2deb8..0000000 --- a/api-reference/timescaledb-toolkit/timescaledb-toolkit-api-reference.mdx +++ /dev/null @@ -1,7 +0,0 @@ ---- -title: TimescaleDB toolkit API reference -description: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. ---- - -Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. - diff --git a/api-reference/timescaledb/administration/get_telemetry_report.mdx b/api-reference/timescaledb/administration/get_telemetry_report.mdx new file mode 100644 index 0000000..d1274a5 --- /dev/null +++ b/api-reference/timescaledb/administration/get_telemetry_report.mdx @@ -0,0 +1,29 @@ +--- +title: get_telemetry_report() +description: View the background telemetry string sent to Timescale +keywords: [administration, telemetry] +tags: [telemetry] +products: [cloud, mst, self_hosted] +license: apache +type: function +--- + +Returns the background [telemetry][telemetry] string sent to Timescale. + +If telemetry is turned off, it sends the string that would be sent if telemetry were enabled. + +## Samples + +View the telemetry report: + +```sql +SELECT get_telemetry_report(); +``` + +## Returns + +|Column|Type|Description| +|-|-|-| +|telemetry_report|TEXT|The telemetry string that is or would be sent to Timescale| + +[telemetry]: /manage-data/self-hosted/configuration/telemetry diff --git a/api-reference/timescaledb/administration/index.mdx b/api-reference/timescaledb/administration/index.mdx new file mode 100644 index 0000000..251fffb --- /dev/null +++ b/api-reference/timescaledb/administration/index.mdx @@ -0,0 +1,101 @@ +--- +title: Administrative functions +sidebarTitle: Overview +description: Administration functions help you manage your service before and after recovery, as well as keeping track of your data +keywords: [administration] +tags: [backup, restore, set up] +products: [cloud, mst, self_hosted] +license: apache +--- + +import { TIMESCALE_DB } from '/snippets/vars.mdx'; + +Administrative APIs help you prepare a database before and after a restore event. They also help you keep track of your +{TIMESCALE_DB} setup data. + +## Samples + +### Prepare database for restore + +Before restoring a database backup, prepare the database: + +```sql +SELECT timescaledb_pre_restore(); +``` + +Then perform your restore operation: + +```bash +psql -d your_database < backup.sql +``` + +### Complete restore operation + +After restoring a database backup, complete the restore: + +```sql +SELECT timescaledb_post_restore(); +``` + +### View telemetry report + +Check what telemetry data is being collected and sent: + +```sql +SELECT get_telemetry_report(); +``` + +### Full backup and restore workflow + +Complete workflow for backing up and restoring a TimescaleDB database: + +```bash +# On source database +psql -d source_db -c "SELECT timescaledb_pre_restore();" +pg_dump -Fc -f backup.dump source_db + +# On target database +createdb target_db +psql -d target_db -c "CREATE EXTENSION IF NOT EXISTS timescaledb;" +psql -d target_db -c "SELECT timescaledb_pre_restore();" +pg_restore -d target_db backup.dump +psql -d target_db -c "SELECT timescaledb_post_restore();" +``` + +### Check TimescaleDB version and installation + +Verify the TimescaleDB extension is installed and check version: + +```sql +SELECT default_version, installed_version +FROM pg_available_extensions +WHERE name = 'timescaledb'; + +SELECT extversion +FROM pg_extension +WHERE extname = 'timescaledb'; +``` + +## Dump TimescaleDB meta data + +To help when asking for support and reporting bugs, {TIMESCALE_DB} includes an SQL dump script. It outputs metadata from +the internal {TIMESCALE_DB} tables, along with version information. + +This script is available in the source distribution in `scripts/`. To use it, run: + +```bash +psql [your connect flags] -d your_timescale_db < dump_meta_data.sql > dumpfile.txt +``` + +Inspect `dumpfile.txt` before sending it together with a bug report or support question. + +## Available functions + +- [`get_telemetry_report()`][get_telemetry_report]: view the background telemetry string sent to Timescale +- [`timescaledb_post_restore()`][timescaledb_post_restore]: perform required operations after finishing a database + restore +- [`timescaledb_pre_restore()`][timescaledb_pre_restore]: prepare the database for a restore operation + +[get_telemetry_report]: /api-reference/timescaledb/administration/get_telemetry_report +[timescaledb_post_restore]: /api-reference/timescaledb/administration/timescaledb_post_restore +[timescaledb_pre_restore]: /api-reference/timescaledb/administration/timescaledb_pre_restore diff --git a/api-reference/timescaledb/administration/timescaledb_post_restore.mdx b/api-reference/timescaledb/administration/timescaledb_post_restore.mdx new file mode 100644 index 0000000..83c8824 --- /dev/null +++ b/api-reference/timescaledb/administration/timescaledb_post_restore.mdx @@ -0,0 +1,32 @@ +--- +title: timescaledb_post_restore() +description: Perform required operations after finishing a database restore +keywords: [administration, restore, backup] +tags: [backup, restore] +products: [cloud, mst, self_hosted] +license: apache +type: function +--- + +import { TIMESCALE_DB } from '/snippets/vars.mdx'; + +Perform the required operations after you have finished restoring the database using `pg_restore`. Specifically, this +resets the `timescaledb.restoring` GUC and restarts any background workers. + +For more information, see [Migrate using pg_dump and pg_restore][pg-dump-restore]. + +## Samples + +Prepare the database for normal use after a restore: + +```sql +SELECT timescaledb_post_restore(); +``` + +## Returns + +|Column|Type|Description| +|-|-|-| +|success|BOOLEAN|TRUE if the operation completed successfully| + +[pg-dump-restore]: /migrate/postgres/pg-dump-and-restore diff --git a/api-reference/timescaledb/administration/timescaledb_pre_restore.mdx b/api-reference/timescaledb/administration/timescaledb_pre_restore.mdx new file mode 100644 index 0000000..e4b3759 --- /dev/null +++ b/api-reference/timescaledb/administration/timescaledb_pre_restore.mdx @@ -0,0 +1,41 @@ +--- +title: timescaledb_pre_restore() +description: Prepare the database for a restore operation +keywords: [administration, restore, backup] +tags: [backup, restore] +products: [cloud, mst, self_hosted] +license: apache +type: function +--- + +import { TIMESCALE_DB } from '/snippets/vars.mdx'; + +Perform the required operations so that you can restore the database using `pg_restore`. Specifically, this sets the +`timescaledb.restoring` GUC to `on` and stops any background workers which could have been performing tasks. + +The background workers are stopped until the [`timescaledb_post_restore()`][timescaledb_post_restore] function is run, +after the restore operation is complete. + +For more information, see [Migrate using pg_dump and pg_restore][pg-dump-restore]. + + +After using `timescaledb_pre_restore()`, you need to run [`timescaledb_post_restore()`][timescaledb_post_restore] before +you can use the database normally. + + +## Samples + +Prepare to restore the database: + +```sql +SELECT timescaledb_pre_restore(); +``` + +## Returns + +|Column|Type|Description| +|-|-|-| +|success|BOOLEAN|TRUE if the operation completed successfully| + +[timescaledb_post_restore]: /api-reference/timescaledb/administration/timescaledb_post_restore +[pg-dump-restore]: /migrate/postgres/pg-dump-and-restore diff --git a/api-reference/timescaledb/compression/add_compression_policy.mdx b/api-reference/timescaledb/compression/add_compression_policy.mdx new file mode 100644 index 0000000..985bfe8 --- /dev/null +++ b/api-reference/timescaledb/compression/add_compression_policy.mdx @@ -0,0 +1,93 @@ +--- +title: add_compression_policy() +description: Add policy to schedule automatic compression of chunks +topics: [compression, jobs] +keywords: [compression, policies] +tags: [scheduled jobs, background jobs, automation framework] +license: community +type: function +products: [cloud, mst, self_hosted] +--- + +import { CHUNK, HYPERTABLE, CAGG, TIMESCALE_DB } from '/snippets/vars.mdx'; + +Old API since [{TIMESCALE_DB} v2.18.0](https://github.com/timescale/timescaledb/releases/tag/2.18.0). Superseded by [add_columnstore_policy()][add-columnstore-policy]. +However, compression APIs are still supported, you do not need to migrate to the hypercore APIs. + +Allows you to set a policy by which the system compresses a {CHUNK} automatically in the background after it reaches a +given age. + +Compression policies can only be created on {HYPERTABLE}s or {CAGG}s that already have compression enabled. To set +`timescaledb.compress` and other configuration parameters for {HYPERTABLE}s, use the [`ALTER +TABLE`][compression-alter-table] command. To enable compression on {CAGG}s, use the [`ALTER MATERIALIZED +VIEW`][compression-continuous-aggregate] command. To view the policies that you set or the policies that already exist, +see [informational views][informational-views]. + +## Samples + +Add a policy to compress {CHUNK}s older than 60 days on the `cpu` {HYPERTABLE}. + +```sql +SELECT add_compression_policy('cpu', compress_after => INTERVAL '60d'); +``` + +Add a policy to compress {CHUNK}s created 3 months before on the 'cpu' {HYPERTABLE}. + +```sql +SELECT add_compression_policy('cpu', compress_created_before => INTERVAL '3 months'); +``` + +Note above that when `compress_after` is used then the time data range present in the partitioning time column is used +to select the target {CHUNK}s. Whereas, when `compress_created_before` is used then the {CHUNK}s which were created 3 +months ago are selected. + +Add a compress {CHUNK}s policy to a {HYPERTABLE} with an integer-based time column: + +```sql +SELECT add_compression_policy('table_with_bigint_time', BIGINT '600000'); +``` + +Add a policy to compress {CHUNK}s of a {CAGG} called `cpu_weekly`, that are older than eight weeks: + +```sql +SELECT add_compression_policy('cpu_weekly', INTERVAL '8 weeks'); +``` + +## Arguments + +The syntax is: + +```sql +SELECT add_compression_policy( + hypertable = '', + compress_after = , + if_not_exists = true | false, + schedule_interval = , + initial_start = , + timezone = '', + compress_created_before = +); +``` + +| Name | Type | Default | Required | Description | +|-|-|-|-|-| +| `hypertable` | REGCLASS | - | ✔ | Name of the {HYPERTABLE} or {CAGG} | +| `compress_after` | INTERVAL or INTEGER | - | ✔ | The age after which the policy job compresses {CHUNK}s. `compress_after` is calculated relative to the current time, so {CHUNK}s containing data older than `now - compress_after::interval` are compressed. This argument is mutually exclusive with `compress_created_before`. | +| `compress_created_before` | INTERVAL | NULL | - | {CHUNK}s with creation time older than this cut-off point are compressed. The cut-off point is computed as `now() - compress_created_before`. Defaults to `NULL`. Not supported for {CAGG}s yet. This argument is mutually exclusive with `compress_after`. | +| `schedule_interval` | INTERVAL | 12 hours for {HYPERTABLE}s with `chunk_interval` >= 1 day, `chunk_interval / 2` for others | - | The interval between the finish time of the last execution and the next start. | +| `initial_start` | TIMESTAMPTZ | NULL | - | Time the policy is first run. Defaults to NULL. If omitted, then the schedule interval is the interval from the finish time of the last execution to the next start. If provided, it serves as the origin with respect to which the next_start is calculated | +| `timezone` | TEXT | NULL | - | A valid time zone. If `initial_start` is also specified, subsequent executions of the compression policy are aligned on its initial start. However, daylight savings time (DST) changes may shift this alignment. Set to a valid time zone if this is an issue you want to mitigate. If omitted, UTC bucketing is performed. | +| `if_not_exists` | BOOLEAN | false | - | Setting to `true` causes the command to fail with a warning instead of an error if a compression policy already exists on the {HYPERTABLE}. | + +The `compress_after` parameter should be specified differently depending on the type of the time column of the +{HYPERTABLE} or {CAGG}: + +- For {HYPERTABLE}s with TIMESTAMP, TIMESTAMPTZ, and DATE time columns: the time interval should be an INTERVAL type. +- For {HYPERTABLE}s with integer-based timestamps: the time interval should be an integer type (this requires the + [integer_now_func][set-integer-now-func] to be set). + +[add-columnstore-policy]: /api-reference/timescaledb/hypercore/add_columnstore_policy +[compression-alter-table]: /api-reference/timescaledb/compression/alter_table_compression +[compression-continuous-aggregate]: /api-reference/timescaledb/continuous-aggregates/alter_materialized_view +[set-integer-now-func]: /api-reference/timescaledb/hypertables/set_integer_now_func +[informational-views]: /api-reference/timescaledb/informational-views/jobs diff --git a/api-reference/timescaledb/compression/alter_table_compression.mdx b/api-reference/timescaledb/compression/alter_table_compression.mdx new file mode 100644 index 0000000..2c305b4 --- /dev/null +++ b/api-reference/timescaledb/compression/alter_table_compression.mdx @@ -0,0 +1,73 @@ +--- +title: ALTER TABLE (Compression) +description: Change compression settings on a compressed hypertable +topics: [compression] +keywords: [compression] +tags: [settings, hypertables, alter, change] +license: community +type: command +products: [cloud, mst, self_hosted] +--- + +import { HYPERTABLE, CHUNK, TIMESCALE_DB } from '/snippets/vars.mdx'; + +Old API since [{TIMESCALE_DB} v2.18.0](https://github.com/timescale/timescaledb/releases/tag/2.18.0). Superseded by [ALTER TABLE (Hypercore)][alter-table-hypercore]. +However, compression APIs are still supported, you do not need to migrate to the hypercore APIs. + +Use 'ALTER TABLE' to turn on compression and set compression options. + +By itself, this `ALTER` statement alone does not compress a {HYPERTABLE}. To do so, either create a compression policy +using the [add_compression_policy][add-compression-policy] function or manually compress a specific {HYPERTABLE} {CHUNK} +using the [compress_chunk][compress-chunk] function. + +The syntax is: + +```sql +ALTER TABLE SET (timescaledb.compress, + timescaledb.compress_orderby = ' [ASC | DESC] [ NULLS { FIRST | LAST } ] [, ...]', + timescaledb.compress_segmentby = ' [, ...]', + timescaledb.compress_chunk_time_interval='interval' +); +``` + +## Samples + +Configure a {HYPERTABLE} that ingests device data to use compression. Here, if the {HYPERTABLE} is often queried about a +specific device or set of devices, the compression should be segmented using the `device_id` for greater performance. + +```sql +ALTER TABLE metrics SET (timescaledb.compress, timescaledb.compress_orderby = 'time DESC', timescaledb.compress_segmentby = 'device_id'); +``` + +You can also specify compressed {CHUNK} interval without changing other compression settings: + +```sql +ALTER TABLE metrics SET (timescaledb.compress_chunk_time_interval = '24 hours'); +``` + +To disable the previously set option, set the interval to 0: + +```sql +ALTER TABLE metrics SET (timescaledb.compress_chunk_time_interval = '0'); +``` + +## Arguments + +| Name | Type | Default | Required | Description | +|-|-|-|-|-| +| `timescaledb.compress` | BOOLEAN | - | ✔ | Enable or disable compression | +| `timescaledb.compress_orderby` | TEXT | Descending order of the {HYPERTABLE}'s time column | - | Order used by compression, specified in the same way as the ORDER BY clause in a SELECT query. | +| `timescaledb.compress_segmentby` | TEXT | No segment by columns | - | Column list on which to key the compressed segments. An identifier representing the source of the data such as `device_id` or `tags_id` is usually a good candidate. | +| `timescaledb.compress_chunk_time_interval` | TEXT | - | - | EXPERIMENTAL: Set compressed {CHUNK} time interval used to roll {CHUNK}s into. This parameter compresses every {CHUNK}, and then irreversibly merges it into a previous adjacent {CHUNK} if possible, to reduce the total number of {CHUNK}s in the {HYPERTABLE}. Note that {CHUNK}s will not be split up during decompression. It should be set to a multiple of the current {CHUNK} interval. This option can be changed independently of other compression settings and does not require the `timescaledb.compress` argument. | + +## Parameters + +| Name | Type | Description | +|-|-|-| +| `table_name` | TEXT | {HYPERTABLE} that supports compression | +| `column_name` | TEXT | Column used to order by or segment by | +| `interval` | TEXT | Time interval used to roll compressed {CHUNK}s into | + +[alter-table-hypercore]: /api-reference/timescaledb/hypercore/alter_table +[add-compression-policy]: /api-reference/timescaledb/compression/add_compression_policy +[compress-chunk]: /api-reference/timescaledb/compression/compress_chunk diff --git a/api-reference/timescaledb/compression/chunk_compression_stats.mdx b/api-reference/timescaledb/compression/chunk_compression_stats.mdx new file mode 100644 index 0000000..24a6f6d --- /dev/null +++ b/api-reference/timescaledb/compression/chunk_compression_stats.mdx @@ -0,0 +1,99 @@ +--- +title: chunk_compression_stats() +description: Get compression-related statistics for chunks +topics: [compression] +keywords: [compression, statistics, chunks, information] +tags: [disk space, schemas, size] +license: community +type: function +products: [cloud, mst, self_hosted] +--- + +import { CHUNK, HYPERTABLE, TIMESCALE_DB } from '/snippets/vars.mdx'; + +Old API since [{TIMESCALE_DB} v2.18.0](https://github.com/timescale/timescaledb/releases/tag/2.18.0). Superseded by [chunk_columnstore_stats()][chunk-columnstore-stats]. +However, compression APIs are still supported, you do not need to migrate to the hypercore APIs. + +Get {CHUNK}-specific statistics related to {HYPERTABLE} compression. All sizes are in bytes. + +This function shows the compressed size of {CHUNK}s, computed when the `compress_chunk` is manually executed, or when a +compression policy processes the {CHUNK}. An insert into a compressed {CHUNK} does not update the compressed sizes. For +more information about how to compute {CHUNK} sizes, see the `chunks_detailed_size` section. + +## Samples + +```sql +SELECT * FROM chunk_compression_stats('conditions') + ORDER BY chunk_name LIMIT 2; + +-[ RECORD 1 ]------------------+---------------------- +chunk_schema | _timescaledb_internal +chunk_name | _hyper_1_1_chunk +compression_status | Uncompressed +before_compression_table_bytes | +before_compression_index_bytes | +before_compression_toast_bytes | +before_compression_total_bytes | +after_compression_table_bytes | +after_compression_index_bytes | +after_compression_toast_bytes | +after_compression_total_bytes | +node_name | +-[ RECORD 2 ]------------------+---------------------- +chunk_schema | _timescaledb_internal +chunk_name | _hyper_1_2_chunk +compression_status | Compressed +before_compression_table_bytes | 8192 +before_compression_index_bytes | 32768 +before_compression_toast_bytes | 0 +before_compression_total_bytes | 40960 +after_compression_table_bytes | 8192 +after_compression_index_bytes | 32768 +after_compression_toast_bytes | 8192 +after_compression_total_bytes | 49152 +node_name | +``` + +Use `pg_size_pretty` get the output in a more human friendly format. + +```sql +SELECT pg_size_pretty(after_compression_total_bytes) AS total + FROM chunk_compression_stats('conditions') + WHERE compression_status = 'Compressed'; + +-[ RECORD 1 ]--+------ +total | 48 kB +``` + +## Arguments + +The syntax is: + +```sql +SELECT * FROM chunk_compression_stats( + hypertable = '' +); +``` + +| Name | Type | Default | Required | Description | +|-|-|-|-|-| +| `hypertable` | REGCLASS | - | ✔ | Name of the {HYPERTABLE} | + +## Returns + +| Column | Type | Description | +|-|-|-| +| `chunk_schema` | TEXT | Schema name of the {CHUNK} | +| `chunk_name` | TEXT | Name of the {CHUNK} | +| `compression_status` | TEXT | the current compression status of the {CHUNK} | +| `before_compression_table_bytes` | BIGINT | Size of the heap before compression (NULL if currently uncompressed) | +| `before_compression_index_bytes` | BIGINT | Size of all the indexes before compression (NULL if currently uncompressed) | +| `before_compression_toast_bytes` | BIGINT | Size the TOAST table before compression (NULL if currently uncompressed) | +| `before_compression_total_bytes` | BIGINT | Size of the entire {CHUNK} table (table+indexes+toast) before compression (NULL if currently uncompressed) | +| `after_compression_table_bytes` | BIGINT | Size of the heap after compression (NULL if currently uncompressed) | +| `after_compression_index_bytes` | BIGINT | Size of all the indexes after compression (NULL if currently uncompressed) | +| `after_compression_toast_bytes` | BIGINT | Size the TOAST table after compression (NULL if currently uncompressed) | +| `after_compression_total_bytes` | BIGINT | Size of the entire {CHUNK} table (table+indexes+toast) after compression (NULL if currently uncompressed) | +| `node_name` | TEXT | nodes on which the {CHUNK} is located, applicable only to distributed {HYPERTABLE}s | + +[chunk-columnstore-stats]: /api-reference/timescaledb/hypercore/chunk_columnstore_stats diff --git a/api-reference/timescaledb/compression/compress_chunk.mdx b/api-reference/timescaledb/compression/compress_chunk.mdx new file mode 100644 index 0000000..f9cd0f6 --- /dev/null +++ b/api-reference/timescaledb/compression/compress_chunk.mdx @@ -0,0 +1,62 @@ +--- +title: compress_chunk() +description: Manually compress a given chunk +topics: [compression] +keywords: [compression] +tags: [chunks] +license: community +type: function +products: [cloud, mst, self_hosted] +--- + +import { CHUNK, HYPERTABLE, TIMESCALE_DB } from '/snippets/vars.mdx'; + +Old API since [{TIMESCALE_DB} v2.18.0](https://github.com/timescale/timescaledb/releases/tag/2.18.0). Superseded by [convert_to_columnstore()][convert-to-columnstore]. +However, compression APIs are still supported, you do not need to migrate to the hypercore APIs. + +The `compress_chunk` function is used for synchronous compression (or recompression, if necessary) of a specific +{CHUNK}. This is most often used instead of the [`add_compression_policy`][add-compression-policy] function, when a user +wants more control over the scheduling of compression. For most users, we suggest using the policy framework instead. + +You can also compress {CHUNK}s by [running the job associated with your compression policy][run-job]. `compress_chunk` +gives you more fine-grained control by allowing you to target a specific {CHUNK} that needs compressing. + + +You can get a list of {CHUNK}s belonging to a {HYPERTABLE} using the [`show_chunks` function][show-chunks]. + + +## Samples + +Compress a single {CHUNK}. + +```sql +SELECT compress_chunk('_timescaledb_internal._hyper_1_2_chunk'); +``` + +## Arguments + +The syntax is: + +```sql +SELECT compress_chunk( + uncompressed_chunk = '', + if_not_compressed = true | false, + recompress = true | false +); +``` + +| Name | Type | Default | Required | Description | +|-|-|-|-|-| +| `chunk_name` | REGCLASS | - | ✔ | Name of the {CHUNK} to be compressed | +| `if_not_compressed` | BOOLEAN | true | - | Disabling this will make the function error out on {CHUNK}s that are already compressed. | + +## Returns + +| Column | Type | Description | +|-|-|-| +| `compress_chunk` | REGCLASS | Name of the {CHUNK} that was compressed | + +[convert-to-columnstore]: /api-reference/timescaledb/hypercore/convert_to_columnstore +[add-compression-policy]: /api-reference/timescaledb/compression/add_compression_policy +[run-job]: /api-reference/timescaledb/jobs-automation/run_job +[show-chunks]: /api-reference/timescaledb/hypertables/show_chunks diff --git a/api-reference/timescaledb/compression/decompress_chunk.mdx b/api-reference/timescaledb/compression/decompress_chunk.mdx new file mode 100644 index 0000000..6468462 --- /dev/null +++ b/api-reference/timescaledb/compression/decompress_chunk.mdx @@ -0,0 +1,57 @@ +--- +title: decompress_chunk() +description: Decompress a compressed chunk +topics: [compression] +keywords: [compression, decompression, chunks, backfilling] +license: community +type: function +products: [cloud, mst, self_hosted] +--- + +import { CHUNK, HYPERTABLE, TIMESCALE_DB } from '/snippets/vars.mdx'; + +Old API since [{TIMESCALE_DB} v2.18.0](https://github.com/timescale/timescaledb/releases/tag/2.18.0). Superseded by [convert_to_rowstore()][convert-to-rowstore]. +However, compression APIs are still supported, you do not need to migrate to the hypercore APIs. + + +Before decompressing {CHUNK}s, stop any compression policy on the {HYPERTABLE} you are decompressing. You can use +`SELECT alter_job(JOB_ID, scheduled => false);` to prevent scheduled execution. + + +## Samples + +Decompress a single {CHUNK}: + +```sql +SELECT decompress_chunk('_timescaledb_internal._hyper_2_2_chunk'); +``` + +Decompress all compressed {CHUNK}s in a {HYPERTABLE} named `metrics`: + +```sql +SELECT decompress_chunk(c, true) FROM show_chunks('metrics') c; +``` + +## Arguments + +The syntax is: + +```sql +SELECT decompress_chunk( + uncompressed_chunk = '', + if_compressed = true | false +); +``` + +| Name | Type | Default | Required | Description | +|-|-|-|-|-| +| `chunk_name` | REGCLASS | - | ✔ | Name of the {CHUNK} to be decompressed. | +| `if_compressed` | BOOLEAN | true | - | Disabling this will make the function error out on {CHUNK}s that are not compressed. | + +## Returns + +| Column | Type | Description | +|-|-|-| +| `decompress_chunk` | REGCLASS | Name of the {CHUNK} that was decompressed. | + +[convert-to-rowstore]: /api-reference/timescaledb/hypercore/convert_to_rowstore diff --git a/api-reference/timescaledb/compression/hypertable_compression_stats.mdx b/api-reference/timescaledb/compression/hypertable_compression_stats.mdx new file mode 100644 index 0000000..2c29dd9 --- /dev/null +++ b/api-reference/timescaledb/compression/hypertable_compression_stats.mdx @@ -0,0 +1,89 @@ +--- +title: hypertable_compression_stats() +description: Get hypertable statistics related to compression +topics: [compression] +keywords: [compression, hypertables, information] +tags: [statistics, size] +license: community +type: function +products: [cloud, mst, self_hosted] +--- + +import { HYPERTABLE, CHUNK, TIMESCALE_DB } from '/snippets/vars.mdx'; + +Old API since [{TIMESCALE_DB} v2.18.0](https://github.com/timescale/timescaledb/releases/tag/2.18.0). Superseded by [hypertable_columnstore_stats()][hypertable-columnstore-stats]. +However, compression APIs are still supported, you do not need to migrate to the hypercore APIs. + +Get statistics related to {HYPERTABLE} compression. All sizes are in bytes. + +For more information about using {HYPERTABLE}s, including {CHUNK} size partitioning, see the [hypertable +section][hypertable-docs]. + +For more information about compression, see the [compression section][compression-docs]. + +## Samples + +```sql +SELECT * FROM hypertable_compression_stats('conditions'); + +-[ RECORD 1 ]------------------+------ +total_chunks | 4 +number_compressed_chunks | 1 +before_compression_table_bytes | 8192 +before_compression_index_bytes | 32768 +before_compression_toast_bytes | 0 +before_compression_total_bytes | 40960 +after_compression_table_bytes | 8192 +after_compression_index_bytes | 32768 +after_compression_toast_bytes | 8192 +after_compression_total_bytes | 49152 +node_name | +``` + +Use `pg_size_pretty` get the output in a more human friendly format. + +```sql +SELECT pg_size_pretty(after_compression_total_bytes) as total + FROM hypertable_compression_stats('conditions'); + +-[ RECORD 1 ]--+------ +total | 48 kB +``` + +## Arguments + +The syntax is: + +```sql +SELECT * FROM hypertable_compression_stats( + hypertable = '' +); +``` + +| Name | Type | Default | Required | Description | +|-|-|-|-|-| +| `hypertable` | REGCLASS | - | ✔ | {HYPERTABLE} to show statistics for | + +## Returns + +| Column | Type | Description | +|-|-|-| +| `total_chunks` | BIGINT | The number of {CHUNK}s used by the {HYPERTABLE} | +| `number_compressed_chunks` | BIGINT | The number of {CHUNK}s used by the {HYPERTABLE} that are currently compressed | +| `before_compression_table_bytes` | BIGINT | Size of the heap before compression | +| `before_compression_index_bytes` | BIGINT | Size of all the indexes before compression | +| `before_compression_toast_bytes` | BIGINT | Size the TOAST table before compression | +| `before_compression_total_bytes` | BIGINT | Size of the entire table (table+indexes+toast) before compression | +| `after_compression_table_bytes` | BIGINT | Size of the heap after compression | +| `after_compression_index_bytes` | BIGINT | Size of all the indexes after compression | +| `after_compression_toast_bytes` | BIGINT | Size the TOAST table after compression | +| `after_compression_total_bytes` | BIGINT | Size of the entire table (table+indexes+toast) after compression | +| `node_name` | TEXT | nodes on which the {HYPERTABLE} is located, applicable only to distributed {HYPERTABLE}s | + + +Returns show `NULL` if the data is currently uncompressed. + + +[hypertable-columnstore-stats]: /api-reference/timescaledb/hypercore/hypertable_columnstore_stats +[hypertable-docs]: /use-timescale/hypertables +[compression-docs]: /use-timescale/compression diff --git a/api-reference/timescaledb/compression/index.mdx b/api-reference/timescaledb/compression/index.mdx new file mode 100644 index 0000000..8787694 --- /dev/null +++ b/api-reference/timescaledb/compression/index.mdx @@ -0,0 +1,89 @@ +--- +title: Compression overview (Old API, replaced by hypercore) +sidebarTitle: Overview +description: TimescaleDB API reference for compressing your data. Includes SQL functions for compressing and decompressing chunks, managing compression policies, and getting compression stats +keywords: [compression] +tags: [hypertables] +products: [cloud, mst, self_hosted] +--- + +import { HYPERTABLE, CHUNK, HYPERCORE, TIMESCALE_DB } from '/snippets/vars.mdx'; + +Old API since [{TIMESCALE_DB} v2.18.0](https://github.com/timescale/timescaledb/releases/tag/2.18.0). Superseded by [Hypercore][hypercore]. + +Compression functionality is included in {HYPERCORE}. + +Before you set up compression, you need to [configure the {HYPERTABLE} for compression][configure-compression] and then +[set up a compression policy][add-compression-policy]. + + +Before you set up compression for the first time, read the compression [blog post][compression-blog] and +[documentation][compression-docs]. + + +You can also [compress {CHUNK}s manually][compress-chunk], instead of using an automated compression policy to compress +{CHUNK}s as they age. + +Compressed {CHUNK}s have the following limitations: + +- `ROW LEVEL SECURITY` is not supported on compressed {CHUNK}s. +- Creation of unique constraints on compressed {CHUNK}s is not supported. You can add them by disabling compression on + the {HYPERTABLE} and re-enabling after constraint creation. + +## Restrictions + +In general, compressing a {HYPERTABLE} imposes some limitations on the types of data modifications that you can perform +on data inside a compressed {CHUNK}. + +This table shows changes to the compression feature, added in different versions of {TIMESCALE_DB}: + +| {TIMESCALE_DB} version | Supported data modifications on compressed {CHUNK}s | +|-|-| +| 1.5 - 2.0 | Data and schema modifications are not supported. | +| 2.1 - 2.2 | Schema may be modified on compressed {HYPERTABLE}s. Data modification not supported. | +| 2.3 | Schema modifications and basic insert of new data is allowed. Deleting, updating and some advanced insert statements are not supported. | +| 2.11 | Deleting, updating and advanced insert statements are supported. | + +In {TIMESCALE_DB} 2.1 and later, you can modify the schema of {HYPERTABLE}s that have compressed {CHUNK}s. Specifically, +you can add columns to and rename existing columns of compressed {HYPERTABLE}s. + +In {TIMESCALE_DB} v2.3 and later, you can insert data into compressed {CHUNK}s and to enable compression policies on +distributed {HYPERTABLE}s. + +In {TIMESCALE_DB} v2.11 and later, you can update and delete compressed data. You can also use advanced insert +statements like `ON CONFLICT` and `RETURNING`. + +## Available functions + +### Configuration + +- [`ALTER TABLE (Compression)`][alter-table-compression]: change compression settings on a compressed {HYPERTABLE} + +### Policies + +- [`add_compression_policy()`][add-compression-policy]: add policy to schedule automatic compression of {CHUNK}s +- [`remove_compression_policy()`][remove-compression-policy]: remove a compression policy from a {HYPERTABLE} + +### Manual compression + +- [`compress_chunk()`][compress-chunk]: manually compress a given {CHUNK} +- [`decompress_chunk()`][decompress-chunk]: decompress a compressed {CHUNK} +- [`recompress_chunk()`][recompress-chunk]: recompress a {CHUNK} that had new data inserted after compression + +### Statistics + +- [`chunk_compression_stats()`][chunk-compression-stats]: get compression-related statistics for {CHUNK}s +- [`hypertable_compression_stats()`][hypertable-compression-stats]: get {HYPERTABLE} statistics related to compression + +[hypercore]: /api-reference/timescaledb/hypercore +[configure-compression]: /api-reference/timescaledb/compression/alter_table_compression +[add-compression-policy]: /api-reference/timescaledb/compression/add_compression_policy +[compress-chunk]: /api-reference/timescaledb/compression/compress_chunk +[compression-blog]: https://www.timescale.com/blog/building-columnar-compression-in-a-row-oriented-database +[compression-docs]: /use-timescale/compression +[alter-table-compression]: /api-reference/timescaledb/compression/alter_table_compression +[remove-compression-policy]: /api-reference/timescaledb/compression/remove_compression_policy +[decompress-chunk]: /api-reference/timescaledb/compression/decompress_chunk +[recompress-chunk]: /api-reference/timescaledb/compression/recompress_chunk +[chunk-compression-stats]: /api-reference/timescaledb/compression/chunk_compression_stats +[hypertable-compression-stats]: /api-reference/timescaledb/compression/hypertable_compression_stats diff --git a/api-reference/timescaledb/compression/recompress_chunk.mdx b/api-reference/timescaledb/compression/recompress_chunk.mdx new file mode 100644 index 0000000..aacd14b --- /dev/null +++ b/api-reference/timescaledb/compression/recompress_chunk.mdx @@ -0,0 +1,98 @@ +--- +title: recompress_chunk() +description: Recompress a chunk that had new data inserted after compression +topics: [compression] +keywords: [compression, recompression, chunks] +tags: [hypertables] +license: community +type: function +products: [cloud, mst, self_hosted] +--- + +import { CHUNK, TIMESCALE_DB } from '/snippets/vars.mdx'; + +Old API since [{TIMESCALE_DB} v2.18.0](https://github.com/timescale/timescaledb/releases/tag/2.18.0). Superseded by [convert_to_columnstore()][convert-to-columnstore]. +However, compression APIs are still supported, you do not need to migrate to the hypercore APIs. + +Recompresses a compressed {CHUNK} that had more data inserted after compression. + +```sql +recompress_chunk( + chunk REGCLASS, + if_not_compressed BOOLEAN = false +) +``` + +You can also recompress {CHUNK}s by [running the job associated with your compression policy][run-job]. +`recompress_chunk` gives you more fine-grained control by allowing you to target a specific {CHUNK}. + + +`recompress_chunk` is deprecated since {TIMESCALE_DB} v2.14 and will be removed in the future. The procedure is now a +wrapper which calls [`compress_chunk`][compress-chunk] instead of it. + + + +`recompress_chunk` is implemented as an SQL procedure and not a function. Call the procedure with `CALL`. Don't use a +`SELECT` statement. + + + +`recompress_chunk` only works on {CHUNK}s that have previously been compressed. To compress a {CHUNK} for the first +time, use [`compress_chunk`][compress-chunk]. + + +## Samples + +Recompress the {CHUNK} `timescaledb_internal._hyper_1_2_chunk`: + +```sql +CALL recompress_chunk('_timescaledb_internal._hyper_1_2_chunk'); +``` + +## Arguments + +The syntax is: + +```sql +CALL recompress_chunk( + chunk = '', + if_not_compressed = true | false +); +``` + +| Name | Type | Default | Required | Description | +|-|-|-|-|-| +| `chunk` | REGCLASS | - | ✔ | The {CHUNK} to be recompressed. Must include the schema, for example `_timescaledb_internal`, if it is not in the search path. | +| `if_not_compressed` | BOOLEAN | false | - | If `true`, prints a notice instead of erroring if the {CHUNK} is already compressed. | + +## Troubleshoot + +In TimescaleDB 2.6.0 and above, `recompress_chunk` is implemented as a procedure. Previously, it was implemented as a +function. If you are upgrading to TimescaleDB 2.6.0 or above, the`recompress_chunk` function could cause an error. For +example, trying to run `SELECT recompress_chunk(i.show_chunks, true) FROM...` gives the following error: + +```sql +ERROR: recompress_chunk(regclass, boolean) is a procedure +``` + +To fix the error, use `CALL` instead of `SELECT`. You might also need to write a procedure to replace the full +functionality in your `SELECT` statement. For example: + +```sql +DO $$ +DECLARE chunk regclass; +BEGIN + FOR chunk IN SELECT format('%I.%I', chunk_schema, chunk_name)::regclass + FROM timescaledb_information.chunks + WHERE is_compressed = true + LOOP + RAISE NOTICE 'Recompressing %', chunk::text; + CALL recompress_chunk(chunk, true); + END LOOP; +END +$$; +``` + +[convert-to-columnstore]: /api-reference/timescaledb/hypercore/convert_to_columnstore +[run-job]: /api-reference/timescaledb/jobs-automation/run_job +[compress-chunk]: /api-reference/timescaledb/compression/compress_chunk diff --git a/api-reference/timescaledb/compression/remove_compression_policy.mdx b/api-reference/timescaledb/compression/remove_compression_policy.mdx new file mode 100644 index 0000000..0d23d3d --- /dev/null +++ b/api-reference/timescaledb/compression/remove_compression_policy.mdx @@ -0,0 +1,51 @@ +--- +title: remove_compression_policy() +description: Remove a compression policy from a hypertable +topics: [compression, jobs] +keywords: [compression, policies, remove] +tags: [delete, drop] +license: community +type: function +products: [cloud, mst, self_hosted] +--- + +import { HYPERTABLE, CAGG, TIMESCALE_DB } from '/snippets/vars.mdx'; + +Old API since [{TIMESCALE_DB} v2.18.0](https://github.com/timescale/timescaledb/releases/tag/2.18.0). Superseded by [remove_columnstore_policy()][remove-columnstore-policy]. +However, compression APIs are still supported, you do not need to migrate to the hypercore APIs. + +If you need to remove the compression policy. To restart policy-based compression you need to add the policy again. To +view the policies that already exist, see [informational views][informational-views]. + +## Samples + +Remove the compression policy from the 'cpu' table: + +```sql +SELECT remove_compression_policy('cpu'); +``` + +Remove the compression policy from the 'cpu_weekly' {CAGG}: + +```sql +SELECT remove_compression_policy('cpu_weekly'); +``` + +## Arguments + +The syntax is: + +```sql +SELECT remove_compression_policy( + hypertable = '', + if_exists = true | false +); +``` + +| Name | Type | Default | Required | Description | +|-|-|-|-|-| +| `hypertable` | REGCLASS | - | ✔ | Name of the {HYPERTABLE} or {CAGG} the policy should be removed from | +| `if_exists` | BOOLEAN | false | - | Setting to true causes the command to fail with a notice instead of an error if a compression policy does not exist on the {HYPERTABLE}. | + +[remove-columnstore-policy]: /api-reference/timescaledb/hypercore/remove_columnstore_policy +[informational-views]: /api-reference/timescaledb/informational-views/jobs diff --git a/api-reference/timescaledb/configuration/gucs.mdx b/api-reference/timescaledb/configuration/gucs.mdx new file mode 100644 index 0000000..434c8c0 --- /dev/null +++ b/api-reference/timescaledb/configuration/gucs.mdx @@ -0,0 +1,19 @@ +--- +title: Grand Unified Configuration (GUC) parameters +description: Optimize the behavior of TimescaleDB using Grand Unified Configuration (GUC) parameters +keywords: [GUC, Configuration] +--- + +import TsdbGucsList from '/snippets/api-reference/timescaledb/configuration/_timescaledb-gucs.mdx'; +import { SERVICE_LONG } from '/snippets/vars.mdx'; + +You use the following Grand Unified Configuration (GUC) parameters to optimize the behavior of your {SERVICE_LONG}. + +The namespace of each GUC is `timescaledb`. +To set a GUC you specify `.`. For example: + +```sql +SET timescaledb.enable_tiered_reads = true; +``` + + diff --git a/api-reference/timescaledb/configuration/index.mdx b/api-reference/timescaledb/configuration/index.mdx new file mode 100644 index 0000000..e7851d2 --- /dev/null +++ b/api-reference/timescaledb/configuration/index.mdx @@ -0,0 +1,86 @@ +--- +title: Service configuration +description: Use the default PostgreSQL server configuration settings for your Tiger Cloud service, or customize them as needed +keywords: [configure] +products: [self_hosted, cloud] +sidebarTitle: Overview +--- + +import { SERVICE_LONG, SERVICE_SHORT, PG, TIMESCALE_DB } from '/snippets/vars.mdx'; + +{TIMESCALE_DB} uses the default {PG} server configuration settings. You can optimize your {SERVICE_SHORT} configuration +using the following {TIMESCALE_DB} and Grand Unified Configuration (GUC) parameters. + +## Samples + +### View current TimescaleDB settings + +Check all TimescaleDB-specific configuration settings: + +```sql +SELECT name, setting, unit, short_desc +FROM pg_settings +WHERE name LIKE 'timescaledb%' +ORDER BY name; +``` + +### Enable chunkwise aggregation + +Enable query optimization for aggregations: + +```sql +ALTER DATABASE your_database SET timescaledb.enable_chunkwise_aggregation = 'on'; +``` + +Or set it for your current session: + +```sql +SET timescaledb.enable_chunkwise_aggregation = on; +``` + +### Enable vectorized aggregation + +Enable vectorized optimizations for compressed chunks: + +```sql +ALTER DATABASE your_database SET timescaledb.vectorized_aggregation = 'on'; +``` + +### Configure continuous aggregate refresh optimization + +Enable merge optimization for continuous aggregate refreshes: + +```sql +SET timescaledb.enable_merge_on_cagg_refresh = on; +``` + +### Disable telemetry + +Turn off telemetry reporting: + +```sql +ALTER SYSTEM SET timescaledb.telemetry_level = 'off'; +SELECT pg_reload_conf(); +``` + +### Check TimescaleDB version and license + +View the current TimescaleDB version and license: + +```sql +SELECT extname, extversion +FROM pg_extension +WHERE extname = 'timescaledb'; + +SHOW timescaledb.license; +``` + +## Available configuration options + +- [TimescaleDB configuration and tuning][tigerpostgres-config]: configure the TimescaleDB settings related to policies, + query planning and execution, distributed hypertables, and administration +- [Grand Unified Configuration (GUC) parameters][gucs]: optimize the behavior of TimescaleDB using Grand Unified + Configuration (GUC) parameters + +[tigerpostgres-config]: /api-reference/timescaledb/configuration/tiger-postgres +[gucs]: /api-reference/timescaledb/configuration/gucs diff --git a/api-reference/timescaledb/configuration/tiger-postgres.mdx b/api-reference/timescaledb/configuration/tiger-postgres.mdx new file mode 100644 index 0000000..a302cf7 --- /dev/null +++ b/api-reference/timescaledb/configuration/tiger-postgres.mdx @@ -0,0 +1,12 @@ +--- +title: TimescaleDB configuration and tuning +description: Configure the TimescaleDB settings related to policies, query planning and execution, distributed hypertables, and administration +products: [cloud] +keywords: [configuration, settings] +tags: [tune] +--- + +import TimescaleDBConfig from '/snippets/api-reference/timescaledb/configuration/_timescaledb-config.mdx'; +import { TIMESCALE_DB } from '/snippets/vars.mdx'; + + diff --git a/api-reference/timescaledb/continuous-aggregates/add_continuous_aggregate_policy.mdx b/api-reference/timescaledb/continuous-aggregates/add_continuous_aggregate_policy.mdx new file mode 100644 index 0000000..e498407 --- /dev/null +++ b/api-reference/timescaledb/continuous-aggregates/add_continuous_aggregate_policy.mdx @@ -0,0 +1,96 @@ +--- +title: add_continuous_aggregate_policy() +description: Add policy to schedule automatic refresh of a continuous aggregate +topics: [continuous aggregates, jobs] +keywords: [continuous aggregates, policies] +tags: [scheduled jobs, refresh] +license: community +type: function +products: [cloud, self_hosted, mst] +--- + +import { CAGG, HYPERTABLE, TIMESCALE_DB } from '/snippets/vars.mdx'; + +Create a policy that automatically refreshes a {CAGG}. To view the +policies that you set or the policies that already exist, see +[informational views][informational-views]. + +## Samples + +Add a policy that refreshes the last month once an hour, excluding the latest +hour from the aggregate. For performance reasons, we recommend that you +exclude buckets that see lots of writes: + +```sql +SELECT add_continuous_aggregate_policy('conditions_summary', + start_offset => INTERVAL '1 month', + end_offset => INTERVAL '1 hour', + schedule_interval => INTERVAL '1 hour'); +``` + +## Arguments + +The syntax is: + +```sql +SELECT add_continuous_aggregate_policy( + continuous_aggregate = '', + start_offset = , + end_offset = , + schedule_interval = , + if_not_exists = true | false, + initial_start = , + timezone = '', + include_tiered_data = true | false, + buckets_per_batch = , + max_batches_per_execution = , + refresh_newest_first = true | false +); +``` + +| Name | Type | Default | Required | Description | +|-|-|-|-|-| +| `continuous_aggregate` | REGCLASS | - | ✔ | The {CAGG} to add the policy for | +| `start_offset` | INTERVAL or integer | - | ✔ | Start of the refresh window as an interval relative to the time when the policy is executed. `NULL` is equivalent to `MIN(timestamp)` of the {HYPERTABLE}. | +| `end_offset` | INTERVAL or integer | - | ✔ | End of the refresh window as an interval relative to the time when the policy is executed. `NULL` is equivalent to `MAX(timestamp)` of the {HYPERTABLE}. | +| `schedule_interval` | INTERVAL | 24 hours | ✔ | Interval between refresh executions in wall-clock time. | +| `initial_start` | TIMESTAMPTZ | NULL | ✔ | Time the policy is first run. Defaults to NULL. If omitted, then the schedule interval is the interval between the finish time of the last execution and the next start. If provided, it serves as the origin with respect to which the next_start is calculated | +| `if_not_exists` | BOOLEAN | false | - | Set to `true` to issue a notice instead of an error if the job already exists. | +| `timezone` | TEXT | NULL | - | A valid time zone. If you specify `initial_start`, subsequent executions of the refresh policy are aligned on `initial_start`. However, daylight savings time (DST) changes may shift this alignment. If this is an issue you want to mitigate, set `timezone` to a valid time zone. Default is `NULL`, [UTC bucketing](https://docs.tigerdata.com/use-timescale/latest/time-buckets/about-time-buckets/) is performed. | +| `include_tiered_data` | BOOLEAN | NULL | - | Enable/disable reading tiered data. This setting helps override the current settings for the`timescaledb.enable_tiered_reads` GUC. The default is NULL i.e we use the current setting for `timescaledb.enable_tiered_reads` GUC | +| `buckets_per_batch` | INTEGER | 1 | - | Number of buckets to be refreshed by a batch. This value is multiplied by the CAgg bucket width to determine the size of the batch range. Default value is `1`, single batch execution. Values of less than `0` are not allowed. | +| `max_batches_per_execution` | INTEGER | 0 | - | Limit the maximum number of batches to run when a policy executes. If some batches remain, they are processed the next time the policy runs. Default value is `0`, for an unlimted number of batches. Values of less than `0` are not allowed. | +| `refresh_newest_first` | BOOLEAN | TRUE | - | Control the order of incremental refreshes. Set to `TRUE` to refresh from the newest data to the oldest. Set to `FALSE` for oldest to newest. The default is `TRUE`. | + +The `start_offset` should be greater than `end_offset`. + +You must specify the `start_offset` and `end_offset` parameters differently, +depending on the type of the time column of the {HYPERTABLE}: + +* For {HYPERTABLE}s with `TIMESTAMP`, `TIMESTAMPTZ`, and `DATE` time columns, + set the offset as an `INTERVAL` type. +* For {HYPERTABLE}s with integer-based timestamps, set the offset as an + `INTEGER` type. + + +While setting `end_offset` to `NULL` is possible, it is not recommended. To include the data between `end_offset` and +the current time in queries, enable [real-time aggregation](/use-timescale/latest/continuous-aggregates/real-time-aggregates/). + + +You can add [concurrent refresh policies](/use-timescale/latest/continuous-aggregates/refresh-policies/) on each {CAGG}, as long as the `start_offset` and `end_offset` does not overlap with another policy on the same {CAGG}. + + +Setting `buckets_per_batch` greater than zero means that the refresh window is split in batches of `bucket width` * +`buckets per batch`. For example, a given {CAGG} with `bucket width` of `1 day` and `buckets_per_batch` of 10 has a +batch size of `10 days` to process the refresh. +Because each `batch` is an individual transaction, executing a policy in batches make the data visible for the users +before the entire job is executed. Batches are processed from the most recent data to the oldest. + + +## Returns + +| Column | Type | Description | +|-|-|-| +| `job_id` | INTEGER | {TIMESCALE_DB} background job ID created to implement this policy | + +[informational-views]: /api-reference/timescaledb/informational-views/jobs diff --git a/api-reference/timescaledb/continuous-aggregates/add_policies.mdx b/api-reference/timescaledb/continuous-aggregates/add_policies.mdx new file mode 100644 index 0000000..5ba1198 --- /dev/null +++ b/api-reference/timescaledb/continuous-aggregates/add_policies.mdx @@ -0,0 +1,90 @@ +--- +title: add_policies() +description: Add refresh, compression, and data retention policies on a continuous aggregate +topics: [continuous aggregates, jobs, compression, data retention] +keywords: [continuous aggregates, policies, add, compress, data retention] +license: community +type: function +experimental: true +products: [cloud, self_hosted, mst] +--- + +import { CAGG, HYPERTABLE, CHUNK } from '/snippets/vars.mdx'; + + Early access + +Add refresh, compression, and data retention policies to a {CAGG} +in one step. The added compression and retention policies apply to the +{CAGG}, _not_ to the original {HYPERTABLE}. + +```sql +timescaledb_experimental.add_policies( + relation REGCLASS, + if_not_exists BOOL = false, + refresh_start_offset "any" = NULL, + refresh_end_offset "any" = NULL, + compress_after "any" = NULL, + drop_after "any" = NULL) +) RETURNS BOOL +``` + + +`add_policies()` does not allow the `schedule_interval` for the {CAGG} to be set, instead using a default value of 1 +hour. + +If you would like to set this add your policies manually (see +[`add_continuous_aggregate_policy`][add_continuous_aggregate_policy]). + + +## Samples + +Given a {CAGG} named `example_continuous_aggregate`, add three +policies to it: + +1. Regularly refresh the {CAGG} to materialize data between 1 day + and 2 days old. +1. Compress data in the {CAGG} after 20 days. +1. Drop data in the {CAGG} after 1 year. + +```sql +SELECT timescaledb_experimental.add_policies( + 'example_continuous_aggregate', + refresh_start_offset => '1 day'::interval, + refresh_end_offset => '2 day'::interval, + compress_after => '20 days'::interval, + drop_after => '1 year'::interval +); +``` + +## Arguments + +The syntax is: + +```sql +CALL add_policies( + relation = '', + refresh_start_offset = , + refresh_end_offset = , + compress_after = , + refresh_schedule_interval = +); +``` + +| Name | Type | Default | Required | Description | +|-|-|-|-|-| +| `relation` | `REGCLASS` | - | ✔ | The {CAGG} that the policies should be applied to | +| `if_not_exists` | `BOOL` | false | - | When true, prints a warning instead of erroring if the {CAGG} doesn't exist. | +| `refresh_start_offset` | `INTERVAL` or `INTEGER` | - | - | The start of the {CAGG} refresh window, expressed as an offset from the policy run time. | +| `refresh_end_offset` | `INTERVAL` or `INTEGER` | - | - | The end of the {CAGG} refresh window, expressed as an offset from the policy run time. Must be greater than `refresh_start_offset`. | +| `compress_after` | `INTERVAL` or `INTEGER` | - | - | {CAGG} {CHUNK}s are compressed if they exclusively contain data older than this interval. | +| `drop_after` | `INTERVAL` or `INTEGER` | - | - | {CAGG} {CHUNK}s are dropped if they exclusively contain data older than this interval. | + +For arguments that could be either an `INTERVAL` or an `INTEGER`, use an +`INTERVAL` if your time bucket is based on timestamps. Use an `INTEGER` if your +time bucket is based on integers. + +## Returns + +Returns `true` if successful. + +[add_continuous_aggregate_policy]: /api-reference/timescaledb/continuous-aggregates/add_continuous_aggregate_policy diff --git a/api-reference/timescaledb/continuous-aggregates/alter_materialized_view.mdx b/api-reference/timescaledb/continuous-aggregates/alter_materialized_view.mdx new file mode 100644 index 0000000..c6b8743 --- /dev/null +++ b/api-reference/timescaledb/continuous-aggregates/alter_materialized_view.mdx @@ -0,0 +1,79 @@ +--- +title: ALTER MATERIALIZED VIEW (Continuous Aggregate) +description: Change an existing continuous aggregate +sidebarTitle: ALTER MATERIALIZED VIEW +topics: [continuous aggregates] +keywords: [continuous aggregates] +tags: [materialized views, hypertables, alter, change] +license: community +type: command +products: [cloud, self_hosted, mst] +--- + +import { CAGG, HYPERTABLE, COLUMNSTORE, CHUNK, TIMESCALE_DB, PG } from '/snippets/vars.mdx'; + +You use the `ALTER MATERIALIZED VIEW` statement to modify some of the `WITH` +clause [options][create_materialized_view] for a {CAGG} view. You can only set the `continuous` and +`create_group_indexes` options when you [create a {CAGG}][create_materialized_view]. `ALTER MATERIALIZED VIEW` also +supports the following +[{PG} clauses][postgres-alterview] on the {CAGG} view: + +* `RENAME TO`: rename the {CAGG} view +* `RENAME [COLUMN]`: rename the {CAGG} column +* `SET SCHEMA`: set the new schema for the {CAGG} view +* `SET TABLESPACE`: move the materialization of the {CAGG} view to the new tablespace +* `OWNER TO`: set a new owner for the {CAGG} view + +## Samples + +- Enable real-time aggregates for a {CAGG}: + + ```sql + ALTER MATERIALIZED VIEW contagg_view SET (timescaledb.materialized_only = false); + ``` + +- Enable hypercore for a {CAGG}: + + Since 2.18.0 + + ```sql + ALTER MATERIALIZED VIEW contagg_view SET ( + timescaledb.enable_columnstore = true, + timescaledb.segmentby = 'symbol' ); + ``` + +- Rename a column for a {CAGG}: + + ```sql + ALTER MATERIALIZED VIEW contagg_view RENAME COLUMN old_name TO new_name; + ``` + +## Arguments + +The syntax is: + +```sql +ALTER MATERIALIZED VIEW SET ( timescaledb. = [, ... ] ) +``` + +| Name | Type | Default | Required | Description | +|-|-|-|-|-| +| `view_name` | TEXT | - | ✔ | The name of the {CAGG} view to be altered. | +| `timescaledb.materialized_only` | BOOLEAN | `true` | - | Enable real-time aggregation. | +| `timescaledb.enable_columnstore` | BOOLEAN | `true` | - | Enable {COLUMNSTORE}. Effectively the same as `timescaledb.compress`. Since 2.18.0 | +| `timescaledb.compress` | TEXT | Disabled | - | Enable compression. | +| `timescaledb.orderby` | TEXT | Descending order on the time column in `table_name`. | - | Set the order in which items are used in the {COLUMNSTORE}. Specified in the same way as an `ORDER BY` clause in a `SELECT` query. Since 2.18.0 | +| `timescaledb.compress_orderby` | TEXT | Descending order on the time column in `table_name`. | - | Set the order used by compression. Specified in the same way as the `ORDER BY` clause in a `SELECT` query. | +| `timescaledb.segmentby` | TEXT | No segementation by column. | - | Set the list of columns used to segment data in the {COLUMNSTORE} for `table`. An identifier representing the source of the data such as `device_id` or `tags_id` is usually a good candidate. Since 2.18.0 | +| `timescaledb.compress_segmentby` | TEXT | No segementation by column. | - | Set the list of columns used to segment the compressed data. An identifier representing the source of the data such as `device_id` or `tags_id` is usually a good candidate. | +| `column_name` | TEXT | - | - | Set the name of the column to order by or segment by. | +| `timescaledb.compress_chunk_time_interval` | TEXT | - | - | Reduce the total number of compressed/{COLUMNSTORE} {CHUNK}s for `table`. If you set `compress_chunk_time_interval`, compressed/{COLUMNSTORE} {CHUNK}s are merged with the previous adjacent {CHUNK} within `chunk_time_interval` whenever possible. These {CHUNK}s are irreversibly merged. If you call to [decompress][decompress]/[convert_to_rowstore][convert_to_rowstore], merged {CHUNK}s are not split up. You can call `compress_chunk_time_interval` independently of other compression settings; `timescaledb.compress`/`timescaledb.enable_columnstore` is not required. | +| `timescaledb.enable_cagg_window_functions` | BOOLEAN | `false` | - | EXPERIMENTAL: enable window functions on {CAGG}s. Support is experimental, as there is a risk of data inconsistency. For example, in backfill scenarios, buckets could be missed. | +| `timescaledb.chunk_interval` (formerly `timescaledb.chunk_time_interval`) | INTERVAL | 10x the original {HYPERTABLE}. | - | Set the {CHUNK} interval. Renamed in {TIMESCALE_DB} V2.20. | + +[create_materialized_view]: /api-reference/timescaledb/continuous-aggregates/create_materialized_view#arguments +[postgres-alterview]: https://www.postgresql.org/docs/current/sql-alterview.html +[create-cagg]: /use-timescale/latest/continuous-aggregates/create-a-continuous-aggregate/ +[default_table_access_method]: https://www.postgresql.org/docs/17/runtime-config-client.html#GUC-DEFAULT-TABLE-ACCESS-METHOD +[convert_to_rowstore]: /api-reference/timescaledb/hypercore/convert_to_rowstore +[decompress]: /api-reference/timescaledb/compression/decompress_chunk diff --git a/api-reference/timescaledb/continuous-aggregates/alter_policies.mdx b/api-reference/timescaledb/continuous-aggregates/alter_policies.mdx new file mode 100644 index 0000000..44ebc3d --- /dev/null +++ b/api-reference/timescaledb/continuous-aggregates/alter_policies.mdx @@ -0,0 +1,74 @@ +--- +title: alter_policies() +description: Alter refresh, compression, or data retention policies on a continuous aggregate +topics: [continuous aggregates, jobs, compression, data retention] +keywords: [continuous aggregates, policies, alter, compress, data retention] +tags: [change] +license: community +type: function +experimental: true +products: [cloud, self_hosted, mst] +--- + +import { CAGG, HYPERTABLE, COLUMNSTORE, CHUNK } from '/snippets/vars.mdx'; + + Early access + +Alter refresh, {COLUMNSTORE}, or data retention policies on a {CAGG}. The altered {COLUMNSTORE} and retention policies +apply to the +{CAGG}, _not_ to the original {HYPERTABLE}. + +```sql +timescaledb_experimental.alter_policies( + relation REGCLASS, + if_exists BOOL = false, + refresh_start_offset "any" = NULL, + refresh_end_offset "any" = NULL, + compress_after "any" = NULL, + drop_after "any" = NULL +) RETURNS BOOL +``` + +## Samples + +Given a {CAGG} named `example_continuous_aggregate` with an +existing {COLUMNSTORE} policy, alter the {COLUMNSTORE} policy to compress data older +than 16 days: + +```sql +SELECT timescaledb_experimental.alter_policies( + 'continuous_agg_max_mat_date', + compress_after => '16 days'::interval +); +``` + +## Arguments + +The syntax is: + +```sql +CALL alter_policies( + relation = '', + refresh_start_offset = , + refresh_end_offset = , + compress_after = , + refresh_schedule_interval = +); +``` + +| Name | Type | Default | Required | Description | +|-|-|-|-|-| +| `relation` | `REGCLASS` | - | ✔ | The {CAGG} that you want to alter policies for | +| `if_not_exists` | `BOOL` | false | - | When true, prints a warning instead of erroring if the policy doesn't exist. | +| `refresh_start_offset` | `INTERVAL` or `INTEGER` | - | - | The start of the {CAGG} refresh window, expressed as an offset from the policy run time. | +| `refresh_end_offset` | `INTERVAL` or `INTEGER` | - | - | The end of the {CAGG} refresh window, expressed as an offset from the policy run time. Must be greater than `refresh_start_offset`. | +| `compress_after` | `INTERVAL` or `INTEGER` | - | - | {CAGG} {CHUNK}s are compressed into the {COLUMNSTORE} if they exclusively contain data older than this interval. | +| `drop_after` | `INTERVAL` or `INTEGER` | - | - | {CAGG} {CHUNK}s are dropped if they exclusively contain data older than this interval. | + +For arguments that could be either an `INTERVAL` or an `INTEGER`, use an +`INTERVAL` if your time bucket is based on timestamps. Use an `INTEGER` if your +time bucket is based on integers. + +## Returns + +Returns true if successful. diff --git a/api-reference/timescaledb/continuous-aggregates/cagg_migrate.mdx b/api-reference/timescaledb/continuous-aggregates/cagg_migrate.mdx new file mode 100644 index 0000000..0ff1b02 --- /dev/null +++ b/api-reference/timescaledb/continuous-aggregates/cagg_migrate.mdx @@ -0,0 +1,61 @@ +--- +title: cagg_migrate() +description: Migrate a continuous aggregate from the old format to the new format introduced in TimescaleDB 2.7 +topics: [continuous aggregates] +keywords: [continuous aggregates] +tags: [migrate] +license: community +type: procedure +products: [cloud, self_hosted, mst] +--- + +import { CAGG, TIMESCALE_DB } from '/snippets/vars.mdx'; + +Migrate a {CAGG} from the old format to the new format introduced +in {TIMESCALE_DB} 2.7. + +```sql +CALL cagg_migrate ( + cagg REGCLASS, + override BOOLEAN DEFAULT FALSE, + drop_old BOOLEAN DEFAULT FALSE +); +``` + +{TIMESCALE_DB} 2.7 introduced a new format for {CAGG}s that improves +performance. It also makes {CAGG}s compatible with more types of +SQL queries. + +The new format, also called the finalized format, stores the {CAGG} data exactly as it appears in the final view. The +old format, also +called the partial format, stores the data in a partially aggregated state. + +Use this procedure to migrate {CAGG}s from the old format to the +new format. + +For more information, see the [migration how-to guide][how-to-migrate]. + + +There are known issues with `cagg_migrate()` in version {TIMESCALE_DB} 2.8.0. +Upgrade to version 2.8.1 or above before using it. + + +## Arguments + +The syntax is: + +```sql +CALL cagg_migrate( + cagg = '', + override = true | false, + drop_old = true | false +); +``` + +| Name | Type | Default | Required | Description | +|-|-|-|-|-| +| `cagg` | `REGCLASS` | - | ✔ | The {CAGG} to migrate | +| `override` | `BOOLEAN` | `false` | - | If false, the old {CAGG} keeps its name. The new {CAGG} is named `_new`. If true, the new {CAGG} gets the old name. The old {CAGG} is renamed `_old`. | +| `drop_old` | `BOOLEAN` | `false` | - | If true, the old {CAGG} is deleted. Must be used together with `override`. | + +[how-to-migrate]: /use-timescale/latest/continuous-aggregates/migrate/ diff --git a/api-reference/timescaledb/continuous-aggregates/create_materialized_view.mdx b/api-reference/timescaledb/continuous-aggregates/create_materialized_view.mdx new file mode 100644 index 0000000..c6122d9 --- /dev/null +++ b/api-reference/timescaledb/continuous-aggregates/create_materialized_view.mdx @@ -0,0 +1,119 @@ +--- +title: CREATE MATERIALIZED VIEW (Continuous Aggregate) +sidebarTitle: CREATE MATERIALIZED VIEW +description: Create a continuous aggregate on a hypertable or another continuous aggregate +topics: [continuous aggregates] +keywords: [continuous aggregates, create] +tags: [materialized view, hypertables] +license: community +type: command +products: [cloud, self_hosted, mst] +--- + +import { CAGG, CAGG_CAP, HYPERTABLE, TIMESCALE_DB, PG, CHUNK } from '/snippets/vars.mdx'; + + Since 2.22.0 + +You use the `CREATE MATERIALIZED VIEW` statement to create {CAGG}s. To learn more, see the +[{CAGG} how-to guides][cagg-how-tos]. + +The syntax is: + +```sql +CREATE MATERIALIZED VIEW [ ( column_name [, ...] ) ] + WITH ( timescaledb.continuous [, timescaledb.