Search Shortcut cmd + k | ctrl + k
gaggle

A DuckDB extension for working with Kaggle datasets

Maintainer(s): habedi

Installing and Loading

INSTALL gaggle FROM community;
LOAD gaggle;

Example

-- 0. Assuming the extension is already installed and loaded

-- 1. Get extension version
SELECT gaggle_version();

-- 2. List files in the dataset
SELECT * FROM gaggle_ls('habedi/flickr-8k-dataset-clean', true) LIMIT 5;

-- 3. Read a Parquet file FROM local cache using a prepared statement
PREPARE rp as SELECT * FROM read_parquet(?) LIMIT 10;
EXECUTE rp(gaggle_file_path('habedi/flickr-8k-dataset-clean', 'flickr8k.parquet'));

-- 4. Alternatively, we can use a replacement scan to read directly via `kaggle:` prefix
SELECT COUNT(*) FROM 'kaggle:habedi/flickr-8k-dataset-clean/flickr8k.parquet';

-- 5. Check cache info
SELECT gaggle_cache_info();

-- 6. Check if cached dataset is current (is the newest version?)
SELECT gaggle_is_current('habedi/flickr-8k-dataset-clean');

About gaggle

Gaggle is a DuckDB extension that uses the Kaggle API to let you query Kaggle datasets directly with SQL. It aims to simplify the data science workflows by hiding the complexity of manually downloading, extracting, and managing dataset files from Kaggle.

For more information, like API references and usage examples, visit the project's GitHub repository.

Added Functions

function_name function_type description comment examples
gaggle_set_credentials scalar Sets Kaggle API credentials from SQL. Alternatively use env vars or ~/.kaggle/kaggle.json. Returns true on success. [select gaggle_set_credentials('your-username', 'your-api-key');]
gaggle_download scalar Downloads a Kaggle dataset to the local cache directory and returns the local dataset path. This function is idempotent. [select gaggle_download('habedi/flickr-8k-dataset-clean') as local_path;]
gaggle_search scalar Searches Kaggle datasets and returns a JSON array. Constraints: page >= 1, 1 <= page_size <= 100. [select gaggle_search('flickr', 1, 5);]
gaggle_info scalar Returns metadata for a dataset as JSON. For example, title, url, last_updated. [select gaggle_info('habedi/flickr-8k-dataset-clean') as dataset_metadata;]
gaggle_version scalar Returns the extension version string. For example, "0.1.0". [select gaggle_version();]
gaggle_clear_cache scalar Clears the dataset cache directory. Returns true on success. [select gaggle_clear_cache();]
gaggle_cache_info scalar Returns cache info JSON. Includes path, size_mb, limit_mb, usage_percent, is_soft_limit, and type fields. [select gaggle_cache_info();]
gaggle_enforce_cache_limit scalar Manually enforces cache size limit using LRU eviction. Returns true on success. (Automatic with soft limit by default). [select gaggle_enforce_cache_limit();]
gaggle_is_current scalar Checks if cached dataset is the latest version from Kaggle. Returns false if not cached or outdated. [select gaggle_is_current('owner/dataset') as is_current;]
gaggle_update_dataset scalar Forces update to latest version (ignores cache). Returns local path to freshly downloaded dataset. [select gaggle_update_dataset('owner/dataset') as updated_path;]
gaggle_version_info scalar Returns version info. Includes: cached_version, latest_version, is_current, is_cached. [select gaggle_version_info('owner/dataset') as version_info;]
gaggle_json_each scalar Expands a JSON object into newline-delimited JSON rows. Fields: key, value, type, path. Users normally shouldn't need to use this function. [select gaggle_json_each('{"a":1,"b":[true,{"c":"x"}]}') as rows;]
gaggle_file_path scalar Resolves a specific file's local path inside a downloaded dataset. Will retrieve (and cache if not downloaded). Set GAGGLE_STRICT_ONDEMAND=1 to prevent fallback to full download. [select gaggle_file_path('owner/dataset', 'file.parquet');]
gaggle_ls table Lists files in the dataset's local directory; non-recursive by default. Set recursive=true to walk subdirs. size is in MB. path is relative 'owner/dataset/'. [select * from gaggle_ls('habedi/flickr-8k-dataset-clean') limit 5;]
gaggle_last_error scalar NULL NULL NULL