A DuckDB extension for working with Kaggle datasets
Maintainer(s):
habedi
Installing and Loading
INSTALL gaggle FROM community;
LOAD gaggle;
Example
-- 0. Assuming the extension is already installed and loaded
-- 1. Get extension version
SELECT gaggle_version();
-- 2. List files in the dataset
SELECT * FROM gaggle_ls('habedi/flickr-8k-dataset-clean', true) LIMIT 5;
-- 3. Read a Parquet file FROM local cache using a prepared statement
PREPARE rp as SELECT * FROM read_parquet(?) LIMIT 10;
EXECUTE rp(gaggle_file_path('habedi/flickr-8k-dataset-clean', 'flickr8k.parquet'));
-- 4. Alternatively, we can use a replacement scan to read directly via `kaggle:` prefix
SELECT COUNT(*) FROM 'kaggle:habedi/flickr-8k-dataset-clean/flickr8k.parquet';
-- 5. Check cache info
SELECT gaggle_cache_info();
-- 6. Check if cached dataset is current (is the newest version?)
SELECT gaggle_is_current('habedi/flickr-8k-dataset-clean');
About gaggle
Gaggle is a DuckDB extension that uses the Kaggle API to let you query Kaggle datasets directly with SQL. It aims to simplify the data science workflows by hiding the complexity of manually downloading, extracting, and managing dataset files from Kaggle.
For more information, like API references and usage examples, visit the project's GitHub repository.
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| gaggle_set_credentials | scalar | Sets Kaggle API credentials from SQL. | Alternatively use env vars or ~/.kaggle/kaggle.json. Returns true on success. |
[select gaggle_set_credentials('your-username', 'your-api-key');] |
| gaggle_download | scalar | Downloads a Kaggle dataset to the local cache directory and returns the local dataset path. | This function is idempotent. | [select gaggle_download('habedi/flickr-8k-dataset-clean') as local_path;] |
| gaggle_search | scalar | Searches Kaggle datasets and returns a JSON array. | Constraints: page >= 1, 1 <= page_size <= 100. | [select gaggle_search('flickr', 1, 5);] |
| gaggle_info | scalar | Returns metadata for a dataset as JSON. | For example, title, url, last_updated. | [select gaggle_info('habedi/flickr-8k-dataset-clean') as dataset_metadata;] |
| gaggle_version | scalar | Returns the extension version string. | For example, "0.1.0". | [select gaggle_version();] |
| gaggle_clear_cache | scalar | Clears the dataset cache directory. | Returns true on success. | [select gaggle_clear_cache();] |
| gaggle_cache_info | scalar | Returns cache info JSON. | Includes path, size_mb, limit_mb, usage_percent, is_soft_limit, and type fields. | [select gaggle_cache_info();] |
| gaggle_enforce_cache_limit | scalar | Manually enforces cache size limit using LRU eviction. | Returns true on success. (Automatic with soft limit by default). | [select gaggle_enforce_cache_limit();] |
| gaggle_is_current | scalar | Checks if cached dataset is the latest version from Kaggle. | Returns false if not cached or outdated. | [select gaggle_is_current('owner/dataset') as is_current;] |
| gaggle_update_dataset | scalar | Forces update to latest version (ignores cache). | Returns local path to freshly downloaded dataset. | [select gaggle_update_dataset('owner/dataset') as updated_path;] |
| gaggle_version_info | scalar | Returns version info. | Includes: cached_version, latest_version, is_current, is_cached. | [select gaggle_version_info('owner/dataset') as version_info;] |
| gaggle_json_each | scalar | Expands a JSON object into newline-delimited JSON rows. | Fields: key, value, type, path. Users normally shouldn't need to use this function. | [select gaggle_json_each('{"a":1,"b":[true,{"c":"x"}]}') as rows;] |
| gaggle_file_path | scalar | Resolves a specific file's local path inside a downloaded dataset. | Will retrieve (and cache if not downloaded). Set GAGGLE_STRICT_ONDEMAND=1 to prevent fallback to full download. | [select gaggle_file_path('owner/dataset', 'file.parquet');] |
| gaggle_ls | table | Lists files in the dataset's local directory; non-recursive by default. | Set recursive=true to walk subdirs. size is in MB. path is relative 'owner/dataset/ |
[select * from gaggle_ls('habedi/flickr-8k-dataset-clean') limit 5;] |
| gaggle_last_error | scalar | NULL | NULL | NULL |