Getting started

This guide will instruct you through:

Creating your first R2 bucket and enabling its data catalog.
Creating an API token needed for query engines to authenticate with your data catalog.
Using PyIceberg ↗ to create your first Iceberg table in a marimo ↗ Python notebook.
Using PyIceberg ↗ to load sample data into your table and query it.

Prerequisites

Sign up for a Cloudflare account ↗.
Install Node.js ↗.

Node.js version manager

Use a Node version manager like Volta ↗ or nvm ↗ to avoid permission issues and change Node.js versions. Wrangler, discussed later in this guide, requires a Node version of 16.17.0 or later.

If not already logged in, run:
```
npx wrangler login
```

Create an R2 bucket:

npx wrangler r2 bucket create r2-data-catalog-tutorial

2. Enable the data catalog for your bucket

Wrangler CLI
Dashboard

Then, enable the catalog on your chosen R2 bucket:

npx wrangler r2 bucket catalog enable r2-data-catalog-tutorial

When you run this command, take note of the "Warehouse" and "Catalog URI". You will need these later.

3. Create an API token

Iceberg clients (including PyIceberg ↗) must authenticate to the catalog with an R2 API token that has both R2 and catalog permissions.

In the Cloudflare dashboard, go to the R2 object storage page.
Go to Overview
Select Manage API tokens.
Select Create API token.
Select the R2 Token text to edit your API token name.
Under Permissions, choose the Admin Read & Write permission.
Select Create API Token.
Note the Token value.

4. Install uv

You need to install a Python package manager. In this guide, use uv ↗. If you do not already have uv installed, follow the installing uv guide ↗.

5. Install marimo and set up your project with uv

We will use marimo ↗ as a Python notebook.

Create a directory where our notebook will be stored:
```
mkdir r2-data-catalog-notebook
```
Change into our new directory:
```
cd r2-data-catalog-notebook
```
Initialize a new uv project (this creates a .venv and a pyproject.toml):
```
uv init
```
Add marimo and required dependencies:
Python
```
uv add marimo pyiceberg pyarrow pandas
```

6. Create a Python notebook to interact with the data warehouse

Create a file called r2-data-catalog-tutorial.py.

Paste the following code snippet into your r2-data-catalog-tutorial.py file:

import marimo

__generated_with = "0.11.31"
app = marimo.App(width="medium")


@app.cell
def _():
    import marimo as mo
    return (mo,)


@app.cell
def _():
    import pandas
    import pyarrow as pa
    import pyarrow.compute as pc
    import pyarrow.parquet as pq

    from pyiceberg.catalog.rest import RestCatalog

    # Define catalog connection details (replace variables)
    WAREHOUSE = "<WAREHOUSE>"
    TOKEN = "<TOKEN>"
    CATALOG_URI = "<CATALOG_URI>"

    # Connect to R2 Data Catalog
    catalog = RestCatalog(
        name="my_catalog",
        warehouse=WAREHOUSE,
        uri=CATALOG_URI,
        token=TOKEN,
    )
    return (
        CATALOG_URI,
        RestCatalog,
        TOKEN,
        WAREHOUSE,
        catalog,
        pa,
        pandas,
        pc,
        pq,
    )


@app.cell
def _(catalog):
    # Create default namespace if needed
    catalog.create_namespace_if_not_exists("default")
    return


@app.cell
def _(pa):
    # Create simple PyArrow table
    df = pa.table({
        "id": [1, 2, 3],
        "name": ["Alice", "Bob", "Charlie"],
        "score": [80.0, 92.5, 88.0],
    })
    return (df,)


@app.cell
def _(catalog, df):
    # Create or load Iceberg table
    test_table = ("default", "people")
    if not catalog.table_exists(test_table):
        print(f"Creating table: {test_table}")
        table = catalog.create_table(
            test_table,
            schema=df.schema,
        )
    else:
        table = catalog.load_table(test_table)
    return table, test_table


@app.cell
def _(df, table):
    # Append data
    table.append(df)
    return


@app.cell
def _(table):
    print("Table contents:")
    scanned = table.scan().to_arrow()
    print(scanned.to_pandas())
    return (scanned,)


@app.cell
def _():
    # Optional cleanup. To run uncomment and run cell
    # print(f"Deleting table: {test_table}")
    # catalog.drop_table(test_table)
    # print("Table dropped.")
    return


if __name__ == "__main__":
    app.run()

Replace the CATALOG_URI, WAREHOUSE, and TOKEN variables with your values from sections 2 and 3 respectively.
Launch the notebook editor in your browser:
```
uv run marimo edit r2-data-catalog-tutorial.py
```
Once your notebook connects to the catalog, you'll see the catalog along with its namespaces and tables appear in marimo's Datasources panel.

In the Python notebook above, you:

Connect to your catalog.
Create the default namespace.
Create a simple PyArrow table.
Create (or load) the people table in the default namespace.
Append sample data to the table.
Print the contents of the table.
(Optional) Drop the people table we created for this tutorial.

Learn more

Managing catalogs Enable or disable R2 Data Catalog on your bucket, retrieve configuration details, and authenticate your Iceberg engine.

Connect to Iceberg engines Find detailed setup instructions for Apache Spark and other common query engines.

Was this helpful?

Community
X
Discord
YouTube
GitHub