SQL Application#

This article describes how to use SQL statements in Python with Swordfish.

Create an In-Memory Table#

You can create a table using the CREATE TABLE statement.

import swordfish as sf
# Create an in-memory table
sf.sql('''
    CREATE TABLE trades (
        symbol SYMBOL,
        price DOUBLE,
        volume INT,
        timestamp TIMESTAMP
    )
''')

The created table object trades is managed within Swordfish. You can check all variables managed by Swordfish using sf.function.objs().

Insert Data#

Swordfish supports multiple data insertion methods through the INSERT statement, including single-row insertion, batch insertion, and data import from another table.

# Insert a single row
sf.sql("INSERT INTO trades VALUES ('AAPL', 182.5, 1000, 2023.10.01 09:30:00.000)")

# Batch insert from a query result
sf.sql('''
    CREATE TABLE temp_data (
        id INT,
        name STRING,
        category STRING
    )
''')
source_table = sf.table({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie'], 'category': ['tech', 'finance', 'tech']})
sf.sql('INSERT INTO temp_data SELECT * FROM source_table WHERE category="tech"', vars={'source_table': source_table})

Update and Delete Data#

Use the UPDATE and DELETE statements to perform conditional updates and batch deletions.

sf.sql('''
    UPDATE trades
    SET price = price * 1.05
    WHERE symbol = 'AAPL' AND timestamp < 2023.09.01
''')

sf.sql("DELETE FROM trades")

Access and Query Data#

The SELECT statement enables flexible queries and computations, and it works on both Swordfish tables and Python table variables.

# Query data from a table in Swordfish
sf.sql('SELECT symbol, price FROM trades WHERE volume > 100')

# Query data from pandas.DataFrame object
import pandas as pd
source_table = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie'], 'category': ['tech', 'finance', 'tech']})
sf.sql('SELECT * FROM source_table WHERE category="tech"', vars={'source_table': source_table})

Grouping and Aggregation#

Swordfish provides rich aggregate functions and supports both GROUP BY and the extended CONTEXT BY syntax for grouped computations.

GROUP BY#

GROUP BY groups data by specified columns, and applies aggregate functions (such as sum or avg) to each group. The result of the GROUP BY for each group is a scalar.

Example: Calculate the average score of each subject in every class.

# Simulate the scores
data = {
    "class": ["A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B"],
    "student": ["Alice", "Alice", "Alice", "Bob", "Bob", "Bob", "Charlie", "Charlie", "Charlie", "David", "David", "David", "Eve", "Eve", "Eve", "Frank", "Frank", "Frank", "Grace", "Grace", "Grace"],
    "subject": ["Math", "Science", "English", "Math", "Science", "English", "Math", "Science", "English", "Math", "Science", "English", "Math", "Science", "English", "Math", "Science", "English", "Math", "Science", "English"],
    "score": [85, 92, 78, 88, 91, 85, 90, 87, 79, 84, 95, 80, 89, 90, 76, 83, 85, 88, 92, 79, 81]
}
# Calculate the average score of each subject in every class
groupbyTable = sf.table(data)
sf.sql('''
    SELECT
        avg(score) AS avg_score
    FROM groupbyTable
    GROUP BY class, subject
''', vars={'groupbyTable': groupbyTable})

CONTEXT BY#

Unlike GROUP BY, CONTEXT BY supports vectorized computation and can be combined with aggregate, moving window, or cumulative functions.

Example: Calculate the difference between each student’s score and the class average for each subject.

sf.sql('''
    SELECT
        class,
        student,
        subject,
        score - avg(score) as deltas
    FROM groupbyTable
    CONTEXT BY class, subject
''', vars={'groupbyTable': groupbyTable})

Example: Calculate each student’s rank by subject within the grade.

sf.sql('''
    SELECT
        class, student, subject, rank(score)+1 as grade_rank
    FROM groupbyTable
    CONTEXT BY subject
''', vars={'groupbyTable': groupbyTable})

In summary, GROUP BY returns aggregated row per group,making it suitable for generating grouped statistical reports, while CONTEXT BY retains the original data along with grouped results, which is ideal for time-series or individual-level analysis.

PARTITION BY#

The PARTITION BY clause is also supported by Swordfish. It can achieve similar functionality to CONTEXT BY. However, CONTEXT BY is generally more concise and intuitive.

Example: The above CONTEXT BY examples can also be written using window functions in SQL:

# Calculate the difference between each student’s score and the class average for each subject.
window_query = '''
SELECT
        class,
        student,
        subject,
        score - avg(score) OVER (PARTITION BY class, subject) AS deltas
    FROM groupbyTable
    ORDER BY class, subject
'''
sf.sql(window_query, vars = {'groupbyTable': groupbyTable})

# Calculate each student's rank by subject within the grade
sf.sql('''
    SELECT
        class, student, subject, rank() OVER (PARTITION BY subject order by score) AS grade_rank
    FROM groupbyTable
''', vars={'groupbyTable': groupbyTable})

Data Pivoting#

PIVOT BY is similar to pandas.DataFrame.pivot, but more flexible and efficient. PIVOT BY rearranges one or more columns of a table along two dimensions and can be combined with data transformation or aggregation functions.

Example: Display the average score of each subject by class in a pivoted layout using PIVOT BY:

sf.sql('''
    SELECT
        avg(score) AS avg_score
    FROM groupbyTable
    PIVOT BY class, subject
''', vars={'groupbyTable': groupbyTable})

Table Joiners#

Swordfish SQL supports multiple types of table joiners.

# UNION
source_table1 = sf.table({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie'], 'category': ['tech', 'finance', 'tech']})
source_table2 = sf.table({'id': [4, 5, 6], 'name': ['Lily', 'Mary', 'Tom'], 'category': ['tech', 'finance', 'tech']})
sf.sql('''
    SELECT *
    FROM source_table1
    UNION source_table2
    ''', vars={'source_table1': source_table1,'source_table2': source_table2})
# JOIN
source_table3 = sf.table({'name': ['Alice', 'Mary', 'Tom'],'age': [20, 5, 35]})
sf.sql('''
    SELECT *
    FROM source_table1 as t1
    LEFT JOIN source_table3 as t3
    ON t1.name = t3.name
    ''', vars={'source_table1': source_table1,'source_table3': source_table3})

In addition to LEFT JOIN, Swordfish supports multiple join types for different scenarios.

Join Type	Description
INNER JOIN	Returns rows that satisfy the join condition in both tables, excluding non-matching rows.
LEFT JOIN / LEFT OUTER	Returns all rows from the left table and the matching rows from the right table. If there is no match in the right table, NULL is returned.
RIGHT JOIN / RIGHT OUTER	Returns all rows from the right table and the matching rows from the left table. If there is no match in the left table, NULL is returned.
FULL JOIN / FULL OUTER	Returns all rows from both tables. If there is no match, NULL is returned.
EQUI JOIN	A join based on equality conditions, usually using `=` to join columns from two tables.
CROSS JOIN	Returns all possible combinations of rows from the two tables, independent of any condition.
ASOF JOIN	Used for time-series data; the join condition is based on a time column.
WINDOW JOIN	A join performed using window functions, typically involving aggregation or analysis within a window of data.
PREFIX JOIN	A join based on matching column name prefixes, suitable when column names are similar but not identical.

Beyond ANSI SQL syntax, Swordfish also provides a functional API for table joiners. For example, LEFT JOIN can be performed using the lj function:

import swordfish.function as F
F.lj(source_table1 , source_table3, "name")

Compared with ANSI SQL syntax, the functional API is more concise and intuitive, making it suitable for users familiar with functional programming.