DataFrame Documentation

This is the official documentation for the DataFrame API provided by Raven Computing. This page unifies the documentation for all supported programming languages and gives examples on how to use the API. The specification strictly defines the behaviour of DataFrames as an in-memory data structure as well as a platform-independent file format for persistence. The objective is to provide a unified interface and the same experience for working with strongly typed tabular data in different languages.



Core Concepts

This section describes the core concepts of DataFrames. The specification defines two components which are referred to as being a DataFrame. On the one hand, a DataFrame denotes a data structure used by programs to handle data in memory in a specific way. On the other hand, a DataFrame is also specified as a file format, recognizable by the '.df' file extension. The following paragraphs will explain in more detail how the specification defines DataFrames.

What are DataFrames?

First and foremost, a DataFrame is an in-memory data structure. Originally popularized by the programming language R where DataFrames are built-in, other languages have in the meantime provided support for DataFrames as well in the form of external libraries. A DataFrame can be thought of as a table of data, similar to a spreadsheet. This way of organizing data is very useful in computer science and software engineering since it makes it easy for humans to understand the content as two-dimensional data. One might think of the rows as individual data points, i.e. observations within the underlying dataset, and the columns as the individual variables of each observation. The columns are also often referred to as features. One might therefore think of a DataFrame as a collection of feature vectors since the rows correspond to the entries in each feature vector at a specific index. When the dataset is cleanly structured, each feature has a specific predetermined type. Therefore, the feature vectors, i.e. columns, also have a specific type because each vector holds data of one and only one specific type.

In essence, a DataFrame is simply a collection of columns (feature vectors). As a data structure it provides methods to manage those columns and both query and manipulate individual data entries. In the object-oriented way of programming, a DataFrame is an object that binds strongly typed columns into one data structure and provides methods to work with that data during program execution.

In addition to the in-memory usage, a DataFrame can be persisted to the file system. The structure of the files read and written is strictly defined. All information in a DataFrame object when used in-memory is serializable and can therefore be written to the filesystem. Any process that knows how to read a DataFrame file can do so and thus load the persisted DataFrame object back into memory whenever desired.

Types

Depending on the underlying programming language, a DataFrame object is represented by a DataFrame interface or class. DataFrames can therefore be referenced in code by variables that have the type DataFrame. In dynamically typed languages (e.g. Python) the variable type is omitted. Every column inside a DataFrame is represented by an object of type Column. It is an abstract class defining methods that all concrete DataFrame columns must implement. The actual column data (internally used array) is handled by a specific implementation of the Column abstract class. The specification demands that a DataFrame can work with 10 different element types. This is manifested by the presence of a separate Column implementation for each element type.
The following table lists all supported types:

Type Name Element Type Description Implementations
byte int8 signed 8-bit integer ByteColumn
NullableByteColumn
short int16 signed 16-bit integer ShortColumn
NullableShortColumn
int int32 signed 32-bit integer IntColumn
NullableIntColumn
long int64 signed 64-bit integer LongColumn
NullableLongColumn
float float32 single precision 32-bit float FloatColumn
NullableFloatColumn
double float64 double precision 64-bit float DoubleColumn
NullableDoubleColumn
string string arbitrary-length unicode string StringColumn
NullableStringColumn
char uint8 single printable ASCII-character CharColumn
NullableCharColumn
boolean bool single boolean value BooleanColumn
NullableBooleanColumn
binary uint8 array arbitrary-length byte array BinaryColumn
NullableBinaryColumn


In the table above, the type name denotes the standardized name of the corresponding type. A Column implementation must holds data which is of the corresponding element type. For example, a ByteColumn holds signed 8-bit integers. It can therefore hold integer numbers in the range [-128, 127]. Every official DataFrame implementation must support the described types and provide an implementation for all corresponding Column classes.

Implementations

The specification describes two main DataFrame implementations. These implementations only differ in their treatment of null values. Both implementations provide the same API and overall behaviour.

A DefaultDataFrame is the implementation used by default. It works with primitives which means that it does not support null values. Passing a null value to a DefaultDataFrame at any time will cause a runtime exception.

A NullableDataFrame is a more flexible implementation which can work with null values. Since many programming languages differentiate between primitive data types and objects, this implementation has to use wrapper objects for all primitives as the underlying structure of its columns to allow the use of null values. Generally, as a result, NullableDataFrames are less efficient than DefaultDataFrames. They usually require more memory and some operations are slower. When the usage of null values is not needed, you should always use a DefaultDataFrame.

The concrete Column types to use depends on the DataFrame implementation used. For DefaultDataFrames the concrete Column class is denoted by the type name of the column followed by the 'Column'-postfix. For NullableDataFrames, however, the concrete Column class is prefixed with a 'Nullable'-prefix. (See table in Sec. 1.2 Types).

Limitations

The DataFrame specification is designed to represent a general-purpose data structure, file format and data interchange format. The usage is not limited to pure data analysis tasks. However, the specification was not designed to support extremely large datasets (so-called "big data"). This has several reasons. First, the binary file format does not support random access to individual elements. This means that a DataFrame file must be always read and written all at once. A DataFrame object always resides in memory in its entirety. Currently there is no mechanism for loading data "on demand". This means that DataFrames which do not fit into memory cannot be processed by the underlying system.

DataFrames are not intended to be used as databases. The limitations of above paragraph apply. One might be tempted to view DataFrames as equivalent to tables of a relational database management system (RDBMS). However, the DataFrame file format was not designed to be used as a database. Although it might make sense to use DataFrames as a relational data storage in a prototype or experimental environment, it is recommended to use a standard RDBMS for data storage in production systems. On the other hand, if random access is not required by the underlying use case, using DataFrame files is much easier and maintainable.

The DataFrame file format cannot represent DataFames with more than 232 columns and 232 rows. Consider using another file format (e.g. HDF5) for larger datasets. The concrete programming language might further limit the size of DataFrames usable. For example, in Java, the maximum length of a Column is 231-1 because larger arrays cannot be directly allocated.

Based on the exact use case and requirements, you can split up a dataset into multiple DataFrames and index the individual files as desired.

Getting Started

This section describes how to add DataFrames to your project and how to import the API classes.

Adding DataFrames to your Project

Adding DataFrames to a project is easy as precompiled packages are available for common dependency management systems.

For Java projects, the easiest way to add DataFrames is through the Claymore library, which is available on the Maven Central Repository. (source code)

For Python projects, our official implementation is available on PyPI. (source code)

Below you can find commands and dependency entries for your language:

pip install raven-pydf
                
<!-- Note: Replace major.minor.patch with concrete version numbers! -->

<!-- Maven -->
<dependency>
    <groupId>com.raven-computing</groupId>
    <artifactId>claymore</artifactId>
    <version>major.minor.patch</version>
</dependency>

<!-- Gradle -->
implementation 'com.raven-computing:claymore:major.minor.patch'
                

You can also add the corresponding library manually. Please see the Development section in the source code repository for the corresponding language.

Importing the Classes in your Code

The core usage during development is provided through the DataFrame interface/class. Although when directly referring to concrete implementation classes or concrete Columns additional classes might have to be imported.

For the most basic example, let's see how to import DataFrames and construct a new DefaultDataFrame instance in code:

from raven.struct.dataframe import DataFrame

# create a DefaultDataFrame with 3 columns and 3 rows
df = DataFrame.Default(
        DataFrame.IntColumn("A", [1, 2, 3]),
        DataFrame.FloatColumn("B", [4.4, 5.5, 6.6])
        DataFrame.StringColumn("C", ["cat", "dog", "horse"]))
import com.raven.common.struct.DataFrame;
import com.raven.common.struct.DefaultDataFrame;
import com.raven.common.struct.Column;

// create a DefaultDataFrame with 3 columns and 3 rows
DataFrame df = new DefaultDataFrame(
        Column.create("A", 1, 2, 3),
        Column.create("B", 4.4f, 5.5f, 6.6f),
        Column.create("C", "cat", "dog", "horse"));
                

The import statements are slightly different, depending on the language. The concrete way how DataFrames and Columns are constructed is one of the few things which are not strictly defined by the specification. The API in each language might therefore vary slightly

Each concrete DataFrame implementation and concrete Column must be implemented by a separate class. In order to reduce the number of import statements and therefore to make the manual construction of DataFrames more convenient, the core APIs usually provide convenience functions either through the Column API or through the DataFrame interface/class directly.
In the above example, we used the minimum number of import statements. Alternatively, one can construct concrete DataFrame and Column implementations by calling the corresponding constructors directly. The following example creates the same DataFrame as before:

from raven.struct.dataframe import (DefaultDataFrame,
                                    IntColumn,
                                    FloatColumn,
                                    StringColumn)

# create a DefaultDataFrame with 3 columns and 3 rows
df = DefaultDataFrame(
        IntColumn("A", [1, 2, 3]),
        FloatColumn("B", [4.4, 5.5, 6.6])
        StringColumn("C", ["cat", "dog", "horse"]))
                
import com.raven.common.struct.DataFrame;
import com.raven.common.struct.DefaultDataFrame;
import com.raven.common.struct.IntColumn;
import com.raven.common.struct.FloatColumn;
import com.raven.common.struct.StringColumn;

// create a DefaultDataFrame with 3 columns and 3 rows
DataFrame df = new DefaultDataFrame(
        new IntColumn("A", new int[]{1, 2, 3}),
        new FloatColumn("B", new float[]{4.4f, 5.5f, 6.6f}),
        new StringColumn("C", new String[]{"cat", "dog", "horse"}));
                

Ultimately, how one decides to construct DataFrames and Columns is a matter of taste. Some might prefer to be explicit while others might want shorter code. Both is fine.

Note:
In all subsequent code examples, the import statements will not be explicitly mentioned again for the sake of brevity.

DataFrame API

This section describes the DataFrame API. Most of the calls and operations are standardized through the DataFrame specification. The aim is to provide a unified API for working with DataFrames as a data structure. However, since different programming languages have different features and peculiarities, there are some minor variations of the API, for example when manually constructing DataFrame instances.
This section gives descriptions of all API calls and usage examples through code samples in all supported languages. Expected output in all samples is indicated as commented text. In principle, you can directly copy the code samples and run them for example in an interactive Python REPL, provided that you have imported the necessary classes and created the corresponding DataFrame instance that the specific sample uses to illustrate an API call.

Construction

DataFrames can be created in various ways. As DataFrames are normal objects in the object-oriented sence, they can be created by the means of a standard constructor. Of course, this also applies to Column objects. In the simplest way, you can construct an empty DataFrame by using the default constructor and not specifying any arguments.
For example:

df1 = DataFrame.Default()
df2 = DataFrame.Nullable()

# or alternatively, when the necessary import statements are present:

df1 = DefaultDataFrame()
df2 = NullableDataFrame()
                
DataFrame df1 = new DefaultDataFrame();
DataFrame df2 = new NullableDataFrame();
                

The above example constructs two DataFrames, a DefaultDataFrame (non-nullable) and a NullableDataFrame. Since we have not defined and added any Columns yet, both DataFrames are completely empty. They are said to be uninitialized.

You can now add as many Columns as you want. However, you may also wish to specify all Columns inside a DataFrame at construction. Therefore, you can pass all Columns you want a DataFrame to hold directly to the constructor.
The following example demonstrates how to construct one labeled Column for each type, both for a DefaultDataFrame and a NullableDataFrame:

df1 = DataFrame.Default(
        DataFrame.ByteColumn("A", [10, 11, 12]),
        DataFrame.ShortColumn("B", [13, 14, 15]),
        DataFrame.IntColumn("C", [16, 17, 18]),
        DataFrame.LongColumn("D", [19, 20, 21]),
        DataFrame.FloatColumn("E", [22.1, 23.2, 24.3]),
        DataFrame.DoubleColumn("F", [25.4, 26.5, 27.6]),
        DataFrame.StringColumn("G", ["car", "airplane", "bike"]),
        DataFrame.CharColumn("H", ["a", "b", "c"]),
        DataFrame.BooleanColumn("I", [True, True, False]),
        DataFrame.BinaryColumn("J", [bytearray.fromhex("00aa"),
                                     bytearray.fromhex("0102bb"),
                                     bytearray.fromhex("030405cc")]))

df2 = DataFrame.Nullable(
        DataFrame.NullableByteColumn("A", [10, None, 12]),
        DataFrame.NullableShortColumn("B", [13, None, 15]),
        DataFrame.NullableIntColumn("C", [16, None, 18]),
        DataFrame.NullableLongColumn("D", [19, None, 21]),
        DataFrame.NullableFloatColumn("E", [22.1, None, 24.3]),
        DataFrame.NullableDoubleColumn("F", [25.4, None, 27.6]),
        DataFrame.NullableStringColumn("G", ["car", None, "bike"]),
        DataFrame.NullableCharColumn("H", ["a", None, "c"]),
        DataFrame.NullableBooleanColumn("I", [True, None, False]),
        DataFrame.NullableBinaryColumn("J", [bytearray.fromhex("00aa"),
                                             None,
                                             bytearray.fromhex("030405cc")]))

# or alternatively, when the necessary import statements are present:

df1 = DefaultDataFrame(
        ByteColumn("A", [10, 11, 12]),
        ShortColumn("B", [13, 14, 15]),
        IntColumn("C", [16, 17, 18]),
        LongColumn("D", [19, 20, 21]),
        FloatColumn("E", [22.1, 23.2, 24.3]),
        DoubleColumn("F", [25.4, 26.5, 27.6]),
        StringColumn("G", ["car", "airplane", "bike"]),
        CharColumn("H", ["a", "b", "c"]),
        BooleanColumn("I", [True, True, False]),
        BinaryColumn("J", [bytearray.fromhex("00aa"),
                           bytearray.fromhex("0102bb"),
                           bytearray.fromhex("030405cc")]))

df2 = NullableDataFrame(
        NullableByteColumn("A", [10, None, 12]),
        NullableShortColumn("B", [13, None, 15]),
        NullableIntColumn("C", [16, None, 18]),
        NullableLongColumn("D", [19, None, 21]),
        NullableFloatColumn("E", [22.1, None, 24.3]),
        NullableDoubleColumn("F", [25.4, None, 27.6]),
        NullableStringColumn("G", ["car", None, "bike"]),
        NullableCharColumn("H", ["a", None, "c"]),
        NullableBooleanColumn("I", [True, None, False]),
        NullableBinaryColumn("J", [bytearray.fromhex("00aa"),
                                   None,
                                   bytearray.fromhex("030405cc")]))
                
DataFrame df1 = new DefaultDataFrame(
        Column.create("A", (byte)10, (byte)11, (byte)12),
        Column.create("B", (short)13, (short)14, (short)15),
        Column.create("C", 16, 17, 18),
        Column.create("D", 19L, 20L, 21L),
        Column.create("E", 22.1f, 23.2f, 24.3f),
        Column.create("F", 25.4, 26.5, 27.6),
        Column.create("G", "car", "airplane", "bike"),
        Column.create("H", 'a', 'b', 'c'),
        Column.create("I", true, true, false),
        Column.create("J", new byte[]{0x00, (byte)0xaa},
                           new byte[]{0x01, 0x02, (byte)0xbb},
                           new byte[]{0x03, 0x04, 0x05, (byte)0xcc}));

DataFrame df2 = new NullableDataFrame(
        Column.nullable("A", (byte)10, null, (byte)12),
        Column.nullable("B", (short)13, null, (short)15),
        Column.nullable("C", 16, null, 18),
        Column.nullable("D", 19L, null, 21L),
        Column.nullable("E", 22.1f, null, 24.3f),
        Column.nullable("F", 25.4, null, 27.6),
        Column.nullable("G", "car", null, "bike"),
        Column.nullable("H", 'a', null, 'c'),
        Column.nullable("I", true, null, false),
        Column.nullable("J", new byte[]{0x00, (byte)0xaa},
                             null,
                             new byte[]{0x03, 0x04, 0x05, (byte)0xcc}));

// or alternatively, when the necessary import statements are present:

DataFrame df1 = new DefaultDataFrame(
        new ByteColumn("A", new byte[]{(byte)10, (byte)11, (byte)12}),
        new ShortColumn("B", new short[]{(short)13, (short)14, (short)15}),
        new IntColumn("C", new int[]{16, 17, 18}),
        new LongColumn("D", new long[]{19L, 20L, 21L}),
        new FloatColumn("E", new float[]{22.1f, 23.2f, 24.3f}),
        new DoubleColumn("F", new double[]{25.4, 26.5, 27.6}),
        new StringColumn("G", new String[]{"car", "airplane", "bike"}),
        new CharColumn("H", new char[]{'a', 'b', 'c'}),
        new BooleanColumn("I", new boolean[]{true, true, false}),
        new BinaryColumn("J", new byte[][]{new byte[]{0x00, (byte)0xaa},
                                           new byte[]{0x01, 0x02, (byte)0xbb},
                                           new byte[]{0x03, 0x04, 0x05, (byte)0xcc}}));

DataFrame df2 = new NullableDataFrame(
        new NullableByteColumn("A", new Byte[]{(byte)10, null, (byte)12}),
        new NullableShortColumn("B", new Short[]{(short)13, null, (short)15}),
        new NullableIntColumn("C", new Integer[]{16, null, 18}),
        new NullableLongColumn("D", new Long[]{19L, null, 21L}),
        new NullableFloatColumn("E", new Float[]{22.1f, null, 24.3f}),
        new NullableDoubleColumn("F", new Double[]{25.4, null, 27.6}),
        new NullableStringColumn("G", new String[]{"car", null, "bike"}),
        new NullableCharColumn("H", new Character[]{'a', null, 'c'}),
        new NullableBooleanColumn("I", new Boolean[]{true, null, false}),
        new NullableBinaryColumn("J", new byte[][]{new byte[]{0x00, (byte)0xaa},
                                                   null,
                                                   new byte[]{0x03, 0x04, 0x05, (byte)0xcc}}));
                

Please note that you can optionally also pass default Column instances to the NullableDataFrame constructor. However, since the Columns are internally converted to a corresponding nullable Column instance, it is less efficient because to conversion entails copying the Column values. Be aware that the opposite is not possible, i.e. the NullableDataFrame constructor MUST use nullable Column instances.

In the above example, all Columns were also labeled directly during their construction. Although it is recommended for most use cases to use labeled Columns, it is by no means a necessity. That is, you can also construct all Column instances without providing a name.
For example, the following code shows how the construct a DefaultDataFrame with unlabeled Columns:

# the following columns are not labeled.
# you have to omit the column names (leave them as None)
df = DefaultDataFrame(
    StringColumn(values=["good", "medium", "bad"]),
    IntColumn(values=[10, 5, 0]))
                
// the following columns are not labeled.
// you have to use the constructors of the columns directly
DataFrame df = new DefaultDataFrame(
    new StringColumn(new String[]{"good", "medium", "bad"}),
    new IntColumn(new int[]{10, 5, 0}));
                

You can set column names at any time by calling the appropriate methods, see Sec. 3.5 Column Names.

Value Access

All values inside a DataFrame can be accessed and set individually. It can be done by using ordinary getters and setters. Since DataFrames use typed columns, methods for reading and writing values contain the type name in their signature.
Both read and write operations of individual values is guaranteed to be perfomed in constant time O(1).

Getters

Use the appropriate get method to retrieve an individual value inside a DataFrame. Two positions have to be specified when calling a get method: the column index and row index. The column index can be replaced by the column name if such a column exists inside the DataFrame.

The following example demonstrates how to access individual values for every supported type. The underlying DataFrame comes from the second example in Sec. 3.1 Construction

mybyte = df1.get_byte("A", 1) # returns a Python int
myshort = df1.get_short("B", 1) # returns a Python int
myint = df1.get_int("C", 1) # returns a Python int
mylong = df1.get_long("D", 1) # returns a Python int
myfloat = df1.get_float("E", 1) # returns a Python float
mydouble = df1.get_double("F", 1) # returns a Python float
mystring = df1.get_string("G", 1) # returns a Python str
mychar = df1.get_char("H", 1) # returns a Python str of length 1
mybool = df1.get_boolean("I", 1) # returns a Python bool
mybytearray = df1.get_binary("J", 1) # returns a Python bytearray
                
// variable types can be primitives when working with DefaultDataFrames
// since all values are guaranteed to be non-null
byte mybyte = df1.getByte("A", 1);
short myshort = df1.getShort("B", 1);
int myint = df1.getInt("C", 1);
long mylong = df1.getLong("D", 1);
float myfloat = df1.getFloat("E", 1);
double mydouble = df1.getDouble("F", 1);
String mystring = df1.getString("G", 1);
char mychar = df1.getChar("H", 1);
boolean mybool = df1.getBoolean("I", 1);
byte[] mybytearray = df1.getBinary("J", 1);
                

Equivalently, the columns A-J can be referenced with their corresponding column index 0-9 when getting a value. When using DefaultDataFrames, all values returned by a get method are guaranteed to be non-null. However, when using NullableDataFrames, values returned by a get method might be null (or the equivalent for the underlying language).
For example, when calling the get methods on df2 instead of df1 then the returned values will be null because all entries in the row at index 1 were set to null in the second example in Sec. 3.1 Construction.

myfloat = df2.get_float("E", 1) # returns None
                
// variable types should be primitive wrapper objects when working
// with NullableDataFrames since the value returned by get methods
// might be null
Float myfloat = df2.getFloat("E", 1); // returns null
                

Setters

Analogous to get methods, you can use set methods to write individual values inside a DataFrame. Since columns are strongly typed, all DataFrames will enforce the correct type for each element in all operations, even in dynamically typed languages like Python.
For example, the following code will set a new value for each column in the DataFrame from the second example in Sec. 3.1 Construction at index 1:

df1.set_byte("A", 1, 42) # must be a Python int
df1.set_short("B", 1, 42) # must be a Python int
df1.set_int("C", 1, 42) # must be a Python int
df1.set_long("D", 1, 42) # must be a Python int
df1.set_float("E", 1, 42.123) # must be a Python float
df1.set_double("F", 1, 42.123) # must be a Python float
df1.set_string("G", 1, "Hello") # must be a Python str
df1.set_char("H", 1, "x") # must be a Python str of length 1
df1.set_boolean("I", 1, False) # must be a Python bool
df1.set_binary("J", 1, bytearray.fromhex("aabbccff")) # must be a Python bytearray
                
df1.setByte("A", 1, (byte)42);
df1.setShort("B", 1, (short)42);
df1.setInt("C", 1, 42);
df1.setLong("D", 1, 42L);
df1.setFloat("E", 1, 42.123f);
df1.setDouble("F", 1, 42.123);
df1.setString("G", 1, "Hello");
df1.setChar("H", 1, 'x');
df1.setBoolean("I", 1, false);
df1.setBinary("J", 1, new byte[]{(byte)0xaa, (byte)0xbb, (byte)0xcc, (byte)0xff});
                

Equivalently, the columns A-J can be referenced with their corresponding column index 0-9 when setting a value. When using DefaultDataFrames, the specified value must not be null. When using NullableDataFrames, the specified value may be null.
For example, when calling the set methods on df2 instead of df1 then the specified values may be null.

# set the value in the 'E' column at row index 2 to None
df2.set_float("E", 2, None)
                
// set the value in the 'E' column at row index 2 to null
df2.setFloat("E", 2, null);
                

Metrics

Because DataFrames are complex objects consisting of one or more columns, they exhibit certain properties. These properties can be queried at any time by calling the corresponding method.

Columns and Rows

The current number of columns and rows can be queried by simple method calls.

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False
# 3| 44 aad False
# 4| 55 aae True

print(df.columns())
# 3
print(df.rows())
# 5
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false
// 3| 44 aad false
// 4| 55 aae true

System.out.println(df.columns());
// 3
System.out.println(df.rows());
// 5
                

Capacity

The capacity of a DataFrame is the number of rows it can hold without the necessity of a resizing operation. This is done so that adding, inserting and removing rows is more efficient because copy operations do not have to be performed in every method call. The capacity therefore is the actual length of each internal array used by every Column within a particular DataFrame.
The following example illustrates how the capacity behaves when more rows are added to a DataFrame:

df = DefaultDataFrame(
        IntColumn("A", [11, 22, 33]),
        StringColumn("B", ["aaa", "aab", "aac"]),
        BooleanColumn("C", [True, True, False]))

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

print(df.capacity())
# 3
df.add_row([44, "aad", True])
print(df.capacity())
# 6
df.add_row([55, "aae", False])
print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False
# 3| 44 aad True
# 4| 55 aae False

print(df.capacity())
# 6
                
DataFrame df = new DefaultDataFrame(
        Column.create("A", 11, 22, 33),
        Column.create("B", "aaa", "aab", "aac"),
        Column.create("C", true, true, false));

System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

System.out.println(df.capacity());
// 3
df.addRow(44, "aad", true);
System.out.println(df.capacity());
// 6
df.addRow(55, "aae", false);
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false
// 3| 44 aad true
// 4| 55 aae false

System.out.println(df.capacity());
// 6
                

In the above example a DataFrame with 3 columns and 3 rows is constructed. Since the arrays of each Column are specified at construction, the capacity of the created DataFrame is equalm to the number of rows, i.e. there is no additional buffer present. When an additional row is added to the DataFrame, the capacity must be increased so that the row fits into the DataFrame. The concrete resizing strategy is an implementation detail of a DataFrame. In this example the capacity is doubled. When another row is added the capacity is not increased again because the underlying buffered space is large enough to hold the provided row data. Therefore, after having added the two rows, the length of the internal arrays in each Column is actually 6 even though the DataFrame has 5 rows.

The capacity of a DataFrame can be controlled via the flush() method (see Sec. 3.17.8 Flush).

Column Operations

Columns can be added, set, inserted and removed at any time.

Add

New Column objects can be added to a DataFrame, which will place the specified Column at the right end and assign it the corresponding column index. When working with DefaultDataFrames the Column length must match the length of the already existing Columns. When working with NullableDataFrames, then all Columns are resized when the provided Column has a different length and missing values are set to null values. When an empty Column is provided, then all column entries are set to either default values or null values, depending on the Column type.

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

df.add_column(FloatColumn("D", [1.0, 2.0, 3.0]))
print(df)
# _| A  B   C     D
# 0| 11 aaa True  1.0
# 1| 22 aab True  2.0
# 2| 33 aac False 3.0
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

df.addColumn(new FloatColumn("D", new float[]{1.0f, 2.0f, 3.0f}));
System.out.println(df);
// _| A  B   C     D
// 0| 11 aaa true  1.0
// 1| 22 aab true  2.0
// 2| 33 aac false 3.0
                

The name by which a Column should be referenceable within a DataFrame can be explicitly set when adding a Column. Be aware that this will also override the name within the specified Column.
The following example illustrates this:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

col = FloatColumn("D", [1.0, 2.0, 3.0])
print(col.get_name())
# D
df.add_column(col, name="F")
print(df)
# _| A  B   C     F
# 0| 11 aaa True  1.0
# 1| 22 aab True  2.0
# 2| 33 aac False 3.0

print(col.get_name())
# F
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

Column col = new FloatColumn("D", new float[]{1.0f, 2.0f, 3.0f});
System.out.println(col.getName());
// D
df.addColumn("F", col);
System.out.println(df);
// _| A  B   C     F
// 0| 11 aaa true  1.0
// 1| 22 aab true  2.0
// 2| 33 aac false 3.0

System.out.println(col.getName());
// F
                

Insert

Column objects can be inserted at a specific position (column index) within a DataFrame. The Column at the specified index and all Columns to the right of that position are shifted to the right. Therefore, all Columns to the right of the specified index will be referenceable by their original column index incremented by 1. The Column names are not affected by this operation.
The following example shows the insertion of a Column:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

# insert a new column
df.insert_column(1, FloatColumn("D"))
print(df)
# _| A  D   B   C
# 0| 11 0.0 aaa True
# 1| 22 0.0 aab True
# 2| 33 0.0 aac False
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

// insert a new column
df.insertColumn(1, new FloatColumn("D"));
System.out.println(df);
// _| A  D   B   C
// 0| 11 0.0 aaa true
// 1| 22 0.0 aab true
// 2| 33 0.0 aac false
                

In the above example the added FloatColumn is originally empty. Because the DataFrame already has 3 rows the inserted Column is resized and the missing values are replaced with default values of the FloatColumn.
The name of the inserted Column can be explicitly set when inserting:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

col = FloatColumn("D")
print(col.get_name())
# D
df.insert_column(1, col, name="F")
print(df)
# _| A  F   B   C
# 0| 11 0.0 aaa True
# 1| 22 0.0 aab True
# 2| 33 0.0 aac False

print(col.get_name())
# F
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

Column col = new FloatColumn("D");
System.out.println(col.getName());
// D
df.insertColumn(1, "F", col);
System.out.println(df);
// _| A  F   B   C
// 0| 11 0.0 aaa true
// 1| 22 0.0 aab true
// 2| 33 0.0 aac false

System.out.println(col.getName());
// F
                

Remove

Column instances can be removed from a DataFrame in three ways: by column index, by column name and by object reference.
The corresponding method returns the removed Column instance when the Column argument is specified as an index or name:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

col = df.remove_column("B") # returns the removed column
# or equivalently:
# col = df.remove_column(1)

print(df)
# _| A  C
# 0| 11 True
# 1| 22 True
# 2| 33 False
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

Column col = df.removeColumn("B"); // returns the removed column
// or equivalently:
// Column col = df.removeColumn(1);

System.out.println(df);
// _| A  C
// 0| 11 true
// 1| 22 true
// 2| 33 false
                

When the argument is specified as a Column instance, the corresponding method returns a boolean value which indicates whether the specified Column was successfully removed. If the specified Column is is not part of the underlying DataFrame, then the method call has no effect and a boolean value of false is returned.
For example:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

col1 = df.get_column("B")
col2 = IntColumn("F")

val = df.remove_column(col1) # returns a bool
print(val)
# True

val = df.remove_column(col2)
print(val)
# False

print(df)
# _| A  C
# 0| 11 True
# 1| 22 True
# 2| 33 False
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

Column col1 = df.getColumn("B");
Column col2 = new IntColumn("F");

boolean val = df.removeColumn(col1);
System.out.println(val);
// true 

val = df.removeColumn(col2);
System.out.println(val);
// false

System.out.println(df);
// _| A  C
// 0| 11 true
// 1| 22 true
// 2| 33 false
                

Set Columns

Columns can be explicitly set. If the specified Column is not already part of the underlying DataFrame, then the behaviour is equivalent to adding the Column to the DataFrame. On the other hand, if the specified Column is already present within the underlying DataFrame, then the present Column will be replaced by the specified instance.

For example, a particular column inside a DataFrame can be replaced by another Column instance:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

df.set_column("B", FloatColumn())

print(df)
# _| A  B   C
# 0| 11 0.0 True
# 1| 22 0.0 True
# 2| 33 0.0 False

df.set_column(2, IntColumn())

print(df)
# _| A  B   C
# 0| 11 0.0 0
# 1| 22 0.0 0
# 2| 33 0.0 0
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa True
// 1| 22 aab True
// 2| 33 aac False

df.setColumn("B", new FloatColumn());

System.out.println(df);
// _| A  B   C
// 0| 11 0.0 True
// 1| 22 0.0 True
// 2| 33 0.0 False

df.setColumn(2, new IntColumn());

System.out.println(df);
// _| A  B   C
// 0| 11 0.0 0
// 1| 22 0.0 0
// 2| 33 0.0 0
                

The above example shows that the type of the Column specified as the method argument does not necessarily have to be equal to the type of the Column inside of the DataFrame.

If the specified column name is not present in the underlying DataFrame, then the argument Column is effectively added to the DataFrame:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

df.set_column("D", FloatColumn())

print(df)
# _| A  B   C     D
# 0| 11 aaa True  0.0
# 1| 22 aab True  0.0
# 2| 33 aac False 0.0
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa True
// 1| 22 aab True
// 2| 33 aac False

df.setColumn("D", new FloatColumn());

System.out.println(df);
// _| A  B   C     D
// 0| 11 aaa True  0.0
// 1| 22 aab True  0.0
// 2| 33 aac False 0.0
                

Get Columns

The Column objects themselves can be referenced both individually and as a group. Accessing one individual Column will simply provide a reference to a particular Column instance.
For example:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

col = df.get_column("B")
print(col)
# <raven.struct.dataframe.stringcolumn.StringColumn object at 0x7f9334c2f7f0>
print(col.get_value(1))
# aab

col = df.get_column(2)
print(col)
# <raven.struct.dataframe.booleancolumn.BooleanColumn object at 0x7f9334d231f0>
print(col.get_value(1))
# True
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

Column col = df.getColumn("B");
System.out.println(col);
// com.raven.common.struct.StringColumn@5b3e8e3
System.out.println(col.getValue(1));
// aab

col = df.getColumn(2);
System.out.println(col);
// com.raven.common.struct.BooleanColumn@131b97
System.out.println(col.getValue(1));
// true
                

Alternatively, the DataFrame API provides a method to get multiple Columns bound together in a new DataFrame instance of the same type. The Columns can be selected by index, name or type.
The following example illustrates this:

print(df)
# _| A  B   C     D  E    F
# 0| 11 aaa True  44 11.0 bba
# 1| 22 aab True  55 12.0 bbb
# 2| 33 aac False 66 13.0 bbc

# get columns by names
df2 = df.get_columns(cols=("B", "F", "A"))
print(df2)
# _| B   F   A
# 0| aaa bba 11
# 1| aab bbb 22
# 2| aac bbc 33

# get columns by indices
df2 = df.get_columns(cols=(1, 2, 4))
print(df2)
# _| B   C     E
# 0| aaa True  11.0
# 1| aab True  12.0
# 2| aac False 13.0

# get columns by types
df2 = df.get_columns(types=("int", "float"))
print(df2)
# _| A  D  E
# 0| 11 44 11.0
# 1| 22 55 12.0
# 2| 33 66 13.0
                
System.out.println(df);
// _| A  B   C     D  E    F
// 0| 11 aaa true  44 11.0 bba
// 1| 22 aab true  55 12.0 bbb
// 2| 33 aac false 66 13.0 bbc

// get columns by names
DataFrame df2 = df.getColumns("B", "F", "A");
System.out.println(df2);
// _| B   F   A
// 0| aaa bba 11
// 1| aab bbb 22
// 2| aac bbc 33

// get columns by indices
df2 = df.getColumns(1, 2, 4);
System.out.println(df2);
// _| B   C     E
// 0| aaa true  11.0
// 1| aab true  12.0
// 2| aac false 13.0

// get columns by types
df2 = df.getColumns(Integer.class, Float.class);
System.out.println(df2);
// _| A  D  E
// 0| 11 44 11.0
// 1| 22 55 12.0
// 2| 33 66 13.0
                

The order of the arguments defines the order of the Columns in the returned DataFrame. One important thing to note is that all Columns are only added to the returned DataFrame by reference, i.e. selecting columns does not perform any operations. If you want a truly independent DataFrame as a result, you must explicitly copy the returned DataFrame.

Direct Access

Columns inside a DataFrame can be accessed as shown in the previous sections. As a Column is simply a container for data, it provides methods for reading and writing individual values directly. This can become useful when writing highly optimized code because certain access checks performed by DataFrames are then omitted.

Warning:
Directly accessing and manipulating values inside a Column instance is a more low-level operation. Column instances are allowed to throw exceptions other than DataFrameException!

The following example shows how to get a reference to a Column instance and set a specific value directly:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

col = df.get_column("B")

col.set_value(1, "New Value")

print(df)
# _| A  B         C
# 0| 11 aaa       True
# 1| 22 New Value True
# 2| 33 aac       False
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

Column col = df.getColumn("B");

col.setValue(1, "New Value");

System.out.println(df);
// _| A  B         C
// 0| 11 aaa       true
// 1| 22 New Value true
// 2| 33 aac       false
                

You can even get a reference to the internal array of the underlying Column:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

# get the numpy array of column A (may include a capacity buffer)
array = df.get_column("A").as_array()

print(array)
# [11 22 33]

array[1] = 42

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 42 aab True
# 2| 33 aac False
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

// get the internal array object (may include a capacity buffer)
Column col = df.getColumn("A");
int[] array = ((IntColumn)col).asArray();

System.out.println(Arrays.toString(array));
// [11, 22, 33]

array[1] = 42;

System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 42 aab true
// 2| 33 aac false
                
Warning:
No access checks by the underlying DataFrame are perfomed when handling internal arrays directly. Misuse of direct array access might lead to an invalid DataFrame state!

Column Iteration

DataFrames can be iterated over in multiple ways. The following example shows how to iterate over all columns within a DataFrame:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

# with a for-each loop:
for col in df:
    print(col.get_name())

# A
# B
# C

# with a classic range loop:
for i in range(df.columns()):
    print(df.get_column(i).get_name())

# A
# B
# C
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

// with a for-each loop:
for(Column col : df){
    System.out.println(col.getName());
}

// A
// B
// C

// with a classic c-style loop:
for(int i=0; i<df.columns(); ++i){
    System.out.println(df.getColumn(i).getName());
}

// A
// B
// C
                

Column Names

Column names can be queried and set, for individual Columns or the entire DataFrame.

Get, Set and Remove all Column Names

The DataFrame API provides methods to query and manipulate column names for the entire DataFrame at once. All column names are represented as strings. Please note that column names must never be specified as null or empty strings. Therefore, any valid column name consists of at least one string character.
The following example shows how to get and set all column names at once:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

names = df.get_column_names() # returns a list

print(names)
# ['A', 'B', 'C']

df.set_column_names(["X", "Y", "Z"])

print(df)
# _| X  Y   Z
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

String[] names = df.getColumnNames();

System.out.println(Arrays.toString(names));
// [A, B, C]

df.setColumnNames("X", "Y", "Z");

System.out.println(df);
// _| X  Y   Z
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false
                

Since a column name argument cannot be specified as null or empty, the DataFrame API provides methods to remove all columnn names and to explicitly query whether column names are set.
The following example illustrates this:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

val = df.has_column_names() # returns a bool

print(val)
# True

df.remove_column_names()

val = df.has_column_names()

print(val)
# False

print(df)
# _| 0  1   2
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

boolean val = df.hasColumnNames();

System.out.println(val);
// true

df.removeColumnNames();

val = df.hasColumnNames();

System.out.println(val);
// false

System.out.println(df);
// _| 0  1   2
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false
                

The above example shows that columns without a name are represented with their column index when converting a DataFrame to a string representation.

Individual Columns

Column name operations can also be performed for individual Columns. The following example shows how to query and set names of individual Columns:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

name = df.get_column_name(1) # returns a str

print(name)
# 'B'

df.set_column_name(1, "NewName")

name = df.get_column_name(1)

print(name)
# 'NewName'

df.set_column_name("C", "OtherName")

name = df.get_column_name(2)

print(name)
# 'OtherName'

print(df)
# _| A  NewName OtherName
# 0| 11 aaa      True
# 1| 22 aab      True
# 2| 33 aac      False
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

String name = df.getColumnName(1);

System.out.println(name);
// B

df.setColumnName(1, "NewName");

name = df.getColumnName(1);

System.out.println(name);
// NewName

df.setColumnName("C", "OtherName");

name = df.getColumnName(2);

System.out.println(name);
// OtherName

System.out.println(df);
// _| A  NewName OtherName
// 0| 11 aaa      true
// 1| 22 aab      true
// 2| 33 aac      false
                

The column index an individual Column is located at inside a DataFrame can be queried, as shown in the following example:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

index = df.get_column_index("A") # returns an int

print(index)
# 0

index = df.get_column_index("B")

print(index)
# 1
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

int index = df.getColumnIndex("A");

System.out.println(index);
// 0

index = df.getColumnIndex("B");

System.out.println(index);
// 1
                

The existence of a Column with a particular name can be queried with a separate method:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

val = df.has_column("B") # returns a bool

print(val)
# True

val = df.has_column("Data")

print(val)
# False
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

boolean val = df.hasColumn("B");

System.out.println(val);
// true

val = df.hasColumn("Data");

System.out.println(val);
// false
                

Row Operations

This section describes how to add, insert, set and remove rows inside a DataFrame. All rows can also be queried in various ways. Even when you think of a DataFrame as a collection if rows, it is important to understand that the values of rows are not stored in memory as row objects by a DataFrame, but rather as column values. Therefore, a row can be simply seen as the array of values inside all Columns at a specific row index. The values in a row can heterogeneous, i.e. all values are of the corresponding type used by the Column the row item is part of.

Add

Rows can be added to a DataFrame. This might entail a resizing operation of all Columns if the underlying DataFrame has no free capacity to store the provided row items inside the corresponding Columns (see Sec. 3.3.2 Capacity).
The following example shows how the add a single row to a DataFrame:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

df.add_row([44, "aad", True])

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False
# 3| 44 aad True

# this would fail because of a wrong type:
#              v
# df.add_row([55.5, "aae", False])

# this would also fail because the row is too long:
#                                     v
# df.add_row([55, "aae", False, "anotherItem"])
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

df.addRow(44, "aad", true);

System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false
// 3| 44 aad true

// this would fail because of a wrong type:
//             v
// df.addRow(55.5f, "aae", false);

// this would also fail because the row is too long:
//                                   v
// df.addRow(55, "aae", false, "anotherItem");
                

As seen in the above example, all row item types must exactly match the element type of each corresponding Column. Row item types are not automatically converted. This also means, for example, that you cannot specify a row item as the number 300 if the Column at the index of the row item is a ByteColumn, because a byte has only a valid range of [-128, +127].
Additionally, the length of the specified row must match the number of Columns within the DataFrame such that every row item has a corresponding Column to which it is added.

Alternatively, rows can be also be added from another DataFrame instance directly. The following example illustrates this:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

print(df2)
# _| A  B   C
# 0| 97 cca True
# 1| 98 ccb False
# 2| 99 ccc True

df.add_rows(df2)

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False
# 3| 97 cca True
# 4| 98 ccb False
# 5| 99 ccc True
            
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

System.out.println(df2);
// _| A  B   C
// 0| 97 cca true
// 1| 98 ccb false
// 2| 99 ccc true

df.addRows(df2);

System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false
// 3| 97 cca true
// 4| 98 ccb false
// 5| 99 ccc true
                

Insert

Rows can be inserted into a DataFrame at a specific row index. The same restrictions apply compared to row additions. The following example shows how to insert rows into a DataFrame:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

df.insert_row(1, [42, "AAA", False])

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 42 AAA False
# 2| 22 aab True
# 3| 33 aac False

df.insert_row(0, [99, "BBB", False])

print(df)
# _| A  B   C
# 0| 99 BBB False
# 1| 11 aaa True
# 2| 42 AAA False
# 3| 22 aab True
# 4| 33 aac False
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

// the first argument specifies the row index
df.insertRow(1, 42, "AAA", false);

System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 42 AAA false
// 2| 22 aab true
// 3| 33 aac false

df.insertRow(0, 99, "BBB", false);

System.out.println(df);
// _| A  B   C
// 0| 99 BBB false
// 1| 11 aaa true
// 2| 42 AAA false
// 3| 22 aab true
// 4| 33 aac false
                

Remove

Rows can be removed in two ways. You can either specify a range of row indices which removes all rows within that range, or you can specify a column and regular expression which removes all rows that match the regex in the specified Column.

How to remove rows in a given range is shown in the following example:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False
# 3| 44 aad False
# 4| 55 aae True
# 5| 66 aaf False
# 6| 77 aag True

df.remove_rows(from_index=1, to_index=5)

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 66 aaf False
# 2| 77 aag True
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false
// 3| 44 aad false
// 4| 55 aae true
// 5| 66 aaf false
// 6| 77 aag true

df.removeRows(1, 5);

System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 66 aaf false
// 2| 77 aag true
                

How to remove rows that match specific values in a given Column is shown in the following example:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False
# 3| 44 aad False
# 4| 55 aae True
# 5| 66 aaf False
# 6| 77 aag True

df.remove_rows("B", "aa[b-e]")

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 66 aaf False
# 2| 77 aag True
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false
// 3| 44 aad false
// 4| 55 aae true
// 5| 66 aaf false
// 6| 77 aag true

df.removeRows("B", "aa[b-e]");

System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 66 aaf false
// 2| 77 aag true
                

Set Rows

Rows at specific indices can be set directly. The following example illustrates this:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

df.set_row(1, [42, "AAA", False])

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 42 AAA False
# 3| 33 aac False
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

df.setRow(1, 42, "AAA", false);

System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 42 AAA false
// 3| 33 aac false
                

Get Rows

You can get rows from a DataFrame in two ways. You can either select a single row by its row index or select multiple rows by a range of row indices.

How to get a row at a specific row index is shown in the following example:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

row = df.get_row(1) #returns a list

print(row)
# [22, "aab", True]
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

Object[] row = df.getRow(1);

System.out.println(Arrays.toString(row));
// [22, aab, true]
                

Hot to get multiple rows, bound together in a DataFrame with the same column structure, is shown in the following example:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False
# 3| 44 aad False
# 4| 55 aae True
# 5| 66 aaf False
# 6| 77 aag True

df2 = df.get_rows(2, 5) # returns a DataFrame

print(df2)
# _| A  B   C
# 0| 33 aac False
# 1| 44 aad False
# 2| 55 aae True
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false
// 3| 44 aad false
// 4| 55 aae true
// 5| 66 aaf false
// 6| 77 aag true

DataFrame df2 = df.getRows(2, 5);

System.out.println(df2);
// _| A  B   C
// 0| 33 aac false
// 1| 44 aad false
// 2| 55 aae true
                

Row Iteration

DataFrames can be iterated over in multiple ways. The following example shows how to iterate over all rows within a DataFrame:

print(df)
# _| A  B   C
# 0| 11 aaa True
# 1| 22 aab True
# 2| 33 aac False

for i in range(df.rows()):
    print(df.get_row(i))

# [11, 'aaa', True]
# [22, 'aab', True]
# [33, 'aac', False]
                
System.out.println(df);
// _| A  B   C
// 0| 11 aaa true
// 1| 22 aab true
// 2| 33 aac false

for(int i=0; i<df.rows(); ++i){
    System.out.println(Arrays.toString(df.getRow(i)));
}

// [11, aaa, true]
// [22, aab, true]
// [33, aac, false]
                

Search Operations

Specific elements can be searched for. The elements can be specified as regular expressions to allow easy pattern matching without the need to write complex multi-line conditional code. Searching is always done with respect to one particular Column. You may search for both a single element or all elements within a specified Column that match a given regular expression. Since the condition to match all elements against is specified as a regular expression, the search term is always a string, even when searching for a specific number.

Search for a Single Element

The following example shows how to search for a single element in a specific Column:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

index = df.index_of("name", "Paul") #returns an int
print(index)
# 4

index = df.index_of("name", "Steven")
print(index)
# -1

index = df.index_of("age", "2\\d")
print(index)
# 2
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

int index = df.indexOf("name", "Paul");
System.out.println(index);
// 4

index = df.indexOf("name", "Steven");
System.out.println(index);
// -1

index = df.indexOf("age", "2\\d");
System.out.println(index);
// 2
                

The above example shows that the indexOf() method simply returns the index of the first element that matches the specified search term, or -1 if no element in the specified Column matches the seach term.

Optionally, you can specify a row index from which to start searching. This will effectively ignore all column values prior to that index in the search operation. The following example illustrates how to do that:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

index = df.index_of("age", "2\\d", start_from=3)
print(index)
# 4

index = df.index_of("name", "Bob", start_from=2)
print(index)
# -1
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

index = df.indexOf("age", 3, "2\\d");
System.out.println(index);
// 4

int index = df.indexOf("name", 2, "Bob");
System.out.println(index);
// -1
                

Search for Multiple Elements

If you want to get the row indices of all elements that match the given search term, then the DataFrame API provides an alternative method for that purpose.

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

indices = df.index_of_all("age", "2\\d") #returns a list of int
print(indices)
# [2, 4, 5]

indices = df.index_of_all("active", "True")
print(indices)
# [0, 2, 3, 4]

indices = df.index_of_all("group", "F")
print(indices)
# []
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

int[] indices = df.indexOfAll("age", "2\\d");
System.out.println(Arrays.toString(indices));
// [2, 4, 5]

indices = df.indexOfAll("active", "true");
System.out.println(Arrays.toString(indices));
// [0, 2, 3, 4]

indices = df.indexOfAll("group", "F");
System.out.println(Arrays.toString(indices));
// []
                

This operation is useful when you don't just care about the elements you are searching for but rather the row indices that they are located at. This can then easily be used to do something with other elements in those rows.

For example, the following code shows how to print the name of all people who are in their twenties:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

for index in df.index_of_all("age", "2\\d"):
    print(df.get_string("name", index))

# Mark
# Paul
# Simon
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

for(int index : df.indexOfAll("age", "2\\d")){
    System.out.println(df.getString("name", index));
}

// Mark
// Paul
// Simon
                

Since the indexOfAll() method never returns null but rather an empty array/list if there are no matches, the code from the above example can be safely used even in such a situation.

Filter Operations

The DataFrame API provides various methods to filter the content of any DataFrame. Generally, a filter operation acts on all rows that have a matching element in a specific Column. There are two ways you can treat matched rows: either retain or discard them. Therefore, you can use filter operations to specifically keep certain rows in a DataFrame or on the other hand specifically remove certain rows. There are two separate functions for each filter operation mode: one returns the result of the filter operation as a new DataFrame and leaves the original DataFrame unchanged, and the other directly changes the DataFrame that the filter operation is called upon. Additionally, it is also possible to retain a certain number of first or last rows.

The following table gives a summary over the behaviour of all filter operations:

Function Operation Mode
filter() Retains matching rows Returns a new DataFrame
drop() Discards matching rows Returns a new DataFrame
include() Retains matching rows Directly changes the DataFrame
exclude() Discards matching rows Directly changes the DataFrame
head() Retains the first n rows Returns a new DataFrame
tail() Retains the last n rows Returns a new DataFrame


Please note that all filter operations return a DataFrame instance, regardless of the operation mode. That is, even the include() and exclude() functions return a DataFrame (the same instance that the function was called upon). In this way, you can arbitrarily combine filter operations simply by chaining function calls to get the desired result.

Filter

The filter() function retains all rows that have a matching element in the specified Column. The result of the operation is returned as a new independent DataFrame instance.
The following example shows how to use the filter() function:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

res = df.filter("age", "2\\d") # returns a DataFrame

print(res)
# _| name  age active group
# 0| Mark  25  True   C
# 1| Paul  29  True   A
# 2| Simon 21  False  B
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

DataFrame res = df.filter("age", "2\\d");

System.out.println(res);
// _| name  age active group
// 2| Mark  25  true   C
// 4| Paul  29  true   A
// 5| Simon 21  false  B
                

In the above example, a DataFrame containing attributes of some random people is filtered to only hold all people whose age is in the interval [20, 29]. The result of the computation is returned as a new DataFrame instance. The DataFrame instance that the filter() function was called upon is not changed.

Drop

The drop() function discards all rows that have a matching element in the specified Column. The result of the operation is returned as a new independent DataFrame instance.
The following example shows how to use the drop() function:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

res = df.drop("age", "2\\d") # returns a DataFrame

print(res)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Sofia 31  True   B
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

DataFrame res = df.drop("age", "2\\d");

System.out.println(res);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Sofia 31  true   B
                

In the above example, a DataFrame containing attributes of some random people is filtered to remove all people whose age is in the interval [20, 29]. The result of the computation is returned as a new DataFrame instance. The DataFrame instance that the drop() function was called upon is not changed.

Include

The include() function retains all rows that have a matching element in the specified Column. The operation is directly performed on the DataFrame instance.
The following example shows how to use the include() function:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

df.include("active", "True")

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Mark  25  True   C
# 2| Sofia 31  True   B
# 3| Paul  29  True   A
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

df.include("active", "true");

System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
                

In the above example, a DataFrame containing attributes of some random people is filtered to only hold people whose active attribute is true. The computation is directly performed on the DataFrame instance. Therefore, the DataFrame instance that the include() function was called upon is changed.

Exclude

The exclude() function removes all rows that have a matching element in the specified Column. The operation is directly performed on the DataFrame instance.
The following example shows how to use the exclude() function:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

df.exclude("active", "True")

print(df)
# _| name  age active group
# 0| Bob   36  False  B
# 1| Simon 21  False  B
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

df.exclude("active", "true");

System.out.println(df);
// _| name  age active group
// 0| Bob   36  false  B
// 1| Simon 21  false  B
                

In the above example, a DataFrame containing attributes of some random people is filtered to discard all people whose active attribute is true. The computation is directly performed on the DataFrame instance. Therefore, the DataFrame instance that the exclude() function was called upon is changed.

Head and Tail

The head() function simply returns the first n rows inside the DataFrame whereas the tail() function returns the last n rows. The result of the computation is returned as a new independent DataFrame instance.
The following example illustrates this:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

first3 = df.head(3) # returns a DataFrame
last3 = df.tail(3) # returns a DataFrame

print(first3)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C

print(last3)
# _| name  age active group
# 0| Sofia 31  True   B
# 1| Paul  29  True   A
# 2| Simon 21  False  B
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

DataFrame first3 = df.head(3);
DataFrame last3 = df.tail(3);

System.out.println(first3);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C

System.out.println(last3);
// _| name  age active group
// 0| Sofia 31  true   B
// 1| Paul  29  true   A
// 2| Simon 21  false  B
                

Value Replacement

The DataFrame API provides various ways to change and replace column values. Instead of setting specific values inside a column directly you can use the replace() method to perform a bulk replacement. This involves three parts: the column to replcae values in, an optional condition and the replacement. As always, the Column to replace values in can be specified either by index or by name. The condition is specified as a regular expression that all values to be replaced must match. Therefore, any value that does not match the specified regex is not changed inside the specified Column. The replacement can be either specified as a constant or as a function. The replace() method returns an integer which indicats how many values were replaced by the operation, i.e. the number of values that matched the specified condition and were therefore replaced by the specified value.

Conditional Replacement

In the simplest case, you can use the replace() method to change all values that match a given regex to some other value
For example:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

replaced = df.replace("group", "A", "F") # returns an int

print(replaced)
# 2

print(df)
# _| name  age active group
# 0| Bill  34  True   F
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   F
# 5| Simon 21  False  B
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

int replaced = df.replace("group", "A", 'F');

System.out.println(replaced);
// 2

System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   F
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   F
// 5| Simon 21  false  B
                

As shown in the above example, the replace() method returns the number of replaced values. Consequently, if nothing matches the given condition, the replace() method returns 0 which indicates that nothing in the DataFrame was changed.
Be aware that the type of the replacement value must be equal to the element type of the underlying Column.

Replacement Function

Alternatively to a plain constant value, you can also use a replacement function. The easiest way to do that is by using lambda expressions.
The following example demonstrates how to use lambda expressions as a replacement function:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

# increase everyone's age by 1
replaced = df.replace("age", replacement=lambda v: v + 1)

print(replaced)
# 6

# increase the age of people in their twenties by 3
replaced = df.replace("age", "2\\d", lambda v: v + 3)

print(replaced)
# 2

print(df)
# _| name  age active group
# 0| Bill  35  True   A
# 1| Bob   37  False  B
# 2| Mark  29  True   C
# 3| Sofia 32  True   B
# 4| Paul  30  True   A
# 5| Simon 25  False  B
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

// increase everyone's age by 1
int replaced = df.replace("age", (Short v) -> (short)(v + 1));

System.out.println(replaced);
// 6

// increase the age of people in their twenties by 3
replaced = df.replace("age", "2\\d", (Short v) -> (short)(v + 3));

System.out.println(replaced);
// 2

System.out.println(df);
// _| name  age active group
// 0| Bill  35  true   A
// 1| Bob   37  false  B
// 2| Mark  29  true   C
// 3| Sofia 32  true   B
// 4| Paul  30  true   A
// 5| Simon 25  false  B
                

As shown in the above example, the type of the value returned by the replacement function must be equal to the element type of the underlying Column, e.g. if the age Column is modeled as a ShortColumn, then the return type must be a valid short value

Optionally, if you need to know the row index of each value you are trying to replace, you can simply adjust the lambda expression to include the row index as an int.
For example, the following code will only increase the age of a person if he is in his twenties and in the B group:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

r = df.replace("age", "2\\d", lambda i, v: v + 1 if df.get_char("group", i) == "B" else v)

print(r)
# 1

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 22  False  B
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

int r = df.replace("age", "2\\d", (int i, Short v) -> df.getChar("group", i) == 'B' ? (short)(v + 1) : v);

System.out.println(r);
// 1

System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 22  false  B
                

The above example shows that if a Column value should not be changed (because it does not meet a specific condition), then you can simply return the original Column value (i.e. the replacement function parameter).
Of course, you can also move the age condition into the lambda expression if you want.

Alternatively, for more complex code, you could also define the replacement function somewhere else and then pass it to the replace method as an argument.
The above example could therefore be rewritten as:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

def my_fn(index, value):
    if df.get_char("group", index) == "B":
        return value + 1
    else:
        return value

df.replace("age", "2\\d", my_fn)

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 22  False  B
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

IndexedValueReplacement<Short> myFn = new IndexedValueReplacement<Short>(){
    @Override
    public Short replace(int index, Short value){
        if(df.getChar("group", index) == 'B'){
            return (short)(value + 1);
        }else{
            return value;
        }
    }
};

df.replace("age", "2\\d", myFn);

System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 22  false  B
                

Replace Columns

Another way to replace values is to replace the entire Column with a Column from another DataFrame. Columns can be matched both by index and by name. Columns that cannot be matched are simply ignored
The following example shows how to replace Columns by Columns from another DataFrame:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

print(df2)
# _| level active group
# 0| 9     False  C
# 1| 8     True   C
# 2| 7     False  B
# 3| 6     False  A
# 4| 7     True   B
# 5| 5     True   B

replaced = df.replace(df=df2) # returns an int

print(replaced)
# 2

print(df)
# _| name  age active group
# 0| Bill  34  False  C
# 1| Bob   36  True   C
# 2| Mark  25  False  B
# 3| Sofia 31  False  A
# 4| Paul  29  True   B
# 5| Simon 21  True   B
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

System.out.println(df2);
// _| level active group
// 0| 9     false  C
// 1| 8     true   C
// 2| 7     false  B
// 3| 6     false  A
// 4| 7     true   B
// 5| 5     true   B

int replaced = df.replace(df2);

System.out.println(replaced);
// 2

System.out.println(df);
// _| name  age active group
// 0| Bill  34  false  C
// 1| Bob   36  true   C
// 2| Mark  25  false  B
// 3| Sofia 31  false  A
// 4| Paul  29  true   B
// 5| Simon 21  true   B
                

The above example shows that the level Column in the DataFrame argument passed to the replace() method is ignored in the operation. The int value returned by the replace() method indicates the number of replaced Columns.

Replace Categories with Factors

StringColumns often don't hold entirely unique data points but rather recurring designations for a particular attribute. Such data points are called categories. Categories are more convenient to read for humans but it is not easy to do numerical computations with them when they are represented as strings. Categories can be replaced by so called factors. A factor is simply a numerical representation of a category. Therefore, replacing categories by their factors unambiguously maps every category to a unique number. This process is a simple way of encoding non-integers into integer numbers.

The following example shows how the replace character categories inside a CharColumn into factors:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

print(df.get_column("group").type_name())
# char

cat_map = df.factor("group") # returns a dict

print(cat_map)
# {'A': 1, 'B': 2, 'C': 3}

print(df)
# _| name  age active group
# 0| Bill  34  False  1
# 1| Bob   36  True   2
# 2| Mark  25  False  3
# 3| Sofia 31  False  2
# 4| Paul  29  True   1
# 5| Simon 21  True   2

print(df.get_column("group").type_name())
# int
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

System.out.println(df.getColumn("group").typeName());
// char

Map<Object, Integer> catMap = df.factor("group");

System.out.println(catMap);
// {A=1, B=2, C=3}

System.out.println(df);
// _| name  age active group
// 0| Bill  34  false  1
// 1| Bob   36  true   2
// 2| Mark  25  false  3
// 3| Sofia 31  false  2
// 4| Paul  29  true   1
// 5| Simon 21  true   2

System.out.println(df.getColumn("group").typeName());
// int
                

As shown in the above example, the factor() method returns a map indicating the carried out replacement of all found categories into factors. It also shows how the group Column has been converted to an IntColumn. Please note that a DataFrame does not keep track of any category-factor-maps, so if you at a later point want to replace the factors back to their corresponding category, you must not lose the reference to the map object returned by the factor() method.

Conversion

The DataFrame API provides methods and utilities to perform various conversions. You may convert specific Columns or an entire DataFrame at once. Converting Columns changes their (element) type. Such a conversion is not in-place because Column instances are immutable with respect to their type. That is, a Column conversion created a new Column instance of the desired type and sets all values to converted values of the underlying Column. The same principle applies to DataFrame conversions which can convert between DataFrame implementations.

Convert Columns

All Columns can be converted to any other Column. The conversion of a Column may throw a DataFrameException if a column value cannot be meaningfully converted to the target type. For example, a Column conversion from a StringColumn to an IntColumn may fail if one encountered value is a non-numeric string, e.g. 'abcd'.

The following example shows the results of various Column conversions:

print(df)
# _| A    B  C
# 0| 11.1 42 Yes
# 1| 22.2 43 No
# 2| 33.3 44 1

print(df.info())
# Type:    Default
# Columns: 3
# Rows:    3
# _| column type   code
# 0| A      double 7
# 1| B      int    3
# 2| C      string 5

df.convert("A", "int")
df.convert("B", "double")
df.convert("C", "boolean")

print(df)
# _| A  B    C
# 0| 11 42.0 True
# 1| 22 43.0 False
# 2| 33 44.0 True

print(df.info())
# Type:    Default
# Columns: 3
# Rows:    3
# _| column type    code
# 0| A      int     3
# 1| B      double  7
# 2| C      boolean 9
                
System.out.println(df);
// _| A    B  C
// 0| 11.1 42 Yes
// 1| 22.2 43 No
// 2| 33.3 44 1

System.out.println(df.info());
// Type:    Default
// Columns: 3
// Rows:    3
// _| column type   code
// 0| A      double 7
// 1| B      int    3
// 2| C      string 5

df.convert("A", IntColumn.TYPE_CODE);
df.convert("B", DoubleColumn.TYPE_CODE);
df.convert("C", BooleanColumn.TYPE_CODE);

System.out.println(df);
// _| A  B    C
// 0| 11 42.0 true
// 1| 22 43.0 false
// 2| 33 44.0 true

System.out.println(df.info());
// Type:    Default
// Columns: 3
// Rows:    3
// _| column type    code
// 0| A      int     3
// 1| B      double  7
// 2| C      boolean 9
                

Convert DataFrames

A DataFrame can be converted to a different DataFrame implementation at runtime. That is, a DefaultDataFrame can be converted to a NullableDataFrame and a NullableDataFrame to a DefaultDataFrame. In the first case, the actual values in each Column are not changed. The opposite is not true however. Converting a NullableDataFrame to a DefaultDataFrame can result in loss of information since all null values in each Column are converted to the corresponding default value of the underlying Column.

The following example illustrates the conversion of a NullableDataFrame to a DefaultDataFrame:

print(df)
# _| name  age  active group
# 0| Bill  34   True   A
# 1| Bob   36   False  B
# 2| Mark  25   True   C
# 3| null  null null   null
# 4| null  null null   null
# 5| Simon 21   False  B

print(df.is_nullable())
# True

df = DataFrame.convert_to(df, "default")

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| n/a   0   False  ?
# 4| n/a   0   False  ?
# 5| Simon 21  False  B

print(df.is_nullable())
# False
                
System.out.println(df);
// _| name  age  active group
// 0| Bill  34   true   A
// 1| Bob   36   false  B
// 2| Mark  25   true   C
// 3| null  null null   null
// 4| null  null null   null
// 5| Simon 21   false  B

System.out.println(df.isNullable());
// true

df = DataFrame.convert(df, DefaultDataFrame.class);

System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| n/a   0   false  ?
// 4| n/a   0   false  ?
// 5| Simon 21  false  B

System.out.println(df.isNullable());
// false
                

The above example shows how values in the row at index 3 and 4 are converted to the corresponding default value of the converted Column. This also illustrates the circumstance that if the DefaultDataFrame in the above example would get converted back to a NullableDataFrame, the replaced (formerly null) values would not get converted back to null values.

Statistical Operations

The DataFrame API provides various methods to gain statistical information.

Minimum

The minimum can be computed for all numeric Columns. The minimum() method can be used in two ways. First, by only specifying the Column index or name, the method computes and returns the minimum, i.e. smallest value, in the specified Column. Second, by additionally specifying a rank, the method computes and returns a DataFrame with the specified number of elements, sorted ascendingly by the specified Column.

The following example shows how to compute the minimum:

print(df)
# _| name  age score
# 0| Bill  34  0.45
# 1| Bob   36  0.62
# 2| Mark  25  0.78
# 3| Sofia 31  0.42
# 4| Paul  29  0.89
# 5| Simon 21  0.57

# returns a Python int because 'age' is not FP
youngest = df.minimum("age")

print(youngest)
# 21

# returns a Python float because 'score' is FP
lowest_score = df.minimum("score")

print(lowest_score)
# 0.42
                
System.out.println(df);
// _| name  age score
// 0| Bill  34  0.45
// 1| Bob   36  0.62
// 2| Mark  25  0.78
// 3| Sofia 31  0.42
// 4| Paul  29  0.89
// 5| Simon 21  0.57

//can be sefely cast if 'age' column is not FP
int youngest = (int) df.minimum("age");

System.out.println(youngest);
// 21

double lowestScore = df.minimum("score");

System.out.println(lowestScore);
// 0.42
                

Optionally, you can pass an additional int to the minimum() method which specifies how many ranked minima you want the method to return.

The following example shows how to compute ranked minima:

print(df)
# _| name  age score
# 0| Bill  34  0.45
# 1| Bob   36  0.62
# 2| Mark  25  0.78
# 3| Sofia 31  0.42
# 4| Paul  29  0.89
# 5| Simon 21  0.57

lowest_scores = df.minimum("score", 3) # returns a DataFrame

print(lowest_scores)
# _| name  age score
# 0| Sofia 31  0.42
# 1| Bill  34  0.45
# 2| Simon 21  0.57
                
System.out.println(df);
// _| name  age score
// 0| Bill  34  0.45
// 1| Bob   36  0.62
// 2| Mark  25  0.78
// 3| Sofia 31  0.42
// 4| Paul  29  0.89
// 5| Simon 21  0.57

DataFrame lowestScores = df.minimum("score", 3);

System.out.println(lowestScores);
// _| name  age score
// 0| Sofia 31  0.42
// 1| Bill  34  0.45
// 2| Simon 21  0.57
                

In the above example, the minimum() method returns the 3 rows with the lowest score, sorted in ascending order. Please note that this is the preferred way of computing ranked minima, as opposed to simply sorting the entire DataFrame by the specified Column and selecting the first few rows.

Maximum

The maximum can be computed for all numeric Columns. The maximum() method can be used in two ways. First, by only specifying the Column index or name, the method computes and returns the maximum, i.e. largest value, in the specified Column. Second, by additionally specifying a rank, the method computes and returns a DataFrame with the specified number of elements, sorted descendingly by the specified Column.

The following example shows how to compute the maximum:

print(df)
# _| name  age score
# 0| Bill  34  0.45
# 1| Bob   36  0.62
# 2| Mark  25  0.78
# 3| Sofia 31  0.42
# 4| Paul  29  0.89
# 5| Simon 21  0.57

# returns a Python int because 'age' is not FP
oldest = df.maximum("age")

print(oldest)
# 36

# returns a Python float because 'score' is FP
highest_score = df.maximum("score")

print(highest_score)
# 0.89
                
System.out.println(df);
// _| name  age score
// 0| Bill  34  0.45
// 1| Bob   36  0.62
// 2| Mark  25  0.78
// 3| Sofia 31  0.42
// 4| Paul  29  0.89
// 5| Simon 21  0.57

//can be sefely cast if 'age' column is not FP
int oldest = (int) df.maximum("age");

System.out.println(oldest);
// 36

double highestScore = df.maximum("score");

System.out.println(highestScore);
// 0.89
                

Optionally, you can pass an additional int to the maximum() method which specifies how many ranked maxima you want the method to return.

The following example shows how to compute ranked maxima:

print(df)
# _| name  age score
# 0| Bill  34  0.45
# 1| Bob   36  0.62
# 2| Mark  25  0.78
# 3| Sofia 31  0.42
# 4| Paul  29  0.89
# 5| Simon 21  0.57

highest_scores = df.maximum("score", 3) # returns a DataFrame

print(highest_scores)
# _| name  age score
# 0| Paul  29  0.89
# 1| Mark  25  0.78
# 2| Bob   36  0.62
                
System.out.println(df);
// _| name  age score
// 0| Bill  34  0.45
// 1| Bob   36  0.62
// 2| Mark  25  0.78
// 3| Sofia 31  0.42
// 4| Paul  29  0.89
// 5| Simon 21  0.57

DataFrame highestScores = df.maximum("score", 3);

System.out.println(highestScores);
// _| name  age score
// 0| Paul  29  0.89
// 1| Mark  25  0.78
// 2| Bob   36  0.62
                

In the above example, the maximum() method returns the 3 rows with the highest score, sorted in descending order. Please note that this is the preferred way of computing ranked maxima, as opposed to simply sorting the entire DataFrame by the specified Column and selecting the first few rows.

Average

The average can be computed for all numeric Columns.
The following example shows how to compute the average:

print(df)
# _| name  age score
# 0| Bill  34  0.45
# 1| Bob   36  0.62
# 2| Mark  25  0.78
# 3| Sofia 31  0.42
# 4| Paul  29  0.89
# 5| Simon 21  0.57

average_age = df.average("age") # returns a Python float

print(average_age)
# 29.333333333333332
                
System.out.println(df);
// _| name  age score
// 0| Bill  34  0.45
// 1| Bob   36  0.62
// 2| Mark  25  0.78
// 3| Sofia 31  0.42
// 4| Paul  29  0.89
// 5| Simon 21  0.57

double averageAge = df.average("age");

System.out.println(averageAge);
// 29.333333333333332
                

Median

The median can be computed for all numeric Columns.
The following example shows how to compute the median:

print(df)
# _| name  age score
# 0| Bill  34  0.45
# 1| Bob   36  0.62
# 2| Mark  25  0.78
# 3| Sofia 31  0.42
# 4| Paul  29  0.89
# 5| Simon 21  0.57

median_age = df.median("age") # returns a Python float

print(median_age)
# 30.0
                
System.out.println(df);
// _| name  age score
// 0| Bill  34  0.45
// 1| Bob   36  0.62
// 2| Mark  25  0.78
// 3| Sofia 31  0.42
// 4| Paul  29  0.89
// 5| Simon 21  0.57

double medianAge = df.median("age");

System.out.println(medianAge);
// 30.0
                

Sum

The sum can be computed for all numeric Columns.
The following example shows how to compute the sum:

print(df)
# _| name  age score
# 0| Bill  34  0.45
# 1| Bob   36  0.62
# 2| Mark  25  0.78
# 3| Sofia 31  0.42
# 4| Paul  29  0.89
# 5| Simon 21  0.57

sum_scores = df.sum("score")

print(sum_scores)
# 3.73
                
System.out.println(df);
// _| name  age score
// 0| Bill  34  0.45
// 1| Bob   36  0.62
// 2| Mark  25  0.78
// 3| Sofia 31  0.42
// 4| Paul  29  0.89
// 5| Simon 21  0.57

double sumScores = df.sum("score");

System.out.println(sumScores);
// 3.73
                

Count

The occurrence of values can be counted. The are two ways that the count() method can be used. First, you can count the number of occurrences of all unique elements in a specific Column. The occurrences are modeled as a DataFrame. Second, you can specify an additional regular expression which defines the elements to count.
The following example shows how to count all values in a specific Column:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

groups = df.count("group") # returns a DataFrame

print(groups)
# _| group count %
# 0| A     2     0.3333333432674408
# 1| B     3     0.5
# 2| C     1     0.1666666716337204
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

DataFrame groups = df.count("group");

System.out.println(groups);
// _| group count %
// 0| A     2     0.33333334
// 1| B     3     0.5
// 2| C     1     0.16666667
                

You can pass an additional argument to the count() method to specify the regular expression to count matches in the specified Column for
The following example shows how to count specific elements:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

count_B = df.count("group", "B") # returns an int

print(count_B)
# 3
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

int countB = df.count("group", "B");

System.out.println(countB);
// 3
                

Count Unique Values

You can directly count the number of unique values in a specific Column.
The following example shows how to count all unique values in a Column:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

val = df.count_unique("group") # returns an int

print(val)
# 3
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

int val = df.countUnique("group");

System.out.println(val);
// 3
                

Unique Values

You can compute a set of all unique values in a specific Column.
The following example shows how to create a set holding all unique values inside a Column:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

groups = df.unique("group") # returns a Python set

print(groups)
# {'A', 'B', 'C'}
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

Set<Character> groups = df.unique("group");

System.out.println(groups);
// [A, B, C]
                

Numerical Operations

The DataFrame API provides various methods to manipulate values in numerical Columns. These methods are bulk operations, which means that they apply to the entire Column.

Absolute

Changing numerical values to their absolute value essentially ensures that all values have a positive sign. The magnitude of each value, however, is not changed by this operation.
The following example shows how to use the absolute() method:

print(df)
# _| name  age active level
# 0| Bill  34  True   -42
# 1| Bob   36  False  12
# 2| Mark  25  True   56
# 3| Sofia 31  True   -13
# 4| Paul  29  True   -51
# 5| Simon 21  False  -46

df.absolute("level")

print(df)
# _| name  age active level
# 0| Bill  34  True   42
# 1| Bob   36  False  12
# 2| Mark  25  True   56
# 3| Sofia 31  True   13
# 4| Paul  29  True   51
# 5| Simon 21  False  46
                
System.out.println(df);
// _| name  age active level
// 0| Bill  34  true   -42
// 1| Bob   36  false  12
// 2| Mark  25  true   56
// 3| Sofia 31  true   -13
// 4| Paul  29  true   -51
// 5| Simon 21  false  -46

df.absolute("level");

System.out.println(df);
// _| name  age active level
// 0| Bill  34  true   42
// 1| Bob   36  false  12
// 2| Mark  25  true   56
// 3| Sofia 31  true   13
// 4| Paul  29  true   51
// 5| Simon 21  false  46
                

Ceil

Ceiling numerical values replaces all values by the value returned by the mathematical ceil function.
The following example shows how to use the ceil() method:

print(df)
# _| name  age active level
# 0| Bill  34  True   -42.4
# 1| Bob   36  False  12.5
# 2| Mark  25  True   56.87
# 3| Sofia 31  True   -13.1
# 4| Paul  29  True   51.9
# 5| Simon 21  False  46.01

df.ceil("level")

print(df)
# _| name  age active level
# 0| Bill  34  True   -42.0
# 1| Bob   36  False  13.0
# 2| Mark  25  True   57.0
# 3| Sofia 31  True   -13.0
# 4| Paul  29  True   52.0
# 5| Simon 21  False  47.0
                
System.out.println(df);
// _| name  age active level
// 0| Bill  34  true   -42.4
// 1| Bob   36  false  12.5
// 2| Mark  25  true   56.87
// 3| Sofia 31  true   -13.1
// 4| Paul  29  true   51.9
// 5| Simon 21  false  46.01

df.ceil("level");

System.out.println(df);
// _| name  age active level
// 0| Bill  34  true   -42.0
// 1| Bob   36  false  13.0
// 2| Mark  25  true   57.0
// 3| Sofia 31  true   -13.0
// 4| Paul  29  true   52.0
// 5| Simon 21  false  47.0
                

After ceiling values in a numerical Column you could convert that Column into, e.g. an IntColumn, without losing information.

Floor

Flooring numerical values replaces all values by the value returned by the mathematical floor function.
The following example shows how to use the floor() method:

print(df)
# _| name  age active level
# 0| Bill  34  True   -42.4
# 1| Bob   36  False  12.5
# 2| Mark  25  True   56.87
# 3| Sofia 31  True   -13.1
# 4| Paul  29  True   51.9
# 5| Simon 21  False  46.01

df.floor("level")

print(df)
# _| name  age active level
# 0| Bill  34  True   -43.0
# 1| Bob   36  False  12.0
# 2| Mark  25  True   56.0
# 3| Sofia 31  True   -14.0
# 4| Paul  29  True   51.0
# 5| Simon 21  False  46.0
                
System.out.println(df);
// _| name  age active level
// 0| Bill  34  true   -42.4
// 1| Bob   36  false  12.5
// 2| Mark  25  true   56.87
// 3| Sofia 31  true   -13.1
// 4| Paul  29  true   51.9
// 5| Simon 21  false  46.01

df.floor("level");

System.out.println(df);
// _| name  age active level
// 0| Bill  34  true   -43.0
// 1| Bob   36  false  12.0
// 2| Mark  25  true   56.0
// 3| Sofia 31  true   -14.0
// 4| Paul  29  true   51.0
// 5| Simon 21  false  46.0
                

After flooring values in a numerical Column you could convert that Column into, e.g. an IntColumn, without losing information.

Round

Rounding numerical values replaces all values in a Column by the corresponding rounded value. You must specify the number of decimal places to round for.
The following example shows how to use the round() method:

print(df)
# _| name  age active level
# 0| Bill  34  True   -42.459
# 1| Bob   36  False  12.525
# 2| Mark  25  True   56.879
# 3| Sofia 31  True   -13.148
# 4| Paul  29  True   51.999
# 5| Simon 21  False  46.452

df.round("level", 1)

print(df)
# _| name  age active level
# 0| Bill  34  True   -42.5
# 1| Bob   36  False  12.5
# 2| Mark  25  True   56.9
# 3| Sofia 31  True   -13.1
# 4| Paul  29  True   52.0
# 5| Simon 21  False  46.5
                
System.out.println(df);
// _| name  age active level
// 0| Bill  34  true   -42.459
// 1| Bob   36  false  12.525
// 2| Mark  25  true   56.879
// 3| Sofia 31  true   -13.148
// 4| Paul  29  true   51.999
// 5| Simon 21  false  46.459

df.round("level", 1);

System.out.println(df);
// _| name  age active level
// 0| Bill  34  true   -42.5
// 1| Bob   36  false  12.5
// 2| Mark  25  true   56.9
// 3| Sofia 31  true   -13.1
// 4| Paul  29  true   52.0
// 5| Simon 21  false  46.5
                

Clip

Clipping numerical values ensures that all values in a Column are in a specified range by cutting off all values that lie outside of that range. The range can be open on either side, i.e. you don't necessarily have to specify both lower and upper clip boundaries.
The following example shows how to use the clip() method:

print(df)
# _| name  age active level
# 0| Bill  34  True   -42.49
# 1| Bob   36  False  12.52
# 2| Mark  25  True   56.87
# 3| Sofia 31  True   -13.14
# 4| Paul  29  True   51.999
# 5| Simon 21  False  26.4

df.clip("level", -20, 30)

print(df)
# _| name  age active level
# 0| Bill  34  True   -20.0
# 1| Bob   36  False  12.52
# 2| Mark  25  True   30.0
# 3| Sofia 31  True   -13.14
# 4| Paul  29  True   30.0
# 5| Simon 21  False  26.4
                
System.out.println(df);
// _| name  age active level
// 0| Bill  34  true   -42.49
// 1| Bob   36  false  12.52
// 2| Mark  25  true   56.87
// 3| Sofia 31  true   -13.14
// 4| Paul  29  true   51.999
// 5| Simon 21  false  26.4

df.clip("level", -20, 30);

System.out.println(df);
// _| name  age active level
// 0| Bill  34  true   -20.0
// 1| Bob   36  false  12.52
// 2| Mark  25  true   30.0
// 3| Sofia 31  true   -13.14
// 4| Paul  29  true   30.0
// 5| Simon 21  false  26.4
                

Sort Operations

The DataFrame API provides methods to sort all rows of a DataFrame based on the values in a specific Column.
The following example shows how to sort a DataFrame in ascending order:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

df.sort_by("age")

# or alternatively:
# df.sort_ascending_by("age")

print(df)
# _| name  age active group
# 0| Simon 21  False  B
# 1| Mark  25  True   C
# 2| Paul  29  True   A
# 3| Sofia 31  True   B
# 4| Bill  34  True   A
# 5| Bob   36  False  B
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

df.sortBy("age");

// or alternatively:
// df.sortAscendingBy("age");

System.out.println(df);
// _| name  age active group
// 0| Simon 21  false  B
// 1| Mark  25  true   C
// 2| Paul  29  true   A
// 3| Sofia 31  true   B
// 4| Bill  34  true   A
// 5| Bob   36  false  B
                

In principle, you can sort by any Column. All values are sorted according to their natural order. For strings and chars this means that they are sorted lexicographically. Please note that values in BinaryColumns are sorted according to their length, i.e. the number of bytes in the byte array object.

You may also sort all rows in a DataFrame in descending order according to values in a specific Column.
The following example shows how to sort a DataFrame in descending order:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

df.sort_descending_by("age")

print(df)
# _| name  age active group
# 0| Bob   36  False  B
# 1| Bill  34  True   A
# 2| Sofia 31  True   B
# 3| Paul  29  True   A
# 4| Mark  25  True   C
# 5| Simon 21  False  B
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

df.sortDescendingBy("age");

System.out.println(df);
// _| name  age active group
// 0| Bob   36  false  B
// 1| Bill  34  true   A
// 2| Sofia 31  true   B
// 3| Paul  29  true   A
// 4| Mark  25  true   C
// 5| Simon 21  false  B
                

Regardless of whether a DataFrame is sorted in ascending or descending order, when using a NullableDataFrame all null values are moved to the end of the underlying DataFrame. When sorting values with respect to a Column containing float or double values, then any NaN values are moved to the end of the DataFrame but before any null values.

The following example illustrates the sort behaviour for a Column holding double values in a DataFrame containing null values:

print(df)
# _| name  age active level
# 0| Bill  34  True   null
# 1| Bob   36  False  NaN
# 2| Mark  25  True   12.3
# 3| Sofia 31  True   null
# 4| Paul  29  True   5.2
# 5| Simon 21  False  NaN

df.sort_by("level")

print(df)
# _| name  age active level
# 0| Paul  29  True   5.2
# 1| Mark  25  True   12.3
# 2| Bob   36  False  NaN
# 3| Simon 21  False  NaN
# 4| Sofia 31  True   null
# 5| Bill  34  True   null
                
System.out.println(df);
// _| name  age active level
// 0| Bill  34  true   null
// 1| Bob   36  false  NaN
// 2| Mark  25  true   12.3
// 3| Sofia 31  true   null
// 4| Paul  29  true   5.2
// 5| Simon 21  false  NaN

df.sortBy("level");

System.out.println(df);
// _| name  age active level
// 0| Paul  29  true   5.2
// 1| Mark  25  true   12.3
// 2| Bob   36  false  NaN
// 3| Simon 21  false  NaN
// 4| Sofia 31  true   null
// 5| Bill  34  true   null
                

Please note that the order between equal elements is not defined. Particularly, the sorting algorithm does not have to be stable.

Set Operations

The DataFrame API provides various methods for set-theoretic operations. These operations can either be performed with respect to columns or rows. Three basic set-theoretic operations are supported: difference, union and intersection. All of these operations are performed by considering either column or rows as two sets of two separate DataFrames.

Warning:
Set-theoretic operations on Columns only copy the references to the corresponding Column instances. Changing the row structure of the result DataFrames can lead to an incoherent DataFrame state. Always copy the DataFrame returned by set-theoretic column operations when subsequently changing the row structure!

Difference Columns

This operation computes the set-theoretic difference with regard to all columns in two distinct DataFrame instances.
The following example shows how to compute the column difference of two DataFrames:

print(df)
# _| name  age active level
# 0| Bill  34  True   42.49
# 1| Bob   36  False  12.52
# 2| Mark  25  True   56.87
# 3| Sofia 31  True   13.14
# 4| Paul  29  True   51.999
# 5| Simon 21  False  26.4

print(df2)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

diff = df.difference_columns(df2) # returns a DataFrame

print(diff)
# _| level  group
# 0| 42.49  A
# 1| 12.52  B
# 2| 56.87  C
# 3| 13.14  B
# 4| 51.999 A
# 5| 26.4   B
                
System.out.println(df);
// _| name  age active level
// 0| Bill  34  true   42.49
// 1| Bob   36  false  12.52
// 2| Mark  25  true   56.87
// 3| Sofia 31  true   13.14
// 4| Paul  29  true   51.999
// 5| Simon 21  false  26.4

System.out.println(df2);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

DataFrame diff = df.differenceColumns(df2);

System.out.println(diff);
// _| level  group
// 0| 42.49  A
// 1| 12.52  B
// 2| 56.87  C
// 3| 13.14  B
// 4| 51.999 A
// 5| 26.4   B
                

Union Columns

This operation computes the set-theoretic union with regard to all columns in two distinct DataFrame instances. The DataFrame returned by this operation contains the references to all Columns of the DataFrame that the method is called upon and the specified DataFrame, omitting any duplicate columns from the argument DataFrame
The following example shows how to compute the column union of two DataFrames:

print(df)
# _| name  age active level
# 0| Bill  34  True   42.49
# 1| Bob   36  True   12.52
# 2| Mark  25  True   56.87
# 3| Sofia 31  True   13.14
# 4| Paul  29  True   51.999
# 5| Simon 21  True   26.4

print(df2)
# _| name  active group
# 0| Bill  False  A
# 1| Bob   False  B
# 2| Mark  False  C
# 3| Sofia False  B
# 4| Paul  False  A
# 5| Simon False  B

union = df.union_columns(df2) # returns a DataFrame

print(union)
# _| name  age active level  group
# 0| Bill  34  True   42.49  A
# 1| Bob   36  True   12.52  B
# 2| Mark  25  True   56.87  C
# 3| Sofia 31  True   13.14  B
# 4| Paul  29  True   51.999 A
# 5| Simon 21  True   26.4   B
                
System.out.println(df);
// _| name  age active level
// 0| Bill  34  true   42.49
// 1| Bob   36  true   12.52
// 2| Mark  25  true   56.87
// 3| Sofia 31  true   13.14
// 4| Paul  29  true   51.999
// 5| Simon 21  true   26.4

System.out.println(df2);
// _| name  active group
// 0| Bill  false  A
// 1| Bob   false  B
// 2| Mark  false  C
// 3| Sofia false  B
// 4| Paul  false  A
// 5| Simon false  B

DataFrame union = df.unionColumns(df2);

System.out.println(union);
// _| name  age active level  group
// 0| Bill  34  true   42.49  A
// 1| Bob   36  true   12.52  B
// 2| Mark  25  true   56.87  C
// 3| Sofia 31  true   13.14  B
// 4| Paul  29  true   51.999 A
// 5| Simon 21  true   26.4   B
                

Intersection Columns

This operation computes the set-theoretic intersection with regard to all columns in two distinct DataFrame instances. The DataFrame returned by this operation contains the references to all Columns of the DataFrame that the method is called upon that are also in the specified DataFrame
The following example shows how to compute the column intersection of two DataFrames:

print(df)
# _| name  age active level
# 0| Bill  34  True   42.49
# 1| Bob   36  True   12.52
# 2| Mark  25  True   56.87
# 3| Sofia 31  True   13.14
# 4| Paul  29  True   51.999
# 5| Simon 21  True   26.4

print(df2)
# _| name  active group
# 0| Bill  False  A
# 1| Bob   False  B
# 2| Mark  False  C
# 3| Sofia False  B
# 4| Paul  False  A
# 5| Simon False  B

intersec = df.intersection_columns(df2) # returns a DataFrame

print(intersec)
# _| name  active
# 0| Bill  True
# 1| Bob   True
# 2| Mark  True
# 3| Sofia True
# 4| Paul  True
# 5| Simon True
                
System.out.println(df);
// _| name  age active level
// 0| Bill  34  true   42.49
// 1| Bob   36  true   12.52
// 2| Mark  25  true   56.87
// 3| Sofia 31  true   13.14
// 4| Paul  29  true   51.999
// 5| Simon 21  true   26.4

System.out.println(df2);
// _| name  active group
// 0| Bill  false  A
// 1| Bob   false  B
// 2| Mark  false  C
// 3| Sofia false  B
// 4| Paul  false  A
// 5| Simon false  B

DataFrame intersec = df.intersectionColumns(df2);

System.out.println(intersec);
// _| name  active
// 0| Bill  true
// 1| Bob   true
// 2| Mark  true
// 3| Sofia true
// 4| Paul  true
// 5| Simon true
                

Difference Rows

This operation computes the set-theoretic difference with regard to all rows in two distinct DataFrame instances.
The following example shows how to compute the row difference of two DataFrames:

print(df)
# _| A   B  C
# 0| aaa 11 True
# 1| aab 22 False
# 2| aac 33 True
# 3| aad 44 False
# 4| aae 55 True

print(df2)
# _| A   B  C
# 0| aaa 11 True
# 1| aab 22 False
# 2| aac 33 True
# 3| aad 44 False
# 3| ccc 22 False
# 4| ccc 33 True

diff = df.difference_rows(df2) # returns a DataFrame

print(diff)
# _| A   B  C
# 0| aae 55 True
# 3| ccc 22 False
# 4| ccc 33 True
                
System.out.println(df);
// _| A   B  C
// 0| aaa 11 true
// 1| aab 22 false
// 2| aac 33 true
// 3| aad 44 false
// 4| aae 55 true

System.out.println(df2);
// _| A   B  C
// 0| aaa 11 true
// 1| aab 22 false
// 2| aac 33 true
// 3| aad 44 false
// 3| ccc 22 false
// 4| ccc 33 true

DataFrame diff = df.differenceRows(df2);

System.out.println(diff);
// _| A   B  C
// 0| aae 55 true
// 3| ccc 22 false
// 4| ccc 33 true
                

Union Rows

This operation computes the set-theoretic union with regard to all rows in two distinct DataFrame instances.
The following example shows how to compute the row union of two DataFrames:

print(df)
# _| A   B  C
# 0| aaa 11 True
# 1| aab 22 False
# 2| aac 33 True

print(df2)
# _| A   B  C
# 0| aaa 11 True
# 1| aab 22 False
# 2| ccc 33 False

union = df.union_rows(df2) # returns a DataFrame

print(union)
# _| A   B  C
# 0| aaa 11 True
# 1| aab 22 False
# 2| aac 33 True
# 3| ccc 33 False
                
System.out.println(df);
// _| A   B  C
// 0| aaa 11 true
// 1| aab 22 false
// 2| aac 33 true

System.out.println(df2);
// _| A   B  C
// 0| aaa 11 true
// 1| aab 22 false
// 2| ccc 33 false

DataFrame union = df.unionRows(df2);

System.out.println(union);
// _| A   B  C
// 0| aaa 11 true
// 1| aab 22 false
// 2| aac 33 true
// 3| ccc 33 false
                

Intersection Rows

This operation computes the set-theoretic intersection with regard to all rows in two distinct DataFrame instances.
The following example shows how to compute the row intersection of two DataFrames:

print(df)
# _| A   B  C
# 0| aaa 11 True
# 1| aab 22 False
# 2| aac 33 True

print(df2)
# _| A   B  C
# 0| ccc 33 False
# 1| aaa 11 True
# 2| aab 22 False

intersec = df.intersection_rows(df2) # returns a DataFrame

print(intersec)
# _| A   B  C
# 0| aaa 11 True
# 1| aab 22 False
                
System.out.println(df);
// _| A   B  C
// 0| aaa 11 true
// 1| aab 22 false
// 2| aac 33 true

System.out.println(df2);
// _| A   B  C
// 0| ccc 33 false
// 1| aaa 11 true
// 2| aab 22 false

DataFrame intersec = df.intersectionRows(df2);

System.out.println(intersec);
// _| A   B  C
// 0| aaa 11 true
// 1| aab 22 false
                

Group Operations

The DataFrame API provides various methods to group elements together and aggregate values from other numerical Columns by applying a statistical operation.

Minimum

This operation groups values from the specified Column and computes the minimum of all numerical Columns for each group.
The following example shows how to group minimum values:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

minima = df.group_minimum_by("group") # returns a DataFrame

print(minima)
# _| group age
# 0| A     29
# 1| B     21
# 2| C     25
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

DataFrame minima = df.groupMinimumBy("group");

System.out.println(minima);
// _| group age
// 0| A     29
// 1| B     21
// 2| C     25
                

Maximum

This operation groups values from the specified Column and computes the maximum of all numerical Columns for each group.
The following example shows how to group maximum values:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

maxima = df.group_maximum_by("group") # returns a DataFrame

print(maxima)
# _| group age
# 0| A     34
# 1| B     36
# 2| C     25
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

DataFrame maxima = df.groupMaximumBy("group");

System.out.println(maxima);
// _| group age
// 0| A     34
// 1| B     36
// 2| C     25
                

Average

This operation groups values from the specified Column and computes the average of all numerical Columns for each group.
The following example shows how to group average values:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

averages = df.group_average_by("group") # returns a DataFrame

print(averages)
# _| group age
# 0| A     31.5
# 1| B     29.333333333333332
# 2| C     25.0
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

DataFrame averages = df.groupAverageBy("group");

System.out.println(averages);
// _| group age
// 0| A     31.5
// 1| B     29.333333333333332
// 2| C     25.0
                

Sum

This operation groups values from the specified Column and computes the sum of all numerical Columns for each group.
The following example shows how to group sum values:

print(df)
# _| name  age active group
# 0| Bill  34  True   A
# 1| Bob   36  False  B
# 2| Mark  25  True   C
# 3| Sofia 31  True   B
# 4| Paul  29  True   A
# 5| Simon 21  False  B

sums = df.group_sum_by("group") # returns a DataFrame

print(sums)
# _| group age
# 0| A     63.0
# 1| B     88.0
# 2| C     25.0
                
System.out.println(df);
// _| name  age active group
// 0| Bill  34  true   A
// 1| Bob   36  false  B
// 2| Mark  25  true   C
// 3| Sofia 31  true   B
// 4| Paul  29  true   A
// 5| Simon 21  false  B

DataFrame sums = df.groupSumBy("group");

System.out.println(sums);
// _| group age
// 0| A     63.0
// 1| B     88.0
// 2| C     25.0
                

Join Operations

The DataFrame API provides a method to perform joins on DataFrame instances. This is essentially equivalent to an SQL inner-join operation. Therefore, even though the join() method is called on a DataFrame instance, the operation itself is commutative.

The following example shows how to use the join() method when specifying both column names:

print(df)
# _| id  name age
# 0| 101 Bill 34
# 1| 102 Bob  36
# 2| 103 Mark 25
# 3| 104 Paul 29

print(df2)
# _| key active group
# 0| 101 True   A
# 1| 102 False  B
# 2| 103 True   C

result = df.join(df2, "id", "key") # returns a DataFrame

print(result)
# _| id  name age active group
# 0| 101 Bill 34  True   A
# 1| 102 Bob  36  False  B
# 2| 103 Mark 25  True   C
                
System.out.println(df);
// _| id  name age
// 0| 101 Bill 34
// 1| 102 Bob  36
// 2| 103 Mark 25
// 3| 104 Paul 29

System.out.println(df2);
// _| key active group
// 0| 101 true   A
// 1| 102 false  B
// 2| 103 true   C

DataFrame result = df.join(df2, "id", "key");

System.out.println(result);
// _| id  name age active group
// 0| 101 Bill 34  true   A
// 1| 102 Bob  36  false  B
// 2| 103 Mark 25  true   C
                

Optionally, when both DataFrames involved in a join operation have exactly one Column with an identical name in common, you can omit the specification of the join keys when calling the join() method.
The following example illustrates this:

print(df)
# _| id  name age
# 0| 101 Bill 34
# 1| 102 Bob  36
# 2| 103 Mark 25
# 3| 104 Paul 29

print(df2)
# _| id active group
# 0| 101 True  A
# 1| 102 False B
# 2| 103 True  C

result = df.join(df2)

print(result)
# _| id  name age active group
# 0| 101 Bill 34  True   A
# 1| 102 Bob  36  False  B
# 2| 103 Mark 25  True   C
                
System.out.println(df);
// _| id  name age
// 0| 101 Bill 34
// 1| 102 Bob  36
// 2| 103 Mark 25
// 3| 104 Paul 29

System.out.println(df2);
// _| id active group
// 0| 101 true  A
// 1| 102 false B
// 2| 103 true  C

DataFrame result = df.join(df2);

System.out.println(result);
// _| id  name age active group
// 0| 101 Bill 34  true   A
// 1| 102 Bob  36  false  B
// 2| 103 Mark 25  true   C
                

Utilities

The DataFrame API provides a collection of standard methods for various purposes. All methods that do not directly fit into one of the already described sections are explained in the following.

Info

The info() method provides a descriptive string summarizing the main properties of a DataFrame without showing the actual data. The actual string returned by the method is not strictly defined by the specification and may therefore be implementation dependent.
The following example shows how the info string looks like:

print(df)
# _| id  name  age active group
# 0| 101 Bill  34  True   A
# 1| 102 Bob   36  False  B
# 2| 103 Mark  25  True   C
# 3| 104 Sofia 31  True   B

print(df.info())
# Type:    Default
# Columns: 5
# Rows:    4
# _| column type    code
# 0| id     int     3
# 1| name   string  5
# 2| age    short   2
# 3| active boolean 9
# 4| group  char    8
                
System.out.println(df);
// _| id  name  age active group
// 0| 101 Bill  34  true   A
// 1| 102 Bob   36  false  B
// 2| 103 Mark  25  true   C
// 3| 104 Sofia 31  true   B

System.out.println(df.info());
// Type:    Default
// Columns: 5
// Rows:    4
// _| column type    code
// 0| id     int     3
// 1| name   string  5
// 2| age    short   2
// 3| active boolean 9
// 4| group  char    8
                

The code column in the above info-string DataFrame refers to the unique type code of the corresponding Column instance, e.g. a (non-nullable) StringColumn has a unique type code of 5.

To Array

A DataFrame can be converted to a plain two-dimensional array/list. All values are copies of the values inside the underlying DataFrame, with the exception of byte arrays of BinaryColumns.
The following code gives an example of how to get a DataFrame as an array/list of objects:

print(df)
# _| id  name active
# 0| 101 Bill True
# 1| 102 Bob  False
# 2| 103 Mark True

array = df.to_array() # returns a list of lists
print(array)
# [[101, 102, 103], ['Bill', 'Bob', 'Mark'], [True, False, True]]
                
System.out.println(df);
// _| id  name active
// 0| 101 Bill true
// 1| 102 Bob  false
// 2| 103 Mark true

Object[] array = df.toArray();
System.out.println(Arrays.deepToString(array));
// [[101, 102, 103], [Bill, Bob, Mark], [true, false, true]]
                

To String

A DataFrame can be represented as a string. This can be very helpful when working with DataFrames interactively or when debugging.
The following example shows a string representation of a DataFrame:

print(df)
# _| id  name active
# 0| 101 Bill True
# 1| 102 Bob  False
# 2| 103 Mark True

string = df.to_string() # returns a Python str
print(string)
# _| id  name active
# 0| 101 Bill True
# 1| 102 Bob  False
# 2| 103 Mark True
                
System.out.println(df);
// _| id  name active
// 0| 101 Bill true
// 1| 102 Bob  false
// 2| 103 Mark true

String string = df.toString();
System.out.println(string);
// _| id  name active
// 0| 101 Bill true
// 1| 102 Bob  false
// 2| 103 Mark true
                

Clone

A DataFrame can be cloned (copied) in its entirety. This will create a deep copy, i.e. all values including byte arrays of BinaryColumns are copied to a new DataFrame instance.
The following example shows how to copy a DataFrame:

print(df)
# _| id  name active
# 0| 101 Bill True
# 1| 102 Bob  False
# 2| 103 Mark True

copy = df.clone()

# or alternatively:
# copy = DataFrame.copy(df)

print(copy)
# _| id  name active
# 0| 101 Bill True
# 1| 102 Bob  False
# 2| 103 Mark True
                
System.out.println(df);
// _| id  name active
// 0| 101 Bill true
// 1| 102 Bob  false
// 2| 103 Mark true

DataFrame copy = df.clone();

// or alternatively:
// DataFrame copy = DataFrame.copy(df);

System.out.println(copy);
// _| id  name active
// 0| 101 Bill true
// 1| 102 Bob  false
// 2| 103 Mark true
                

Hash Code

A hash code can be generated from any DataFrame. Please note that the hash values does not need to be platform independent. Even two identical DataFrames in separate system processes do not necessarily have the same hash code. The method for computing the hash code value for a DataFrame simply returns an integer value with at least 32 bits worth of information. Therefore, hash code values are not sufficiently resistant to hash collisions.

Warning:
Do not use hash codes to determine if two DataFrames are equal. Use the equals() method instead!

The following example shows how to compute a hash code values:

print(df)
# _| id  name active
# 0| 101 Bill True
# 1| 102 Bob  False
# 2| 103 Mark True

hashcode = df.hash_code() # returns an int

# or alternatively:
# hashcode = hash(df)

print(hashcode)
# 1486664104986588480
                
System.out.println(df);
// _| id  name active
// 0| 101 Bill true
// 1| 102 Bob  false
// 2| 103 Mark true

int hashcode = df.hashCode();

System.out.println(hashcode);
// -843453324
                

Equals

Whether two DataFrames are equal can be determined with the equals() method. Two DataFrames are equal if they have the same column structure and all corresponding rows elements are equal. The order of Columns in both DataFrames is taken into consideration when checking for equality.
The following example shows how to check if two DataFrames are equal:

print(df)
# _| id  name active
# 0| 101 Bill True
# 1| 102 Bob  False
# 2| 103 Mark True

df2 = df.clone()

is_equal = df.equals(df2) # returns a Python bool

# or alternatively:
# is_equal = df == df2

print(is_equal)
# True

df2.set_boolean("active", 1, True)

is_equal = df.equals(df2)

print(is_equal)
# False
                
System.out.println(df);
// _| id  name active
// 0| 101 Bill true
// 1| 102 Bob  false
// 2| 103 Mark true

DataFrame df2 = df.clone();

boolean isEqual = df.equals(df2);

System.out.println(isEqual);
// true

df2.setBoolean("active", 1, true);

isEqual = df.equals(df2);

System.out.println(isEqual);
// false
                

Memory Usage

The memory usage of a DataFrame can be approximately determined at runtime. Please note that the value determined by the method is only an approximation comparable to the size of the payload data in a serialized form. The memory usage is denoted in bytes and might actually be higher than the value returned by the method.

print(df)
# _| id  name active
# 0| 101 Bill True
# 1| 102 Bob  False
# 2| 103 Mark True

val = df.memory_usage() # returns an int

print(val)
# 39
                
System.out.println(df);
// _| id  name active
// 0| 101 Bill true
// 1| 102 Bob  false
// 2| 103 Mark true

int val = df.memoryUsage();

System.out.println(val);
// 27
                

Flush

A flush operation shrinks the internally used array of each Column to match the actual needed size of the DataFrame. This can be used in a situation where unecessary space allocation should get freed to reduce the overall memory footprint of a process.
The following example shows how to flush a DataFrame:

print(df)
# _| id  name active
# 0| 101 Bill True
# 1| 102 Bob  False

print(df.capacity())
# 2

df.add_row([103, "Mark", True])

print(df.capacity())
# 4

df.flush()

print(df.capacity())
# 3
                
System.out.println(df);
// _| id  name active
// 0| 101 Bill True
// 1| 102 Bob  False

System.out.println(df.capacity());
// 2

df.addRow(103, "Mark", true);

System.out.println(df.capacity());
// 4

df.flush();

System.out.println(df.capacity());
// 3
                

Merge

Multiple DataFrames can be merged into one DataFrame. Merging is performed with respect to all Columns. If DataFrames have duplicate Column names then the offending Columns are automatically renamed.
The following example demonstrates how to merge DataFrames:

print(df)
# _| id  name
# 0| 101 Bill
# 1| 102 Bob

print(df2)
# _| name    active
# 0| Smith   True
# 1| Swanson False

print(df3)
# _| age level
# 0| 34  1.5
# 1| 36  2.3

res = DataFrame.merge(df, df2, df3) # returns a DataFrame
print(res)
# _| id  name_0 name_1  active age level
# 0| 101 Bill   Smith   True   34  1.5
# 1| 102 Bob    Swanson False  36  2.3
                
System.out.println(df);
// _| id  name
// 0| 101 Bill
// 1| 102 Bob

System.out.println(df2);
// _| name    active
// 0| Smith   true
// 1| Swanson false

System.out.println(df3);
// _| age level
// 0| 34  1.5
// 1| 36  2.3

DataFrame res = DataFrame.merge(df, df2, df3);
System.out.println(res);
// _| id  name_0 name_1  active age level
// 0| 101 Bill   Smith   True   34  1.5
// 1| 102 Bob    Swanson False  36  2.3
                

Like

The static function like() creates a new DataFrame instance with the same column structure as the provided DataFrame argument. The returned DataFrame is empty.
The following example shows how to copy the column structure of a DataFrame:

print(df)
# _| id  name active
# 0| 101 Bill True
# 1| 102 Bob  False
# 2| 103 Mark True

df2 = DataFrame.like(df) # returns a DataFrame

print(df2)
# __| id name active

print(df2.is_empty())
# True
                
System.out.println(df);
// _| id  name active
// 0| 101 Bill true
// 1| 102 Bob  false
// 2| 103 Mark true

DataFrame df2 = DataFrame.like(df);

System.out.println(df2);
// __| id name active

System.out.println(df2.isEmpty());
// true
                

I/O Support

The DataFrame API provides standard functions for serialization support.

Serialization

You may serialize any DataFrame into an array of bytes. These byte arrays can be deserialized again which restores the original DataFrame object.
The following example shows DataFrame serialization:

print(df)
# _| id  name active
# 0| 101 Bill True
# 1| 102 Bob  False
# 2| 103 Mark True

val = DataFrame.serialize(df) # returns a Python bytearray

print(val.hex())
# 7b763a323b64 ... 4d61726b00a0

df = DataFrame.deserialize(val) # returns a DataFrame

print(df)
# _| id  name active
# 0| 101 Bill True
# 1| 102 Bob  False
# 2| 103 Mark True
                
System.out.println(df);
// _| id  name active
// 0| 101 Bill true
// 1| 102 Bob  false
// 2| 103 Mark true

byte[] val = DataFrame.serialize(df);

System.out.println(Arrays.toString(val));
// [123, 118, 58, 50, 59, 100, ... , 77, 97, 114, 107, 0, -96]

df = DataFrame.deserialize(val);

System.out.println(df);
// _| id  name active
// 0| 101 Bill true
// 1| 102 Bob  false
// 2| 103 Mark true
                

The byte array in the above example is not compressed. You can compress the byte array by passing an additional boolean flag to the serialize() function. The following example shows how to serialize a DataFrame to a comressed byte array:

print(df)
# _| id  name active
# 0| 101 Bill True
# 1| 102 Bob  False
# 2| 103 Mark True

val = DataFrame.serialize(df, compress=True)

print(val.hex())
# 6466ab2eb332 ... 020080d80d6e

df = DataFrame.deserialize(val)

print(df)
# _| id  name active
# 0| 101 Bill True
# 1| 102 Bob  False
# 2| 103 Mark True
                
import static com.raven.common.io.DataFrameSerializer.MODE_COMPRESSED;

System.out.println(df);
// _| id  name active
// 0| 101 Bill true
// 1| 102 Bob  false
// 2| 103 Mark true

byte[] val = DataFrame.serialize(df, MODE_COMPRESSED);

// or alternatively:
// byte[] val = DataFrame.serialize(df, true);

System.out.println(Arrays.toString(val));
// [100, 102, -85, 46, -77, 50, ... , 2, 0, -128, -40, 13, 110]

df = DataFrame.deserialize(val);

System.out.println(df);
// _| id  name active
// 0| 101 Bill true
// 1| 102 Bob  false
// 2| 103 Mark true
                

The byte array is automatically decompressed if applicable. Please note that both compression and decompression requires additional runtime.

Read and Write Files

You may persist a DataFrame to a file. The file extension for DataFrame files is .df

The following example shows how to persist a DataFrame to a file and read that DataFrame from the file again:

print(df)
# _| id  name active
# 0| 101 Bill True
# 1| 102 Bob  False
# 2| 103 Mark True

DataFrame.write("myFile.df", df)

df = DataFrame.read("myFile.df")

print(df)
# _| id  name active
# 0| 101 Bill True
# 1| 102 Bob  False
# 2| 103 Mark True
                
System.out.println(df);
// _| id  name active
// 0| 101 Bill true
// 1| 102 Bob  false
// 2| 103 Mark true

DataFrame.write("myFile.df", df);

df = DataFrame.read("myFile.df");

System.out.println(df);
// _| id  name active
// 0| 101 Bill true
// 1| 102 Bob  false
// 2| 103 Mark true
                

Base64 Encoding

You may encode a DataFrame to a Base64-encoded string.
The following example shows how to encode a DataFrame to Base64:

print(df)
# _| id  name active
# 0| 101 Bill True
# 1| 102 Bob  False
# 2| 103 Mark True

string = DataFrame.to_base64(df) # returns a str

print(string)
# ZGarLr ... gNbg==

df = DataFrame.from_base64(string)

print(df)
# _| id  name active
# 0| 101 Bill True
# 1| 102 Bob  False
# 2| 103 Mark True
                
System.out.println(df);
// _| id  name active
// 0| 101 Bill true
// 1| 102 Bob  false
// 2| 103 Mark true

String string = DataFrame.toBase64(df);

System.out.println(string);
// ZGarLr ... gNbg==

df = DataFrame.fromBase64(string);

System.out.println(df);
// _| id  name active
// 0| 101 Bill true
// 1| 102 Bob  false
// 2| 103 Mark true
                

CSV Files

You can read a CSV file into a DataFrame and write a DataFrame to a CSV file. Although not strictly specified by the DataFrame specification, all available implementations provide support for handling CSV files.

The following example shows how to read a CSV file from the filesystem:

# id,name,active
# 101,Bill,True
# 102,Bob,False
# 103,Mark,True

df = DataFrame.read_csv("myFile.csv") # returns a DataFrame

print(df)
# _| id  name active
# 0| 101 Bill True
# 1| 102 Bob  False
# 2| 103 Mark True
                
// id,name,active
// 101,Bill,True
// 102,Bob,False
// 103,Mark,True

import com.raven.common.io.CSVReader;

DataFrame df = new CSVReader("myFile.csv").read();

System.out.println(df);
// _| id  name active
// 0| 101 Bill true
// 1| 102 Bob  false
// 2| 103 Mark true
                

All Columns in the returned DataFrame are StringColumns because CSV files do not carry any type information. You can explicitly define the column types in the returned DataFrame when reading a file.
The following example shows how to read a CSV file and specify the column types:

# id,name,active
# 101,Bill,True
# 102,Bob,False
# 103,Mark,True

df = DataFrame.read_csv("myFile.csv", types=("int", "string", "boolean"))

print(df.info())
# Type:    Default
# Columns: 3
# Rows:    3
# _| column type    code
# 0| id     int     3
# 1| name   string  5
# 2| active boolean 9
                    
// id,name,active
// 101,Bill,True
// 102,Bob,False
// 103,Mark,True

import com.raven.common.io.CSVReader;

DataFrame df = new CSVReader("myFile.csv")
        .useColumnTypes(Integer.class, String.class, Boolean.class)
        .read();

System.out.println(df.info());
// Type:    Default
// Columns: 3
// Rows:    3
// _| column type    code
// 0| id     int     3
// 1| name   string  5
// 2| active boolean 9
                    

The corresponding functions for reading CSV files have more parameters. See the source code documentation for details.

You can write a DataFrame to a CSV file in a similar way.
The following example shows how to write a DataFrame to a CSV file:

print(df)
# _| id  name active
# 0| 101 Bill True
# 1| 102 Bob  False
# 2| 103 Mark True

DataFrame.write_csv("myFile.csv", df)
                    
import com.raven.common.io.CSVWriter;

System.out.println(df);
// _| id  name active
// 0| 101 Bill true
// 1| 102 Bob  false
// 2| 103 Mark true

new CSVWriter("myFile.csv").write(df);
                    

The corresponding functions for writing CSV files have more parameters. See the source code documentation for details.