Apache iceberg example

11/30/2023

When tables grow to tens or hundreds of petabytes there can be gigabytes of metadata and a brute force scan through that metadata would require a long wait or a parallel job-instead, Iceberg skips most of it. This additional level of indexing makes a huge difference in the amount of work needed to find the files for a query. When reading manifests from the list, Iceberg will compare a query’s partition predicates with the range of values for each partition field and skip any manifest that doesn’t overlap. The metadata for every manifest in a given version or snapshot is kept in a manifest list file. Just like Iceberg uses value ranges for columns to skip data files, it also uses value ranges for partition fields to skip manifest files (along with whether there are null or NaN values). One of the most important reasons to split metadata across files is to be able to skip whole manifests and not just individual files. Each manifest tracks a subset of the files in a table to reduce write amplification and to allow parallel metadata operations. Iceberg keeps the partition and column metrics for data files in manifest files. Iceberg will compare the query’s time range with each file’s event_time range to find just the files that overlap, which can easily be a 10-100x reduction in data. However, if data is being committed continuously from Kafka, each data file would have a small slice of event_time. In the events table, looking for an hour-sized range of event_time would require reading a whole day if filtering with only day-sized partitions. File formats will also use column stats to skip rows, but it is much more efficient to keep stats in table metadata to avoid even creating a task for a file with no matching rows. After filtering by partition, Iceberg will check whether each file matches a given query filter using these stats. The metrics kept are lower and upper bounds, as well as null, NaN, and total counts for the columns in a file. Iceberg stores column-level metrics for each data file for an extra level of filtering within each partition. Partitions tend to be fairly large, with tens or hundreds of files in each partition. Storing the values directly instead of as a string key makes it easy to run partition filters to select data files, without needing to parse a key or URL encode values like in Hive.įor our example table, a row with event_time= 11:23:31.129055 and device_id=74 is stored in a Parquet file under partition (, 1), corresponding to the partition transforms, days(event_time) and bucket(device_id, 64). The partition of a file is stored as a struct of metadata values that are automatically derived using the table’s partition transforms. Partitions in Iceberg are the same idea as in Hive, but Iceberg tracks individual files and the partition each file belongs to instead of keeping track of partitions and locating files by listing directories. Iceberg tracks each data file in a table, along with two main pieces of information: its partition and column-level metrics. File formats like Parquet already make it possible to read a subset of columns and skip rows within each data file, so the purpose of table-level metadata is to efficiently select which files to read. Iceberg stores data in Parquet, Avro, or ORC files. SELECT * FROM events WHERE device_id = 74 AND event_time > 'T10:00:00' AND event_time < 'T20:00:00' This events table is organized into daily partitions of event time and bucketed by the device ID that sent an event. To demonstrate metadata indexing, I’ll use an example table of events that is written by Flink from a Kafka stream. This post will focus on how Iceberg’s metadata forms an index that Iceberg uses to scale to hundreds of petabytes in a single table and to quickly find matching data, even on a single node. There is more to a table format than just finding or skipping data-reliable, atomic commits are another critical piece for a future post. And like a file format, the organization and metadata in a table format helps engines to quickly find the data for a query and skip the rest. A table format tracks data files just like a file format tracks rows. I’ve found that the easiest way to think about it is to compare table formats to file formats. Table formats are a relatively new concept in the big data space because previously we only had one: Hive tables. When people first hear about Iceberg, it’s often difficult to understand just what a table format does.

0 Comments

Apache iceberg example

Leave a Reply.

Author

Archives

Categories