Catalog layer
The Kinesis Data Firehose Delivery Stream needs a Glue table that holds the schema of the Parquet files to be able to produce them (incoming_events
table). The stats
and the events
tables are aggregated daily from the base incoming_events
table via cron jobs scheduled by Amazon EventBridge Rules at 00:16 AM.
incoming_events table
This table stores the events that are the result of the data transformation and dynamic partitioning Lambda function. The schema for the table incoming_events
looks like this:
Column name | Data type | Is partition key? | Description |
---|---|---|---|
domain_name | string | yes | The domain name |
event_date | string | yes | The date of the event (YYYY-MM-DD), as string |
event_type | string | yes | The type of the event (pageview or track ) |
event_year | int | no | The year of the event_date (YYYY) |
event_month | int | no | The month of the event (MM) |
event_day | int | no | The day of the event (DD) |
event_timestamp | timestamp | no | The exact event timestamp |
arrival_timestamp | timestamp | no | The exact timestamp when the event arrived in the Kinesis Data Stream |
arrival_delay_ms | int | no | The difference between event_timestamp and arrival_timestamp in milliseconds |
edge_city | string | no | The name of the edge city (all edge location info is derived from the x-edge-location field in the logs) |
edge_state | string | no | The state of the edge location |
edge_country | string | no | The country of the edge location |
edge_country_code | string | no | The country code of the edge location |
edge_latitude | float | no | The latitude of the edge location |
edge_longitude | float | no | The longitude of the edge location |
edge_id | string | no | The original id of the edge location |
referrer | string | no | The referrer |
referrer_domain_name | string | no | The domain name of the referrer |
browser_name | string | no | The name of the browser |
browser_version | string | no | The version of the browser |
browser_os_name | string | no | The OS name of the browser |
browser_os_version | string | no | The OS version of the browser |
browser_timezone | string | no | The timezone of the browser |
browser_language | string | no | The language of the browser |
device_type | string | no | The device type |
device_vendor | string | no | The device vendor |
device_outer_resolution | string | no | The outer resolution of the device |
device_inner_resolution | string | no | The inner resolution of the device |
device_color_depth | int | no | The color depth of the device |
device_platform | string | no | The platform of the device |
device_memory | float | no | The memory of the device (in MB) |
device_cores | int | no | The number of cores of the device |
utm_source | string | no | Identifies which site sent the traffic |
utm_campaign | string | no | Identifies a specific product promotion or strategic campaign |
utm_medium | string | no | Identifies what type of link was used, such as cost per click or email |
utm_content | string | no | Identifies what specifically was clicked to bring the user to the site |
utm_term | string | no | Identifies search terms |
request_url | string | no | The full requested URL |
request_path | string | no | The path of the requested URL |
request_query_string | string | no | The query string of the requested URL |
request_bytes | int | no | The size of the request in bytes |
request_status_code | int | no | The HTTP status code of the request |
request_cache_status | string | no | The CloudFront cache status |
request_delivery_time_ms | int | no | The time in ms it took for CloudFront to complete the request |
request_asn | int | no | The ASN of the requestor |
request_is_bot | int | no | If the request is categorized as a bot, the value will be 1 , if not 0 |
event_name | string | no | The name of the event for tracking events |
event_data | string | no | The stringified event payload for tracking events |
page_view_id | string | no | The unique pageview id |
daily_page_view_id | string | no | The unique daily pageview id |
daily_visitor_id | string | no | The unique daily visitor id |
stats table
The pageviews and visitor aggregation table. Its schema looks like this:
Column name | Data type | Is partition key? | Description |
---|---|---|---|
domain_name | string | yes | The domain name |
event_date | string | yes | The date of the event (YYYY-MM-DD), as string |
event_hour | int | no | The hour part of the event timestamp |
edge_city | string | no | The name of the edge city (all edge location info is derived from the x-edge-location field in the logs) |
edge_country | string | no | The country of the edge location |
edge_latitude | float | no | The latitude of the edge location |
edge_longitude | float | no | The longitude of the edge location |
referrer_domain_name | string | no | The domain name of the referrer |
browser_name | string | no | The name of the browser |
browser_os_name | string | no | The OS name of the browser |
device_type | string | no | The device type |
device_vendor | string | no | The device vendor |
utm_source | string | no | Identifies which site sent the traffic |
utm_campaign | string | no | Identifies a specific product promotion or strategic campaign |
utm_medium | string | no | Identifies what type of link was used, such as cost per click or email |
utm_content | string | no | Identifies what type of link was used, such as cost per click or email |
utm_term | string | no | Identifies search terms |
request_path | string | no | The path of the requested URL |
page_view_cnt | int | no | The number of page views |
visitor_cnt | int | no | The number of daily visitors |
bounces_cnt | int | no | The number of bounces (visited only one page) |
visit_duration_sec_avg | int | no | The average duration of a visit (in seconds) |
events table
The schema for the table events
looks like this:
Column name | Data type | Is partition key | Description |
---|---|---|---|
domain_name | string | yes | The domain name |
event_date | string | yes | The date of the event (YYYY-MM-DD), as string |
event_name | string | yes | The name of the event for tracking events |
event_timestamp | timestamp | no | The exact event timestamp |
edge_city | string | no | The name of the edge city (all edge location info is derived from the x-edge-location field in the logs) |
edge_country | string | no | The country of the edge location |
edge_latitude | float | no | The latitude of the edge location |
edge_longitude | float | no | The longitude of the edge location |
request_path | string | no | The path of the requested URL |
page_view_id | string | no | The unique pageview id |
daily_visitor_id | string | no | The unique daily visitor id |
event_data | string | no | The stringified event payload for tracking events |