Catalog layer
The Kinesis Data Firehose Delivery Stream needs a Glue table that holds the schema of the Parquet files to be able to produce them (incoming_events table). The stats and the events tables are aggregated daily from the base incoming_events table via cron jobs scheduled by Amazon EventBridge Rules at 00:16 AM.
incoming_events table
This table stores the events that are the result of the data transformation and dynamic partitioning Lambda function. The schema for the table incoming_events looks like this:
| Column name | Data type | Is partition key? | Description |
|---|---|---|---|
| domain_name | string | yes | The domain name |
| event_date | string | yes | The date of the event (YYYY-MM-DD), as string |
| event_type | string | yes | The type of the event (pageview or track) |
| event_year | int | no | The year of the event_date (YYYY) |
| event_month | int | no | The month of the event (MM) |
| event_day | int | no | The day of the event (DD) |
| event_timestamp | timestamp | no | The exact event timestamp |
| arrival_timestamp | timestamp | no | The exact timestamp when the event arrived in the Kinesis Data Stream |
| arrival_delay_ms | int | no | The difference between event_timestamp and arrival_timestamp in milliseconds |
| edge_city | string | no | The name of the edge city (all edge location info is derived from the x-edge-location field in the logs) |
| edge_state | string | no | The state of the edge location |
| edge_country | string | no | The country of the edge location |
| edge_country_code | string | no | The country code of the edge location |
| edge_latitude | float | no | The latitude of the edge location |
| edge_longitude | float | no | The longitude of the edge location |
| edge_id | string | no | The original id of the edge location |
| referrer | string | no | The referrer |
| referrer_domain_name | string | no | The domain name of the referrer |
| browser_name | string | no | The name of the browser |
| browser_version | string | no | The version of the browser |
| browser_os_name | string | no | The OS name of the browser |
| browser_os_version | string | no | The OS version of the browser |
| browser_timezone | string | no | The timezone of the browser |
| browser_language | string | no | The language of the browser |
| device_type | string | no | The device type |
| device_vendor | string | no | The device vendor |
| device_outer_resolution | string | no | The outer resolution of the device |
| device_inner_resolution | string | no | The inner resolution of the device |
| device_color_depth | int | no | The color depth of the device |
| device_platform | string | no | The platform of the device |
| device_memory | float | no | The memory of the device (in MB) |
| device_cores | int | no | The number of cores of the device |
| utm_source | string | no | Identifies which site sent the traffic |
| utm_campaign | string | no | Identifies a specific product promotion or strategic campaign |
| utm_medium | string | no | Identifies what type of link was used, such as cost per click or email |
| utm_content | string | no | Identifies what specifically was clicked to bring the user to the site |
| utm_term | string | no | Identifies search terms |
| request_url | string | no | The full requested URL |
| request_path | string | no | The path of the requested URL |
| request_query_string | string | no | The query string of the requested URL |
| request_bytes | int | no | The size of the request in bytes |
| request_status_code | int | no | The HTTP status code of the request |
| request_cache_status | string | no | The CloudFront cache status |
| request_delivery_time_ms | int | no | The time in ms it took for CloudFront to complete the request |
| request_asn | int | no | The ASN of the requestor |
| request_is_bot | int | no | If the request is categorized as a bot, the value will be 1, if not 0 |
| event_name | string | no | The name of the event for tracking events |
| event_data | string | no | The stringified event payload for tracking events |
| page_view_id | string | no | The unique pageview id |
| daily_page_view_id | string | no | The unique daily pageview id |
| daily_visitor_id | string | no | The unique daily visitor id |
stats table
The pageviews and visitor aggregation table. Its schema looks like this:
| Column name | Data type | Is partition key? | Description |
|---|---|---|---|
| domain_name | string | yes | The domain name |
| event_date | string | yes | The date of the event (YYYY-MM-DD), as string |
| event_hour | int | no | The hour part of the event timestamp |
| edge_city | string | no | The name of the edge city (all edge location info is derived from the x-edge-location field in the logs) |
| edge_country | string | no | The country of the edge location |
| edge_latitude | float | no | The latitude of the edge location |
| edge_longitude | float | no | The longitude of the edge location |
| referrer_domain_name | string | no | The domain name of the referrer |
| browser_name | string | no | The name of the browser |
| browser_os_name | string | no | The OS name of the browser |
| device_type | string | no | The device type |
| device_vendor | string | no | The device vendor |
| utm_source | string | no | Identifies which site sent the traffic |
| utm_campaign | string | no | Identifies a specific product promotion or strategic campaign |
| utm_medium | string | no | Identifies what type of link was used, such as cost per click or email |
| utm_content | string | no | Identifies what type of link was used, such as cost per click or email |
| utm_term | string | no | Identifies search terms |
| request_path | string | no | The path of the requested URL |
| page_view_cnt | int | no | The number of page views |
| visitor_cnt | int | no | The number of daily visitors |
| bounces_cnt | int | no | The number of bounces (visited only one page) |
| visit_duration_sec_avg | int | no | The average duration of a visit (in seconds) |
events table
The schema for the table events looks like this:
| Column name | Data type | Is partition key | Description |
|---|---|---|---|
| domain_name | string | yes | The domain name |
| event_date | string | yes | The date of the event (YYYY-MM-DD), as string |
| event_name | string | yes | The name of the event for tracking events |
| event_timestamp | timestamp | no | The exact event timestamp |
| edge_city | string | no | The name of the edge city (all edge location info is derived from the x-edge-location field in the logs) |
| edge_country | string | no | The country of the edge location |
| edge_latitude | float | no | The latitude of the edge location |
| edge_longitude | float | no | The longitude of the edge location |
| request_path | string | no | The path of the requested URL |
| page_view_id | string | no | The unique pageview id |
| daily_visitor_id | string | no | The unique daily visitor id |
| event_data | string | no | The stringified event payload for tracking events |