Skip to content

Catalog layer

The Kinesis Data Firehose Delivery Stream needs a Glue table that holds the schema of the Parquet files to be able to produce them (incoming_events table). The stats and the events tables are aggregated daily from the base incoming_events table via cron jobs scheduled by Amazon EventBridge Rules at 00:16 AM.

incoming_events table

This table stores the events that are the result of the data transformation and dynamic partitioning Lambda function. The schema for the table incoming_events looks like this:

Column nameData typeIs partition key?Description
domain_namestringyesThe domain name
event_datestringyesThe date of the event (YYYY-MM-DD), as string
event_typestringyesThe type of the event (pageview or track)
event_yearintnoThe year of the event_date (YYYY)
event_monthintnoThe month of the event (MM)
event_dayintnoThe day of the event (DD)
event_timestamptimestampnoThe exact event timestamp
arrival_timestamptimestampnoThe exact timestamp when the event arrived in the Kinesis Data Stream
arrival_delay_msintnoThe difference between event_timestamp and arrival_timestamp in milliseconds
edge_citystringnoThe name of the edge city (all edge location info is derived from the x-edge-location field in the logs)
edge_statestringnoThe state of the edge location
edge_countrystringnoThe country of the edge location
edge_country_codestringnoThe country code of the edge location
edge_latitudefloatnoThe latitude of the edge location
edge_longitudefloatnoThe longitude of the edge location
edge_idstringnoThe original id of the edge location
referrerstringnoThe referrer
referrer_domain_namestringnoThe domain name of the referrer
browser_namestringnoThe name of the browser
browser_versionstringnoThe version of the browser
browser_os_namestringnoThe OS name of the browser
browser_os_versionstringnoThe OS version of the browser
browser_timezonestringnoThe timezone of the browser
browser_languagestringnoThe language of the browser
device_typestringnoThe device type
device_vendorstringnoThe device vendor
device_outer_resolutionstringnoThe outer resolution of the device
device_inner_resolutionstringnoThe inner resolution of the device
device_color_depthintnoThe color depth of the device
device_platformstringnoThe platform of the device
device_memoryfloatnoThe memory of the device (in MB)
device_coresintnoThe number of cores of the device
utm_sourcestringnoIdentifies which site sent the traffic
utm_campaignstringnoIdentifies a specific product promotion or strategic campaign
utm_mediumstringnoIdentifies what type of link was used, such as cost per click or email
utm_contentstringnoIdentifies what specifically was clicked to bring the user to the site
utm_termstringnoIdentifies search terms
request_urlstringnoThe full requested URL
request_pathstringnoThe path of the requested URL
request_query_stringstringnoThe query string of the requested URL
request_bytesintnoThe size of the request in bytes
request_status_codeintnoThe HTTP status code of the request
request_cache_statusstringnoThe CloudFront cache status
request_delivery_time_msintnoThe time in ms it took for CloudFront to complete the request
request_asnintnoThe ASN of the requestor
request_is_botintnoIf the request is categorized as a bot, the value will be 1, if not 0
event_namestringnoThe name of the event for tracking events
event_datastringnoThe stringified event payload for tracking events
page_view_idstringnoThe unique pageview id
daily_page_view_idstringnoThe unique daily pageview id
daily_visitor_idstringnoThe unique daily visitor id

stats table

The pageviews and visitor aggregation table. Its schema looks like this:

Column nameData typeIs partition key?Description
domain_namestringyesThe domain name
event_datestringyesThe date of the event (YYYY-MM-DD), as string
event_hourintnoThe hour part of the event timestamp
edge_citystringnoThe name of the edge city (all edge location info is derived from the x-edge-location field in the logs)
edge_countrystringnoThe country of the edge location
edge_latitudefloatnoThe latitude of the edge location
edge_longitudefloatnoThe longitude of the edge location
referrer_domain_namestringnoThe domain name of the referrer
browser_namestringnoThe name of the browser
browser_os_namestringnoThe OS name of the browser
device_typestringnoThe device type
device_vendorstringnoThe device vendor
utm_sourcestringnoIdentifies which site sent the traffic
utm_campaignstringnoIdentifies a specific product promotion or strategic campaign
utm_mediumstringnoIdentifies what type of link was used, such as cost per click or email
utm_contentstringnoIdentifies what type of link was used, such as cost per click or email
utm_termstringnoIdentifies search terms
request_pathstringnoThe path of the requested URL
page_view_cntintnoThe number of page views
visitor_cntintnoThe number of daily visitors
bounces_cntintnoThe number of bounces (visited only one page)
visit_duration_sec_avgintnoThe average duration of a visit (in seconds)

events table

The schema for the table events looks like this:

Column nameData typeIs partition keyDescription
domain_namestringyesThe domain name
event_datestringyesThe date of the event (YYYY-MM-DD), as string
event_namestringyesThe name of the event for tracking events
event_timestamptimestampnoThe exact event timestamp
edge_citystringnoThe name of the edge city (all edge location info is derived from the x-edge-location field in the logs)
edge_countrystringnoThe country of the edge location
edge_latitudefloatnoThe latitude of the edge location
edge_longitudefloatnoThe longitude of the edge location
request_pathstringnoThe path of the requested URL
page_view_idstringnoThe unique pageview id
daily_visitor_idstringnoThe unique daily visitor id
event_datastringnoThe stringified event payload for tracking events