[Impression Analytics]

[11/06/2025]

Problem Statement

The next large set of data that we can leverage for internal and external use is our user impressions. These impressions include active impressions like button clicks or banner clicks and passive impressions like an element loading on the page and time spent on certain pages. We need to build out a system that gathers all of these impressions, stores them, and allows them to be visualized. This design should take into consideration

Scalability
- Will this be able to support 10k users, 100k users, and so on
Cost
- Is this reasonable for our current number of users and does it scale well - can’t have a cost balloon with increased usage
- Also want a cost that is reasonable for our current data needs
Ease of integration
- Does it work well with our exisiting architecture
Flexibility
- How easy is it to query data in different ways or add new impressions
Data Freshness
- Do we need real time queries or daily reports
  - I am erroring toward daily reports at the moment
Front End Speed
- We do not want the Front end lagging due to impression gathering
Data quality
- Make sure impressions are meaningful (e.g. double clicks or network retries are just counted once instead of many times)
User Consent?
- Anything legal we need to recognize here

Proposed Solution

Overview of redshift vs RDS vs S3

Redshift is recommended when we have multiple close to a PB of data. Currently are DB has a few GB of data. The cost is fairly high to start with and we can’t leverage its MPP with a relatively small amount of data.
RDS is a transactional DB and supports OLTP operations. This impression data will rarely need to be edited/deleted so an OLAP storage is preferred. Also S3 scales better and can be transferred to redshift spectrum (a middle ground between redshift and s3) easily in the future.
S3 can store data in parquet files, these files can then be easily read and queried by AWS Athena. Glue Crawlers can be used to build schemas from the S3 Parquet files.

Plan

Have code in front end that batches impression events then sends to firehose. Firehose can have transformation lambda that can enrich data and fliter out duplicate events. Firehose then batches data and sends to S3 for storage, have specific partitions within the S3 by date, and store files in parquet. Read this data with AWS Glue Crawlers for schema discovery and use athena for queries. Can link this up to dev panel or quicksight.

Architectural & Technical Details

Capture impressions with javascript code
- async, add to batch and batch sends asynchronously
Batch and send data
- Client side batching
Data Handling Backend
- Firehose with lambda transformation if needed
Queue for processing
- Kinesis firehose for batching
- lambda transformation for processing
Batch write to storage
- every 15 min firehose writes to S3 or 100MB whichever comes first
Storage for Data
- S3
- Parquet - columnar binary format
  - smaller than JSON and faster queries
  - ideal size of files is 100 - 500 MB
  - Need to include date partitions for efficiency
Partitioning Strategies
- Date partitioning
  - year/month/day
Discover schema
- Use AWS Glue Crawler
  - Automatically discovers schemas
  - Allows for easy SQL queries by Athena
Query Data
- AWS Athena
  - cheap at our scale - pay per query
  - 5$ / TB of data scanned
Visualize Data
- AWS Quick Sight
  - cost per user
  - integrates with athena, AWS BI tool
- Dev panel
  - manually create dashboards

Cost Estimate

Service	1M impressions/month	10M impressions/month	100M impressions/month
Kinesis Firehose	$0.29 (10GB)	$2.90 (100GB)	$29 (1TB)
Transform Lambda	$0.02 (8.6K invocations)	$0.10 (86K invocations)	$1 (864K invocations)
S3 Storage	$0.50	$2	$10
Glue Crawler	$0.44 (daily runs)	$0.44	$0.44
Athena Queries	$2-5	$5-10	$10-25
Data Transfer	$0.50	$2	$10
TOTAL	~$4-7/month	~$13-18/month	~$60-75/month

Key Cost Drivers:

Firehose: $0.029 per GB ingested (very cost-effective at small scale)
Transform Lambda: Only invoked every 15 minutes (not per-request), keeping costs low
Athena: $5 per TB scanned - use date partitioning to minimize scanned data
Optional: QuickSight adds $9/user/month (reader) or $24/user/month (author)

Summary

This is a very cheap option for our impression analytics. It also scales well with many more users.

Next Steps

Start creating impression logic, identifying all areas in Client where impressions can be tracked and determine batching logic
- Impressions : number of times elements are loaded on the page, these are elements that don’t need to be clicked/interacted with.
- Reach : unique number of users that interact with the app in any form, can also be on a finer grain : fight button has 32 reach whereas trends have 24 reach
- Engagements : All active actions taken by the user, we have some of this tracked when things occur on the backend like when a fight is created. This also includes button clicks
Create AWS Firehose, link to Client, set up parameters (time out, batch size, parquet transforamtion to S3) for hose.
Create S3 bucket for storage
- Figure out partition logic and organization of file storage
Set up AWS Glue Crawlers for schema recognition
Set up Amazon Athena and simple queries to test - linked to S3
Determine if quick sight or dev panel is a better fit for us.

Action Items

Need review of this contribution and feedback on design
Start with basic impressions tracking and getting minimal version of this pipeline working, then can add more and more impression/reach/engagement tracking
Open Questions
1. Firehose is cheaper than setting up API gateway + lambda + SQS for data ingestion into S3. It does have an upper limit of 15 min uploads to S3. This could result in small file sizes being uploaded to S3. This shouldn’t be too much of a problem, but want to highlight it anyway.
2. Do we get any perks of using new AWS tools? I know this was mentioned awhile ago, but this design would use multiple new things in AWS.

Approvals

You need architectural approval from Trace Carrasco & product approval from Filip Pacyna / Troy Lenihan

Trace Carrasco
Filip Pacyna
Troy Lenihan