[Impression Analytics]

[11/06/2025]

Problem Statement

The next large set of data that we can leverage for internal and external use is our user impressions. These impressions include active impressions like button clicks or banner clicks and passive impressions like an element loading on the page and time spent on certain pages. We need to build out a system that gathers all of these impressions, stores them, and allows them to be visualized. This design should take into consideration

  1. Scalability
    • Will this be able to support 10k users, 100k users, and so on
  2. Cost
    • Is this reasonable for our current number of users and does it scale well - can’t have a cost balloon with increased usage
    • Also want a cost that is reasonable for our current data needs
  3. Ease of integration
    • Does it work well with our exisiting architecture
  4. Flexibility
    • How easy is it to query data in different ways or add new impressions
  5. Data Freshness
    • Do we need real time queries or daily reports
      • I am erroring toward daily reports at the moment
  6. Front End Speed
    • We do not want the Front end lagging due to impression gathering
  7. Data quality
    • Make sure impressions are meaningful (e.g. double clicks or network retries are just counted once instead of many times)
  8. User Consent?
    • Anything legal we need to recognize here

Proposed Solution

Overview of redshift vs RDS vs S3

  • Redshift is recommended when we have multiple close to a PB of data. Currently are DB has a few GB of data. The cost is fairly high to start with and we can’t leverage its MPP with a relatively small amount of data.
  • RDS is a transactional DB and supports OLTP operations. This impression data will rarely need to be edited/deleted so an OLAP storage is preferred. Also S3 scales better and can be transferred to redshift spectrum (a middle ground between redshift and s3) easily in the future.
  • S3 can store data in parquet files, these files can then be easily read and queried by AWS Athena. Glue Crawlers can be used to build schemas from the S3 Parquet files.

Plan

Have code in front end that batches impression events then sends to firehose. Firehose can have transformation lambda that can enrich data and fliter out duplicate events. Firehose then batches data and sends to S3 for storage, have specific partitions within the S3 by date, and store files in parquet. Read this data with AWS Glue Crawlers for schema discovery and use athena for queries. Can link this up to dev panel or quicksight.

Architectural & Technical Details

  1. Capture impressions with javascript code
    • async, add to batch and batch sends asynchronously
  2. Batch and send data
    • Client side batching
  3. Data Handling Backend
    • Firehose with lambda transformation if needed
  4. Queue for processing
    • Kinesis firehose for batching
    • lambda transformation for processing
  5. Batch write to storage
    • every 15 min firehose writes to S3 or 100MB whichever comes first
  6. Storage for Data
    • S3
    • Parquet - columnar binary format
      • smaller than JSON and faster queries
      • ideal size of files is 100 - 500 MB
      • Need to include date partitions for efficiency
  7. Partitioning Strategies
    • Date partitioning
      • year/month/day
  8. Discover schema
    • Use AWS Glue Crawler
      • Automatically discovers schemas
      • Allows for easy SQL queries by Athena
  9. Query Data
    • AWS Athena
      • cheap at our scale - pay per query
      • 5$ / TB of data scanned
  10. Visualize Data
    • AWS Quick Sight
      • cost per user
      • integrates with athena, AWS BI tool
    • Dev panel
      • manually create dashboards

Cost Estimate

Service 1M impressions/month 10M impressions/month 100M impressions/month
Kinesis Firehose $0.29 (10GB) $2.90 (100GB) $29 (1TB)
Transform Lambda $0.02 (8.6K invocations) $0.10 (86K invocations) $1 (864K invocations)
S3 Storage $0.50 $2 $10
Glue Crawler $0.44 (daily runs) $0.44 $0.44
Athena Queries $2-5 $5-10 $10-25
Data Transfer $0.50 $2 $10
TOTAL ~$4-7/month ~$13-18/month ~$60-75/month

Key Cost Drivers:

  • Firehose: $0.029 per GB ingested (very cost-effective at small scale)
  • Transform Lambda: Only invoked every 15 minutes (not per-request), keeping costs low
  • Athena: $5 per TB scanned - use date partitioning to minimize scanned data
  • Optional: QuickSight adds $9/user/month (reader) or $24/user/month (author)

Summary

  • This is a very cheap option for our impression analytics. It also scales well with many more users.

Next Steps

  • Start creating impression logic, identifying all areas in Client where impressions can be tracked and determine batching logic
    • Impressions : number of times elements are loaded on the page, these are elements that don’t need to be clicked/interacted with.
    • Reach : unique number of users that interact with the app in any form, can also be on a finer grain : fight button has 32 reach whereas trends have 24 reach
    • Engagements : All active actions taken by the user, we have some of this tracked when things occur on the backend like when a fight is created. This also includes button clicks
  • Create AWS Firehose, link to Client, set up parameters (time out, batch size, parquet transforamtion to S3) for hose.
  • Create S3 bucket for storage
    • Figure out partition logic and organization of file storage
  • Set up AWS Glue Crawlers for schema recognition
  • Set up Amazon Athena and simple queries to test - linked to S3
  • Determine if quick sight or dev panel is a better fit for us.

Action Items

  • Need review of this contribution and feedback on design
  • Start with basic impressions tracking and getting minimal version of this pipeline working, then can add more and more impression/reach/engagement tracking

    Open Questions

    1. Firehose is cheaper than setting up API gateway + lambda + SQS for data ingestion into S3. It does have an upper limit of 15 min uploads to S3. This could result in small file sizes being uploaded to S3. This shouldn’t be too much of a problem, but want to highlight it anyway.
    2. Do we get any perks of using new AWS tools? I know this was mentioned awhile ago, but this design would use multiple new things in AWS.

Approvals

You need architectural approval from Trace Carrasco & product approval from Filip Pacyna / Troy Lenihan

  • Trace Carrasco
  • Filip Pacyna
  • Troy Lenihan