Why Most Performance Tests Fail at the Foundation

Here is the pattern I have seen repeated hundreds of times across 25 years of performance engineering. A team is told their system needs load testing. Someone installs JMeter or Gatling. They record a few user journeys. They configure the test for 100 virtual users because it is a round number. They run it, collect response times, produce a report, and move on. Everyone feels better. Nobody asks the question that matters: does this test simulate anything resembling what real users actually do?

It does not. And the consequences are predictable.

"100 virtual users" is not a load profile. It is a guess. It tells you nothing about what those users are doing, how quickly they are doing it, whether 100 is the right number, or whether steady concurrent load bears any resemblance to how your system is actually used. A system that handles 100 simultaneous logins might collapse under 100 simultaneous checkout transactions because the database contention pattern is completely different. The number of users is the least interesting dimension of load. What those users do, and in what proportions, is where the real information lives.

The consequences of testing without a load profile are severe and consistent:

  • False confidence. The test passes because it was never designed to find the problems that matter. Everyone signs off. Production fails anyway.
  • Missed bottlenecks. Without the correct transaction mix, the test exercises the wrong code paths. The payment service that handles 3% of transactions but causes 80% of contention never gets stressed.
  • Production surprises. The system that handled your test beautifully falls over on launch day because the actual traffic pattern looks nothing like what you simulated.
  • Wasted investment. The team spent weeks building and running tests that cannot answer the business question: will this system handle real production load?

Every step downstream of the load profile depends on it. Script design, data preparation, environment sizing, success criteria, results analysis: all of it is anchored to the load profile. If the profile is wrong, everything built on top of it is unanchored. You are generating data, not evidence.

Case Study: The Launch Day Surprise

A retail company tested their new e-commerce platform with 500 concurrent users browsing products. The test passed with flying colours: sub-second response times across the board. On launch day, the site collapsed within 20 minutes.

The problem was not capacity. It was the transaction mix. The test had 500 users browsing. In reality, a marketing campaign drove users directly to the checkout flow. The actual traffic was 60% add-to-cart and checkout, not 5%. The payment gateway, inventory check, and order write path had never been tested at anything close to real proportions. The browsing test proved the CDN worked. It said nothing about whether the system could process orders.

A load profile built from the marketing team's campaign plan would have caught this in the first test run.

What a Load Profile Actually Is

A load profile is a quantified description of how your system is used under the conditions you need to test. It translates real user behaviour into numbers that a testing tool can simulate. It answers six questions.

Transaction Mix: What Users Do, in What Proportion

Not all user actions are equal. On an e-commerce site, browsing might represent 60% of all transactions, search 20%, add-to-cart 12%, and checkout 8%. These proportions matter because each transaction exercises different parts of the system. Browsing hits the CDN and product catalogue. Checkout hits the payment gateway, inventory service, order database, and email notification system simultaneously. A test that runs all transactions in equal proportions is not simulating your system; it is simulating a system that does not exist.

The transaction mix should reflect production reality, not the order in which the development team built features.

Concurrency Model: How Many Users, Arriving How

Concurrent users is the most commonly cited metric and the most commonly misunderstood. There is a critical difference between users who are active on the system (executing transactions) and users who have a session open (mostly idle, occasionally clicking). A system with 10,000 logged-in users might have only 300 actively executing transactions at any moment. Those 300 active users are the load. The other 9,700 are holding sessions and consuming memory, which matters for a different reason, but they are not generating the transaction load that stresses the application tier.

Equally important is the arrival pattern. Do users arrive at a steady rate throughout the day, or do they arrive in bursts? A flash sale, a marketing email, or a news event can send thousands of users to your site within seconds. A steady-state test will not reveal how your system handles that burst.

Think Times and Pacing: Realistic Human Behaviour

A real user browses a product page for 15 to 45 seconds before clicking. They read search results. They fill in forms. This idle time between actions is think time, and it dramatically affects the load a given number of users places on a system. Remove think time and 100 virtual users behave like 2,000 real users, hammering the system with back-to-back requests at machine speed. The test results will look catastrophic, but they will not represent reality.

Think time is not optional. It is the difference between a realistic simulation and a denial-of-service attack on your own system. Calibrate it from production data (real user session recordings, analytics click intervals) or, at minimum, use evidence-based estimates.

Data Volume Context: The Database Under Real Conditions

A database with 1,000 product records behaves differently from one with 5 million. Query plans change. Indexes that work at small scale become bottlenecks at large scale. Full-table scans that complete in milliseconds against a test database take minutes against production volumes. The load profile must account for the data context in which the system operates, not just the user traffic hitting it.

This is why I always recommend volume testing as a separate concern. Even at single-user concurrency, running your transactions against production-scale data reveals problems that no amount of concurrent load against an empty database will find.

Time Patterns: Peak Hours, Seasonal Spikes, Growth Trends

Most systems do not experience uniform load throughout the day. An internal HR system peaks at 9am when everyone logs in. A streaming service peaks at 8pm. A tax filing system peaks in the final week before deadline. The load profile must specify which time period it represents and why that period was chosen.

Testing at the average load is a common and dangerous mistake. The average is, by definition, the load your system handles comfortably most of the time. The peak is what breaks it. Seasonal patterns add another layer: a system that handles Tuesday peaks easily might fail during the Christmas trading period when baseline traffic doubles.

Safety Margins: Headroom Beyond Observed Peaks

A load profile built exactly to observed peak levels tells you whether the system can handle today's peak. It does not tell you whether it can handle tomorrow's. Safety margins (typically 20% to 50% above observed peak, depending on the risk profile and growth trajectory) provide headroom for organic growth, unexpected traffic spikes, and the inevitable gap between the model and reality.

The margin should be proportionate to the risk. A system protecting revenue or public safety warrants a larger margin than an internal reporting tool. The margin should also be documented and justified, not arbitrary.

Self-Assessment Prompt

Can you describe your system's peak load in specific, quantified terms: the number of concurrent users, the transaction mix, the peak duration, and the safety margin? If you cannot, your performance testing is built on assumptions rather than evidence.

Where the Numbers Come From

A load profile is only as good as its data sources. The best profiles triangulate from multiple sources to build a picture that no single source provides alone.

Production Analytics

This is the gold standard. Google Analytics, Adobe Analytics, or equivalent tools tell you exactly how many users visit your site, what pages they view, how long they spend, and what they do. APM tools (Dynatrace, AppDynamics, Datadog, New Relic) provide the server-side view: transaction rates, response times, error rates, and resource utilisation. Server access logs give you raw request counts per endpoint, per hour, per day.

The key metrics to extract:

  • Hourly and daily active users over the past 3 to 6 months, including any seasonal peaks
  • Top user journeys by volume (the 10 to 15 paths that represent 80% or more of all activity)
  • Transaction rates per endpoint during the peak hour, not the daily average
  • Session duration and pages per session to derive realistic think times
  • Error rates and timeout patterns to identify existing capacity limits

Business Stakeholders

Production data tells you what happened. Stakeholders tell you what is about to happen. The marketing team knows a campaign is launching next month that will drive 3x normal traffic to a specific landing page. The product team knows a new feature will shift user behaviour. The sales team knows a major client is onboarding 50,000 users in Q3.

This information rarely appears in analytics dashboards. You have to ask for it, and you have to ask the right people. In my experience, the most valuable 30 minutes of any load profiling exercise is a conversation with the head of marketing and the head of product. They know things the data does not yet reflect.

Historical Incident Data

Every production incident is a data point. If the system fell over at 2,000 concurrent users last Black Friday, that is a hard boundary. If response times degraded beyond acceptable limits at 500 transactions per second during a flash sale, that tells you where the current ceiling sits. Incident reports, post-mortems, and support ticket volumes during peak events all feed into the profile.

Growth Projections

If the business plans to grow by 40% this year, the load profile must account for that. Growth is not always linear; a product launch or market expansion can create step changes. Planned features may also change the traffic pattern entirely. A new recommendation engine might increase page views per session. A mobile app launch might shift the protocol mix from web to API.

Competitive Benchmarks

Industry data provides useful context, particularly for user expectations. Research consistently shows that users expect pages to load in under 2 to 3 seconds and will abandon after 3 to 5 seconds. Competitor performance (measured through publicly available tools) sets the bar your users compare you against, whether you like it or not.

When Production Data Does Not Exist

New systems present a challenge. There is no production data to analyse. In this situation, three approaches work in combination:

  • Stakeholder workshops. Bring together product, marketing, sales, and engineering to build a demand model from business projections. How many users are expected in month one, month six, month twelve? What will they do? This will be approximate, but an informed estimate is infinitely better than "let's test with 100 users."
  • Comparable systems. If you are building a new e-commerce platform, look at traffic patterns for similar platforms at similar scale. Industry benchmarks, case studies, and published architectures provide reference points. Your system will differ, but the order of magnitude will be informative.
  • Phased approach. Start with a conservative profile based on best estimates. Run the test. Instrument heavily. Launch with close monitoring. Use the first weeks of production data to refine the profile for the next round. This iterative approach acknowledges uncertainty honestly rather than pretending precision that does not exist.
Case Study: The Data That Was Already There

A financial services company asked me to load test a customer portal. The team had been testing with 200 concurrent users for two years. When I asked where the number came from, nobody knew. "It's what the previous tester set up."

I spent half a day in Google Analytics and their APM tool. The portal had 12,000 daily active users, with a peak hour between 9:00am and 10:00am where 3,200 users were active simultaneously. The transaction mix was 40% account summary views, 25% transaction history searches, 15% fund transfers, 12% document downloads, and 8% profile updates. Average session duration was 8 minutes with a median of 4 page interactions.

The load profile based on this data was fundamentally different from "200 concurrent users." The test revealed that the document download service, which had never been stressed because it was only 12% of the mix, became a bottleneck at realistic proportions because it made synchronous calls to a legacy document management system with a connection pool limited to 20.

The data was there all along. Nobody had looked at it.

Self-Assessment Prompt

How many data sources feed your current load profile? If the answer is zero or one, your profile has blind spots. The minimum credible profile draws from production analytics and at least one business stakeholder conversation.

Building the Profile: A Step by Step Guide

Here is the process I follow on every engagement. The steps are sequential because each builds on the previous one.

  • Identify the business transactions. Start from the user's perspective, not the technical architecture. "Browse products" is a business transaction; "GET /api/v2/products?category=shoes&page=1" is a technical implementation detail. List the 10 to 20 transactions that represent how real users interact with the system. Group related API calls into the business transaction they support. A single "checkout" transaction might involve 8 API calls, but it is one unit of user behaviour.
  • Gather production metrics (or estimates). For each business transaction, collect the hourly volume during the peak period. Use analytics, APM tools, and server logs. If production data is unavailable, work with stakeholders to produce informed estimates. Document the source and confidence level of each number.
  • Define the peak period. Identify the specific time window the test will simulate. This should be the busiest period your system faces, or will face. "Peak hour on the busiest day of the busiest week" is a common choice for systems with seasonal patterns. For systems with more uniform load, the peak hour of a typical busy day may suffice. Document why this period was selected.
  • Calculate concurrent users and transaction rates. From the hourly volumes and average session duration, derive the number of users active on the system at any moment during the peak period. Calculate the transactions per second (TPS) for each transaction type. These are the numbers your testing tool needs.
  • Apply safety margins. Add headroom above observed peaks. The margin depends on risk appetite and growth trajectory. For revenue-critical systems, I typically recommend 30% to 50%. For lower-risk internal systems, 20% may suffice. Document the margin and the rationale.
  • Document assumptions and establish a review cycle. Every load profile contains assumptions. Traffic will continue to follow current patterns. The marketing campaign will drive the projected volume. Growth will proceed at the estimated rate. Write these down. Schedule a review (quarterly for stable systems, monthly for rapidly changing ones) to validate assumptions against actual data and update the profile.

The Load Profile Document: What to Deliver

The output of this process is a load profile document that your entire team can reference. It should be concise, specific, and testable. Below is an example for a mid-sized e-commerce platform based on a composite of real engagements.

Example: Peak Hour Load Profile for E-Commerce Platform

Peak period: Wednesday 12:00 to 13:00 (lunchtime shopping peak, based on 6 months of analytics data).

Baseline peak concurrent users: 2,400 (derived from 18,000 hourly active users with average session duration of 8 minutes).

Safety margin: 30% (to account for planned Q4 marketing campaign and organic growth). Test target: 3,120 concurrent users.

Business Transaction % of Mix Hourly Volume (Peak) TPS Target (with 30% margin) Avg Think Time (sec) Concurrent Users (with margin)
Browse / Search Products 45% 32,400 11.7 25 1,404
View Product Detail 22% 15,840 5.7 35 686
Add to Cart 12% 8,640 3.1 10 374
Checkout (Payment) 6% 4,320 1.6 60 187
Account Login 8% 5,760 2.1 5 250
Order History / Tracking 5% 3,600 1.3 20 156
Account Update / Settings 2% 1,440 0.5 45 63
Total 100% 72,000 26.0 3,120

Key assumptions documented:

  • Transaction mix based on Google Analytics data from January to June 2026, validated against APM transaction counts
  • Think times derived from median session interaction intervals in analytics (not averages, to exclude outlier idle sessions)
  • 30% safety margin accounts for Q4 marketing campaign (projected 20% traffic increase) plus 10% contingency
  • Database pre-populated to production volume: 2.4 million product records, 8.6 million customer accounts, 45 million order history records
  • Background load included: batch inventory sync (every 15 minutes), recommendation engine reindexing (continuous), third-party price feed updates (every 5 minutes)
  • Profile review scheduled for end of Q3 to incorporate actual Q2/Q3 traffic data before Q4 peak testing

This table drives every subsequent decision. The script design follows the transaction list. The data preparation follows the concurrent user counts (you need at least 3,120 unique user accounts, product search terms, payment methods). The test environment must handle the aggregate TPS. The success criteria map to each transaction type individually, because a 2-second checkout is acceptable but a 2-second product browse is not (users expect browsing to be near-instant).

Self-Assessment Prompt

Does your load profile exist as a documented table with specific numbers, or does it live informally as "we test with X users"? Can you trace each number in the table back to a data source? If your profile is informal or untraceable, that is the first thing to fix.

Common Mistakes

Over 25 years, I have seen the same mistakes repeated across industries and organisation sizes. Each one can invalidate an entire testing effort.

Confusing Concurrent Users with Transactions Per Second

These are related but different metrics. 1,000 concurrent users with 30-second think times generate a very different load from 1,000 concurrent users with 1-second think times. The first scenario produces roughly 33 TPS. The second produces 1,000 TPS. If your success criteria are expressed in concurrent users but your bottleneck is transaction throughput, you will miss it. Always define both.

Ignoring Think Time

Removing or minimising think time turns virtual users into robots that fire requests at machine speed. This creates artificial contention that does not exist in production. Worse, it makes your test results look terrible, which either causes unnecessary panic or (more commonly) teaches the team to distrust test results. Neither outcome is useful. Calibrate think times from real user behaviour data.

Testing Only Steady State

A system that handles 2,000 concurrent users at steady state may collapse when 2,000 users arrive within 60 seconds. Ramp-up behaviour exposes connection pool exhaustion, cache cold-start penalties, thread pool limits, and session creation bottlenecks that steady-state testing never touches. Include spike tests and ramp-up scenarios in your test plan, not just the sustained load plateau.

Using Average Load Instead of Peak Load

If your system averages 500 concurrent users but peaks at 3,000 during the lunch hour, testing at 500 tells you nothing about whether the system survives the lunch hour. Always test at the peak, with safety margin. The average is the load your system handles easily. The peak is where it earns its keep.

Forgetting Background Load

Production systems do not serve user traffic in isolation. Batch jobs run overnight (and sometimes overrun into business hours). Integration feeds push and pull data continuously. Monitoring agents consume resources. Search indexes rebuild. Recommendation engines retrain. If your test does not include this background load, it is testing a cleaner environment than production will ever provide. Identify and simulate the background processes that run during your peak period.

"We'll Just Test with 1,000 Users"

Arbitrary round numbers are the most common substitute for a load profile. They are also the most dangerous, because they create an illusion of rigour. The test runs. Results are produced. Reports are written. Everyone feels confident. But the test answered the question "can the system handle 1,000 arbitrary users doing arbitrary things?" which is a question nobody actually needed answered. The question that matters is "can the system handle real peak traffic?" and only a load profile can answer that.

Case Study: The Batch Job Nobody Mentioned

A government digital service tested their citizen portal at 5,000 concurrent users and passed comfortably. Two weeks after launch, the system ground to a halt every morning at 8:30am with only 1,200 users online.

The cause was a data synchronisation batch job that ran at 8:00am, pulling records from a legacy mainframe system. The job locked critical database tables for 15 to 25 minutes while reconciling overnight changes. User requests that touched those tables queued behind the lock, causing cascading timeouts that consumed the application server thread pool.

The batch job had run for years on the legacy system without issue. When the new portal was built, nobody mentioned it to the performance testing team. It was not in the load profile because nobody asked about background processes. The fix was straightforward (restructuring the batch to use row-level locking), but the incident cost two weeks of emergency work and significant embarrassment during a high-profile public launch.

Connection to the Maturity Model

Load profiling is the thread that runs through the Non-Functional Testing Maturity Model. Its presence or absence is one of the clearest indicators of where an organisation sits on the capability spectrum.

  • At Level 1 (Reactive), no load profile exists. The team cannot describe what peak load looks like. Tests run against arbitrary numbers, if they run at all. There is no connection between the test and production reality.
  • At Level 2 (Repeatable), load profiles, if they exist, are based on guesswork or inherited from a previous tester without validation. The team tests consistently but against numbers that may have no relationship to real traffic.
  • At Level 3 (Defined), the methodology starts with load profiling. Profiles are grounded in evidence: analytics data, stakeholder input, historical incidents. The team can describe peak load in specific terms and justify each number in the profile. This is the threshold where performance testing starts producing evidence rather than data.
  • At Level 4 (Managed), production data feeds load profile development systematically. Real transaction mixes, observed peak concurrency, seasonal patterns, and growth trends are captured and fed into profiles as part of an established process, not a one-off effort.
  • At Level 5 (Optimising), load profiles evolve continuously through production feedback loops. The gap between the profile and reality is measured and minimised. The test answers harder questions: not just "does it pass?" but "how much headroom before failure?" and "what does failure look like?"

If you have read this article and recognised that your organisation operates without a load profile, or with one based on assumptions rather than evidence, you are likely at Level 1 or Level 2. The single highest-value improvement you can make is to build an evidence-based load profile using the process described above. Everything else in performance testing improves once this foundation is in place.

For the full assessment framework covering all six dimensions, see the Non-Functional Testing Maturity Model.