FastAPI, Concurrency, and Parallelism — Post 1 (Bullet Draft)
Table of Contents
FastAPI, Concurrency, and Parallelism — Post 1 (Bullet Draft) #
FastAPI has moved from niche to mainstream. Surveys show steady growth (1)See JetBrains/PSF 2024 and Stack Overflow 2025 for survey references. , and teams use it from startups to enterprises. Developers praise its developer experience: Pydantic type-driven validation catches errors before runtime, first-class OpenAPI generates interactive docs automatically, and a lightweight dependency-injection system keeps endpoints clean and testable. Under the hood, it builds on Starlette for production-ready building blocks like WebSockets, background tasks, and static files. These features all contribute to its popularity and make FastAPI fast to develop as the name suggests, but on top of all these, FastAPI is also fast.
1) Introduction: FastAPI Is Fast #
Understanding the Protocol Difference
In classic Python web stacks, a typical deployment involves the web server passing each incoming request to a worker process. That worker then handles the request—processing it from start to finish—before it becomes available to take on another. This is commonly deployed as a synchronous, one-request-at-a-time approach under WSGI (Web Server Gateway Interface). While WSGI itself is only a specification for interfacing between servers and applications, and doesn’t require single-threaded or single-request workers, most popular frameworks and servers have historically adopted this one-request-per-worker pattern, so concurrency is scaled by adding more worker processes rather than overlapping work in a single one.
In contrast, async-capable deployments let one worker handle many overlapping I/O-bound tasks by leveraging an event loop. Python’s async tools split workloads into coroutines that take turns running, so many requests can be “in flight” at once—only blocking when truly necessary. ASGI (Asynchronous Server Gateway Interface) makes this possible, building on the event loop and modern async frameworks; it also enables things like WebSockets and background jobs beyond WSGI’s reach. (2)Timeline: WSGI standardized in 2003 (updated 2010). Django released 2005; Flask in 2010—both built for WSGI’s sync model. asyncio landed in 3.4 (2014); native async/await in 3.5 (2015, PEP 492). ASGI 3.0 arrived 2019.
One Loop, Many Requests #
FastAPI’s speed comes from a shift in how work is scheduled. Instead of handling requests in parallel, a single worker cycles through many I/O-bound requests, pausing only when outside resources must respond. This model—cooperative scheduling—means tasks voluntarily yield control while waiting
(3)Cooperative: Unlike preemptive multitasking (where the OS forcibly switches), here each task yields control itself. A blocking call blocks the whole loop because it never yields.
. When one request waits on a database or remote API, it “yields;” the loop immediately runs another ready task. The await keyword is your tool for this: it signals, “let others run while I wait.” That’s how a single process achieves eye-popping throughput in benchmarks—by overlapping I/O waits, not running code simultaneously.
This dispatch-and-resume pattern is the event loop.
Compare this to classic Python web stacks. Frameworks like Flask use the traditional one-at-a-time approach: each worker takes a request and holds it until it’s finished, even if it’s just idling on network responses. This is the one-request-per-worker setup. (4)Flask identifies as a WSGI framework. You can pair it with async-capable components, but its core interface stays synchronous; ASGI alternatives (e.g., Quart) target native async.
Some frameworks are evolving. For example, in the right deployment, Django can let a single worker juggle multiple requests—each advancing whenever it’s waiting on I/O—as long as it’s running in async mode on ASGI. This is the many-requests-per-worker model. (5)Django’s async journey: 3.0 (2019) brought ASGI support; 3.1 (2020) added async views and middleware. These features work best with ASGI servers.
Ultimately, the key isn’t a framework’s label. What matters is whether your runtime overlaps many I/O-bound requests in one worker, or if each request monopolizes a worker until finished. This is the difference between concurrent I/O within a process and the old model of serial, one-per-worker handling that scales only by adding more processes. (6)Python’s modern async shape took years: event loop via PEP 3156 (2013/2014), then first-class async/await via PEP 492 (2015). Frameworks created earlier quite reasonably targeted WSGI.
Where It Breaks: Blocking and CPU #
Real services aren’t pure I/O. You mix JSON parsing, cryptography, image operations, and third‑party SDKs that still block. FastAPI protects the event loop by offloading blocking work to a bounded thread pool. That pool is shared and finite; when you saturate it, tasks queue and p95 stretches. Meanwhile, “true” parallelism for CPU‑heavy paths still requires multiple processes—Python threads contend for the GIL on CPU‑bound code—so you scale with worker processes, not just more threads. And even if your app side is perfect, external pools (DB connections, HTTP client limits) gate effective concurrency; overdriving the app knobs just moves the bottleneck downstream unless you know what you’re doing.
Post Overview #
- A) Inside one worker: event loop & task concurrency.
- B) Offloading sync work: how the thread pool works and when it backfires.
- C) Parallelism: process workers for CPU‑bound paths.
- D) External gates: DB/HTTP pools & timeouts.
- E) Reference architecture + tuning checklist.
2) A — Inside a Worker: Event Loop and Task Concurrency #
FastAPI feels fast because the event loop overlaps I/O within one worker. The loop doesn’t run Python code in parallel; it lets many in-flight requests make progress by taking turns when one waits. This idea prevents most production surprises.
Concurrency vs. parallelism (two different wins) #
CONCURRENCY
Many requests making progress on one core.
In an async app, each request is a coroutine. When it reaches an await on I/O—an HTTP call, a DB query, a sleep—the coroutine yields control. The event loop immediately runs another ready coroutine. No core sits idle waiting for the network.
PARALLELISM
Work running at the same time on multiple cores.
In Python that typically means multiple processes (workers). Inside a single worker, your Python handlers still execute one at a time.
This post section is about that first win: making one worker excellent at overlapping I/O waits.
What actually runs inside a FastAPI worker #
A FastAPI “worker” (think uvicorn --workers 1) is one OS process with: 
(7)“Orchestrated by … via …”? For years, Python’s async community had two major event loop implementations—asyncio in the standard library, and the alternative, more structured trio. Many libraries picked one or the other, fragmenting the ecosystem. AnyIO was created as an abstraction layer: it lets framework authors (like Starlette/FastAPI) support both backends with a single API, so your app code and dependencies can work with either event loop under the hood. This unification means more robust, flexible async support regardless of which engine runs beneath.
- One event loop (from asyncio, orchestrated by AnyIO via Starlette).
- Many coroutines (tasks) representing requests in flight.
- A small thread‑pool for offloading sync functions (we will treat this thoroughly in Section 3, not here).
Only one coroutine’s Python bytecode runs at any instant in that process. Concurrency comes from yielding at the right times so others can run.
The request journey (happy path) #
- The socket is accepted by the server and handed to your worker.
- The ASGI stack parses HTTP and routes to your async defendpoint.
- Your handler starts running on the event loop.
- It awaits I/O (await client.get(...),await db.fetch_one(...), etc.).
- On each await, the coroutine pauses and the loop picks a different ready task.
- When I/O completes, the coroutine resumes exactly where it left off.
- The response is assembled and sent back out over the socket.
Notice what did not happen: the worker didn’t create a new thread for every request, and it didn’t spin the CPU while waiting.
What yields vs. what blocks #
YIELDS
Awaited network calls (HTTP, DB, cache, MQ), timers (await asyncio.sleep(...) as a stand‑in for I/O), async file APIs, and any library that truly integrates with the loop.
BLOCKS
time.sleep, long pure‑Python loops, heavy serialization/compression/crypto in the handler, and any sync DB/HTTP client you call directly from async def. These keep the loop busy and prevent other requests from running.
Blocking the loop stalls every in‑flight request on that worker—classic head‑of‑line blocking.
A tiny demo you can run #
Save as app.py:
|  |  | 
Run one worker to isolate the effect:
|  |  | 
Hit /io with 50 concurrent requests; they finish in about one second because the loop overlaps the waits. Do the same for /block; they complete one after another because the loop is held by time.sleep. Times may vary by machine; the shape holds.
Why this works (cooperative scheduling) #
The event loop is a cooperative scheduler. It switches tasks only when a task voluntarily yields at an await. Awaiting a non‑blocking operation tells the loop, “I’m waiting on the kernel or another resource—go run someone else.” When the kernel signals completion, the loop schedules your task to resume. Context switches are microseconds; for millisecond I/O, the overhead is noise.
Cost model and practical limits #
- Task switching is cheap; I/O is expensive. You can keep hundreds or thousands of requests in flight if they spend most of their time waiting on I/O and your code yields promptly.
- Memory still matters. Each in‑flight request holds some state: Python stack frames, request/response objects, and pending futures. Observe memory when you turn up concurrency.
- One blocking call freezes the worker. Only one Python thread executes handlers; any blocking section freezes all other coroutines until it finishes.
The boundary of this technique #
Within a single FastAPI worker, true parallelism does not happen for Python code that holds the GIL. Concurrency is still incredibly valuable—it hides I/O latency and lets one core serve many clients—but once you mix in CPU‑bound work, you will either offload (Section 3) or add more worker processes (Section 4). Keep that boundary in mind as you reason about p95 under real traffic.
A FastAPI worker delivers high concurrency by cooperatively switching at points
yeild. If your handler path is non‑blocking, one process can keep many requests moving. If you accidentally block the loop, you do not just slow down that one request—you stall the entire worker.Of course, not all code yields. Some functions must block, either because they’re CPU-bound or because they rely on synchronous libraries. FastAPI handles these by offloading them to a bounded thread-pool—a clever but limited escape hatch that keeps the event loop responsive. That’s the focus of Section 3.
3) B — Offloading Sync Work: Thread-Pool Mechanics & Limits #
The event loop makes one worker feel large—until code doesn’t yield. Starlette protects the loop by offloading synchronous work to a thread pool so other requests can keep moving. Some offloading is automatic; some you do explicitly. This section explains the thread pool mechanics.
What the framework offloads vs. what you must do explicitly #
Starlette automatically runs several synchronous paths in a thread pool so they don’t block the loop: def endpoints, synchronous dependencies
(8)We haven’t covered FastAPI’s magnificent Dependency Injection system in this post, but “dependencies” here means Annotated[…, Depends(your_func)], not import pandas. Different kind of dependency—equally capable of ruining your day if you get it wrong.
, file responses/uploads, and synchronous background tasks. Internally it uses AnyIO’s to_thread.run_sync for this work. In other words, if you declare a path operation with plain def, FastAPI will schedule that handler on a worker thread and await it; likewise, a dependency declared with def is awaited.
For everything inside and outside an async def handler, you are responsible: any blocking call (legacy SDK, time.sleep, heavy file I/O, CPU blips) must be wrapped with Starlette/AnyIO helpers so the loop can yield:
|  |  | 
When code won’t cooperate, offload it.
What actually happens under the hood #
Concretely, sync callables that FastAPI is responsible for running are wrapped in run_in_threadpool(...), while async callables run directly on the event loop. FastAPI’s routing layer is responsible for this. In fact, FastAPI adopts the starlette’s routing layer function and slightly modifies it to accommodate the yield dependencies.
Key hooks in routing.py
(9)Source (~4k LoC): fastapi/fastapi/routing.py
:
- Endpoint wrapper picks thread‑pool for sync callables
|  |  | 
FastAPI’s dependency injection system needs to set up before the endpoint is called, and then gaurantee teardown after the response is sent. To do that cleanly across nested dependencies and different call styles, FastAPI inserts an AsyncExitStack into the ASGI scope and drives cleanup from there.
File sends/uploads and sync streaming acquire/release a thread pool token per chunk; sync background tasks use the same pool.
FastAPI offloads sync work so the loop stays unblocked. But it doesn’t inspect your code to verify if it’s truly non-blocking. This is all based on cooperation: the framework trusts that anything you mark as async def is truly async, and that you’re not sneaking slow, blocking calls into those paths. Rule of thumb: if a handler is async, the calls it makes should be async too.
The bounded pool mechanics and why “40 tokens” exist #
Offloads share a bounded pool (AnyIO capacity limiter; ~40 by default today). This keeps the event loop responsive while preventing unbounded thread creation; when all tokens 
(10)Well, it is only cache invalidation that is harder than naming things. Here “tokens” are the threads that are available to be used for offloading. AnyIO’s API is like: to_thread.run_sync(func, token_limit=40) but the documentation doesn’t say “tokens” explicitly. Starlette calls it “tokens” with no shame.
 are borrowed, new offloads wait until one releases. You can adjust the cap, but AnyIO and Starlette caution against raising it.
Why bound it at all? These are real OS threads with non-trivial overhead (stack space, scheduler work). More threads increase memory footprint and context switching, and—because of CPython’s GIL—they don’t deliver parallel speedup for CPU-bound Python bytecode anyway. Threads primarily help to keep the event loop healthy and for pure Python CPU work, more threads usually just add latency.
Thread-pool sizing interacts with the GIL in counterintuitive ways. There are cases where you might actually benefit from increasing the token limit. Specifically, if your def endpoints or dependencies spend most of their time in GIL-releasing blocking I/O or C extensions. For example:
- Database queries via psycopg2,mysqlclient(11)unlike ‘pymysql’ which is pure Python and does not release the GIL , andSQLAlchemywith a sync GIL-releasing driver. (12)An async database client has a significantly higher throughput compared to its GIL-releasing synchronous counterpart because of zero thread creation overhead and context switching.
- File I/O operations (open(),read(),write()) release the GIL during kernel-level syscalls.
- C extensions that release the GIL during network or disk I/O
- Most notably: requestslibrary,boto3(13)the underlyingurllib3library releases the GIL during network operations S3 operations,pillowimage processing.
Takeaway: the thread pool is a safety valve that keeps the loop responsive when you must call sync code. Treat it like a bounded, shared resource—because it is what it is meant to be. If you try to turn it into your throughput lever, you’ll need to right-size it based on your workload and infrastructure constraints, and instrument it.
4) C — True Parallelism: Multi‑Process Workers & CPU‑Bound Work #
Concurrency within one worker can’t overcome the GIL. CPU-bound work needs cores—i.e., processes. Threads inside a single process contend on the Global Interpreter Lock (GIL); only one thread executes Python bytecode at a time. True parallel execution comes from multiple processes, each with its own GIL.
What a “worker” really is #
A Uvicorn (or Gunicorn+Uvicorn) worker is an OS process. Each worker runs one event loop (plus a small, bounded thread‑pool for offloaded blocking I/O). When you set --workers N, you are asking the OS to run up to N copies of your app in parallel 
(14)On Linux with Gunicorn’s pre-fork model, the master loads your application code then forks, creating workers via copy-on-write so read-only pages (bytecode, loaded modules) remain shared until written. On Windows, or with Uvicorn’s spawn model, each worker starts from scratch with full memory duplication—fork isn’t available or safe with asyncio on that platform.
. That’s the lever that moves CPU‑bound throughput and stabilizes tail latency under load.
Request distribution happens at the kernel level through socket sharing, not application logic. All workers call accept() on the same socket descriptor; the kernel distributes incoming connections using round-robin or availability-based algorithms. There’s no master-side routing—this is pure OS-level load balancing. When one worker hangs or crashes, only requests actively being handled by that worker fail; the master detects the exit via SIGCHLD and spawns a replacement in 50-150ms while other workers continue serving traffic normally.
Isolation aids fault tolerance but costs memory. Each worker maintains its own database connection pool, its own Redis client, its own in-memory caches, and—if you’re doing ML inference—its own loaded models. A 2GB model in 4 workers consumes 8GB of RAM (or 3-4GB initially with copy-on-write, growing over time as Python’s reference counting breaks the sharing).
In production, a mature process manager (commonly Gunicorn running Uvicorn workers) adds graceful reloads, worker timeouts, and lifecycle control. When using Gunicorn with FastAPI (ASGI), use the uvicorn.workers.UvicornWorker class so workers speak ASGI rather than WSGI. This improves resilience and makes multi‑process parallelism operationally safe.
Sizing workers and the 2×CPU+1 cargo cult #
The traditional formula workers = (2 × cores) + 1 comes from synchronous WSGI deployments where workers block on I/O. One thread waits on the database while another serves requests—you need the 2× multiplier to keep cores busy during I/O waits. But async FastAPI already handles concurrency through the event loop within each worker. For async services, start with workers ≈ vCPUs and tune from there.
- CPU‑heavy service: start with workers = number of vCPUsavailable to the container/VM.
- Mixed I/O + CPU: still start at vCPUs; only increase towardvCPUs × 1.5–2if profiling shows idle CPU during I/O waits. Oversubscribing the CPU inflates context switching and makes p95/p99 worse.
- In containers, size by the CPU quota granted to the pod, not the host’s core count.
Inline CPU, Process Pool, or Job Queue: Deciding where CPU work executes #
Your CPU work needs to execute somewhere, and there are three main runtime options, each with distinct tradeoffs for latency, isolation, and resource control. Inline execution runs synchronously on the event loop thread. The loop is busy for the entire duration: other requests on this worker need to wait. This is acceptable for when the work completes in sub tens of milliseconds, and happens infrequently. A cryptographic hash of a small payload fits this profile; a second long image processing does not. 
(15)Chances are you may get lucky with operations that release the GIL, specially if you’re using a library like Pillow or opencv and the code path was offloaded to the thread pool, it may save the event loop from blocking while the operation completes. Don’t count on it though.
 The boundary here is sharp; exceed it and and you introduce head-of-line blocking
(16)Head-of-line blocking occurs when the first item in a queue stalls, preventing all items behind it from progressing—even if they’re ready. In FastAPI, one blocking call makes all other requests wait their turn.
, that cascades across every request queued behind it on that worker.
|  |  | 
Your next option, Process Pool offload, moves the work to a separate child process and immediately returns the loop to service other requests. The HTTP handler awaits the result, but the loop itself remains free to schedule other tasks. This pattern can handle work ranging from ~50 milliseconds to ~30 seconds: think ML inference, medium-sized data transformations, or moore complex cryptographic operations. The boundary condition is serialization overhead—arguments and results travel through a pipe, so compact payloads (IDs, small arrays, file paths) work well; multi-megabyte blobs may not. 
(17)You can either use AnyIO’s to_process.run_sync or the more traditional concurrent.futures.ProcessPoolExecutor. The first is simpler and more efficient, the second is more flexible and can be used with more complex tasks. I n both cases, the Inter-Process Communication (IPC) is done via pickle serialization.
Your final option, Job Queue delegation, returns 202 Accepted immediately and hands the work to a separate worker fleet managed by a “Job Queue” system like Celery, Arq, Dramatiq, or RQ. The HTTP request completes in milliseconds; the event loop does no CPU work for the job itself. Use this for long, bursty, or user-indifferent operations: report generation, batch image pipelines, model retraining, or anything where the user expects to check back later. This pattern introduces operational complexity—you now manage a queue, persistence layer, and separate worker processes—but it decouples request latency from job duration entirely.
Takeaway: for CPU‑bound paths, parallelism is a process‑level concern. Use multiple workers for throughput and predictability, be mindful of where you let CPU work execute and how it affects the event loop.
5) D — External Concurrency Gates: DB & HTTP Pools #
Your FastAPI app can be perfectly tuned – event loop humming, thread pool sized right, workers scaled, job queues watching – and still hit a wall. The real bottleneck sometimes lives outside: in databases your app talks to, or the http requests it makes to external services. You can’t control those in your app, but there are certain things about how you interact with them in your app that you need to keep in mind.
This section unpacks the mechancis for Postgres + SQLAlchemy (with an async or sync dirver) then generalizes to upstream HTTP services.
DB connections are costly—pool and release deliberately #
PostgreSQL is process-per-connection: each client session gets its own backend OS process. That makes a connection more than “just a socket,” and it makes raw connection count map directly to server memory and scheduler work.
Before a session can run queries, the wire protocol walks a startup → authentication → ready sequence (and, if enabled, TLS negotiation). Only after the server sends ReadyForQuery does normal traffic begin.
Because establishing a fresh connection is real work—and the server caps how many can exist at once (max_connections)—you don’t create one per request. You pool: keep sessions warm and hand them out briefly, then return them to the pool.
On the app side, SQLAlchemy’s Engine uses a bounded pool with optional overflow to control concurrent checkouts; when the pool is exhausted, new work waits rather than spawning unbounded connections. Remember each worker process owns its own Engine and pool—scaling workers multiplies potential DB load. When your app tier grows (more workers, services, or pods) and raw connection counts outpace what Postgres server can comfortably serve, you may need to put a gateway pooler in front of the postgres server.
The real problem: holding a connection longer than needed #
In practice the ceiling isn’t “how many connections you own,” it’s how long each request holds one. By default, a SQLAlchemy Session autobegins a transaction on first use and keeps a pooled connection checked out until you commit/rollback or close. That means any unrelated work you do while the transaction is open—HTTP calls, file I/O, JSON shaping—stretches the checkout window. Keep DB work narrow: open as late as possible, do the minimum, then end the transaction so the pool can hand that connection to the next request.
LLM response or other large responses introduce a second trap: streaming. Server-side cursors (e.g., stream_results / yield_per) fetch in chunks and pin the connection for the duration of the stream. If you truly need to stream, budget concurrency for it; otherwise, materialize the result (or stage it to a buffer), release the connection, and only then send the body. The gain isn’t theoretical—SQLAlchemy’s streaming modes explicitly select server-side cursors, which tie the connection to the in-progress result set.
FastAPI’s yield dependencies add an ordering subtlety: the code after yield (your teardown/close) runs after the response lifecycle. For normal responses that’s fine; for streaming endpoints it means teardown happens after the stream completes. Don’t couple long-lived responses to a still-open ORM session; keep lookups short-lived and independent so the DB connection is released immediately.
Finally, add backstops so capacity isn’t tied up indefinitely. On the app side, set a pool-acquire wait (pool_timeout) so requests don’t block forever waiting for a connection. On the database side, use session-level timeouts—statement_timeout for long queries, lock_timeout for lock waits, and idle_in_transaction_session_timeout for forgotten open transactions—and align them with your endpoint deadlines. Together these caps keep “stuck” work from hoarding the pool.
All in all, you need to treat your DB connections as an scarce resource which it is. Keep the DB work as narrow as possible, open right before the query, close right after. No unrelated network/file I/O inside a transaction. The moment you call an external API inside a DB transaction, you’ve basically handed over control of your database to the network.
Your HTTP client is a gate, too. #
A single shared HTTP client maintains a pool and reuses connections to the same host; building a new client (or using ad‑hoc top‑level calls) discards that reuse and pays the handshake cost again. In FastAPI, create one client at startup and close it on shutdown so the pool is long‑lived and sockets are cleaned up.
Treat that pool like any other bounded resource. Set per‑host limits explicitly so concurrency to an upstream is deliberate rather than accidental. When the pool is full, calls will wait to acquire a connection even if your event loop is free—exactly the backpressure you want.
If the upstream speaks HTTP/2, you get multiplexing: many in‑flight requests can share a single connection. That doesn’t make the gate disappear—the server announces how many concurrent streams it will allow, and your client’s connection limits still apply. Size your concurrency with both in mind.
Be intentional with streaming responses. While you’re inside a streaming block, the connection remains in use until you finish reading or exit the context; long‑lived streams can pin pool capacity and cause other calls to queue. If you truly need to stream, budget for it; otherwise, read the body, release the connection, then send your response.
Finally, deadlines free capacity. Use timeouts that match your endpoint’s overall budget so a slow or stalled upstream doesn’t monopolize sockets and threads; HTTPX enforces timeouts by default, but making them explicit keeps the policy obvious in code. Pair short timeouts with conservative, idempotent retries.