DRGs: Dual-Hub, Dual-Home Networking on OCI

I came to OCI from AWS. Before that, GCP. In neither of those worlds did I ever think about BGP, route announcements, or transit routing. I didn’t have to. You spin up a VPC, maybe you peer it with another VPC, and you move on with your life.

Then I got to OCI and someone said “you need a DRG” and I said “what’s a DRG?”

That was the start of a long education.

What’s a DRG?

A Dynamic Routing Gateway is a virtual router. That’s it. If you were running your own data center, this would be the physical router sitting in your rack, handling route tables, announcing paths, and forwarding traffic between networks. OCI just virtualizes it.

But DRGs aren’t just a router. They’re the router. The central nervous system of OCI networking. Site-to-site VPNs, FastConnect peering, on-prem connections, cross-region traffic, inter-VCN communication: all of it flows through a DRG. If your VCNs are rooms in a building, the DRG is every hallway, stairwell, and elevator connecting them.

They’re also the least understood component in the entire OCI ecosystem. The documentation reads like it was written by a greybeard who’s been racking switches for 30 years and assumes you already know what an AS path is. Coming from AWS where you just click “create peering connection” and move on, DRGs feel like being handed a soldering iron when you asked for a light switch.

That rawness is actually the point. OCI is the closest thing to running your own data center that I’ve found in any public cloud. Less magic, less abstraction, more engineering and ops. DRGs, RPCs, route distributions: these aren’t dumbed-down wrappers. They’re powerful constructs that just happen to be blocks on a well-architected diagram instead of physical hardware in a rack.

The Setup

Here’s an example of a network that’s entirely reasonable to stand up on OCI:

Chicago: Management VCN. Jumpbox, logging, metrics, CI/CD, ops tooling. The control plane for everything.
Amsterdam: Two OKE clusters (staging and production), each in its own VCN.
Querétaro: Two more OKE clusters (staging and production), each in its own VCN.
Phoenix: A standalone VCN for auxiliary workloads.

Six VCNs across four regions. OKE likes to consume a /16 CIDR block, so you’re standing up a full VCN per cluster. That’s fine. Address space is free and isolation is good.

Now: how do you connect them?

The Wrong Way (I Did This First)

My first instinct was to wire each VCN directly to every other VCN. Full mesh. Point-to-point peering for everything.

That’s n(n-1)/2 connections. For six VCNs: 15 peering relationships. Fifteen sets of route tables to maintain. Fifteen things to debug when traffic stops flowing at 2 AM.

It works. It also doesn’t scale and it makes you want to quit.

The whole point of being on a cloud where private networking isn’t a toll booth at every hop is the ability to spin up new VCNs and expand into new regions as you need them. But on a full mesh, adding a seventh VCN means wiring up another pile of RPCs, NSGs, and routing tables. That’s not engineering. That’s bookkeeping with a blast radius.

Hub-and-Spoke

Hub-and-spoke is borrowed from physical networking (and the airline industry). Instead of connecting every node to every other node, you designate one as the hub and connect everything else (the spokes) to it. All traffic between spokes transits through the hub. Quadratic connection growth becomes linear connection growth. One place to manage routing. One place to monitor.

        Full Mesh

     CHI ---- AMS-Stg
      |\  \     |\
      | \  \    | \
      |  \  \   |  \
      |   \  \  |   \
      |    \  \ |    \
     PHX -- QRO-Stg  AMS-Prod
      |\      |\
      | \     | \
     ... (15 connections)

        Hub-and-Spoke

              AMS-Stg
                |
              AMS-Prod
                |
    QRO-Stg -- CHI (Hub) -- PHX
                |
            QRO-Prod

          (6 connections)

In OCI, the DRG is the hub. DRG v2 natively supports transit routing without a separate hub VCN running a virtual appliance. Attach your VCNs, configure the route tables, traffic flows spoke-to-spoke. Clean.

But this example isn’t in a single region. It’s in four.

Dual-Hub Hub-and-Spoke (The Real Pattern)

Single hub means a single regional dependency, one route-policy blast radius, and a potential latency bottleneck. The fix: two hubs, and you dual-home every spoke by giving each one an uplink to both.

Hub-and-spoke is the shape of the network. Dual-home is the redundancy strategy. Combine them and you get a topology that’s simple to manage and genuinely resilient.

Phoenix and Amsterdam are the hubs. Chicago and Querétaro peer with both. The hubs peer with each other.

Why these two? You want hubs on opposite sides of the planet so no single failure takes both offline. And let’s be honest: most outages aren’t cable cuts or natural disasters. They’re config mistakes and human error. You’re protecting against your own stupidity at 2 AM as much as anything else. A bad route table push to Amsterdam doesn’t matter when Phoenix is still routing traffic like nothing happened.

Dual-home also buys you path efficiency.

Imagine you only had one hub: Amsterdam. Chicago needs to talk to Querétaro. Both in North America. Doesn’t matter. Amsterdam is the hub, so the full round trip is:

QRO -> AMS -> CHI -> AMS -> QRO

Four Atlantic crossings for two regions that are essentially neighbors. Looks fine on a diagram. Murders your latency.

With two hubs, DRG route selection can prefer the lower-network-distance RPC path when the propagated routes line up:

QRO -> PHX -> CHI -> PHX -> QRO

Same continent the entire time. When Chicago needs to reach Europe, it can route through Amsterdam instead. Traffic can gravitate toward the better hub without you hard-coding every pairwise path.

         AMS (Hub) ------- PHX (Hub)
          /    \            /    \
        CHI    QRO        CHI    QRO

Each spoke region has two peering relationships, one to each hub. The hubs have one peering relationship to each other. In the four-region version, the inter-region backbone is five peering relationships: CHI to AMS, CHI to PHX, QRO to AMS, QRO to PHX, and AMS to PHX. That means ten RPC resources, because each relationship has one RPC on each side. If you choose one DRG per VCN instead of one DRG per region, the count goes up, but it still grows linearly instead of turning into a full mesh.

This is not ECMP (Equal-Cost Multi-Path) in the FastConnect or IPSec tunnel sense. RPC paths are redundant propagated routes. The DRG selects the preferred route, and when that path disappears, route propagation converges on the remaining path. No intervention. No runbook.

Adding São Paulo? Two peering relationships: one to Phoenix, one to Amsterdam. Done.

flowchart TB
    backbone["Oracle Backbone<br/>(private inter-region fabric)"]
    ams["DRG-AMS<br/>hub"]
    phx["DRG-PHX<br/>hub"]
    chi["DRG-CHI<br/>spoke"]
    qro["DRG-QRO<br/>spoke"]
    chiVcn["Management VCN"]
    qroVcns["OKE VCNs<br/>staging + production"]
    amsVcns["OKE VCNs<br/>staging + production"]
    phxVcn["Aux VCN"]

    backbone --- ams
    backbone --- phx
    ams --- phx
    chi --- ams
    chi --- phx
    qro --- ams
    qro --- phx
    chiVcn --- chi
    qroVcns --- qro
    amsVcns --- ams
    phxVcn --- phx

Why You Won’t See This on AWS

On AWS you’d use Transit Gateway. It charges per attachment, per GB processed, and adds more per-GB for cross-region peering. For six VCNs across four regions, that’s a real line item every month just to let your networks talk to each other.

So teams compromise. They keep everything in one region. They reduce cross-region communication. They accept higher latency. They build around the billing model instead of building what’s right.

On OCI, the DRGs are free and the RPCs are free. Private cross-region traffic still runs over Oracle’s backbone, but billing class and transport path are not the same thing; inter-region traffic can still show up as outbound data transfer. The important difference is that there is no Transit Gateway-style attachment fee or per-GB processing tax just for letting your private networks talk.

That matters. The architecture can follow the users instead of contorting itself around a managed transit-router bill.

The OCI Object Model

This is not the full Terraform guide. The implementation walkthrough is coming separately. But the object model matters, because most DRG confusion comes from assuming OCI is doing more magic than it is.

There are four moving parts:

VCN attachment: connects a VCN to a DRG in the same region.
Remote peering connection (RPC): connects one DRG to another DRG in a different region.
DRG route table: attached to an attachment; decides where packets from that attachment can go.
Route distribution: controls which routes get imported into or exported from a DRG route table.

The important bit: a remote VCN does not become reachable just because an RPC is up. The DRGs have to learn and re-advertise the right prefixes through the right route tables. The packet path and the route-learning path are separate enough that you can have healthy peering and still drop traffic into a black hole.

In the dual-hub pattern, each spoke DRG learns its local VCN CIDRs from VCN attachments, exports them toward both hub RPCs, and imports remote prefixes back from the hubs. Each hub imports from spokes and exports those learned routes to the other spokes and the other hub. That is the mental model. The exact Terraform resources are just the serialization format.

Route Tables and Import/Export Distributions

This is where “BGP handles it” meets reality.

Every DRG has route tables, and every attachment (VCN, RPC, VPN tunnel, FastConnect circuit) gets assigned to one. The route table controls what that attachment knows how to reach.

The mechanism is import and export route distributions. Think of it like a mail room: every attachment sends and receives route announcements, and the distributions decide who gets whose mail.

Export distributions control what routes an attachment advertises. Import distributions control what routes it’s willing to learn. If your Chicago VCN exports its CIDR but the RPC to Phoenix doesn’t import VCN attachments, Phoenix never learns how to reach Chicago. Traffic gets dropped. No error. No log. Silence.

The topology can be perfect, hubs placed correctly, RPCs connected, BGP sessions healthy, and traffic still won’t flow because an import distribution is missing an attachment type. The DRG won’t guess what you meant.

The Gotchas

Your import distributions need every attachment type you’re using. Every single one.

This sounds obvious. It is not obvious at 2 AM when your AWS-to-OCI traffic is silently disappearing.

I had a DRG with VCN and RPC attachments, all routing fine. Added a site-to-site VPN to bridge workloads back to AWS. Tunnel came up. BGP session established. Routes announced. Traffic went into a black hole.

The import distribution was pulling from VCN and RPC attachments because those were the only types that existed when I configured it. Never added VPN. The DRG knew the VPN was there, BGP was exchanging routes at the protocol level, but the route table refused to learn any of them. Ignoring its own VPN connection.

One line fix. One evening of debugging.

Every time you add a new attachment type to a DRG, check your import distributions. VCN, RPC, VPN, FastConnect are separate categories. Miss one and traffic dies quietly.

mtr is your best friend for debugging DRG routing.

When traffic isn’t flowing or you want to verify a path weight change actually took effect, mtr combines traceroute and ping into a continuous hop-by-hop view in real time. Adjust path weights in a DRG and watch the traffic shift live. Takes the guesswork out of “did that config change actually propagate?”

What’s Next

OCI gives you the networking primitives, not the guardrails. DRGs are frustrating until you stop treating them like cloud peering widgets and start treating them like routers. Then the model clicks: attach networks, exchange routes, control what gets learned, and let the backbone do the work.

That is the real difference from AWS and GCP. Less polish, fewer defaults, more sharp edges. But when the economics let you build the right topology instead of the cheapest billable topology, the extra machinery is worth learning.

Second post in a series about building real infrastructure on OCI. First post: The Cloud Egress Tax.