Technical Architecture

System Overview

BR-ACC is built as a modern, containerized application stack that ingests Brazilian public data, normalizes it into a graph database, and exposes it through a REST API and web interface.

Architecture Layers

1. Data Sources Layer

BR-ACC integrates 45+ Brazilian public data sources:

Government Portals
Sector-Specific
International
Derived

Receita Federal - CNPJ company registry (60M+ companies)
Portal da Transparência - Federal spending, sanctions, contracts
TSE - Electoral data, donations, candidate assets, party membership
ComprasNet/PNCP - Federal procurement and bids
TCU - Federal Court of Auditors sanctions
TransfereGov - Federal transfers to states/municipalities

All sources are accessed through their official APIs or open data portals. No scraping of private data occurs.

2. ETL Layer

The ETL (Extract, Transform, Load) layer is built with Python and follows a modular pipeline architecture.

Pipeline Structure

Each data source has a dedicated pipeline in etl/src/bracc_etl/pipelines/:

# Example: etl/src/bracc_etl/pipelines/cnpj.py
from bracc_etl.base import Pipeline
from bracc_etl.loader import Neo4jBatchLoader
from bracc_etl.transforms import (
    format_cnpj,
    format_cpf,
    normalize_name,
    parse_date,
    deduplicate_rows
)

class CNPJPipeline(Pipeline):
    def extract(self) -> pd.DataFrame:
        """Download and read CNPJ data from Receita Federal"""
        # Download logic
        
    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        """Normalize and clean data"""
        df['cnpj'] = df['cnpj'].apply(format_cnpj)
        df['name'] = df['razao_social'].apply(normalize_name)
        df['start_date'] = df['data_inicio'].apply(parse_date)
        return deduplicate_rows(df)
        
    def load(self, df: pd.DataFrame, driver: Driver) -> None:
        """Write to Neo4j in batches"""
        loader = Neo4jBatchLoader(driver)
        loader.load_companies(df)

Common Transforms

Shared transformation utilities in etl/src/bracc_etl/transforms/:

# transforms/document_formatting.py
from bracc_etl.transforms import format_cnpj, format_cpf, strip_document

# CNPJ: 12345678000190 → 12.345.678/0001-90
formatted = format_cnpj("12345678000190")

# CPF: 12345678900 → 123.456.789-00 (masked in public mode)
formatted = format_cpf("12345678900")

# Remove formatting: 12.345.678/0001-90 → 12345678000190
stripped = strip_document("12.345.678/0001-90")

ETL Orchestration

From etl/src/bracc_etl/runner.py:

# Run single pipeline
python -m bracc_etl.runner --source cnpj

# Run multiple pipelines
python -m bracc_etl.runner --source cnpj,tse,transparencia

# Full orchestration (all 45+ sources)
make bootstrap-all

The bootstrap-all workflow:

Loads source contract from config/bootstrap_all_contract.yml
Runs pipelines in dependency order
Continues on errors and classifies outcomes
Generates audit reports in audit-results/bootstrap-all/

3. Graph Database Layer

Neo4j 5 Community Edition stores all data as a labeled property graph.

Schema Overview

From infra/neo4j/init.cypher, the schema defines:

Core Entities
Transactions
Sanctions & Restrictions
Sector-Specific

Person - Individuals (CPF)

CREATE CONSTRAINT person_cpf_unique IF NOT EXISTS
  FOR (p:Person) REQUIRE p.cpf IS UNIQUE;

Company - Organizations (CNPJ)

CREATE CONSTRAINT company_cnpj_unique IF NOT EXISTS
  FOR (c:Company) REQUIRE c.cnpj IS UNIQUE;

Partner - Company partners/shareholders

CREATE CONSTRAINT partner_id_unique IF NOT EXISTS
  FOR (p:Partner) REQUIRE p.partner_id IS UNIQUE;

Contract - Public contracts

CREATE CONSTRAINT contract_contract_id_unique IF NOT EXISTS
  FOR (c:Contract) REQUIRE c.contract_id IS UNIQUE;

Bid - Procurement bids

CREATE CONSTRAINT bid_id_unique IF NOT EXISTS
  FOR (b:Bid) REQUIRE b.bid_id IS UNIQUE;

Finance - Financial transactions

CREATE CONSTRAINT finance_id_unique IF NOT EXISTS
  FOR (f:Finance) REQUIRE f.finance_id IS UNIQUE;

Sanction - CEIS/CNEP sanctions

CREATE CONSTRAINT sanction_sanction_id_unique IF NOT EXISTS
  FOR (s:Sanction) REQUIRE s.sanction_id IS UNIQUE;

Embargo - IBAMA environmental embargoes

CREATE CONSTRAINT embargo_id_unique IF NOT EXISTS
  FOR (e:Embargo) REQUIRE e.embargo_id IS UNIQUE;

InternationalSanction - OFAC, EU, UN

CREATE CONSTRAINT international_sanction_id_unique IF NOT EXISTS
  FOR (s:InternationalSanction) REQUIRE s.sanction_id IS UNIQUE;

Graph Relationships

Common relationship types:

// Ownership and control
(Person)-[:PARTNER_OF]->(Company)
(Company)-[:OWNS]->(Company)  // Holdings
(Company)-[:HOLDING_GROUP]->(Holding)

// Transactions
(Company)-[:HAS_CONTRACT]->(Contract)
(Company)-[:SUBMITTED_BID]->(Bid)
(Company)-[:RECEIVED_FINANCE]->(Finance)

// Issues
(Company)-[:HAS_SANCTION]->(Sanction)
(Company)-[:HAS_EMBARGO]->(Embargo)
(Company)-[:UNDER_INVESTIGATION]->(Inquiry)

// Source attribution
(Node)-[:SOURCED_FROM]->(SourceDocument)

Query Examples

From api/src/bracc/queries/:

// queries/public_graph_company.cypher
MATCH (c:Company {cnpj: $company_identifier})
OPTIONAL MATCH path = (c)-[*1..3]-(related)
WHERE NOT related:Person OR $allow_person = true
RETURN c, collect(distinct related) as nodes,
       collect(distinct relationships(path)) as rels

4. Backend API Layer

FastAPI (Python 3.12+) provides async REST API with automatic OpenAPI documentation.

Project Structure

api/src/bracc/
├── main.py              # FastAPI app, middleware, lifespan
├── config.py            # Settings (Pydantic BaseSettings)
├── dependencies.py      # Neo4j driver, auth dependencies
├── routers/             # API endpoints
│   ├── public.py        # Public graph/pattern endpoints
│   ├── meta.py          # Health, stats, sources
│   ├── search.py        # Entity search
│   ├── graph.py         # Graph expansion
│   ├── entity.py        # Entity details
│   ├── patterns.py      # Pattern detection
│   ├── investigation.py # Investigation management
│   └── auth.py          # Authentication
├── services/            # Business logic
│   ├── neo4j_service.py            # Query execution
│   ├── public_guard.py             # Privacy enforcement
│   ├── intelligence_provider.py    # Pattern engine
│   └── source_registry.py          # Data source metadata
├── models/              # Pydantic models
│   ├── entity.py
│   ├── graph.py
│   ├── pattern.py
│   └── investigation.py
├── middleware/          # HTTP middleware
│   ├── security_headers.py
│   ├── cpf_masking.py
│   └── rate_limit.py
└── queries/             # Cypher query templates
    └── *.cypher

Key Endpoints

From api/src/bracc/routers/public.py and meta.py:

Public Endpoints
Health & Stats
Pattern Detection

# public.py
@router.get("/api/v1/public/meta")
async def public_meta(session: AsyncSession) -> dict:
    """Aggregated metrics and source health"""
    return {
        "total_nodes": 40_000_000,
        "company_count": 60_000_000,
        "contract_count": 5_000_000,
        "source_health": {...}
    }

@router.get("/api/v1/public/graph/company/{company_ref}")
async def public_graph_for_company(
    company_ref: str,
    depth: int = 2
) -> GraphResponse:
    """Public company subgraph (CNPJ or ID)"""
    # Enforces public-safe defaults
    # Filters out Person nodes in public mode
    # Returns nodes, edges, and center_id

# meta.py
@router.get("/api/v1/meta/health")
async def neo4j_health(session: AsyncSession):
    """Neo4j connection health check"""
    return {"neo4j": "connected"}

@router.get("/api/v1/meta/stats")
async def database_stats(session: AsyncSession):
    """Comprehensive graph statistics (cached 5min)"""
    return {
        "total_nodes": ...,
        "company_count": ...,
        "implemented_sources": 45,
        "loaded_sources": 38,
        ...
    }

# public.py
@router.get("/api/v1/public/patterns/company/{cnpj_or_id}")
async def public_patterns_for_company(
    cnpj_or_id: str,
    lang: str = "pt"
) -> PatternResponse:
    """Run pattern detection on company"""
    # Requires PATTERNS_ENABLED=true
    # Returns detected patterns with evidence
    patterns = [
        "split_contracts_below_threshold",
        "sanctioned_still_receiving",
        "debtor_contracts",
        ...
    ]

Privacy & Security

From api/src/bracc/main.py:

# Middleware stack (executed in reverse order)
app.add_middleware(CPFMaskingMiddleware)        # Mask CPF in responses
app.add_middleware(SecurityHeadersMiddleware)   # CSP, HSTS, etc.
app.add_middleware(SlowAPIMiddleware)           # Rate limiting
app.add_middleware(CORSMiddleware)              # CORS headers

# Public-safe defaults enforced in services/public_guard.py
def enforce_person_access_policy(labels: list[str]) -> None:
    """Block Person nodes in public mode"""
    if settings.public_mode and has_person_labels(labels):
        raise HTTPException(403, "Person data not accessible in public mode")

def sanitize_public_properties(props: dict) -> dict:
    """Remove CPF and other sensitive fields"""
    return {k: v for k, v in props.items() if k not in SENSITIVE_KEYS}

Environment flags:

PUBLIC_MODE=true                    # Enable public-safe mode
PUBLIC_ALLOW_PERSON=false           # Block Person node access
PUBLIC_ALLOW_ENTITY_LOOKUP=false    # Block direct entity lookup
PUBLIC_ALLOW_INVESTIGATIONS=false   # Block investigation features
PATTERNS_ENABLED=false              # Disable pattern engine

5. Frontend Layer

React 19 + TypeScript + Vite provides the web interface.

Tech Stack

From frontend/package.json:

{
  "dependencies": {
    "react": "^19.0.0",
    "react-dom": "^19.0.0",
    "react-router": "^7.0.0",
    "react-force-graph-2d": "^1.26.0",  // Graph visualization
    "react-hook-form": "^7.71.2",        // Form management
    "zustand": "^5.0.0",                  // State management
    "i18next": "^24.0.0",                 // Internationalization
    "lucide-react": "^0.575.0",          // Icons
    "zod": "^4.3.6"                       // Schema validation
  },
  "devDependencies": {
    "vite": "^6.0.0",
    "typescript": "^5.7.0",
    "vitest": "^3.0.0",
    "@testing-library/react": "^16.0.0"
  }
}

Key Features

Company Search - Search by CNPJ, name, or ID
Graph Visualization - Interactive force-directed graph with react-force-graph-2d
Entity Details - Comprehensive entity information panels
Investigation Workspace - Create and manage investigation collections
Pattern Highlighting - Visual indicators for detected patterns
Multi-language - Portuguese (pt-BR) and English (en) via i18next

Docker Compose Architecture

From docker-compose.yml:

services:
  neo4j:
    image: neo4j:5-community
    ports:
      - "7474:7474"  # Browser
      - "7687:7687"  # Bolt
    environment:
      NEO4J_AUTH: neo4j/${NEO4J_PASSWORD}
      NEO4J_PLUGINS: '["apoc"]'
    volumes:
      - neo4j-data:/data
      - ./infra/neo4j/init.cypher:/var/lib/neo4j/init.cypher
    healthcheck:
      test: cypher-shell -u neo4j -p $NEO4J_PASSWORD "RETURN 1"
      interval: 10s
      retries: 8

  api:
    build: ./api
    ports:
      - "8000:8000"
    environment:
      NEO4J_URI: bolt://neo4j:7687
      NEO4J_PASSWORD: ${NEO4J_PASSWORD}
    depends_on:
      neo4j:
        condition: service_healthy

  frontend:
    build: ./frontend
    ports:
      - "3000:3000"
    environment:
      VITE_API_URL: http://localhost:8000
    depends_on:
      api:
        condition: service_healthy

  etl:
    build: ./etl
    profiles: ["etl"]  # Only starts with --profile etl
    volumes:
      - .:/workspace
    depends_on:
      neo4j:
        condition: service_healthy

Data Flow Example

Let’s trace a complete request flow:

User requests company graph

Frontend calls: GET /api/v1/public/graph/company/12345678000190?depth=2

API routes request

FastAPI router in routers/public.py receives request:

@router.get("/graph/company/{company_ref}")
async def public_graph_for_company(
    company_ref: str,
    depth: int = 2
) -> GraphResponse:

Resolve company identifier

Helper function _resolve_company() queries Neo4j:

MATCH (c:Company)
WHERE c.cnpj = $company_identifier
   OR elementId(c) = $company_id
RETURN elementId(c) as entity_id, labels(c) as labels

Execute graph query

Load query from queries/public_graph_company.cypher:

MATCH (c:Company) WHERE elementId(c) = $company_id
CALL apoc.path.subgraphAll(c, {
    maxLevel: $depth,
    relationshipFilter: ">"
})
YIELD nodes, relationships
RETURN nodes, relationships, $company_id as center_id

Enforce privacy policies

Filter results through public_guard.py:

Remove Person nodes if PUBLIC_MODE=true
Strip CPF fields from all nodes
Filter sensitive properties

Transform to response model

Convert Neo4j results to Pydantic models:

nodes = [GraphNode(
    id=node.element_id,
    label=node['razao_social'],
    type=node.labels[0],
    document_id=node['cnpj'],
    properties=sanitize_properties(node),
    sources=[SourceAttribution(database="CNPJ")]
) for node in raw_nodes]

Return JSON response

FastAPI serializes to JSON:

{
  "nodes": [{"id": "...", "label": "Acme Corp", ...}],
  "edges": [{"id": "...", "source": "...", "target": "..."}],
  "center_id": "4:abc123:0"
}

Frontend renders graph

React component uses react-force-graph-2d to visualize:

<ForceGraph2D
  graphData={{nodes, links: edges}}
  nodeLabel={node => node.label}
  nodeColor={node => getColorByType(node.type)}
  onNodeClick={handleNodeClick}
/>

Performance Considerations

Neo4j Optimization

Memory Configuration

For production scale (40M+ nodes):

NEO4J_HEAP_INITIAL=4G
NEO4J_HEAP_MAX=8G
NEO4J_PAGECACHE=12G

Requires 32GB+ RAM (64GB recommended)

Query Optimization

All unique constraints create indexes automatically
Use elementId() for fast node lookup
Limit depth in graph traversals (max 3 recommended)
Use APOC procedures for complex graph operations

Batch Loading

ETL uses Neo4jBatchLoader with:

Batch size: 1000-5000 nodes
UNWIND for bulk inserts
Periodic commits for large datasets

API Performance

Caching

Stats endpoint cached for 5 minutes
Neo4j connection pooling (default: 50 connections)
Response compression via FastAPI middleware

Rate Limiting

from slowapi import Limiter
limiter = Limiter(key_func=get_remote_address)

@app.get("/api/v1/public/graph/company/{ref}")
@limiter.limit("100/minute")
async def get_company_graph(...):

Async Operations

FastAPI with async/await throughout
Neo4j async driver (neo4j.AsyncDriver)
Concurrent query execution where possible

Security Architecture

Defense in Depth

Environment-based Configuration
- Public mode disables sensitive endpoints
- Tier system (community vs enterprise)
- Feature flags for experimental features
Middleware Stack
- Security headers (CSP, HSTS, X-Frame-Options)
- CPF masking in all responses
- Rate limiting per IP
- CORS with explicit origin whitelist
Data Access Control
- Public-safe defaults block Person nodes
- Property-level sanitization
- Query-level enforcement (not relying on frontend)
Authentication (when enabled)
- JWT tokens with secure secret
- Password hashing with bcrypt
- Invite-code system for registration
Audit Trail
- All ingestion runs logged to graph
- Source attribution on every node
- Temporal violation tracking

Deployment Architectures

Local Development
Production Single-Server
Production Distributed

docker compose up -d

Single-machine setup:

Neo4j, API, Frontend in containers
Seed data for testing
Hot reload for development

Monitoring & Observability

Neo4j Metrics

Query performance via dbms.querylog
Memory usage
Transaction throughput
Cache hit rates

API Metrics

Request rate and latency
Error rates by endpoint
Rate limit hits
Health check status

ETL Metrics

Pipeline success/failure rates
Data quality metrics
Source availability
Ingestion duration

System Metrics

CPU and memory usage
Disk I/O
Network throughput
Docker container health

Next Steps

API Reference

Explore all available endpoints

ETL Deep Dive

Learn about data pipelines

Graph Schema

Complete Neo4j schema reference

Contributing

Contribute to the project

Get Started

Data Model

Deployment

ETL Pipelines

Querying Data

Legal & Ethics

Documentation Index

​System Overview

​Architecture Layers

​1. Data Sources Layer

​2. ETL Layer

​Pipeline Structure

​Common Transforms

​ETL Orchestration

​3. Graph Database Layer

​Schema Overview

​Graph Relationships

​Query Examples

​4. Backend API Layer

​Project Structure

​Key Endpoints

​Privacy & Security

​5. Frontend Layer

​Tech Stack

​Key Features

​Docker Compose Architecture

​Data Flow Example

​Performance Considerations

​Neo4j Optimization

​API Performance

​Security Architecture

​Defense in Depth

​Deployment Architectures

​Monitoring & Observability