Documentation Index Fetch the complete documentation index at: https://mintlify.com/World-Open-Graph/br-acc/llms.txt
Use this file to discover all available pages before exploring further.
System Overview
BR-ACC is built as a modern, containerized application stack that ingests Brazilian public data, normalizes it into a graph database, and exposes it through a REST API and web interface.
Architecture Layers
1. Data Sources Layer
BR-ACC integrates 45+ Brazilian public data sources :
Government Portals
Sector-Specific
International
Derived
Receita Federal - CNPJ company registry (60M+ companies)
Portal da Transparência - Federal spending, sanctions, contracts
TSE - Electoral data, donations, candidate assets, party membership
ComprasNet/PNCP - Federal procurement and bids
TCU - Federal Court of Auditors sanctions
TransfereGov - Federal transfers to states/municipalities
DataSUS/CNES - Health facilities
INEP - Education census (schools)
IBAMA - Environmental embargoes
RAIS/CAGED - Labor statistics
BNDES - Development bank loans
CVM - Securities commission proceedings and funds
ICIJ Offshore Leaks - Offshore entities
OpenSanctions - Global PEPs and sanctions
OFAC - US sanctions
EU Sanctions - European sanctions
UN Sanctions - UN Security Council
World Bank - Debarred entities
Holdings - Corporate group detection from CNPJ
Entity Resolution - Cross-source entity matching
All sources are accessed through their official APIs or open data portals. No scraping of private data occurs.
2. ETL Layer
The ETL (Extract, Transform, Load) layer is built with Python and follows a modular pipeline architecture.
Pipeline Structure
Each data source has a dedicated pipeline in etl/src/bracc_etl/pipelines/:
# Example: etl/src/bracc_etl/pipelines/cnpj.py
from bracc_etl.base import Pipeline
from bracc_etl.loader import Neo4jBatchLoader
from bracc_etl.transforms import (
format_cnpj,
format_cpf,
normalize_name,
parse_date,
deduplicate_rows
)
class CNPJPipeline ( Pipeline ):
def extract ( self ) -> pd.DataFrame:
"""Download and read CNPJ data from Receita Federal"""
# Download logic
def transform ( self , df : pd.DataFrame) -> pd.DataFrame:
"""Normalize and clean data"""
df[ 'cnpj' ] = df[ 'cnpj' ].apply(format_cnpj)
df[ 'name' ] = df[ 'razao_social' ].apply(normalize_name)
df[ 'start_date' ] = df[ 'data_inicio' ].apply(parse_date)
return deduplicate_rows(df)
def load ( self , df : pd.DataFrame, driver : Driver) -> None :
"""Write to Neo4j in batches"""
loader = Neo4jBatchLoader(driver)
loader.load_companies(df)
Shared transformation utilities in etl/src/bracc_etl/transforms/:
Document Formatting
Name Normalization
Date Formatting
Deduplication
# transforms/document_formatting.py
from bracc_etl.transforms import format_cnpj, format_cpf, strip_document
# CNPJ: 12345678000190 → 12.345.678/0001-90
formatted = format_cnpj( "12345678000190" )
# CPF: 12345678900 → 123.456.789-00 (masked in public mode)
formatted = format_cpf( "12345678900" )
# Remove formatting: 12.345.678/0001-90 → 12345678000190
stripped = strip_document( "12.345.678/0001-90" )
ETL Orchestration
From etl/src/bracc_etl/runner.py:
# Run single pipeline
python -m bracc_etl.runner --source cnpj
# Run multiple pipelines
python -m bracc_etl.runner --source cnpj,tse,transparencia
# Full orchestration (all 45+ sources)
make bootstrap-all
The bootstrap-all workflow:
Loads source contract from config/bootstrap_all_contract.yml
Runs pipelines in dependency order
Continues on errors and classifies outcomes
Generates audit reports in audit-results/bootstrap-all/
3. Graph Database Layer
Neo4j 5 Community Edition stores all data as a labeled property graph.
Schema Overview
From infra/neo4j/init.cypher, the schema defines:
Core Entities
Transactions
Sanctions & Restrictions
Sector-Specific
Person - Individuals (CPF)CREATE CONSTRAINT person_cpf_unique IF NOT EXISTS
FOR ( p : Person ) REQUIRE p . cpf IS UNIQUE ;
Company - Organizations (CNPJ)CREATE CONSTRAINT company_cnpj_unique IF NOT EXISTS
FOR ( c : Company ) REQUIRE c . cnpj IS UNIQUE ;
Partner - Company partners/shareholdersCREATE CONSTRAINT partner_id_unique IF NOT EXISTS
FOR ( p : Partner ) REQUIRE p . partner_id IS UNIQUE ;
Contract - Public contractsCREATE CONSTRAINT contract_contract_id_unique IF NOT EXISTS
FOR ( c : Contract ) REQUIRE c . contract_id IS UNIQUE ;
Bid - Procurement bidsCREATE CONSTRAINT bid_id_unique IF NOT EXISTS
FOR ( b : Bid ) REQUIRE b . bid_id IS UNIQUE ;
Finance - Financial transactionsCREATE CONSTRAINT finance_id_unique IF NOT EXISTS
FOR ( f : Finance ) REQUIRE f . finance_id IS UNIQUE ;
Sanction - CEIS/CNEP sanctionsCREATE CONSTRAINT sanction_sanction_id_unique IF NOT EXISTS
FOR ( s : Sanction ) REQUIRE s . sanction_id IS UNIQUE ;
Embargo - IBAMA environmental embargoesCREATE CONSTRAINT embargo_id_unique IF NOT EXISTS
FOR ( e : Embargo ) REQUIRE e . embargo_id IS UNIQUE ;
InternationalSanction - OFAC, EU, UNCREATE CONSTRAINT international_sanction_id_unique IF NOT EXISTS
FOR ( s : InternationalSanction ) REQUIRE s . sanction_id IS UNIQUE ;
Health - Healthcare facilities (CNES)
Education - Schools (INEP)
LaborStats - Employment data (RAIS)
Amendment - Budget amendments (SIOP)
Expense - Government expenses (CPGF)
GovTravel - Government travel
TaxWaiver - Tax incentivesAnd 20+ more specialized node types…
Graph Relationships
Common relationship types:
// Ownership and control
( Person ) - [: PARTNER_OF ] -> ( Company )
( Company ) - [: OWNS ] -> ( Company ) // Holdings
( Company ) - [: HOLDING_GROUP ] -> ( Holding )
// Transactions
( Company ) - [: HAS_CONTRACT ] -> ( Contract )
( Company ) - [: SUBMITTED_BID ] -> ( Bid )
( Company ) - [: RECEIVED_FINANCE ] -> ( Finance )
// Issues
( Company ) - [: HAS_SANCTION ] -> ( Sanction )
( Company ) - [: HAS_EMBARGO ] -> ( Embargo )
( Company ) - [: UNDER_INVESTIGATION ] -> ( Inquiry )
// Source attribution
( Node ) - [: SOURCED_FROM ] -> ( SourceDocument )
Query Examples
From api/src/bracc/queries/:
Find Sanctioned Companies
Aggregate Statistics
Pattern Detection
// queries/public_graph_company.cypher
MATCH ( c : Company { cnpj : $ company_identifier } )
OPTIONAL MATCH path = ( c ) - [ *1..3 ] - ( related )
WHERE NOT related : Person OR $ allow_person = true
RETURN c , collect ( distinct related ) as nodes ,
collect ( distinct relationships ( path )) as rels
4. Backend API Layer
FastAPI (Python 3.12+) provides async REST API with automatic OpenAPI documentation.
Project Structure
api/src/bracc/
├── main.py # FastAPI app, middleware, lifespan
├── config.py # Settings (Pydantic BaseSettings)
├── dependencies.py # Neo4j driver, auth dependencies
├── routers/ # API endpoints
│ ├── public.py # Public graph/pattern endpoints
│ ├── meta.py # Health, stats, sources
│ ├── search.py # Entity search
│ ├── graph.py # Graph expansion
│ ├── entity.py # Entity details
│ ├── patterns.py # Pattern detection
│ ├── investigation.py # Investigation management
│ └── auth.py # Authentication
├── services/ # Business logic
│ ├── neo4j_service.py # Query execution
│ ├── public_guard.py # Privacy enforcement
│ ├── intelligence_provider.py # Pattern engine
│ └── source_registry.py # Data source metadata
├── models/ # Pydantic models
│ ├── entity.py
│ ├── graph.py
│ ├── pattern.py
│ └── investigation.py
├── middleware/ # HTTP middleware
│ ├── security_headers.py
│ ├── cpf_masking.py
│ └── rate_limit.py
└── queries/ # Cypher query templates
└── *.cypher
Key Endpoints
From api/src/bracc/routers/public.py and meta.py:
Public Endpoints
Health & Stats
Pattern Detection
# public.py
@router.get ( "/api/v1/public/meta" )
async def public_meta ( session : AsyncSession) -> dict :
"""Aggregated metrics and source health"""
return {
"total_nodes" : 40_000_000 ,
"company_count" : 60_000_000 ,
"contract_count" : 5_000_000 ,
"source_health" : { ... }
}
@router.get ( "/api/v1/public/graph/company/ {company_ref} " )
async def public_graph_for_company (
company_ref : str ,
depth : int = 2
) -> GraphResponse:
"""Public company subgraph (CNPJ or ID)"""
# Enforces public-safe defaults
# Filters out Person nodes in public mode
# Returns nodes, edges, and center_id
# meta.py
@router.get ( "/api/v1/meta/health" )
async def neo4j_health ( session : AsyncSession):
"""Neo4j connection health check"""
return { "neo4j" : "connected" }
@router.get ( "/api/v1/meta/stats" )
async def database_stats ( session : AsyncSession):
"""Comprehensive graph statistics (cached 5min)"""
return {
"total_nodes" : ... ,
"company_count" : ... ,
"implemented_sources" : 45 ,
"loaded_sources" : 38 ,
...
}
# public.py
@router.get ( "/api/v1/public/patterns/company/ {cnpj_or_id} " )
async def public_patterns_for_company (
cnpj_or_id : str ,
lang : str = "pt"
) -> PatternResponse:
"""Run pattern detection on company"""
# Requires PATTERNS_ENABLED=true
# Returns detected patterns with evidence
patterns = [
"split_contracts_below_threshold" ,
"sanctioned_still_receiving" ,
"debtor_contracts" ,
...
]
Privacy & Security
From api/src/bracc/main.py:
# Middleware stack (executed in reverse order)
app.add_middleware(CPFMaskingMiddleware) # Mask CPF in responses
app.add_middleware(SecurityHeadersMiddleware) # CSP, HSTS, etc.
app.add_middleware(SlowAPIMiddleware) # Rate limiting
app.add_middleware(CORSMiddleware) # CORS headers
# Public-safe defaults enforced in services/public_guard.py
def enforce_person_access_policy ( labels : list[ str ]) -> None :
"""Block Person nodes in public mode"""
if settings.public_mode and has_person_labels(labels):
raise HTTPException( 403 , "Person data not accessible in public mode" )
def sanitize_public_properties ( props : dict ) -> dict :
"""Remove CPF and other sensitive fields"""
return {k: v for k, v in props.items() if k not in SENSITIVE_KEYS }
Environment flags:
PUBLIC_MODE = true # Enable public-safe mode
PUBLIC_ALLOW_PERSON = false # Block Person node access
PUBLIC_ALLOW_ENTITY_LOOKUP = false # Block direct entity lookup
PUBLIC_ALLOW_INVESTIGATIONS = false # Block investigation features
PATTERNS_ENABLED = false # Disable pattern engine
5. Frontend Layer
React 19 + TypeScript + Vite provides the web interface.
Tech Stack
From frontend/package.json:
{
"dependencies" : {
"react" : "^19.0.0" ,
"react-dom" : "^19.0.0" ,
"react-router" : "^7.0.0" ,
"react-force-graph-2d" : "^1.26.0" , // Graph visualization
"react-hook-form" : "^7.71.2" , // Form management
"zustand" : "^5.0.0" , // State management
"i18next" : "^24.0.0" , // Internationalization
"lucide-react" : "^0.575.0" , // Icons
"zod" : "^4.3.6" // Schema validation
},
"devDependencies" : {
"vite" : "^6.0.0" ,
"typescript" : "^5.7.0" ,
"vitest" : "^3.0.0" ,
"@testing-library/react" : "^16.0.0"
}
}
Key Features
Company Search - Search by CNPJ, name, or ID
Graph Visualization - Interactive force-directed graph with react-force-graph-2d
Entity Details - Comprehensive entity information panels
Investigation Workspace - Create and manage investigation collections
Pattern Highlighting - Visual indicators for detected patterns
Multi-language - Portuguese (pt-BR) and English (en) via i18next
Docker Compose Architecture
From docker-compose.yml:
services :
neo4j :
image : neo4j:5-community
ports :
- "7474:7474" # Browser
- "7687:7687" # Bolt
environment :
NEO4J_AUTH : neo4j/${NEO4J_PASSWORD}
NEO4J_PLUGINS : '["apoc"]'
volumes :
- neo4j-data:/data
- ./infra/neo4j/init.cypher:/var/lib/neo4j/init.cypher
healthcheck :
test : cypher-shell -u neo4j -p $NEO4J_PASSWORD "RETURN 1"
interval : 10s
retries : 8
api :
build : ./api
ports :
- "8000:8000"
environment :
NEO4J_URI : bolt://neo4j:7687
NEO4J_PASSWORD : ${NEO4J_PASSWORD}
depends_on :
neo4j :
condition : service_healthy
frontend :
build : ./frontend
ports :
- "3000:3000"
environment :
VITE_API_URL : http://localhost:8000
depends_on :
api :
condition : service_healthy
etl :
build : ./etl
profiles : [ "etl" ] # Only starts with --profile etl
volumes :
- .:/workspace
depends_on :
neo4j :
condition : service_healthy
Data Flow Example
Let’s trace a complete request flow:
User requests company graph
Frontend calls: GET /api/v1/public/graph/company/12345678000190?depth=2
API routes request
FastAPI router in routers/public.py receives request: @router.get ( "/graph/company/ {company_ref} " )
async def public_graph_for_company (
company_ref : str ,
depth : int = 2
) -> GraphResponse:
Resolve company identifier
Helper function _resolve_company() queries Neo4j: MATCH ( c : Company )
WHERE c . cnpj = $ company_identifier
OR elementId ( c ) = $ company_id
RETURN elementId ( c ) as entity_id , labels ( c ) as labels
Execute graph query
Load query from queries/public_graph_company.cypher: MATCH ( c : Company ) WHERE elementId ( c ) = $ company_id
CALL apoc . path . subgraphAll ( c , {
maxLevel : $ depth ,
relationshipFilter : ">"
} )
YIELD nodes , relationships
RETURN nodes , relationships , $ company_id as center_id
Enforce privacy policies
Filter results through public_guard.py:
Remove Person nodes if PUBLIC_MODE=true
Strip CPF fields from all nodes
Filter sensitive properties
Transform to response model
Convert Neo4j results to Pydantic models: nodes = [GraphNode(
id = node.element_id,
label = node[ 'razao_social' ],
type = node.labels[ 0 ],
document_id = node[ 'cnpj' ],
properties = sanitize_properties(node),
sources = [SourceAttribution( database = "CNPJ" )]
) for node in raw_nodes]
Return JSON response
FastAPI serializes to JSON: {
"nodes" : [{ "id" : "..." , "label" : "Acme Corp" , ... }],
"edges" : [{ "id" : "..." , "source" : "..." , "target" : "..." }],
"center_id" : "4:abc123:0"
}
Frontend renders graph
React component uses react-force-graph-2d to visualize: < ForceGraph2D
graphData = { { nodes , links: edges } }
nodeLabel = { node => node . label }
nodeColor = { node => getColorByType ( node . type ) }
onNodeClick = { handleNodeClick }
/>
Neo4j Optimization
For production scale (40M+ nodes): NEO4J_HEAP_INITIAL = 4G
NEO4J_HEAP_MAX = 8G
NEO4J_PAGECACHE = 12G
Requires 32GB+ RAM (64GB recommended)
All unique constraints create indexes automatically
Use elementId() for fast node lookup
Limit depth in graph traversals (max 3 recommended)
Use APOC procedures for complex graph operations
ETL uses Neo4jBatchLoader with:
Batch size: 1000-5000 nodes
UNWIND for bulk inserts
Periodic commits for large datasets
Stats endpoint cached for 5 minutes
Neo4j connection pooling (default: 50 connections)
Response compression via FastAPI middleware
from slowapi import Limiter
limiter = Limiter( key_func = get_remote_address)
@app.get ( "/api/v1/public/graph/company/ {ref} " )
@limiter.limit ( "100/minute" )
async def get_company_graph (...):
FastAPI with async/await throughout
Neo4j async driver (neo4j.AsyncDriver)
Concurrent query execution where possible
Security Architecture
Defense in Depth
Environment-based Configuration
Public mode disables sensitive endpoints
Tier system (community vs enterprise)
Feature flags for experimental features
Middleware Stack
Security headers (CSP, HSTS, X-Frame-Options)
CPF masking in all responses
Rate limiting per IP
CORS with explicit origin whitelist
Data Access Control
Public-safe defaults block Person nodes
Property-level sanitization
Query-level enforcement (not relying on frontend)
Authentication (when enabled)
JWT tokens with secure secret
Password hashing with bcrypt
Invite-code system for registration
Audit Trail
All ingestion runs logged to graph
Source attribution on every node
Temporal violation tracking
Deployment Architectures
Local Development
Production Single-Server
Production Distributed
Single-machine setup:
Neo4j, API, Frontend in containers
Seed data for testing
Hot reload for development
Requirements:
64GB RAM
8+ CPU cores
500GB+ SSD
Ubuntu 22.04 LTS
Services:
Neo4j with production heap settings
Caddy reverse proxy (HTTPS, caching)
API via systemd
Frontend static files served by Caddy
Database Tier:
Neo4j Enterprise with clustering
Read replicas for query scaling
Application Tier:
Multiple API instances behind load balancer
Horizontal scaling based on load
Frontend Tier:
CDN for static assets
Edge caching
ETL Tier:
Dedicated machines for data ingestion
Job queue (Celery) for pipeline orchestration
Monitoring & Observability
Neo4j Metrics
Query performance via dbms.querylog
Memory usage
Transaction throughput
Cache hit rates
API Metrics
Request rate and latency
Error rates by endpoint
Rate limit hits
Health check status
ETL Metrics
Pipeline success/failure rates
Data quality metrics
Source availability
Ingestion duration
System Metrics
CPU and memory usage
Disk I/O
Network throughput
Docker container health
Next Steps
API Reference Explore all available endpoints
ETL Deep Dive Learn about data pipelines
Graph Schema Complete Neo4j schema reference
Contributing Contribute to the project