Advanced Level

AI Resume Analyzer & ATS Parser

Architect and build an enterprise-grade AI Resume Analyzer. You will learn how to extract unstructured text from PDFs, structure it using OpenAI's GPT-4o-mini API, and semantically score candidates against Job Descriptions using PostgreSQL and pgvector.

The Problem

HR departments and recruiting agencies receive hundreds of unqualified resumes daily. Manual screening is biased, slow, and prone to fatigue. An automated AI parser extracts exact matching skills instantly, saving recruiters up to 80% of screening time while standardizing the evaluation matrix.

Real-World Use Case

Technology Stack

Python 3.11+ proficiency

Prerequisite

OpenAI API account & key

Prerequisite

Basic understanding of natural language processing (NLP)

Prerequisite

Docker for local database setup (optional but recommended)

Prerequisite

Architecture & Design

Folder Structure

resume_analyzer/
├── backend/
│   ├── app.py
│   ├── pdf_processor.py
│   ├── llm_client.py
│   ├── db.py
│   └── requirements.txt
├── frontend/
│   ├── index.html
│   └── styles.css
├── Dockerfile
└── README.md

API Design

IntegrationAPI

Integrates with OpenAI GPT-4o-mini API. Crucially, it uses the new Structured Outputs (response_format=PydanticModel) feature to guarantee a 100% valid JSON response, eliminating the need for fragile regex parsing of LLM outputs.

Step-by-Step Implementation

Securely accept multipart/form-data PDF file uploads.

<h3>1. Environment & Database Setup</h3><p>First, we need to set up our virtual environment and install the required dependencies. Since we are handling file uploads and asynchronous LLM calls, we will use <strong>FastAPI</strong>.</p><pre><code class='language-bash'>pip install fastapi uvicorn pypdf2 openai python-multipart pydantic</code></pre><p>For the database, if you are building the full enterprise version, launch a PostgreSQL instance with pgvector using Docker:</p><pre><code class='language-bash'>docker run --name pgvector-resume -e POSTGRES_PASSWORD=secret -p 5432:5432 ankane/pgvector</code></pre>

import openai
import io
import os
from fastapi import FastAPI, UploadFile, File, Form, HTTPException
from pydantic import BaseModel
from typing import List
import PyPDF2
import uvicorn

app = FastAPI(title='AI ATS Parser API')
client = openai.Client(api_key=os.getenv('OPENAI_API_KEY'))

class ResumeAnalysis(BaseModel):
    score: int
    matched_skills: List[str]
    missing_skills: List[str]
    summary: str

def extract_text_from_pdf(file_bytes: bytes) -> str:
    reader = PyPDF2.PdfReader(io.BytesIO(file_bytes))
    return '\n'.join([page.extract_text() for page in reader.pages if page.extract_text()])

@app.post('/analyze')
async def analyze(
    file: UploadFile = File(...),
    job_description: str = Form(...)
):
    if file.content_type != 'application/pdf':
        raise HTTPException(400, 'PDF only')
        
    content = await file.read()
    if len(content) > 5 * 1024 * 1024: raise HTTPException(413, 'File > 5MB')
    
    try:
        resume_text = extract_text_from_pdf(content)
        resp = client.beta.chat.completions.parse(
            model='gpt-4o-mini',
            messages=[{'role': 'user', 'content': f'Job:\n{job_description}\n\nResume:\n{resume_text}'}],
            response_format=ResumeAnalysis
        )
        return resp.choices[0].message.parsed.model_dump()
    except Exception as e:
        raise HTTPException(500, str(e))

if __name__ == '__main__':
    uvicorn.run(app, host='0.0.0.0', port=8000)

Code Explanation

Implementation step

Extract raw text and perform basic sanitization (removing unprintable characters).

<h3>2. The PDF Extraction Pipeline</h3><p>Extracting text from PDFs is notoriously difficult because PDFs are formatting documents, not semantic documents. We will use PyPDF2 for a lightweight approach, but in production, you might consider AWS Textract or Unstructured.io.</p><pre><code class='language-python'>import PyPDF2 import io def extract_text_from_pdf(file_bytes: bytes) -> str: try: reader = PyPDF2.PdfReader(io.BytesIO(file_bytes)) text = '' for page in reader.pages: page_text = page.extract_text() if page_text: text += page_text + '\n' return text.strip() except Exception as e: raise ValueError(f'PDF extraction failed: {str(e)}')</code></pre><p>Notice the try-except block. PDFs can be malformed or password-protected. Handling these edge cases gracefully is what separates junior code from senior code.</p>

import openai
import io
import os
from fastapi import FastAPI, UploadFile, File, Form, HTTPException
from pydantic import BaseModel
from typing import List
import PyPDF2
import uvicorn

app = FastAPI(title='AI ATS Parser API')
client = openai.Client(api_key=os.getenv('OPENAI_API_KEY'))

class ResumeAnalysis(BaseModel):
    score: int
    matched_skills: List[str]
    missing_skills: List[str]
    summary: str

def extract_text_from_pdf(file_bytes: bytes) -> str:
    reader = PyPDF2.PdfReader(io.BytesIO(file_bytes))
    return '\n'.join([page.extract_text() for page in reader.pages if page.extract_text()])

@app.post('/analyze')
async def analyze(
    file: UploadFile = File(...),
    job_description: str = Form(...)
):
    if file.content_type != 'application/pdf':
        raise HTTPException(400, 'PDF only')
        
    content = await file.read()
    if len(content) > 5 * 1024 * 1024: raise HTTPException(413, 'File > 5MB')
    
    try:
        resume_text = extract_text_from_pdf(content)
        resp = client.beta.chat.completions.parse(
            model='gpt-4o-mini',
            messages=[{'role': 'user', 'content': f'Job:\n{job_description}\n\nResume:\n{resume_text}'}],
            response_format=ResumeAnalysis
        )
        return resp.choices[0].message.parsed.model_dump()
    except Exception as e:
        raise HTTPException(500, str(e))

if __name__ == '__main__':
    uvicorn.run(app, host='0.0.0.0', port=8000)

Code Explanation

Implementation step

Prompt the LLM to output a strict JSON schema containing arrays of matched and missing skills.

<h3>3. OpenAI Structured Output Integration</h3><p>We do not want the LLM to just 'chat' with us. We need a deterministic JSON object. We achieve this using Pydantic and OpenAI's Structured Outputs.</p><pre><code class='language-python'>from pydantic import BaseModel from typing import List import openai import os client = openai.Client(api_key=os.getenv('OPENAI_API_KEY')) class ResumeAnalysis(BaseModel): score: int matched_skills: List[str] missing_skills: List[str] summary: str def analyze_resume_llm(resume_text: str, job_description: str) -> ResumeAnalysis: prompt = f""" Analyze the following resume against the job description. Job Description: {job_description} Resume: {resume_text} """ response = client.beta.chat.completions.parse( model='gpt-4o-mini', messages=[{'role': 'user', 'content': prompt}], response_format=ResumeAnalysis ) return response.choices[0].message.parsed</code></pre>

import openai
import io
import os
from fastapi import FastAPI, UploadFile, File, Form, HTTPException
from pydantic import BaseModel
from typing import List
import PyPDF2
import uvicorn

app = FastAPI(title='AI ATS Parser API')
client = openai.Client(api_key=os.getenv('OPENAI_API_KEY'))

class ResumeAnalysis(BaseModel):
    score: int
    matched_skills: List[str]
    missing_skills: List[str]
    summary: str

def extract_text_from_pdf(file_bytes: bytes) -> str:
    reader = PyPDF2.PdfReader(io.BytesIO(file_bytes))
    return '\n'.join([page.extract_text() for page in reader.pages if page.extract_text()])

@app.post('/analyze')
async def analyze(
    file: UploadFile = File(...),
    job_description: str = Form(...)
):
    if file.content_type != 'application/pdf':
        raise HTTPException(400, 'PDF only')
        
    content = await file.read()
    if len(content) > 5 * 1024 * 1024: raise HTTPException(413, 'File > 5MB')
    
    try:
        resume_text = extract_text_from_pdf(content)
        resp = client.beta.chat.completions.parse(
            model='gpt-4o-mini',
            messages=[{'role': 'user', 'content': f'Job:\n{job_description}\n\nResume:\n{resume_text}'}],
            response_format=ResumeAnalysis
        )
        return resp.choices[0].message.parsed.model_dump()
    except Exception as e:
        raise HTTPException(500, str(e))

if __name__ == '__main__':
    uvicorn.run(app, host='0.0.0.0', port=8000)

Code Explanation

Implementation step

Calculate an 'ATS Compatibility Score' based on exact keyword overlap and semantic similarity.

<h3>4. API Endpoints & Error Handling</h3><p>Finally, we wrap this in a FastAPI endpoint. Security is critical here. You must limit the file upload size to prevent Denial of Service (DoS) attacks.</p><pre><code class='language-python'>from fastapi import FastAPI, UploadFile, File, Form, HTTPException app = FastAPI() MAX_FILE_SIZE = 5 * 1024 * 1024 # 5MB @app.post('/api/analyze') async def analyze_endpoint( file: UploadFile = File(...), job_description: str = Form(...) ): if file.content_type != 'application/pdf': raise HTTPException(status_code=400, detail='Only PDF files are allowed.') file_bytes = await file.read() if len(file_bytes) > MAX_FILE_SIZE: raise HTTPException(status_code=413, detail='File too large. Max 5MB.') try: text = extract_text_from_pdf(file_bytes) analysis = analyze_resume_llm(text, job_description) return analysis.model_dump() except Exception as e: raise HTTPException(status_code=500, detail=str(e))</code></pre>

import openai
import io
import os
from fastapi import FastAPI, UploadFile, File, Form, HTTPException
from pydantic import BaseModel
from typing import List
import PyPDF2
import uvicorn

app = FastAPI(title='AI ATS Parser API')
client = openai.Client(api_key=os.getenv('OPENAI_API_KEY'))

class ResumeAnalysis(BaseModel):
    score: int
    matched_skills: List[str]
    missing_skills: List[str]
    summary: str

def extract_text_from_pdf(file_bytes: bytes) -> str:
    reader = PyPDF2.PdfReader(io.BytesIO(file_bytes))
    return '\n'.join([page.extract_text() for page in reader.pages if page.extract_text()])

@app.post('/analyze')
async def analyze(
    file: UploadFile = File(...),
    job_description: str = Form(...)
):
    if file.content_type != 'application/pdf':
        raise HTTPException(400, 'PDF only')
        
    content = await file.read()
    if len(content) > 5 * 1024 * 1024: raise HTTPException(413, 'File > 5MB')
    
    try:
        resume_text = extract_text_from_pdf(content)
        resp = client.beta.chat.completions.parse(
            model='gpt-4o-mini',
            messages=[{'role': 'user', 'content': f'Job:\n{job_description}\n\nResume:\n{resume_text}'}],
            response_format=ResumeAnalysis
        )
        return resp.choices[0].message.parsed.model_dump()
    except Exception as e:
        raise HTTPException(500, str(e))

if __name__ == '__main__':
    uvicorn.run(app, host='0.0.0.0', port=8000)

Code Explanation

Implementation step

Return the JSON response to the frontend for visualization.

import openai
import io
import os
from fastapi import FastAPI, UploadFile, File, Form, HTTPException
from pydantic import BaseModel
from typing import List
import PyPDF2
import uvicorn

app = FastAPI(title='AI ATS Parser API')
client = openai.Client(api_key=os.getenv('OPENAI_API_KEY'))

class ResumeAnalysis(BaseModel):
    score: int
    matched_skills: List[str]
    missing_skills: List[str]
    summary: str

def extract_text_from_pdf(file_bytes: bytes) -> str:
    reader = PyPDF2.PdfReader(io.BytesIO(file_bytes))
    return '\n'.join([page.extract_text() for page in reader.pages if page.extract_text()])

@app.post('/analyze')
async def analyze(
    file: UploadFile = File(...),
    job_description: str = Form(...)
):
    if file.content_type != 'application/pdf':
        raise HTTPException(400, 'PDF only')
        
    content = await file.read()
    if len(content) > 5 * 1024 * 1024: raise HTTPException(413, 'File > 5MB')
    
    try:
        resume_text = extract_text_from_pdf(content)
        resp = client.beta.chat.completions.parse(
            model='gpt-4o-mini',
            messages=[{'role': 'user', 'content': f'Job:\n{job_description}\n\nResume:\n{resume_text}'}],
            response_format=ResumeAnalysis
        )
        return resp.choices[0].message.parsed.model_dump()
    except Exception as e:
        raise HTTPException(500, str(e))

if __name__ == '__main__':
    uvicorn.run(app, host='0.0.0.0', port=8000)

Code Explanation

Implementation step

Common Errors

PyPDF2 extracts garbled text or whitespace.

This happens with scanned PDFs or weird encodings. For production, switch to OCR solutions like Tesseract or AWS Textract.

OpenAI throws a Context Window limit error.

Truncate the resume_text to the first 10,000 characters before sending it to the LLM.

JSON parsing errors from the LLM.

Always use the new `client.beta.chat.completions.parse` with a Pydantic model to guarantee valid JSON.

Security & Performance

Upload a valid 1-page PDF resume.

Upload a massive 50MB PDF to ensure the 413 Payload Too Large error triggers.

Upload a .docx file to ensure the 400 Bad Request triggers.

Verify that the LLM responds in less than 5 seconds using gpt-4o-mini.

Write unit tests mocking the OpenAI client so CI/CD doesn't consume real API credits.

Implement pgvector: Store embeddings of the job description and compare them against candidate embeddings to search millions of resumes in milliseconds.

Add LangChain to support multiple document loaders (.docx, .txt).

Build a React dashboard displaying a radar chart comparing Candidate Skills vs Required Skills.

Interview Questions

Q: Is it safe to send user resumes to OpenAI?

A: According to OpenAI's enterprise privacy policy, data sent via the API is NOT used to train their models. However, you should include a Privacy Policy in your app stating this.

Q: Why use gpt-4o-mini instead of gpt-4?

A: GPT-4o-mini is heavily optimized for fast, structured outputs and costs 90% less. For simple classification and extraction, it performs almost identically to GPT-4.

Technology Stack

Python 3.11+ proficiency

OpenAI API account & key

Basic understanding of natural language processing (NLP)

Docker for local database setup (optional but recommended)

Architecture & Design

Folder Structure

API Design

Step-by-Step Implementation

Securely accept multipart/form-data PDF file uploads.

Code Explanation

Extract raw text and perform basic sanitization (removing unprintable characters).

Code Explanation

Prompt the LLM to output a strict JSON schema containing arrays of matched and missing skills.

Code Explanation

Calculate an 'ATS Compatibility Score' based on exact keyword overlap and semantic similarity.

Code Explanation

Return the JSON response to the frontend for visualization.

Code Explanation

Common Errors

Security & Performance

Interview Questions

Q: Is it safe to send user resumes to OpenAI?

Q: Why use gpt-4o-mini instead of gpt-4?

Get practical AI tools, SEO tips, and growth guides weekly.