This commit is contained in:
2025-12-03 20:37:36 -03:00
parent 970ffa8856
commit 85a5cc75d1
20 changed files with 1265 additions and 0 deletions

29
Dockerfile Normal file
View File

@@ -0,0 +1,29 @@
FROM python:3.13-slim
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY src/ ./src/
COPY gunicorn.conf.py ./
# Create non-root user
RUN useradd -m -u 1000 appuser
RUN mkdir -p /app/tmp && \
chown -R appuser:appuser /app
USER appuser
EXPOSE 5000
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:5000/')" || exit 1
CMD ["gunicorn", "-c", "gunicorn.conf.py", "src.app:app"]

123
README.md Normal file
View File

@@ -0,0 +1,123 @@
# ML Converter
Aplicación web en Python para procesar archivos Excel, convirtiendo automáticamente valores numéricos almacenados como texto al formato numérico apropiado.
Especialmente útil para resúmenes de Mercado Pago.
## Características
- **Detección y Conversión Inteligente de Números**: Identifica columnas con valores numéricos en formato texto (ej: "$1,234.56", "1.234,56", "(123.45)") y los convierte a números reales, manejando formatos internacionales, símbolos de moneda y negativos en paréntesis.
- **Preserva Datos de Texto**: Columnas de texto (nombres, categorías, fechas) permanecen sin cambios.
- **Interfaz Web Simple**: UI responsiva en español, con soporte drag & drop, mensajes claros de éxito/error y resumen de totales, ingresos y egresos tras el procesamiento.
- **Manejo Seguro de Archivos**: Almacenamiento temporal en `/tmp/ml-converter/` y limpieza automática tras 30 minutos.
- **Validación Real de Archivos**: Verifica que el archivo subido sea realmente Excel, no solo por extensión.
- **Soporte de Formatos**: Acepta `.xlsx` y `.xls` (máx. 16MB).
- **Pruebas Automáticas**: Incluye tests para endpoints y validaciones.
- **Headers de Seguridad**: Cabeceras HTTP adicionales (HSTS, CSP, etc).
## Tech Stack
- **Backend**: Flask (Python), pandas, openpyxl
- **Frontend**: HTML5, Tailwind CSS, Jinja2
- **File Cleanup**: APScheduler
- **Containerización**: Docker Compose
## Configuración del Entorno
1. **Copia y edita el archivo de entorno:**
```bash
cp env.example .env
nvim .env
```
2. **Genera un SECRET_KEY seguro:**
```bash
python3 -c "import secrets; print('SECRET_KEY=' + secrets.token_urlsafe(32))"
```
Actualiza `SECRET_KEY` y `DOMAIN` en `.env`.
3. **Variables importantes:**
- `FLASK_ENV`: 'production' para producción
- `MAX_CONTENT_LENGTH`: Tamaño máximo de archivo (por defecto: 16MB)
## Quick Start
### Opción 1: Entorno Virtual Python
```bash
./run.sh
```
O manualmente:
```bash
cd ml-converter
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python src/app.py
```
Visita [http://localhost:5000](http://localhost:5000)
### Opción 2: Docker
```bash
docker-compose up --build
```
Visita [http://localhost:5000](http://localhost:5000)
## Flujo de Uso
1. **Subi** tu archivo Excel (.xlsx/.xls) arrastrando o seleccionando.
2. **Procesa**: Haz clic en "Procesar Archivo". Las columnas numéricas en texto se convierten automáticamente.
3. **Descarga** el archivo procesado.
4. **Limpieza**: Los archivos temporales se eliminan automáticamente tras 30 minutos.
## API Endpoints
- `GET /` Página principal de carga
- `POST /upload` Subida y procesamiento de archivos
- `GET /download/<filename>` Descarga del archivo procesado
## Despliegue en Producción
### Gunicorn
```bash
gunicorn -w 4 -b 0.0.0.0:5000 src.app:app
```
### Docker
```bash
docker compose up -d
```
## Estructura de Archivos
```
ml-converter/
├── src/
│ ├── app.py # Aplicación principal Flask
│ ├── converters.py # Helpers
│ └── templates/
│ ├── index.html # Página de carga
│ └── download.html # Página de descarga
├── tests/ # Tests automáticos
├── requirements.txt # Dependencias
├── Dockerfile
├── compose.yml
└── README.md
```
## Seguridad
- **Nombres de archivo seguros** (`secure_filename`)
- **Validación de tipo de archivo** (.xlsx/.xls y firma interna)
- **Límites de tamaño** (16MB)
- **Limpieza automática** (30 minutos)
- **Nombres de archivo únicos** (UUIDs para prevenir conflictos)

22
compose.yml Normal file
View File

@@ -0,0 +1,22 @@
services:
ml-converter:
build: .
environment:
- FLASK_ENV=${FLASK_ENV:-production}
- SECRET_KEY=${SECRET_KEY:-change-this-in-production}
restart: unless-stopped
networks:
- proxy
labels:
- "traefik.enable=true"
# HTTP Router
- "traefik.http.routers.ml-converter.rule=Host(`${DOMAIN:-localhost}`)"
- "traefik.http.routers.ml-converter.entrypoints=websecure"
- "traefik.http.routers.ml-converter.tls.certresolver=letsencrypt"
- "traefik.http.services.ml-converter.loadbalancer.server.port=5000"
# Optional: File size limit for uploads (16MB)
- "traefik.http.middlewares.ml-converter-limit.buffering.maxRequestBodyBytes=16777216"
- "traefik.http.routers.ml-converter.middlewares=ml-converter-limit"
networks:
proxy:
external: true

12
env.example Normal file
View File

@@ -0,0 +1,12 @@
# Copy this file to .env and update the values
# Domain
DOMAIN=your-domain.com
# Flask Configuration
FLASK_ENV=production
SECRET_KEY=change-this-very-long-random-secret-key-in-production
# File Upload Limits
# 16 MB = 16 * 1024 * 1024 = 16777216 bytes
MAX_CONTENT_LENGTH=16777216

36
gunicorn.conf.py Normal file
View File

@@ -0,0 +1,36 @@
# Gunicorn configuration file for ML Converter
# Usage: gunicorn -c gunicorn.conf.py src.app:app
# Server socket
bind = "0.0.0.0:5000"
backlog = 2048
# Worker processes
workers = 4
worker_class = "sync"
worker_connections = 1000
timeout = 30
keepalive = 2
# Restart workers after this many requests, with up to 50% jitter
max_requests = 1000
max_requests_jitter = 50
# Logging
accesslog = "-"
errorlog = "-"
loglevel = "info"
# Process naming
proc_name = "ml-converter"
# Server mechanics
daemon = False
pidfile = "/tmp/ml-converter.pid"
user = None
group = None
tmp_upload_dir = None
# SSL (uncomment and configure for HTTPS)
# keyfile = "/path/to/keyfile"
# certfile = "/path/to/certfile"

6
main.py Normal file
View File

@@ -0,0 +1,6 @@
def main():
print("Hello from ml-converter!")
if __name__ == "__main__":
main()

3
pytest.ini Normal file
View File

@@ -0,0 +1,3 @@
[pytest]
testpaths = tests
python_files = test_*.py

9
requirements.txt Normal file
View File

@@ -0,0 +1,9 @@
Flask==3.1.2
pandas==2.3.3
openpyxl==3.1.5
xlrd==2.0.2
APScheduler==3.11.1
Werkzeug==3.1.4
gunicorn==23.0.0
XlsxWriter==3.2.9
pytest==9.0.1

19
run.sh Executable file
View File

@@ -0,0 +1,19 @@
#!/bin/bash
echo "Starting ML Converter Development Server"
if [ ! -d "venv" ]; then
echo "Creating virtual environment..."
python -m venv venv
fi
echo "Activating virtual environment..."
source venv/bin/activate
echo "Installing dependencies..."
pip install -r requirements.txt
mkdir -p tmp
echo "Starting Flask development server..."
echo "Access the application at: http://localhost:5000"
python src/app.py

408
src/app.py Normal file
View File

@@ -0,0 +1,408 @@
import atexit
import logging
import os
import uuid
from datetime import datetime, timedelta
from pathlib import Path
from threading import Lock
import pandas as pd
from apscheduler.schedulers.background import BackgroundScheduler
from flask import Flask, flash, redirect, render_template, request, send_file, url_for
from werkzeug.utils import secure_filename
try:
from src.converters import (
convert_text_columns_to_numbers,
find_columns_with_keywords,
normalize_column_name,
)
except ModuleNotFoundError as exc:
if exc.name == "src":
from converters import (
convert_text_columns_to_numbers,
find_columns_with_keywords,
normalize_column_name,
)
else:
raise
app = Flask(__name__)
app.secret_key = os.environ.get("SECRET_KEY", "dev-key-change-in-production")
# Logging setup
logging.basicConfig(
level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s"
)
logger = logging.getLogger(__name__)
UPLOAD_FOLDER = os.path.join(os.path.dirname(os.path.dirname(__file__)), "tmp")
ALLOWED_EXTENSIONS = {"xlsx", "xls"}
MAX_CONTENT_LENGTH_DEFAULT = 16 * 1024 * 1024 # 16MB max file size
MAX_CONTENT_LENGTH = int(
os.environ.get("MAX_CONTENT_LENGTH", str(MAX_CONTENT_LENGTH_DEFAULT))
)
app.config["UPLOAD_FOLDER"] = UPLOAD_FOLDER
app.config["MAX_CONTENT_LENGTH"] = MAX_CONTENT_LENGTH
# Ensure upload directory exists for all entrypoints (app import, gunicorn workers, tests)
os.makedirs(UPLOAD_FOLDER, exist_ok=True)
HSTS_POLICY = "max-age=31536000; includeSubDomains"
CSP_POLICY = (
"default-src 'self'; "
"style-src 'self' 'unsafe-inline' https://cdn.jsdelivr.net; "
"script-src 'self' 'unsafe-inline'; "
"img-src 'self' data:; "
"font-src 'self' data:; "
"connect-src 'self'; "
"form-action 'self'; "
"frame-ancestors 'none'; "
"base-uri 'self'"
)
PERMISSIONS_POLICY = "geolocation=(), microphone=(), camera=()"
@app.after_request
def apply_security_headers(response):
"""Apply modern security headers to every response."""
response.headers["Strict-Transport-Security"] = HSTS_POLICY
response.headers["Content-Security-Policy"] = CSP_POLICY
response.headers["Permissions-Policy"] = PERMISSIONS_POLICY
response.headers["X-Content-Type-Options"] = "nosniff"
response.headers["X-Frame-Options"] = "DENY"
response.headers["X-XSS-Protection"] = "1; mode=block"
response.headers["Referrer-Policy"] = "strict-origin-when-cross-origin"
return response
def allowed_file(filename):
return "." in filename and filename.rsplit(".", 1)[1].lower() in ALLOWED_EXTENSIONS
def is_valid_excel_file(file_path):
"""
Validate that the file is actually an Excel file by checking file signature (magic bytes)
and attempting to read it with pandas.
"""
try:
file_size = os.path.getsize(file_path)
if file_size == 0:
logger.warning(f"Empty file rejected: {file_path}")
return False
if file_size > MAX_CONTENT_LENGTH:
logger.warning(f"File too large rejected: {file_path} ({file_size} bytes)")
return False
# Check file signature (magic bytes)
with open(file_path, "rb") as f:
header = f.read(8)
# Excel file signatures
# .xlsx files start with PK (ZIP format)
# .xls files start with specific OLE signatures
xlsx_signature = (
header.startswith(b"PK\x03\x04")
or header.startswith(b"PK\x05\x06")
or header.startswith(b"PK\x07\x08")
)
xls_signature = header.startswith(
b"\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1"
) # OLE2 signature
if not (xlsx_signature or xls_signature):
logger.warning(f"Invalid file signature for {file_path}: {header.hex()}")
return False
# Try to read with pandas as additional validation
# Use nrows=1 to minimize resource usage and prevent potential DoS
df = pd.read_excel(file_path, nrows=1)
if df is None:
return False
return True
except Exception as e:
logger.warning(f"File validation failed for {file_path}: {str(e)}")
return False
def cleanup_old_files():
"""Remove files older than 1 hour from the temp directory."""
try:
current_time = datetime.now()
for filename in os.listdir(UPLOAD_FOLDER):
file_path = os.path.join(UPLOAD_FOLDER, filename)
if os.path.isfile(file_path):
file_time = datetime.fromtimestamp(os.path.getctime(file_path))
if current_time - file_time > timedelta(minutes=30):
os.remove(file_path)
logger.info("Deleted old file: %s", filename)
except Exception as e:
logger.exception("Error during cleanup: %s", e)
_scheduler_lock = Lock()
_scheduler = None
_scheduler_shutdown_registered = False
def _shutdown_scheduler():
global _scheduler
with _scheduler_lock:
if _scheduler and _scheduler.running:
logger.info("Shutting down cleanup scheduler")
_scheduler.shutdown()
def start_cleanup_scheduler():
"""Ensure the cleanup scheduler starts only once per process."""
global _scheduler, _scheduler_shutdown_registered
with _scheduler_lock:
if _scheduler is None:
_scheduler = BackgroundScheduler()
_scheduler.add_job(
func=cleanup_old_files,
trigger="interval",
minutes=10,
id="cleanup-old-files",
replace_existing=True,
)
if not _scheduler.running:
logger.info("Starting cleanup scheduler")
_scheduler.start()
if not _scheduler_shutdown_registered:
atexit.register(_shutdown_scheduler)
_scheduler_shutdown_registered = True
return _scheduler
start_cleanup_scheduler()
@app.route("/")
def index():
return render_template("index.html")
@app.route("/upload", methods=["GET", "POST"])
def upload_file():
if request.method == "GET":
return redirect(url_for("index"))
if "file" not in request.files:
flash("No se seleccionó ningún archivo")
logger.info("Upload attempted with no file in request")
return redirect(request.url)
file = request.files["file"]
if file.filename == "":
flash("No se seleccionó ningún archivo")
logger.info("Upload attempted with empty filename")
return redirect(request.url)
if file and allowed_file(file.filename):
try:
original_filename = secure_filename(file.filename)
unique_id = str(uuid.uuid4())
upload_path = os.path.join(
app.config["UPLOAD_FOLDER"], f"{unique_id}_original_{original_filename}"
)
file.save(upload_path)
logger.info("File uploaded: %s -> %s", original_filename, upload_path)
if not is_valid_excel_file(upload_path):
os.remove(upload_path)
flash(
"El archivo no es un archivo Excel válido. Por favor sube un archivo Excel real."
)
logger.warning("Invalid Excel file rejected: %s", original_filename)
return redirect(url_for("index"))
logger.info("Starting processing of %s", upload_path)
df = pd.read_excel(upload_path)
processed_df, converted_columns = convert_text_columns_to_numbers(df)
date_keywords = ["fecha", "liberacion", "liberación"]
date_cols = find_columns_with_keywords(processed_df.columns, date_keywords)
for col in date_cols:
processed_df[col] = pd.to_datetime(processed_df[col], errors="coerce")
# Remove timezone info if present (Excel does not support tz-aware datetimes)
if pd.api.types.is_datetime64_any_dtype(processed_df[col]):
try:
processed_df[col] = processed_df[col].dt.tz_localize(None)
except (AttributeError, TypeError):
pass
sum_h = sum_h_pos = sum_h_neg = None
if processed_df.shape[1] > 7:
col_h = processed_df.iloc[:, 7]
col_h_numeric = pd.to_numeric(col_h, errors="coerce")
sum_h = col_h_numeric.sum(skipna=True)
sum_h_pos = col_h_numeric[col_h_numeric > 0].sum(skipna=True)
sum_h_neg = col_h_numeric[col_h_numeric < 0].sum(skipna=True)
processed_filename = f"{unique_id}_processed_{original_filename}"
processed_path = os.path.join(
app.config["UPLOAD_FOLDER"], processed_filename
)
# Use ExcelWriter to set date, ID, and money column formats
with pd.ExcelWriter(
processed_path, engine="xlsxwriter", date_format="yyyy-mm-dd"
) as writer:
processed_df.to_excel(writer, index=False)
workbook = writer.book
worksheet = writer.sheets["Sheet1"]
date_format = workbook.add_format({"num_format": "yyyy-mm-dd"})
id_format = workbook.add_format({"num_format": "0", "align": "left"})
money_format = workbook.add_format({"num_format": "$ #,##0.00"})
header_format = workbook.add_format(
{
"text_wrap": True,
"bold": True,
"align": "center",
"valign": "vcenter",
}
)
worksheet.set_row(0, 40)
# Set all columns to width 20
for col_idx in range(len(processed_df.columns)):
worksheet.set_column(col_idx, col_idx, 20)
# Overwrite header row with header_format to ensure wrap
for col_idx, value in enumerate(processed_df.columns):
worksheet.write(0, col_idx, value, header_format)
# Define normalized money columns
money_col_targets = [
"valor de la compra",
"comision mas iva",
"comisión más iva",
"monto neto de operacion",
"monto neto de operación",
"impuestos cobrados por retenciones iibb",
]
# Set date columns
for col in date_cols:
col_idx = processed_df.columns.get_loc(col)
worksheet.set_column(col_idx, col_idx, 20, date_format)
# Set ID columns to integer format, wide enough to avoid scientific notation
for col in processed_df.columns:
norm_col = normalize_column_name(col)
if "id" in norm_col:
col_idx = processed_df.columns.get_loc(col)
worksheet.set_column(col_idx, col_idx, 15, id_format)
# Set money columns to currency format
for col in processed_df.columns:
norm_col = normalize_column_name(col)
if norm_col in money_col_targets:
col_idx = processed_df.columns.get_loc(col)
worksheet.set_column(col_idx, col_idx, 15, money_format)
logger.info("Processed file saved: %s", processed_path)
os.remove(upload_path)
logger.info("Removed original uploaded file: %s", upload_path)
return render_template(
"download.html",
filename=processed_filename,
original_name=original_filename,
sum_h=sum_h,
sum_h_pos=sum_h_pos,
sum_h_neg=sum_h_neg,
)
except Exception as e:
# Clean up uploaded file in case of any error
try:
if "upload_path" in locals() and os.path.exists(upload_path):
os.remove(upload_path)
logger.info("Cleaned up file after error: %s", upload_path)
except Exception as cleanup_error:
logger.exception("Error during cleanup: %s", cleanup_error)
# Generic error message to avoid information disclosure
flash(
"Error procesando el archivo. Por favor verifica que sea un archivo Excel válido."
)
logger.exception(
"File processing error for %s: %s",
original_filename if "original_filename" in locals() else "unknown",
str(e),
)
return redirect(url_for("index"))
else:
flash(
"Tipo de archivo inválido. Por favor sube un archivo Excel (.xlsx o .xls)"
)
logger.info(
"Rejected upload - invalid file type: %s", file.filename if file else None
)
return redirect(url_for("index"))
@app.route("/download/<filename>")
def download_file(filename):
try:
logger.info("Download requested for: %s", filename)
normalized_filename = secure_filename(filename)
if not normalized_filename:
logger.warning(
"Rejected download with empty normalized filename: %s", filename
)
flash("Archivo no encontrado o ha expirado")
return redirect(url_for("index"))
if normalized_filename != filename:
logger.info(
"Normalized download filename from %s to %s",
filename,
normalized_filename,
)
upload_root = Path(app.config["UPLOAD_FOLDER"]).resolve()
requested_path = upload_root / normalized_filename
try:
resolved_path = requested_path.resolve(strict=True)
except FileNotFoundError:
logger.info("File not found or expired: %s", requested_path)
flash("Archivo no encontrado o ha expirado")
return redirect(url_for("index"))
try:
resolved_path.relative_to(upload_root)
except ValueError:
logger.warning(
"Rejected download outside upload directory: %s -> %s",
filename,
resolved_path,
)
flash("Archivo no encontrado o ha expirado")
return redirect(url_for("index"))
if resolved_path.is_file():
logger.info("Serving file: %s", resolved_path)
download_name = f"convertido_{normalized_filename.split('_', 2)[-1]}"
return send_file(
resolved_path, as_attachment=True, download_name=download_name
)
logger.info("Path is not a regular file or has expired: %s", resolved_path)
flash("Archivo no encontrado o ha expirado")
return redirect(url_for("index"))
except Exception as e:
logger.exception("Error serving download: %s", e)
flash(f"Error descargando el archivo: {str(e)}")
return redirect(url_for("index"))
if __name__ == "__main__":
os.makedirs(UPLOAD_FOLDER, exist_ok=True)
app.run(debug=True, host="0.0.0.0", port=5000)

176
src/converters.py Normal file
View File

@@ -0,0 +1,176 @@
"""Utilities for normalizing and converting tabular data columns."""
from __future__ import annotations
import math
import unicodedata
from typing import Iterable, List, Optional, Tuple
import pandas as pd
_ID_KEYWORDS: Tuple[str, ...] = ("id",)
_CURRENCY_SYMBOLS: Tuple[str, ...] = ("$", "", "£", "¥", "", "", "")
def normalize_column_name(name: object) -> str:
"""Return a normalized, accent-free column identifier."""
if not isinstance(name, str):
return ""
normalized = unicodedata.normalize("NFKD", name.strip().lower())
return "".join(char for char in normalized if not unicodedata.combining(char))
def _strip_currency_symbols(value: str) -> str:
cleaned = value
for symbol in _CURRENCY_SYMBOLS:
cleaned = cleaned.replace(symbol, "")
return cleaned
def _coerce_to_string(value: object) -> Optional[str]:
if value is None:
return None
if isinstance(value, (int, float)) and not isinstance(value, bool):
if math.isnan(value) if isinstance(value, float) else False:
return None
return str(value)
text = str(value).strip()
return text or None
def _parse_numeric_text(text_value: object) -> Tuple[Optional[str], bool]:
"""Clean a numeric-like string and return (normalized_value, is_negative)."""
text = _coerce_to_string(text_value)
if text is None:
return None, False
cleaned = unicodedata.normalize("NFKC", text)
cleaned = cleaned.replace("\xa0", "")
cleaned = _strip_currency_symbols(cleaned)
is_negative = False
if cleaned.startswith("(") and cleaned.endswith(")"):
cleaned = cleaned[1:-1]
is_negative = True
if cleaned.endswith("-"):
cleaned = cleaned[:-1]
is_negative = True
if cleaned.startswith("-"):
cleaned = cleaned[1:]
is_negative = True
if cleaned.startswith("+"):
cleaned = cleaned[1:]
cleaned = cleaned.replace(" ", "")
if "." in cleaned and "," in cleaned:
last_dot = cleaned.rfind(".")
last_comma = cleaned.rfind(",")
if last_dot > last_comma:
cleaned = cleaned.replace(",", "")
else:
cleaned = cleaned.replace(".", "")
cleaned = cleaned.replace(",", ".")
elif cleaned.count(",") == 1 and len(cleaned.split(",")[1]) <= 2:
cleaned = cleaned.replace(",", ".")
else:
cleaned = cleaned.replace(",", "")
if cleaned.count(".") > 1:
parts = cleaned.split(".")
cleaned = "".join(parts[:-1]) + "." + parts[-1]
cleaned = cleaned.replace("'", "")
try:
float(cleaned)
except (TypeError, ValueError):
return None, False
return cleaned, is_negative
def is_numeric_like(text_value: object) -> bool:
"""Return True if a value can be safely interpreted as a number."""
cleaned, _ = _parse_numeric_text(text_value)
return cleaned is not None
def convert_numeric_text(text_value: object) -> Optional[float]:
"""Convert numeric-like text into a float. Returns pandas NA on failure."""
if text_value is None:
return pd.NA
if isinstance(text_value, (int, float)) and not isinstance(text_value, bool):
if isinstance(text_value, float) and math.isnan(text_value):
return pd.NA
return float(text_value)
cleaned, is_negative = _parse_numeric_text(text_value)
if cleaned is None:
return pd.NA
try:
result = float(cleaned)
except (TypeError, ValueError):
return pd.NA
return -result if is_negative else result
def _should_force_numeric(norm_column_name: str) -> bool:
return any(keyword in norm_column_name for keyword in _ID_KEYWORDS)
def convert_text_columns_to_numbers(df: pd.DataFrame) -> Tuple[pd.DataFrame, List[str]]:
"""Convert numeric-like object columns in ``df`` into numeric dtypes."""
converted_columns: List[str] = []
for column in df.columns:
series = df[column]
if pd.api.types.is_numeric_dtype(series):
continue
normalized_name = normalize_column_name(column)
force_numeric = _should_force_numeric(normalized_name)
if not (
force_numeric
or series.dtype == object
or pd.api.types.is_string_dtype(series)
):
continue
non_null = series.dropna()
if non_null.empty and not force_numeric:
continue
cleaned_non_null = non_null.map(_coerce_to_string).dropna()
if cleaned_non_null.empty and not force_numeric:
continue
if force_numeric or cleaned_non_null.map(is_numeric_like).all():
numeric_series = series.map(convert_numeric_text)
df[column] = pd.to_numeric(numeric_series, errors="coerce")
converted_columns.append(column)
return df, converted_columns
def find_columns_with_keywords(
columns: Iterable[str], keywords: Iterable[str]
) -> List[str]:
"""Return columns whose normalized name contains any of the provided keywords."""
normalized_keywords = tuple(normalize_column_name(keyword) for keyword in keywords)
matches: List[str] = []
for column in columns:
normalized_column = normalize_column_name(column)
if any(
keyword and keyword in normalized_column for keyword in normalized_keywords
):
matches.append(column)
return matches

32
src/static/globals.css Normal file
View File

@@ -0,0 +1,32 @@
:root {
color-scheme: dark;
}
html,
body {
font-family: 'Inter', 'Nunito Sans', 'Segoe UI', sans-serif;
-webkit-font-smoothing: antialiased;
-moz-osx-font-smoothing: grayscale;
}
body {
background-color: #111827;
}
a {
color: #60a5fa;
transition: color 120ms ease-in-out;
}
a:hover {
color: #3b82f6;
}
button {
cursor: pointer;
}
.focus-ring {
outline: 2px solid #2563eb;
outline-offset: 2px;
}

View File

@@ -0,0 +1,73 @@
<!DOCTYPE html>
<html lang="es">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Descargar Archivo Procesado - ML Converter</title>
<link href="https://cdn.jsdelivr.net/npm/tailwindcss@2.2.19/dist/tailwind.min.css" rel="stylesheet">
</head>
<body class="bg-gradient-to-br from-gray-900 to-gray-800 min-h-screen py-12">
<div class="max-w-2xl mx-auto">
<div class="bg-gray-900 bg-opacity-80 border border-gray-700 rounded-2xl shadow-xl p-8">
<div class="text-center mb-8">
<div class="mx-auto flex items-center justify-center h-14 w-14 rounded-full bg-green-200 mb-4">
<svg class="h-8 w-8 text-green-600" fill="none" viewBox="0 0 24 24" stroke="currentColor">
<path stroke-linecap="round" stroke-linejoin="round" stroke-width="2" d="M5 13l4 4L19 7" />
</svg>
</div>
<h1 class="text-3xl font-extrabold text-white mb-2">¡Procesamiento Completado!</h1>
</div>
{% with messages = get_flashed_messages() %}
{% if messages %}
<div class="mb-4">
{% for message in messages %}
<div class="bg-green-100 border border-green-400 text-green-700 px-4 py-3 rounded mb-2">
{{ message }}
</div>
{% endfor %}
</div>
{% endif %}
{% endwith %}
<div class="bg-gray-800 border border-gray-700 rounded-lg shadow p-4 mb-6">
<h2 class="text-lg font-semibold text-gray-200 mb-3 tracking-wide">Archivo procesado: {{ original_name }}</h2>
{% if sum_h is not none %}
<div class="grid grid-cols-1 sm:grid-cols-3 gap-2 text-sm text-gray-200 text-center">
<div>
<span class="block font-medium text-gray-400">Total</span>
<span class="block font-semibold text-gray-100">{{ "$ {:,.2f}".format(sum_h) }}</span>
</div>
<div>
<span class="block font-medium text-green-400">Ingresos</span>
<span class="block font-semibold text-gray-100">{{ "$ {:,.2f}".format(sum_h_pos) }}</span>
</div>
<div>
<span class="block font-medium text-red-400">Egresos</span>
<span class="block font-semibold text-gray-100">{{ "$ {:,.2f}".format(sum_h_neg) }}</span>
</div>
</div>
{% endif %}
</div>
<div class="space-y-4">
<a href="/download/{{ filename }}"
class="w-full flex justify-center items-center py-3 px-4 border border-transparent rounded-md shadow text-base font-semibold text-white bg-green-600 hover:bg-green-700 focus:outline-none focus:ring-2 focus:ring-offset-2 focus:ring-green-400">
<svg class="w-5 h-5 mr-2" fill="none" stroke="currentColor" viewBox="0 0 24 24">
<path stroke-linecap="round" stroke-linejoin="round" stroke-width="2" d="M12 10v6m0 0l-3-3m3 3l3-3m2 8H7a2 2 0 01-2-2V5a2 2 0 012-2h5.586a1 1 0 01.707.293l5.414 5.414a1 1 0 01.293.707V19a2 2 0 01-2 2z" />
</svg>
Descargar Archivo Procesado
</a>
<a href="/"
class="w-full flex justify-center items-center py-3 px-4 border border-gray-600 rounded-md shadow text-base font-semibold text-gray-200 bg-gray-800 hover:bg-gray-700 focus:outline-none focus:ring-2 focus:ring-offset-2 focus:ring-blue-400">
Procesar Otro Archivo
</a>
</div>
<div class="mt-8 text-xs text-gray-400 text-center">
<p>⚠️ Los archivos se eliminan automáticamente después de 30 minutos por seguridad.</p>
</div>
</div>
</div>
</body>
</html>

133
src/templates/index.html Normal file
View File

@@ -0,0 +1,133 @@
<!DOCTYPE html>
<html lang="es">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>ML Converter - Resumen de Mercado Libre/Pago para Excel</title>
<link href="https://cdn.jsdelivr.net/npm/tailwindcss@2.2.19/dist/tailwind.min.css" rel="stylesheet">
<link rel="stylesheet" href="{{ url_for('static', filename='globals.css') }}">
</head>
<body class="bg-gradient-to-br from-gray-900 to-gray-800 min-h-screen py-12">
<div class="max-w-2xl mx-auto">
<!-- Card: Upload -->
<div class="bg-gray-900 bg-opacity-80 border border-gray-700 rounded-2xl shadow-xl p-8 mb-8">
<div class="text-center mb-8">
<h1 class="text-4xl font-extrabold text-white mb-2">ML Converter</h1>
<p class="text-gray-300 text-lg">Convertí tu resumen de Mercado Pago para que sea legible en Excel</p>
</div>
{% with messages = get_flashed_messages() %}
{% if messages %}
<div class="mb-4">
{% for message in messages %}
<div class="bg-red-100 border border-red-400 text-red-700 px-4 py-3 rounded mb-2">
{{ message }}
</div>
{% endfor %}
</div>
{% endif %}
{% endwith %}
<form action="/upload" method="post" enctype="multipart/form-data" class="space-y-6">
<div class="flex justify-center px-6 pt-8 pb-8 border-2 border-dashed border-gray-500 rounded-xl bg-gray-800">
<div class="space-y-4 text-center">
<div class="flex justify-center">
<svg class="h-14 w-14 text-blue-400" fill="none" stroke="currentColor" viewBox="0 0 48 48">
<path d="M28 8H12a4 4 0 00-4 4v20m32-12v8m0 0v8a4 4 0 01-4 4H12a4 4 0 01-4-4v-4m32-4l-3.172-3.172a4 4 0 00-5.656 0L28 28M8 32l9.172-9.172a4 4 0 015.656 0L28 28m0 0l4 4m4-24h8m-4-4v8m-12 4h.02" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" />
</svg>
</div>
<div class="flex flex-col items-center text-gray-300">
<label for="file" class="relative cursor-pointer bg-blue-600 hover:bg-blue-700 text-white rounded-md font-semibold py-2 px-6 text-base shadow focus:outline-none focus:ring-2 focus:ring-blue-400 focus:ring-offset-2">
<span>Elegir Archivo</span>
<input id="file" name="file" type="file" class="sr-only" accept=".xlsx,.xls" required>
</label>
<span class="mt-2 text-sm">o arrastrá y soltá tu archivo Excel aquí</span>
</div>
<p class="text-xs text-gray-400">Archivos Excel hasta 16MB</p>
</div>
</div>
<div class="mt-4">
<button type="submit" class="w-full flex justify-center py-3 px-4 border border-transparent rounded-md shadow text-base font-semibold text-white bg-blue-600 hover:bg-blue-700 focus:outline-none focus:ring-2 focus:ring-offset-2 focus:ring-blue-400">
Subir Archivo Excel
</button>
</div>
</form>
</div>
<!-- Card: Cómo funciona -->
<div class="bg-gray-900 bg-opacity-80 border border-gray-700 rounded-2xl shadow-xl p-6">
<h2 class="text-lg font-bold text-white mb-4">Cómo funciona:</h2>
<ul class="space-y-2 text-gray-200 text-sm pl-4 list-disc">
<li>Subí tu archivo de resumen de Mercado Libre/Pago (.xlsx o .xls).</li>
<li>Las columnas como "VALOR DE LA COMPRA" y "MONTO NETO" se convierten automáticamente.</li>
<li>Los datos quedan listos para análisis en Excel con formato numérico correcto.</li>
<li>Las columnas de texto (tipos de pago, estados) permanecen sin cambios.</li>
<li>Los archivos se eliminan automáticamente después de 30 minutos por seguridad.</li>
</ul>
</div>
</div>
<script>
(function() {
const fileInput = document.getElementById('file');
if (!fileInput) return;
const label = document.querySelector('[for="file"]');
const dropZone = label ? label.closest('.border-dashed') : document.querySelector('.border-dashed');
function preventDefaults(e) {
e.preventDefault();
e.stopPropagation();
}
function highlight() {
if (dropZone) dropZone.classList.add('border-indigo-500', 'border-solid');
}
function unhighlight() {
if (dropZone) dropZone.classList.remove('border-indigo-500', 'border-solid');
}
function handleDrop(e) {
const dt = e.dataTransfer;
const files = dt && dt.files;
if (files && files.length > 0) {
fileInput.files = files;
setTimeout(() => fileInput.form && fileInput.form.submit(), 10);
}
}
if (dropZone) {
['dragenter', 'dragover', 'dragleave', 'drop'].forEach(eventName => {
dropZone.addEventListener(eventName, preventDefaults, false);
});
['dragenter', 'dragover'].forEach(eventName => {
dropZone.addEventListener(eventName, highlight, false);
});
['dragleave', 'drop'].forEach(eventName => {
dropZone.addEventListener(eventName, unhighlight, false);
});
dropZone.addEventListener('drop', handleDrop, false);
}
const uploadButton = document.querySelector('button');
if (uploadButton) {
uploadButton.addEventListener('click', function(e) {
e.preventDefault();
fileInput.click();
}, false);
}
fileInput.addEventListener('change', function() {
if (fileInput.files && fileInput.files.length > 0) {
fileInput.form && fileInput.form.submit();
}
});
})();
</script>
</body>
</html>

1
tests/__init__.py Normal file
View File

@@ -0,0 +1 @@
# Init for tests package

13
tests/test_app.py Normal file
View File

@@ -0,0 +1,13 @@
import pytest
from src.app import app
@pytest.fixture
def client():
app.config['TESTING'] = True
with app.test_client() as client:
yield client
def test_index(client):
response = client.get('/')
assert response.status_code == 200
assert b"ML Converter" in response.data or b"Subir Archivo" in response.data

50
tests/test_converters.py Normal file
View File

@@ -0,0 +1,50 @@
import pandas as pd
import pytest
from src import converters
def test_converts_currency_strings_to_numbers():
df = pd.DataFrame(
{
'Monto Neto de Operacion': ['\u20ac1.234,56', '$ 1,234.56', '(1.234,56)'],
'descripcion': ['uno', 'dos', 'tres'],
}
)
processed, converted = converters.convert_text_columns_to_numbers(df)
assert 'Monto Neto de Operacion' in converted
assert processed['Monto Neto de Operacion'].iloc[0] == pytest.approx(1234.56)
assert processed['Monto Neto de Operacion'].iloc[1] == pytest.approx(1234.56)
assert processed['Monto Neto de Operacion'].iloc[2] == pytest.approx(-1234.56)
def test_force_converts_id_columns_even_with_padding():
df = pd.DataFrame(
{
'Operacion ID': ['000123', ' 456 ', None],
}
)
processed, converted = converters.convert_text_columns_to_numbers(df)
assert 'Operacion ID' in converted
assert processed['Operacion ID'].dropna().tolist() == [123.0, 456.0]
def test_mixed_content_column_is_not_converted():
df = pd.DataFrame(
{
'monto': ['$123', 'no aplicar', '$456'],
}
)
processed, converted = converters.convert_text_columns_to_numbers(df)
assert 'monto' not in converted
assert processed['monto'].dtype == object
def test_convert_numeric_text_returns_na_for_invalid_strings():
assert pd.isna(converters.convert_numeric_text('no es numero'))

29
tests/test_errors.py Normal file
View File

@@ -0,0 +1,29 @@
import io
import pandas as pd
import pytest
from src.app import app
@pytest.fixture
def client():
app.config['TESTING'] = True
with app.test_client() as client:
yield client
def test_upload_no_file(client):
response = client.post('/upload', data={}, follow_redirects=True)
assert response.status_code == 200
assert b"Archivo" in response.data or b"Subir Archivo" in response.data
def test_upload_invalid_extension(client):
response = client.post('/upload', data={
'file': (io.BytesIO(b"fake data"), 'test.txt')
}, content_type='multipart/form-data', follow_redirects=True)
assert response.status_code == 200
assert b"Archivo" in response.data or b"Subir Archivo" in response.data
def test_upload_empty_file(client):
response = client.post('/upload', data={
'file': (io.BytesIO(), '')
}, content_type='multipart/form-data', follow_redirects=True)
assert response.status_code == 200
assert b"Archivo" in response.data or b"Subir Archivo" in response.data

31
tests/test_security.py Normal file
View File

@@ -0,0 +1,31 @@
from src import app as app_module
def test_rejects_invalid_signature(tmp_path):
"""Files with non-Excel signatures should be blocked early."""
bogus_excel = tmp_path / "malicious.xlsx"
bogus_excel.write_text("not really an excel file", encoding="utf-8")
assert app_module.is_valid_excel_file(str(bogus_excel)) is False
def test_rejects_empty_file(tmp_path):
"""Empty uploads fail validation."""
empty_excel = tmp_path / "empty.xlsx"
empty_excel.touch()
assert app_module.is_valid_excel_file(str(empty_excel)) is False
def test_rejects_oversized_file(tmp_path, monkeypatch):
"""Respect the MAX_CONTENT_LENGTH guardrail for large uploads."""
oversized_limit = 10
monkeypatch.setattr(app_module, "MAX_CONTENT_LENGTH", oversized_limit)
monkeypatch.setitem(app_module.app.config, "MAX_CONTENT_LENGTH", oversized_limit)
large_excel = tmp_path / "huge.xlsx"
large_excel.write_bytes(
b"PK\x03\x040" * 4
) # Valid ZIP header repeated; file > limit
assert app_module.is_valid_excel_file(str(large_excel)) is False

View File

@@ -0,0 +1,60 @@
import io
import os
import pandas as pd
import pytest
from src.app import app
@pytest.fixture
def client():
app.config['TESTING'] = True
with app.test_client() as client:
yield client
def test_index_page(client):
response = client.get('/')
assert response.status_code == 200
assert b"ML Converter" in response.data or b"Subir Archivo" in response.data
def test_upload_and_download(client):
# Create a simple Excel file in memory
df = pd.DataFrame({'words': ['one', 'two', 'three']})
excel_file = io.BytesIO()
df.to_excel(excel_file, index=False)
excel_file.seek(0)
# Upload the file
response = client.post('/upload', data={
'file': (excel_file, 'test.xlsx')
}, content_type='multipart/form-data', follow_redirects=True)
assert response.status_code == 200
assert b"Descargar Archivo Procesado" in response.data or b"Procesamiento Completado" in response.data
def test_download_normalizes_and_confines_filename(client, tmp_path, monkeypatch):
upload_dir = tmp_path / "uploads"
upload_dir.mkdir()
monkeypatch.setitem(app.config, 'UPLOAD_FOLDER', str(upload_dir))
safe_name = '123_processed_test.xlsx'
file_path = upload_dir / safe_name
file_path.write_bytes(b'dummy excel bytes')
response = client.get(f"/download/..%5C{safe_name}")
assert response.status_code == 200
assert b'dummy excel bytes' in response.data
content_disposition = response.headers.get('Content-Disposition', '')
assert "attachment;" in content_disposition
assert "convertido_test.xlsx" in content_disposition
def test_download_rejects_symlink_escape(client, tmp_path, monkeypatch):
upload_dir = tmp_path / "uploads"
upload_dir.mkdir()
outside_file = tmp_path / "outside.txt"
outside_file.write_text("secret")
monkeypatch.setitem(app.config, 'UPLOAD_FOLDER', str(upload_dir))
symlink_path = upload_dir / "escape"
os.symlink(outside_file, symlink_path)
response = client.get("/download/escape", follow_redirects=False)
# Should redirect back to index instead of serving the symlink target
assert response.status_code == 302