Workflow Steps in AI Modelling

Biological Events largely controlled by Weather

Aug 18, 2025

Daily Weather Data Workflow for AI/ML for Windows OS using Python

This workflow assumes that a biological event is primarily influenced by weather data. I used this in predicting the occurrence of disease in potatoes. It is just an example. The disease severity model running is also given here.

Because some subscribers invoked the importance of data structure, I briefly paused my logical series of posts and posted this one. The raw data presentation, cleaning and feeding the gold level data to the desired models is essential.

A screenshot of Python scripts is embedded for practicals. Please prepare up to step 3 or 4. Those who are familiar may go up to step 9.

Although I use a Mac laptop, I am providing the Windows-specific version. Mac users may adapt these, or if requested, I shall post a Mac-specific workflow.

0) One-box setup (once)

Install: Anaconda (Python), JupyterLab, Git for Windows, Chocolatey (optional: MinIO).

Create folders (PowerShell):

```powershell
$base="C:\Users\acer\Data"
mkdir "$base\bronze" "$base\silver" "$base\gold" "$base\catalog" "$base\backups" -Force
```

1) Define a schema for daily weather data (For raw data called Bronze)

Fields: station_id, lat, lon, elev_m, datetime, variable, value, unit, qc_flag

Variables: TMAX, TMIN, RH, RAIN, WS

Data dictionary: C:\Users\acer\Data\catalog\weather_dictionary.xlsx

2) Create Excel upload template

Build in Excel with fixed headers matching the schema.

Add Codes & Units sheet with dropdowns.

Save as C:\Users\acer\Data\catalog\templates\weather_template.xlsx

3) Ingestion (daily/weekly): Bronze to Silver data

Drop raw station files in C:\Users\acer\Data\bronze\agromet\YYYY\

Run ingestion script (example PowerShell):

```powershell
conda activate base
python C:\Users\acer\projects\weather\ingest_weather.py `
--in "C:\Users\acer\Data\bronze\agromet" `
--out "C:\Users\acer\Data\silver" `
--dict "C:\Users\acer\Data\catalog\weather_dictionary.xlsx"
```

QC failures: *_rejected.csv and QC_report.json written in bronze folder.

4) Validation & promotion

Checks: RH within 0–100, valid dates, no duplicates, required columns.

Promote passing data to gold:

C:\Users\acer\Data\gold\agromet\<station>\vX.Y\data.parquet, metadata.json

5) Access & retrieval

Open gold Parquet in JupyterLab or export to CSV.

Maintain index at C:\Users\acer\Data\catalog\index.xlsx

6) Feature building for AI/ML

From gold, compute various weather indices like 4-day, 7-day rolling means, rainfall totals, GDD, heat persistence, leaf wetness, etc.

Save features to C:\Users\acer\Data\gold\features\weather\feature_table.parquet

7) Experiment tracking

Train in JupyterLab with models: scikit-learn/XGBoost/LightGBM.

Track with MLflow:

```powershell
mlflow ui --backend-store-uri "C:\Users\acer\mlruns"
```

8) Baselines & model menus

Occurrence: Logistic Regression → Random Forest → XGBoost.

Severity: Linear → Random Forest → XGBoost/LightGBM.

Always include a naïve baseline.

9) Batch scoring

score.py loads latest features + production model → predictions folder.

```powershell
python C:\Users\acer\projects\weather\score.py `
--features "C:\Users\acer\Data\gold\features\weather\feature_table.parquet" `
--model "C:\Users\acer\models\weather\prod\model.pkl" `
--out "C:\Users\acer\Data\gold\predictions\weather\v1.0"
```

10) Minimal governance

Roles: You = Data Steward & ML Lead.

Weekly: review QC reports, approve gold promotions, back up gold and models.

Keep local/simple until scale demands MinIO/SQL/orchestration.

Surendranath’s Substack

Discussion about this post