Daily Weather Data Workflow for AI/ML for Windows OS using Python
This workflow assumes that a biological event is primarily influenced by weather data. I used this in predicting the occurrence of disease in potatoes. It is just an example. The disease severity model running is also given here.
Because some subscribers invoked the importance of data structure, I briefly paused my logical series of posts and posted this one. The raw data presentation, cleaning and feeding the gold level data to the desired models is essential.
A screenshot of Python scripts is embedded for practicals. Please prepare up to step 3 or 4. Those who are familiar may go up to step 9.
Although I use a Mac laptop, I am providing the Windows-specific version. Mac users may adapt these, or if requested, I shall post a Mac-specific workflow.
0) One-box setup (once)
Install: Anaconda (Python), JupyterLab, Git for Windows, Chocolatey (optional: MinIO).
Create folders (PowerShell):
```powershell
$base="C:\Users\acer\Data"
mkdir "$base\bronze" "$base\silver" "$base\gold" "$base\catalog" "$base\backups" -Force
```
1) Define a schema for daily weather data (For raw data called Bronze)
Fields: station_id, lat, lon, elev_m, datetime, variable, value, unit, qc_flag
Variables: TMAX, TMIN, RH, RAIN, WS
Data dictionary: C:\Users\acer\Data\catalog\weather_dictionary.xlsx
2) Create Excel upload template
Build in Excel with fixed headers matching the schema.
Add Codes & Units sheet with dropdowns.
Save as C:\Users\acer\Data\catalog\templates\weather_template.xlsx
3) Ingestion (daily/weekly): Bronze to Silver data
Drop raw station files in C:\Users\acer\Data\bronze\agromet\YYYY\
Run ingestion script (example PowerShell):
```powershell
conda activate base
python C:\Users\acer\projects\weather\ingest_weather.py `
--in "C:\Users\acer\Data\bronze\agromet" `
--out "C:\Users\acer\Data\silver" `
--dict "C:\Users\acer\Data\catalog\weather_dictionary.xlsx"
```
QC failures: *_rejected.csv and QC_report.json written in bronze folder.
4) Validation & promotion
Checks: RH within 0–100, valid dates, no duplicates, required columns.
Promote passing data to gold:
C:\Users\acer\Data\gold\agromet\<station>\vX.Y\data.parquet, metadata.json
5) Access & retrieval
Open gold Parquet in JupyterLab or export to CSV.
Maintain index at C:\Users\acer\Data\catalog\index.xlsx
6) Feature building for AI/ML
From gold, compute various weather indices like 4-day, 7-day rolling means, rainfall totals, GDD, heat persistence, leaf wetness, etc.
Save features to C:\Users\acer\Data\gold\features\weather\feature_table.parquet
7) Experiment tracking
Train in JupyterLab with models: scikit-learn/XGBoost/LightGBM.
Track with MLflow:
```powershell
mlflow ui --backend-store-uri "C:\Users\acer\mlruns"
```
8) Baselines & model menus
Occurrence: Logistic Regression → Random Forest → XGBoost.
Severity: Linear → Random Forest → XGBoost/LightGBM.
Always include a naïve baseline.
9) Batch scoring
score.py loads latest features + production model → predictions folder.
```powershell
python C:\Users\acer\projects\weather\score.py `
--features "C:\Users\acer\Data\gold\features\weather\feature_table.parquet" `
--model "C:\Users\acer\models\weather\prod\model.pkl" `
--out "C:\Users\acer\Data\gold\predictions\weather\v1.0"
```
10) Minimal governance
Roles: You = Data Steward & ML Lead.
Weekly: review QC reports, approve gold promotions, back up gold and models.
Keep local/simple until scale demands MinIO/SQL/orchestration.