Wilson Yip

Data Engineer & Scientist bridging the gap between complex mathematical modeling and robust data infrastructure. Leveraged a background in Mathematics to optimize NLP algorithms and build high-frequency asynchronous scrapers. Expert in Python, dbt, and Airflow, with a focus on creating scalable, secure, and observable data environments that ensure data integrity.

Experience

London
Oct, 2023 - Present

Data Engineering & Infrastructure

  • Architected and maintained ELT pipelines using Airflow, dbt, GCS, and BigQuery.
  • Developed schema detection tools to trigger full refreshes upon schema changes from upstream source.
  • Optimized storage by implementing BigQuery-GCS External Tables, eliminating data redundancy and enabling near real-time access.
  • Reduced query costs by implementing Hive-partitioned directory structures for external storage.
  • Deployed CI/CD pipelines to automate testing on pull requests, reducing production errors by 90%.
  • Engineered custom dbt materializations for BigQuery Functions to provide functionality ahead of native dbt-core support.

Data Observability & Cost Optimization

  • Engineered a cost-governance framework by aggregating metadata from dbt manifest.json, BigQuery INFORMATION_SCHEMA, GCP Audit Logs, and GCS Inventory Reports.
  • Developed centralized observability tables to monitor tables, jobs, and GCS blobs, with automated reporting in Looker Studio.
  • Reduced BigQuery expenditure by 80% through strategic partitioning, incremental modeling, query tuning, and storage billing optimization.

Cloud Infrastructure & Security

  • Provisioned and managed GCP infrastructure using Terraform and Docker.
  • Deployed Cloud Functions as webhooks for event-driven architecture.
  • Implemented granular security protocols, including column-level access control and dataset-specific permissions.
  • Containerized Airflow instances for scalable deployment to cloud services.
  • Engineered features and conducted EDA using PySpark and ElasticSearch, processing large-scale datasets to improve model training quality.
  • Developed and deployed ML models to predict YouTube audience demographics, serving predictions via a high-performance FastAPI backend.
  • Optimized NLP matching algorithms by introducing soft-cosine similarity, resulting in a 5–10% increase in top-performer identification.
  • Built asynchronous URL scrapers to resolve millions of shortened links, reducing execution time by 90% through concurrent processing.
  • Architected and maintained PostgreSQL databases, collaborating with stakeholders to design schemas for complex business requirements.
  • Orchestrated ETL pipelines using Airflow to ingest and transform agency performance and operational data.
  • Implemented system observability by performing log analysis with Grafana Loki and building performance dashboards in Grafana.
  • Accelerated internal workflows via rapid application development, automating document generation using Google APIs, Slack API, and ElasticSearch.
Various Universities in Hong Kong
Research Assistant (Data Scientist)
Hong Kong
Sept, 2017 - Jan, 2022
  • Perform statistical analysis and deploy machine learning models, including AB-testing, PCA, Poisson regression, k-means, hierarchical clustering, LDA topic modelling, etc. to perform analysis on different types of data. Develope and maintain RShiny Dashboard to visualise analysis results.

Education

Society of Actuaries
Probability (P) Exam
Hong Kong
Mar, 2017
University of Hong Kong
Bachelor of Science
Hong Kong
Sept, 2014 - Jul, 2017

Major: Mathematics/Physics
Minor: Computational and Financial Mathematics

Skills
Highlights
Proficient in OOP and design patterns (Factory, Singleton). Built non-blocking systems with Asyncio. Managed complex state with Dataclasses. Engineered recursive schema-inference engines for JSON-to-BigQuery mapping and implemented high-throughput streaming via the BigQuery Storage Write API with dynamic schema handling.
Python
Managed IAM roles and permissions for accessing various GCP services such as BigQuery, GCS, Pub/Sub, and Secret Manager. Utilised both Cloud Function and Pub/Sub to stream data from various sources into BigQuery without data loss. Hosting a dbt-core Docker instance on Cloud Run to perform CI checks upon pull requests. Utliise Artifact Registry to store custom Docker images and monitoring Audit Logs on BigQuery.
GCP
To containerise the Airflow instance, Dockerfiles were written to define Airflow’s underlying database, Airflow’s webserver, Celery workers, and Flower monitoring. Built custom images to host dbt-core for CI checks.
Docker
Utilised Terraform to provision and manage GCP resources such as IAM roles, Pub/Sub topics, BigQuery Policy Tags.
Terraform
Setup CI/CD pipelines to automate testing and deployment of data pipelines. Automated the deployment of Airflow images upon merging to the main branch. Automated the running of dbt tests and models upon pull requests to ensure data quality and integrity before merging.
GitHub Actions
Developed an automated lifecycle management system within Airflow that triggers AWS Auto Scaling via Python (Boto3) upon pipeline completion, effectively achieving zero idle-compute costs. Also familar with Lambda and Fargate for serverless and containerized workloads.
AWS
Developed a CLI utility in Rust to automate GCP authentication. Utilised ‘reqwest’ for asynchronous HTTP handling and ‘serde’ for type-safe JSON parsing to manage service account keys and generate OAuth2 bearer tokens for API interactions.
Rust
Data Processing
Leveraged different libraries such as tidyverse, plyr, dplyr for data manipulation. Perform various statistical analysis such as regression, hypothesis testing, and time series analysis. Used ggplot2 and plotly for data visualisation. Developed R Shiny applications for interactive data exploration and reporting.
R
Building custom operators and DAGs with factory classes. Utilise Dataclasses to define DAG and tasks configurations. With dynamic import, the DAG configurations are serialisable and can be stored in a database and visualised in dashboard. Worked with DAGs’ parameters to offers flexibility from the UI. Implemented pre-post-execute functions to handle common tasks such as checking data types between the source and destination tables.
Airflow
Maintaining data warehouse with dbt. Utilising partitioned and clustered tables to optimise query performance and cost. Implemented row-and-column-level security to restrict data access based on user roles. Setup Analytics Hub to securely share datasets across different organisations. Connected BigQuery with GCS with External Tables to prevent data duplication while partitioning the data with Hive-style partitioning.
BigQuery
Utilise different, including custom, materialisations to optimise performance and cost. Created custom macros to standardise commonly used SQL snippets across multiple models. Implemented tests to ensure data quality and integrity, including uniqueness, referential integrity, and custom business logic tests.
dbt
Handled hundreds of millions or records with PySpark. Optimized query performance by implementing broadcast joins. Leveraged SparkSQL for complex analytical views along with custom MapReduce functions.
Spark
Built interactive dashboards for data observation. Identify the build and storage cost for each of the tables and datasets in BigQuery. Monitor all query jobs as well as blobs in GCS.
Looker Studio
Implemented observability by querying Prometheus metrics and Loki logs using PromQL and LogQL.
Grafana
Optimized production PostgreSQL through B-Tree/GIN indexing. Engineered automated ETL pipelines to synchronize relational data from Postgres to BigQuery
PostgreSQL
Architected complex search queries and implemented soft cosine using Painless scripting by adding a correlation matrix between the dot product. as a bilinear form (xAyᵀ) to enhance the matching performance.
Elasticsearch
Leveraged a strong mathematical background to engineer custom neural network architectures. Developed and tuned LSTM models for time-series forecasting, specifically targeting stock price prediction patterns.
Tensorflow
Miscellaneous
Utilised it as a free web-hosting server for hosting the scripts of dbt models (compiled and jinja templated) for other BigQuery Users. Also used it to automate Google Sheets and sending automated reports via email.
Apps Script
System administration using Bash scripting. Utilise GNU commands (grep, sed, awk) for log parsing, filesystem management (chmod/chown), and remote container diagnostics via Docker/SSH. Integrated cloud CLIs (gcloud, aws-cli) into CI/CD pipelines for automated infrastructure scaling.
Linux Bash
Administrative
Everyday documentation and writing this resume. Utilise Latex for mathematical equations, Mermaid for diagrams, Pandoc for conversion from Markdown to HTML with CSS and Lua filters.
Markdown
Academic writing and typesetting. Utilise various packages such as amsmath, biblatex, geometry, hyperref, graphicx, xcolor, tikz, pgfplots.
Latex
Languages
English: Fluent
Cantonese: Native
Mandarin: Fluent
Resume Versions