Both research projects seek to extract parallel corpora for European languages. In paracrawl we are extracting parallel corpora for 27 European languages from (petabyte scale) web crawls and internet archive data, by performing unsupervised multilingual sentence alignment and matching. In europat, we are extracting parallel corpora from European and US patent files.
For paracrawl I maintain the production pipeline, process new data as they arrive from partners, collaborate with other members of the research consortium and write project deliverables.
For europat, I have set-up a text extraction pipeline from pdf documents, I run the pipeline on downloaded patent files and prepare the corresponding project deliverables.
I worked for the finance and economics program where, under the supervision of Ulrich Germann, I developed a state-of-the-art system for large scale news article monitoring, fraud detection and question answering for an industrial partner (Credit Suisse).
To this end, I used a combination of traditional information retrieval tools, transformer based pre-trained language models, search engine technologies (elasticsearch) and a vector similarity engine (FAISS).
In addition to these the pipeline featured similarity and hash-based document deduplication, an "auto"-labelling module (based on a few manually labeled examples, the semantic similarity engine and probabilistic labeling- the latter based on the Snorkel library), several pre-trained (re-trainable) classifiers, dimensionality reduction options and wrapper libraries around elasticsearch and faiss.
The project was delivered as a docker app with three interconnected containers (one running elasticsearch and kibana, another running faiss and one running the main app).
On a 2018 PowerMac, the developed pipeline could ingest and process a daily (Factiva) feed of 100.000 articles in under 4 hours.