© unsplash/@utsavsrestha

© unsplash/@utsavsrestha
Agriculture & Meteorology

Entity recognition in news articles and free text

Context

News about extreme weather events are important to evaluate weather forecasts in retrospective and to correlate weather events with their consequences such as floodings or damage occurred. Moreover, it is difficult to quantify the severity of an extreme weather event. Weather forecasting services, such as the Deutscher Wetterdienst (DWD), therefore analyze press articles and releases to unveil insights on the date, location and severity of the weather event.

Challenges

If the primary sources of weather related articles are defined, the articles need to be retrieved from the different publishers. Press agencies such as dpa or Reuters deliver xml feeds that can easily integrated in the data pipeline. However, other publishers such as regional newspapers do not deliver comparable services, so that the articles need to be scraped from different websites. Depending on the scraping method, the formats can vary from pdf to txt.

When the data is preprocessed in a way that makes the articles readable for an algorithm, the challenge remains how to detect, classify and evaluate the different entities such as location, damages or type of weather event.

Potential solution approaches

Depending on the entities to be extracted and their diversity of inputs, different technical approaches can be chosen. In case of quite uniform formatted entities, such as date or time, regular expressions can be programmed in order to match with common data types. For date, this could be dd/mm/yyyy or similar.

For more complicated entities, a dictionary of synonyms and ontologies can be developed for classification and mapping of text to entities and topics. Topic modelling approaches such as Latent Dirichlet allocation (LDA) are modelled for measuring similarity between text components. Further advanced approaches, which can lead to more promising results, BERT or domain specific word embeddings (such as BioBERT for biomedical language) or supervised learning approaches based on labelled data might be chosen.

Related Case Studies

Natural Language Processing

Legal review of rental contracts

Different methods from the field of NLP helped us to create a software that spots errors in legal contracts.
Our solution

Related webinars

Text recognition (OCR) - The first step on the way to a successful implementation of an NLP project

In this talk we will deal with the topic of text recognition.

Ewelina Fiebig

Machine Learning Scientist

Fabian Gringel

Machine Learning Scientist

Labeling Tools - The second step on the way to the successful implementation of an NLP project

The success of an NLP project consists of a series of steps from data preparation to modeling and deployment. Since the input data are often scanned documents, the data preparation step initially involves the use of text recognition tools (OCR for short) and later on also the use of so-called labeling tools. In this webinar we will deal with the topic of selecting a suitable labeling tool.

Ewelina Fiebig

Machine Learning Scientist

Fabian Gringel

Machine Learning Scientist

Semantic search and understanding of natural text with neural networks: BERT

In this webinar you will get an introduction to the application of BERT for Semantic Search using a real case study: Every year millions of citizens interact with public authorities and are regularly overwhelmed by the technical language used there. We have successfully used BERT to deliver the right answer from government documents with the help of colloquial queries - without having to use technical terms in the queries.

Konrad Schultka

Machine Learning Scientist

Jona Welsch

Machine Learning Scientist

Recurrent neural networks: How computers learn to read

The webinar will give an introduction to the functioning of RNNs and illustrate their use in an example project from the field of legal tech

Fabian Gringel

Machine Learning Scientist