In this blog post I'm going to present the three best free text annotation tools for manually labeling documents in NLP (Natural Language Processing) projects. You will learn how to install, configure and use them and find out which one of them suits your purposes best .
The tools I'm going to present are
The selection is based on this comprehensive scientific review article and our hands-on experience at dida.
I will discuss the tools one by one. For each of them, I will first give a general overview about what the tool is suited for, and then provide details (or links) regarding installation, configuration and usage.
brat rapid annotation tool
What you can do with it
brat is an online environment for collaborative text annotation that can be run on a (possibly local) server and then used in a browser.
brat is rather meant to annotate single expressions and relationships between them, as the examples show:
Annotating significantly longer text spans (i.e. paragraphs) turns out to be really inconvenient (see Usage below).
Input documents must come as text files. The user interface (UI) presentation of the text file in brat is not necessarily true to its original formatting. For these two reasons, brat isn't a very well-suited tool for annotating structured documents, where you rather might want to annotate PDFs directly.
Configuring a labeling scheme is easy and flexible. You can define span entities, relations and attributes and constraints for them, which brat checks automatically. Furthermore, there are special configuration files to define a non-default visual configuration (e.g. colours of labels) and tools like sentence segmentation (splitting) or tokenization.
The annotations are also stored in text files. We found that parsing the annotations works smoothly if the labeled entities are words or sub-sentence expressions, but becomes tedious for longer spans.
brat provides some functionality for collaborative labeling: Multiple users are supported, and there is an integrated annotation comparison.
brat comes with detailed instructions how to install it. You find them here.
Let me just add a couple of hints that might make your life easier:
If you just want to install and run brat on your local machine, then the standalone server is what you want.
Make sure to check out the Detailed Instructions -> Placing Data section of the instructions to learn how to set up the annotation files.
brat is not compatible with Python 3. Thus you might have to modify the command
We've found that brat works best with Google Chrome.
Using brat is fairly straightforward: Marking a text span opens a pop-up menu. The options in the menu depend on the configuration of the labeling scheme.
What you can do with it
doccano is another annotation tool solely for text files. It's easier to use and simpler than brat.
Just like brat, it runs server-based and has a browser UI.
The main differences in comparison with brat are that
all configuration is done in the web user interface and
the use case is limited to document classification, sequence labeling and sequence-to-sequence.
This means that doccano is more beginner-friendly (and probably in general more user-friendly) than brat, but contrary to brat one cannot define relationships or attributes. Depending on the choice of the use case, there are only labels on document level or span level.
doccano admits multiple users, but apart from that there are no additional features for collaborative labeling.
There are two extra features that you don't find in brat: You can write and save labeling guidelines right within the app (in Markdown), and get a basic diagrammatic overview of the labeling stats.
The installation is easy and fully described on doccano's GitHub repo.
You don't need to understand what Docker is and does in order to install and use doccano. Just make sure Docker is installed.
There is not much to configure in doccano. You can create and edit labels directly in the browser UI, as well as labeling guidelines.
I recommend trying out doccano's live demos to get acquainted with its functionality.
What you can do with it
Like the first two tools, it uses a browser UI. It can be set up for a group of users on a server or as a standalone version.
INCEpTION is a way more heavyweight tool than either doccano or brat:
It can handle both text files and PDFs that contain text information (e.g. because they have been created from text files or by OCR software), features an extensive "Settings" section that let's you configure virtually everything you can wish for,
provides functionality to facilitate collaborative labeling and evaluate the annotations statistically,
can export annotations in a broad range of common NLP labeling formats.
This being said, INCEpTION might be a bit overwhelming at first. I recommend that you feel free to ignore the features of INCEpTION you don't know what to do with and concentrate on what you need for your project.
Running INCEpTION is especially easy, because you can execute the downloaded JAR file without installing it.
Configuration and usage
There is so much to configure in INCEpTION that I cannot even really start to cover it here.
However, chances are that you are interested in INCEpTION because of its PDF labeling capacity, so I want to show you at least how to do that. In the following video I will
create a new project,
import a document,
define a label,
change the document viewer settings to display the document as a PDF file,
annotate the document.
To understand what else INCEpTION has to offer and how to use it, you really need to spend some time trying things out and reading the user guide.
You can start with the online demo version.
I have presented the three best free NLP labeling tools and pointed out how to use them.
To conclude, I will give you a coarse guideline how to choose the right tool for you among the three presented ones:
If you work with text files, what you want to do can be categorized as either document classification, sequence labeling or sequence-to-sequence and you don't need relations, but you want to start labeling as soon as possible without lengthy configurations, then choose doccano.
If your work with text files, want to keep things as simple as possible but need more functionality than provided by doccano, then try out brat.
If for some reason you want to work with (native) PDF files, or you are not afraid of a more complex annotation tool that takes some time get acquainted with but awards you with an extensive range of features, than INCEpTION is right for you.
If these labeling tools aren't enough to successfully realize your machine learning project and you need further consultation, feel free to book a Machine Learning Expert Talk with us.