Summary of A Small Claims Court For the Nlp: Judging Legal Text Classification Strategies with Small Datasets, by Mariana Yukari Noguti and Edduardo Vellasques and Luiz Eduardo Soares Oliveira
A Small Claims Court for the NLP: Judging Legal Text Classification Strategies With Small Datasets
by Mariana Yukari Noguti, Edduardo Vellasques, Luiz Eduardo Soares Oliveira
First submitted to arxiv on: 9 Sep 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This research paper investigates the optimal strategies for text classification tasks in domains requiring expert-level annotators, such as the legal domain. The authors utilize a transformer-based model pre-trained on unlabeled data and fine-tune it with a small labeled dataset to achieve state-of-the-art performance. Specifically, they focus on assigning descriptions to one of 50 predefined topics using records of demands to a Brazilian Public Prosecutor’s Office. To overcome the limitations of Portuguese language resources in the legal domain, the authors employ classic supervised models like logistic regression and SVM, as well as ensembles random forest and gradient boosting, along with embeddings extracted from word2vec. Notably, BERT-based models achieve superior performance when used as a classifier, surpassing previous models. The best result is obtained using Unsupervised Data Augmentation (UDA), which combines BERT, data augmentation, and semi-supervised learning strategies, achieving an accuracy of 80.7%. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper looks at how to make computers better at understanding text in fields that need a lot of expertise, like law. They use special models trained on lots of text without labels and then fine-tune them with a small amount of labeled data. The goal is to assign text descriptions to specific categories. They test different approaches using records from a Brazilian public prosecutor’s office and find that some classic methods work better than others. Surprisingly, BERT-based models perform the best when used as a classifier. The top result comes from combining BERT with other techniques, achieving 80.7% accuracy. |
Keywords
» Artificial intelligence » Bert » Boosting » Data augmentation » Logistic regression » Random forest » Semi supervised » Supervised » Text classification » Transformer » Unsupervised » Word2vec