Context
Predicting whether a sentence is finished or not is one of the most high-level classifications that NLP offers. If implemented, it can help e.g. to detect sentences that users forgot to finish, or that leave too much room for interpretation. In many applications, this can help tremendously to clean your text data.
With this dataset, you can build a classification model for such a task.
Content
Each item consists of a sentence and its target is_finished
. Your goal is to predict whether a sentence is finished or not, e.g.:
finished
: "Kaggle is such a great platform, where Data Scientists from all over the world can share their ideas and data!"
not finished
: "I believe that we should just" [... just what?]
The data is collected from various news headlines. The labeling is weakly supervised
using our labeling software onetask, i.e. we labeled the data both programmatically using labeling functions (e.g. dependency parsers etc.) as well as manually.
Acknowledgements
Thanks to my colleague Henrik Wenck, who provided me with the idea to publish this task on Kaggle 🙏
Inspiration
Let's build and discuss ideas! From my point of view, this task can be solved e.g.
- parsing text using linguistic algorithms, to detect e.g. dependencies within the text that indicate whether a sentence is finished or not
- using a recurrent architecture like RNNs
- using vanilla algorithms with fine-tuned embeddings, representing the context of a sentence