NLP project that works with news (NER, context generation, news trend analytics)

Last update: Jan 4, 2023

Related tags

Text Data & NLP CoAuthor

Overview

СоАвтор

СоАвтор – платформа и открытый набор инструментов для редакций и журналистов-фрилансеров, который призван сделать процесс создания контента максимально комфортным и быстрым.

Инструменты для СоАвтора разрабатываются на основе открытой аналитической платформы OT. В ближайших планах полная интеграция приложения с платформой: сбор и обработка данных, запуск аналитических алгоритмов, а также сборка и запуск приложения будет осуществляться на платформе. Публичный репозиторий с инструментами платформы OT coming soon.

Сейчас мы разрабатываем следующие инструменты:

Отслеживание событий и трендов в режиме реального времени (работа со структурированными новостными форматами и парсинг новостных источников). Для этого мы пишем модуль для непрерывного парсинга новостных изданий и придумываем, как отслеживать информативные изменения в статьях.
Подбор релевантных статей к готовому материалу для автоматического формирования модуля бэкграунда (справочной информации или предыстории события). Для этого мы используем инструменты для поиска семантически похожих текстов в архиве и инструменты для генерации саммари из нескольких документов.

Разработка ведется вместе с профессиональным сообществом, чтобы сделать рабочий процесс для редакций и фрилансеров максимально удобным. Платформа "СоАвтор" имеет модульную структуру. Вы можете придумать новый инструмент, который упрощает работу с текстом, или принять участие в работе над теми, что уже в разработке. Вступайте в наше сообщество на Discord и присылайте свои #идеи того, как можно использовать “СоАвтор” при работе с контентом.

English below

Запустить приложение у себя

Установка

Скачайте файлы проекта или сделайте форк и воспользуйтесь командой git clone
Скачайте файлы с данными: ru_stopwords.txt и news_df.parquet
Скачайте файлы моделей: rubert_tiny и rut5_base_sum
Откройте терминал и перейдите в директорию проекта
Используйте pip install requirements.txt, чтобы установить все нужные библиотеки

Запуск

Поменяйте в файле config.yaml пути к файлам данных и моделям
Откройте терминал и перейдите в директорию проекта
Наберите в терминале команду streamlit run menu.py
Приложение по умолчанию будет доступно по адресу http://localhost:8501 P.S.: приложение можно запустить на своём датасете, если будет соблюдён формат. Пример датасета и описание формата в директории data.

Как участвовать в разработке проекта

Текущие задачи

Обновляемая лента новостей
Модуль для подключения к соцсетям
Анализ трендов по постам из социальных сетей
Классификация evergreen новостей

Помочь решить одну из текущих проблем

Проверьте есть ли открытые проблемы в Issues и выберите одну из них
Если у вас есть своя идея, как законтрибьютить в этот проект, откройте в Issues новый тикет (как это сделать, описано ниже).
Сделайте форк проекта, начните работать над тикетом и внесите свои изменения через pull request.

Добавить проблему (issue)

Если вы нашли баг или недоработку, мы будем признательны, если вы оставите её описание в разделе Issues с тегом bug.
Если у вас есть вопросы по функционалу или вы не понимаете баг это или фича, оставьте нам вопрос в разделе Issues с тегом question.
Если у вас есть идея, какие возможности вы хотели бы ещё видеть в приложении, но не уверены, что можете их самостоятельно реализовать, добавьте описание идеи в раздел Issues с тегом enhancement.

Что ещё я могу делать

Принять участие в обсуждении этого проекта или ваших собственных идей в дискорде нашего сообщества WellnessDataClub.
Взять СоАвтор за основу для разработки собственного open source продукта. СоАвтор сейчас работает с новостями и соцсетями, вы можете начать работать с другим типом данных :)
Примите участие в другом нашем open source проекте OpenMask

Launch this project locally

Installation

Download project files or make fork and use git clone
Download data files: ru_stopwords.txt и news_df.parquet
Download models: rubert_tiny и rut5_base_sum
Using the terminal, change directory to the project's directory
Use pip install requirements.txt

Launch

Change paths to the data and models inside config.yaml
Using the terminal, change directory to the project's directory
Run streamlit run menu.py
The app is available with http://localhost:8501 by default P.S.: this app can be launched with your own data in the right format Dataset example, format description are in the data directory.

How to participate in this project

Current tasks

Updating news feed
One module to collect social network data
Trend analysis based on social network posts
Evergreen news classification

Help to resolve one of current issues

Check if there is an open issue that you'd like to solve
If you have your own idea or see a bug, add a new issue (instructions below)
Make fork from this project, make changes and add them with new pull request.

Add an issue

Add bugs or smth that has to be finished to Issues with bug tag.
If you have questions about functionality or code ask in Issues withquestion tag.
If you have some ideas about new functions, suggest it in Issues with enhancement tag.

What else can I do

Take part in the discussion of this project or your own ideas with our Discord community WellnessDataClub.
Use this project as a base for your own open source product. We now work with news, you csn choose another data type :)
Become a part of our another project OpenMask

Comments

Вероятно, неактуальный конфиг

В 11й строке скрипта context_gen.py запрашивается файл "config_local.yaml", которого нет в репозитории, т.к. он в .gitignore, но есть файл config.yaml, который, видимо, неактуален, т.к. тоже возвращает ошибку.

opened by karuna-heks 1
Relative texts searching

What is the best way to look for similar texts when I have input text, its keywords and named entities and also input text embedding? I have the same for all texts in the dataset. I need to find the closest relative texts. Texts that are about the same person or location (i.e. same NE) or about the same events (same keywords maybe) must have the highest similarity score. This is done now in util.kwne_similarity.py and generate_context function in context_gen.py, but does not work really well. Sometimes one of top relatives texts are texts that have similar theme to the input text but have totally different subject. For example, my input text is about Covid. Between top relative texts, there are texts that also have health and medicine theme but these texts are not about Covid. And these non-covid texts still have higher similarity score than some Covid texts from dataset.
help wanted

opened by annachikina 0
Better algo for keywords extraction

At the moment, TextRank on n-grams (adjectives+nouns) is used for keywords extraction. collect_np function in util.data_preprocessing.py for n-grams collection and util.textrank for TextRank scores computation This approach leads to very long unconnected n-grams as output keywords. For example, 'вопрос доставка мигрант белоруссия представитель пресс-служба еврокомиссия стефан' becomes one keyword. The question is how to split these long keywords. Should we collect noun phrases another way (collect_np function in util.data_preprocessing)? Or should we process output of the current TextRank algo to split these long keywords after we already got them?
help wanted

opened by annachikina 0
Add file not found error

Check for file existence for model files before reading them. Functions to implement this check: embed_bert_cls in util.text_embedding.py, remove_stop_words_punct in util.data_preprocessing.py
good first issue

opened by annachikina 0

Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline

Twitter-News-Summarizer Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline 1.) Extracts all tweets fr

1 Jan 27, 2022

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

2 Sep 27, 2022

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

3 May 25, 2022

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

2.3k Jan 7, 2023

2.1k Feb 17, 2021

Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

Texar-PyTorch is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar

726 Dec 30, 2022

NLP project that works with news (NER, context generation, news trend analytics)

Related tags

Overview

СоАвтор

Запустить приложение у себя

Установка

Запуск

Как участвовать в разработке проекта

Текущие задачи

Помочь решить одну из текущих проблем

Добавить проблему (issue)

Что ещё я могу делать

Launch this project locally

Installation

Launch

How to participate in this project

Current tasks

Help to resolve one of current issues

Add an issue

What else can I do

You might also like...

Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

lightweight, fast and robust columnar dataframe for data analytics with online update

Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Comments

Вероятно, неактуальный конфиг

Relative texts searching

Better algo for keywords extraction

Add file not found error

Owner

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular intervals.It sends out the most recent news at random!

Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

天池中药说明书实体识别挑战冠军方案；中文命名实体识别；NER; BERT-CRF & BERT-SPAN & BERT-MRC；Pytorch

Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

TPlinker for NER 中文/英文命名实体识别

Chinese NER with albert/electra or other bert descendable model (keras)

Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA