Catalogue data - A Python Scripts to prepare catalogue data

BigScience Workshop

Last update: Mar 3, 2022

Related tags

Data Analysis catalogue_data

Overview

catalogue_data

Scripts to prepare catalogue data.

Setup

Clone this repo.

Install git-lfs: https://github.com/git-lfs/git-lfs/wiki/Installation

sudo apt-get install git-lfs
git lfs install

Install dependencies:

sudo apt-add-repository non-free
sudo apt-get update
sudo apt-get install unrar

Create virtual environment, activate it and install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Create User Access Token (with write access) at Hugging Face Hub: https://huggingface.co/settings/token and set environment variables in the .env file at the root directory:

HF_USERNAME=
   
    
HF_USER_ACCESS_TOKEN=
    
     
GIT_USER=
     
      
GIT_EMAIL=

Create metadata

To create dataset metadata (in file dataset_infos.json) run:

python create_metadata.py --repo <repo_id>

where you should replace , e.g. bigscience-catalogue-lm-data/lm_ca_viquiquad

Aggregate datasets

To create an aggregated dataset from multiple datasets, and save it as sharded JSON Lines GZIP files, run:

python aggregate_datasets.py --dataset_ratios_path <path_to_file_with_dataset_ratios> --save_path <dir_path_to_save_aggregated_dataset>

where you should replace:

path_to_file_with_dataset_ratios: path to JSON file containing a dict with dataset names (keys) and their ratio (values) between 0 and 1.
: directory path to save the aggregated dataset

Comments

Add wiki filter on "type" meta field

This PR add a filter that filters out all the examples that doesn't have their "type" field inside their "meta" value equal to "text".

I've tested it on lm_en_wikinews_filtered, here's the logs:

03/03/2022 11:46:20 - INFO - __main__ - Applied filter: filter_wiki_non_text_type
03/03/2022 11:46:20 - INFO - __main__ -      Initial number of samples: 54387 samples
03/03/2022 11:46:20 - INFO - __main__ -      Removed samples: 24736 samples
03/03/2022 11:46:20 - INFO - __main__ -      Removed percentage: 45.48 %

Partially solves #5

opened by SaulLu 4

Code doesn't need to run deduplication script

Code doesn't need to run deduplication script as document level was already done, and line deduplication is undesired. Can you confirm @lvwerra @TevenLeScao ? We could also run deduplication on document just in case. LMK.

opened by thomasw21 3

change way to compute the size of the text

As discussed on slack, the method getsizeof() doesn't measure the same thing as len(text.encode()). This PR proposes to change the way we compute the size of the text.

Note however: I think that this change will not be seen by the caches through the map and filter methods. The easiest solution would be to put load_from_cache_file=False in the arguments of these methods but we may want to use them for other filters / cleanings we have already executed.

Test

I tested to run :

python clean.py \
    --dataset-path bigscience-catalogue-lm-data/lm_en_wikinews_filtered \
    --maps-and-filters filter_small_docs_bytes_500 \
    --save-path /home/lucile/data/result_filtering_cleaning/lm_en_wikinews_filtered.jsonl \
    --checks-save-path /home/lucile/data/result_filtering_cleaning/lm_en_wikinews_filtered_checks \
    --num-proc 4 \
    --sampling-size-map-check 100000 \
    --sampling-size-filter-check 100000 \
    --batch-size 100

The output before

03/07/2022 11:56:18 - INFO - __main__ - Applied filter: filter_small_docs_bytes_500
03/07/2022 11:56:18 - INFO - __main__ -      Initial number of samples: 54387 samples
03/07/2022 11:56:18 - INFO - __main__ -      Removed samples: 22412 samples
03/07/2022 11:56:18 - INFO - __main__ -      Removed percentage: 41.21 %
03/07/2022 11:56:18 - INFO - __main__ -      Final number of samples: 31975 samples
03/07/2022 11:56:18 - INFO - __main__ -      Initial size in bytes: 0.3100 GB
03/07/2022 11:56:18 - INFO - __main__ -      Removed bytes: 0.0232 GB
03/07/2022 11:56:18 - INFO - __main__ -      Removed percentage in bytes: 7.48 %
03/07/2022 11:56:18 - INFO - __main__ -      Final size in bytes: 0.2868 GB

The output after

03/07/2022 11:57:59 - INFO - __main__ - Applied filter: filter_small_docs_bytes_500
03/07/2022 11:57:59 - INFO - __main__ -      Initial number of samples: 54387 samples
03/07/2022 11:57:59 - INFO - __main__ -      Removed samples: 23856 samples
03/07/2022 11:57:59 - INFO - __main__ -      Removed percentage: 43.86 %
03/07/2022 11:57:59 - INFO - __main__ -      Final number of samples: 30531 samples
03/07/2022 11:57:59 - INFO - __main__ -      Initial size in bytes: 0.0778 GB
03/07/2022 11:57:59 - INFO - __main__ -      Removed bytes: 0.0079 GB
03/07/2022 11:57:59 - INFO - __main__ -      Removed percentage in bytes: 10.18 %
03/07/2022 11:57:59 - INFO - __main__ -      Final size in bytes: 0.0699 GB

opened by SaulLu 3

new way to simplify dedup url

This PR proposes to modify the way the url is simplified before creating the hash on it for deduplication.

The first modification is to keep the id in the querys parameters.

More testing should be done to see if other query parameters in the urls may not be important to distinguish 2 examples. For example in the lm_en_pseudocrawl-filtered_619_www_qut_edu_au dataset, I see urls of type https://www.qut.edu.au/study/unit?unitCode=ERB316. I don't know if this is an overlapping exemple with https://www.qut.edu.au/study/unit?unitCode=LLB346 (as is this dedup it assumes that there is an overlap).

opened by SaulLu 2
add sentence splitter functions
Adds sentence splitter functions as well as function to remove newlines. I think we should remove the newlines after adding them as the process can be imperfect and a whitespace at the wrong place is more natural than a newline.

for the wiki-rest datasets the steps should be:

add newlines with sentence splitter

line deduplication

remove newlines

@thomasw21 should we add this to your scripts that creates the filters for each dataset programmatically? The rule would be: add the three steps above to all datasets where the dataset name contains wiki but not wikipedia.
opened by lvwerra 2

Remove excessive duplicates

Deduplication works. Now time to finetune the parameters. Tested on bigscience-catalogue-lm-data/lm_fr_pseudocrawl-filtered_530_www_mediapart_fr

Selon des informations confidentielles dont dispose Mediapart, Bernard Tapie a rompu au début du mois d’août les négociations qu’il menait avec l’armateur de CMA CGM, Rodolphe Saadé, en vue de lui céder le contrôle du quotidien La Provence. Du même coup, Xavier Niel qui semblait disposé à apporter à l’acquéreur les 11 % qu’il contrôle lui-même, devrait garder sa participation, sans chercher, selon nos sources, ni à la réduire ni à l’augmenter.
Iskandar Safa
La Provence
Nice-Matin
Rodolphe Saadé
Tapie
Xavier Niel
 ===== ====== ====== ====
La Provence: Tapie rompt les négociations de vente avec Saadé
Au printemps dernier, Bernard Tapie souhaitait vendre le quotidien à l’armateur de CMA CGM, Rodolphe Saadé, et Xavier Niel était disposé à lui céder aussi sa participation. Selon nos informations, les négociations ont capoté au début de l’été.
Selon des informations confidentielles dont dispose Mediapart, Bernard Tapie a rompu au début du mois d’août les négociations qu’il menait avec l’armateur de CMA CGM, Rodolphe Saadé, en vue de lui céder le contrôle du quotidien La Provence. Du même coup, Xavier Niel qui semblait disposé à apporter à l’acquéreur les 11 % qu’il contrôle lui-même, devrait garder sa participation, sans chercher, selon nos sources, ni à la réduire ni à l’augmenter.
Iskandar Safa
La Provence
Nice-Matin
Rodolphe Saadé
Tapie
Xavier Niel
Le Brexit n’aura pas d’impact sur l’échange de renseignements (responsable européen) Par Agence France-Presse
La Chine impose des mesures nationales de dépistage dans les transports Par Agence France-Presse
Bolivie: la présidente par intérim annonce sa candidature à la présidentielle Par Agence France-Presse
Virus: l’armée chinoise déploie du personnel médical à Wuhan Par Agence France-Presse
Emmanuel Macron a reçu Juan Guaido au palais de l’Elysée Par Agence France-Presse

Aide aux étrangers: pourquoi la Cimade a claqué la porte du plus gros centre de rétention de France
La Cimade a claqué la porte du plus gros centre de rétention de France, près de l’aéroport de Roissy. Trop de violences. Alors que l’association (sous contrat avec le ministère de l’intérieur) est chargée d’accompagner juridiquement les étrangers enfermés en vue de leur expulsion, ses salariés exercent leur droit de retrait depuis déjà deux semaines.
Fouilles au centre de rétention de Rennes : le témoignage d'une visiteuse 24 mai 2019 Par Fini de rire
Trois mois d’enfermement en rétention: 2019 marque un tournant dans la répression 2 janv. 2019 Par La Cimade
 ===== ====== ====== ====
Aide aux étrangers: pourquoi la Cimade a claqué la porte du plus gros centre de rétention de France
Chargés d’aider les étrangers enfermés près de Roissy, les salariés de la Cimade viennent de se retirer du centre de rétention. Trop de violences. « Le climat est devenu terrible », explique le président de l’association Christophe Deltombe à Mediapart, qui dénonce « une politique du tout enfermement » généralisée. Entretien.
La Cimade a claqué la porte du plus gros centre de rétention de France, près de l’aéroport de Roissy. Trop de violences. Alors que l’association (sous contrat avec le ministère de l’intérieur) est chargée d’accompagner juridiquement les étrangers enfermés en vue de leur expulsion, ses salariés exercent leur droit de retrait depuis déjà deux semaines.
Islamophobie: un homme poursuivi pour avoir crevé les pneus de femmes voilées Par Camille Polloni
A Dunkerque, le succès des bus gratuits écrase la campagne municipale Par Ludovic Lamant
A Marseille, des campagnes municipales sous pression sociale Par Jean-Marie Leforestier (Marsactu)
Classement truqué des grands crus de Saint-Emilion: le parquet ridiculisé Par Michel Deléan
Fonds pour la gestion de l’emploi agricole: un système de prélèvements sociaux au bénéfice de la FNSEA Par Amélie Poinssot
Violences sexuelles: le coup de com’ de Schiappa sur le dos des étrangers Par Mathilde Mathieu et Ellen Salvi
Immigration: l’exécutif dégaine 20 mesures de bric et de broc pour occuper le terrain Par Mathilde Mathieu
Débat sur l’immigration: Macron plonge ses ministres dans l’embarras Par Mathilde Mathieu et Ellen Salvi
Immigration: «10 faits» brandis par l’exécutif et combien de biais? Par Mathilde Mathieu
Aide médicale aux étrangers: Macron veut un débat sur ses «excès» Par Mathilde Mathieu
PORTFOLIO Asile: à Manus, les marques d'une violence inouïe Par Photos transmises à la rédaction de Mediapart
Fouilles au centre de rétention de Rennes : le témoignage d'une visiteuse 24 mai 2019 Par Fini de rire
Accueil des migrants : 13 maires de grandes villes lancent un appel à l'État 24 avr. 2019 Par Patrick Cahez
Trois mois d’enfermement en rétention: 2019 marque un tournant dans la répression 2 janv. 2019 Par La Cimade
Immersion dans la logique pédocriminelle de Gabriel Matzneff Par Antoine Perraud
Au Togo, l’opposition est piégée par une élection verrouillée Par François Hume-Ferkatadji et Olivia Macadré
Aux confins de Pékin, la crainte du virus bouleverse le quotidien Par Jordan Pouille
Dans le budget de l’UE, cette clause qui chiffonne Viktor Orbán Par Ludovic Lamant
Avant la primaire du Nevada, Bernie Sanders plus que jamais favori et contesté Par Mathieu Magnaudeix
Le système Woerth cerné par les juges Par Fabrice Arfi, Michel Deléan, Laurent Mauduit et Yann Philippin
L’individualisation à outrance de la réforme des retraites Par Manuel Jardinaud

Après son triomphe, Jeremy Corbyn doit affronter son propre parti
Contre la guerre et contre l'austérité, le nouveau leader du parti travailliste a pris la tête du parti avec 59,5 % des voix. Mais avant même de convaincre l’électorat, il lui faudra rassembler un Labour en ébullition.
De notre correspondant à Londres (Royaume-Uni).- Il est 15 h 00, ce samedi 12 septembre, sur la place du Parlement, face à Westminster, et des dizaines de milliers de personnes déferlent avec des panneaux en soutien aux réfugiés. La manifestation est prévue depuis quelque temps déjà mais elle prend soudain un tour très politique. Jeremy Corbyn, élu à la tête du parti travailliste trois heures et demie plus tôt, y prononce son tout premier discours. « Je n’ai jamais vu la place du Parlement aussi belle et heureuse qu’aujourd’hui !, lance-t-il. Nous n’avons plus à avoir peur de l’extrême droite et des racistes. Un soulèvement populaire en faveur de la décence et de l’humanité est en marche. »
Mais qui est Jeremy Corbyn? 10 déc. 2019 Par Martin Benoit
 ===== ====== ====== ====
Après son triomphe, Jeremy Corbyn doit affronter son propre parti
Contre la guerre et contre l'austérité, le nouveau leader du parti travailliste a pris la tête du parti avec 59,5 % des voix. Mais avant même de convaincre l’électorat, il lui faudra rassembler un Labour en ébullition.
De notre correspondant à Londres (Royaume-Uni).- Il est 15 h 00, ce samedi 12 septembre, sur la place du Parlement, face à Westminster, et des dizaines de milliers de personnes déferlent avec des panneaux en soutien aux réfugiés. La manifestation est prévue depuis quelque temps déjà mais elle prend soudain un tour très politique. Jeremy Corbyn, élu à la tête du parti travailliste trois heures et demie plus tôt, y prononce son tout premier discours. « Je n’ai jamais vu la place du Parlement aussi belle et heureuse qu’aujourd’hui !, lance-t-il. Nous n’avons plus à avoir peur de l’extrême droite et des racistes. Un soulèvement populaire en faveur de la décence et de l’humanité est en marche. »
Jair Bolsonaro mène une offensive généralisée contre les autochtones Par Jean-Mathieu Albertini
La BCE se rêve en chef d’orchestre de la transition écologique Par Martine Orange
Un féminicide étalé dans les journaux révulse le Mexique Par Marie Hibon
Football Leaks: l’UEFA bannit Manchester City de la Ligue des champions Par Yann Philippin
Aux Etats-Unis, le pouvoir des images contre les violences policières Par Mathieu Magnaudeix
MediapartLive «hors les murs» avec ceux qui veulent sauver l’hôpital public Par La rédaction de Mediapart
Un droit de réponse d’Augustin de Romanet Par La rédaction de Mediapart
Municipales: altercation lors d’une séance de tractage à Vitry-sur-Seine Par La rédaction de Mediapart
L’enquête sur la mort d’un jeune homme en cellule de dégrisement relancée Par La rédaction de Mediapart
Gilets jaunes: Maxime Nicolle et Eric Drouet verbalisés à Paris Par La rédaction de Mediapart
PODCAST Après le Brexit, Londres risque de devenir un «super-paradis fiscal» Par Ludovic Lamant
WEBDOC «Colis suspect», à qui profite la fermeture des frontières européennes Par Sofia Català Vidal & Rosa Pérez Masdeu
PODCAST Bruxelles, place forte des lobbies? Par Ludovic Lamant
Le ministère de l'intérieur refuse la naturalisation d'un Britannique 1 févr. 2020 Par Patrick Cahez
Élection britannique: les raisons de l’échec du corbynisme 16 déc. 2019 Par Philippe Marlière
Mais qui est Jeremy Corbyn? 10 déc. 2019 Par Martin Benoit
Retraites: les gauches veulent «tenir la tranchée», la majorité souhaite (enfin) débattre Par Manuel Jardinaud
Le chômage baisse, mais pas de «miracle de l'emploi» Par Romaric Godin
Au Brésil, une offensive généralisée contre les autochtones Par Jean-Mathieu Albertini
Querelle sur Léonard de Vinci: un expert attaque le Louvre Par Karl Laske
Un nouveau féminicide révulse le Mexique Par Marie Hibon
Usul. L’affaire Mila est un «révélateur» Par Usul et Rémi Liechti
Castaner veut contrôler le droit de filmer Par Pascale Pascariello
Les abus policiers en neuf vidéos Par Donatien Huet
Aux Etats-Unis, le pouvoir des images Par Mathieu Magnaudeix

Neymar: La police recommande d'abandonner l'enquête pour viol
La police brésilienne annoncé mardi avoir recommandé l'abandon de l'enquête pour viol visant l'international Neymar da Silva Santos Junior, dit Neymar, expliquant ne pas avoir recueilli suffisamment de preuves susceptibles de l'incriminer.
SAO PAULO (Reuters) - La police brésilienne annoncé mardi avoir recommandé l'abandon de l'enquête pour viol visant l'international Neymar da Silva Santos Junior, dit Neymar, expliquant ne pas avoir recueilli suffisamment de preuves susceptibles de l'incriminer.
 ===== ====== ====== ====
Neymar: La police recommande d'abandonner l'enquête pour viol
La police brésilienne annoncé mardi avoir recommandé l'abandon de l'enquête pour viol visant l'international Neymar da Silva Santos Junior, dit Neymar, expliquant ne pas avoir recueilli suffisamment de preuves susceptibles de l'incriminer.
SAO PAULO (Reuters) - La police brésilienne annoncé mardi avoir recommandé l'abandon de l'enquête pour viol visant l'international Neymar da Silva Santos Junior, dit Neymar, expliquant ne pas avoir recueilli suffisamment de preuves susceptibles de l'incriminer.
Mexique: près de 15 ans de prison pour l’un des assassins du journaliste Javier Valdez Par Agence France-Presse
Coronavirus: le complexe Tokyo DisneyLand/DisneySea ferme deux semaines Par Agence France-Presse
Coronavirus: les stars de la K-pop, BTS, annulent quatre dates à Séoul Par Agence France-Presse
Coronavirus: un cas recensé au Nigeria, le 1er en Afrique subsaharienne Par Agence France-Presse
Le risque d’escalade en Syrie « augmente d’heure en heure » si rien n’est fait Par Agence France-Presse

La marche à la mort de Jamal Khashoggi
Proche de la famille royale, des services secrets et de Ben Laden, le journaliste connaissait les arcanes du pouvoir saoudien. Entré en dissidence contre Mohammed ben Salmane, il réunissait derrière lui les libéraux et les islamistes, qu’il voulait faire entrer en démocratie. Un an après son assassinat, l’affaire est semi-enterrée. Le prince héritier a admis sa responsabilité. Pour montrer qu’il tenait le pays.
L’Algérien Abdullah Anas a combattu dix ans aux côtés du célèbre commandant Ahmad Shah Massoud dans les montagnes d’Afghanistan. Dans son récit To the Mountains – My Life in Jihad (Hurst Publishers), qui vient de paraître, il évoque brièvement Jamal Khashoggi. On y lit que c’est bien dans les maquis afghans que le journaliste assassiné a rencontré Oussama ben Laden. Dans un passage du livre, l’auteur mentionne le fait que Khashoggi, Ben Laden, lui-même et d’autres volontaires arabes ayant rejoint la guérilla s’emploient à convaincre les factions afghanes, peu avant la chute du « tyran rouge » Mohammad Najibullah, en avril 1992, de s’unir pour éviter la guerre civile – l’échec sera retentissant.
PODCAST L’Arabie saoudite envisage pour la première fois d’exécuter une militante des droits humains Par jean-pierre perrin
Jamal Khashoggi a été "étranglé" puis "démembré" à l'ambassade saoudienne! 1 nov. 2018 Par Freddy Mulongo
Erdoǧan, Ben Salmane, al-Sissi et Donald Trump 27 oct. 2018 Par Gabas
 ===== ====== ====== ====
La marche à la mort de Jamal Khashoggi
Proche de la famille royale, des services secrets et de Ben Laden, le journaliste connaissait les arcanes du pouvoir saoudien. Entré en dissidence contre Mohammed ben Salmane, il réunissait derrière lui les libéraux et les islamistes, qu’il voulait faire entrer en démocratie. Un an après son assassinat, l’affaire est semi-enterrée. Le prince héritier a admis sa responsabilité. Pour montrer qu’il tenait le pays.
L’Algérien Abdullah Anas a combattu dix ans aux côtés du célèbre commandant Ahmad Shah Massoud dans les montagnes d’Afghanistan. Dans son récit To the Mountains – My Life in Jihad (Hurst Publishers), qui vient de paraître, il évoque brièvement Jamal Khashoggi. On y lit que c’est bien dans les maquis afghans que le journaliste assassiné a rencontré Oussama ben Laden. Dans un passage du livre, l’auteur mentionne le fait que Khashoggi, Ben Laden, lui-même et d’autres volontaires arabes ayant rejoint la guérilla s’emploient à convaincre les factions afghanes, peu avant la chute du « tyran rouge » Mohammad Najibullah, en avril 1992, de s’unir pour éviter la guerre civile – l’échec sera retentissant.
L’Allemagne se prépare à un dépistage encore plus massif Par Thomas Schnee
En Côte d’Ivoire, les prisonniers vivent dans «des conditions inhumaines» Par Olivia Macadré et François Hume-Ferkatadji
Le marché pétrolier affronte une crise centenaire Par martine orange
Une histoire politique du balcon Par Ludovic Lamant
Bolsonaro rejette le confinement et s’isole politiquement Par Jean-Mathieu Albertini
Le pouvoir iranien ébranlé par l’épidémie Par Jean-Pierre Perrin
Arabie saoudite: le prince héritier frappe la monarchie en plein cœur Par Jean-Pierre Perrin
Iran: la chercheuse franco-iranienne Fariba Adelkhah en danger de mort Par Jean-Pierre Perrin
«Lettre à Franco»: l’honneur d’un philosophe Par Jean-Pierre Perrin
Accord «historique» de paix en Afghanistan: les talibans remportent la première manche Par Jean-Pierre Perrin
PODCAST Pétrole: les risques d'un baril à 100 dollars Par martine orange
PODCAST Quatre années de guerre ont renvoyé le Yémen 100 ans en arrière Par Thomas Cantaloube
PODCAST L’Arabie saoudite envisage pour la première fois d’exécuter une militante des droits humains Par jean-pierre perrin
Meurtre de Khashoggi : le verdict de la justice saoudienne enfin rendu 5 janv. 2019 Par Jérôme Henriques
Jamal Khashoggi a été "étranglé" puis "démembré" à l'ambassade saoudienne! 1 nov. 2018 Par Freddy Mulongo
Erdoǧan, Ben Salmane, al-Sissi et Donald Trump 27 oct. 2018 Par Gabas
L’exécutif se prépare à une «tragédie» dans les Ehpad Par Ellen Salvi
«A l’air libre». L’Outre-mer, lutter contre les violences domestiques, la Côte d’Ivoire et Rodolphe Burger Par La rédaction de Mediapart
Coronavirus: les libertés et la démocratie mises à mal Par Michel Deléan
Journal de bord des internes: «C’est comme “Un jour sans fin”» Par Antton Rouget
La crise du Covid-19 en direct. «La fin du confinement ne se fera pas en une fois et pour tout le monde» Par La rédaction de Mediapart
Faire une révolution intérieure avec Xavier de Maistre, Mona Chollet et Thomas Clerc Par Lise Wajeman
Hôpital public: la note explosive de la Caisse des dépôts Par Laurent Mauduit et martine orange
Dans les corons, un infirmier face aux urgences Par Jeremy Lempin
Confinement: la députée LREM Laetitia Avia démentie par sa collaboratrice Par David Perrotin

opened by thomasw21 2

Add feature to see the modified examples by a map operation

This PR proposes a feature to view and save examples which are different after a map operation.

~If we think that the created dataset mapped_diff_ds might be to big, we can implement in a next PR something to only save a subset of it.~

~I've modified replace_newline_with_space to match the new requirement of in and out text columns~

~Please note: the git diff in shown to this PR #15~

opened by SaulLu 2
First draft of generic cleaning script
I think we should code all the function in batch, ie take a batch of document in entry and output a batch of document for maps, and list of bools for filters. That allows for more expressivity + faster IMO.
opened by thomasw21 2
Non-Wikipedia Wikis Dedup script

This dedup script is based on Yacine's pseudocrawl filter script, which Sasha and I adapted for the wikis. It needs to be tuned on a project basis, possibly even a language basis.

opened by cakiki 1
Add substring remover mapper

This function is meant to strip repeated strings like the ones here: https://github.com/bigscience-workshop/catalogue_data/issues/5#issuecomment-1057424610

opened by cakiki 1

Remove short lines

Initially I developped this thinking we didn't want some short sentences in pt_bwarc, thinking they are actually linked to a specific template. However looking more in depth I think it breaks consistency. Examples where I thought it would make sense:

Venda de Rural em Gurupi No Bairro Tocantins Por 23750 -
Tocantins , Gurupi , Tocantins
R$23.750.000
Descrição
5.475 Hectares Sendo 1.131 Alqueires - R$ 21.000,00/alqueire .Toda Formada Em Pastagem , Baquerao , Atropologo e Otima Para Agricultura , Soja e Eucalipto .100% de Aproveitamento Tirando As Reservas .Topografia / relevo - Plana . a Benfeitoria Conta Com 03 Sedes , 08 Retiros , Todos Com Energia , 08 Casas Para Funcionarios , e 03 Currais .18 Km de Beira de Rio e Mais 03 Corregos Dentro da Propriedade .

======

Casa Residencial À Venda , Nossa Senhora de Fátima , Contagem .
Nossa Senhora de Fátima , Contagem , Minas Gerais
3Quartos
2Banheiros
200m²Superfície
R$480.000
Descrição
Linda Casa em excelente construção de alvenaria .03 quartos com piso em laminado de madeira sendo 01 com suíte .Sala de estar grande com rebaixamento de teto , piso em cerâmica .Ampla sala de jantar com excelente arejamento .Ampla e maravilhosa cozinha com armários sob pia e bancadas em granito com excelente posicionamento .Banho social com box de vidro blindex e piso em cerâmica .Área de serviço externa com bom tamanho e churrasqueira em construção .Portão eletrônico , 02 vagas de garagem cobertas e espaço no quintal para até + 06 veículos .Localização privilegiada no bairro , próximo de bancos , escolas e comércio em geral .Linda fachada residencial .Venda :Abraão Imóveis 31 3398-3517 - 8772-9215 - 9950-9556 - 22/07/2016

======

    Travessa Antonio Rosa , Campo Grande
Travessa Antonio Rosa , Campo Grande , Mato Grosso do Sul
R$350
Descrição
Tenho preferencia por MENINAS , estudante ou profissional , que tenha bom senso , que seja organizada , que respeite as diferenças das outras moradoras , que respeite as regras impostas na casa . . .O apartamento é bem localizado , próximo de ponto de ônibus , moro aqui tem quase 4 anos , e posso dizer considerar o lugar seguro . . .Demais detalhes me chamem , nove meia trinta e nove quarenta e cinco oitenta .

This PR is mostly to share the code, and if anyone needs it to use it for another dataset.

opened by thomasw21 0

S2ORC vs Arxiv vs PMC
Currently we have four datasets containing S2ORC, Arxiv, and PMC data:

lm_en_s2orc_ai2_pdf_parses

lm_en_s2orc_ai2_abstracts

lm_en_arxiv

lm_en_pmc

There are a few concerns:

Overlap between abstracts and pdf parses of S2ORC. Since there are many more abstracts than full pdf parses we probably don't want to discard all abstracts. Currently investigating if we can match on paper_id to discard abstracts of papers that have pdf parses.

There is probably significant overlap between Arxiv, PMC <-> S2ORC pdf parses but the former are probably larger. So it would make sense to exclude the Arxiv/PMC sources from S2ORC. The source info exists in principal in the S2ORC dataset but seems not to be present in the datasets above. Asked Kyle if there is a way to get that info.

The Arxiv/PMC sources are less preprocessed and e.g. references should be removed. This is requires a custom filter/map.
opened by lvwerra 6
Catching crawling noise + ads
Some datasets (f.e. bigscience-catalogue-lm-data/lm_es_pseudocrawl-filtered_396_www_eldiario_es) happen to have a mix of:

crawling noise

ad javascript that results in code-like snippets that can be caught by looking for { and }s. I have seen some amount of typo { that are unrelated for some reason, I don't think we want to remove them but I don't think we really care either. I'd advocate either removing any line in the pseudocrawled newspaper datasets that contain }, or removing all of the {...} groups.
opened by TevenLeScao 6
Removing dataset lm_en_a_million_news_headlines_abc_australia

I think it would be better to remove this dataset from the list

Here are some random examples of documents, and they are all like that

Doc 0: adrian bayley minimum prison term extended 10 years over rapes

Doc 1: egg farm break in

Doc 2: stoner claims grand prix in portugal

Doc 3: palau typhoon bopha watch

Doc 4: dna breakthrough on unsolved rape

Doc 5: labor says mortgage stress at record high

Doc 6: concerns raised over carbon capture

Doc 7: nigeria to set up regional anti boko haram force

Doc 8: habib says torturers used information from

Doc 9: mixed bag for wine production

Doc 10: sue butler said it

Doc 11: dog hitches ride from queensland to sa

Doc 12: tests show beach algae harmless

Doc 13: more support sought for chamber of commerce

Doc 14: australia india engaged together to stop people

Doc 15: tim costello on financial crisis

Doc 16: push to save womens army camp ruins from roe highway extension

Doc 17: serial rapist convicted over knifepoint attacks

Doc 18: grandmother lorn cheng jailed for smuggling heroin from cambodia

Doc 19: fact check bradfield scheme barnaby joyce drought

Doc 20: gold coast man attacked with tomahawk

opened by HugoLaurencon 2
Repeated lines across examples
Several datasets have repeated text across examples:

crawled newspapers tend to have the links to other articles at the bottom, which are nearly always the same

datasets like the wiki datasets tend to have templates at the start, also always the same.

The difficulty is that some datasets have legitimate repetitions, such as parliamentary proceedings (lm_en_the_pile_europarl f.e.)
opened by TevenLeScao 3

Owner

BigScience Workshop

Research workshop on large language models - The Summer of Language Models 21

GitHub

Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

The following Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks (MOFs). The training set is extracted from the Cambridge Structural Database and the CoRE_MOF 2019 dataset.

1 Jan 9, 2022

Analysis scripts for QG equations

qg-edgeofchaos Analysis scripts for QG equations FIle/Folder Structure eigensolvers.py - Spectral and finite-difference solvers for Rossby wave eigenf

2 Sep 27, 2022

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

3.7k Jan 3, 2023

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 9, 2023

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

97 Dec 8, 2022

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

791 Jan 4, 2023

Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

1 Jan 16, 2022

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

359 Dec 22, 2022

A data parser for the internal syncing data format used by Fog of World.

A data parser for the internal syncing data format used by Fog of World. The parser is not designed to be a well-coded library with good performance, it is more like a demo for showing the data structure.

40 Dec 12, 2022

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Functional Data Analysis Python package

Grupo de Aprendizaje Automático - Universidad Autónoma de Madrid

184 Dec 27, 2022

Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

3 Oct 3, 2022

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

2 Dec 12, 2021

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021

Utilize data analytics skills to solve real-world business problems using Humana’s big data

Humana-Mays-2021-HealthCare-Analytics-Case-Competition- The goal of the project is to utilize data analytics skills to solve real-world business probl

1 Dec 27, 2021

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

1 Jan 19, 2022

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

27 Nov 1, 2022

Catalogue data - A Python Scripts to prepare catalogue data

Related tags

Overview

catalogue_data

Setup

Create metadata

Aggregate datasets

Comments

Test

The output before

The output after

Owner

BigScience Workshop

Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

Analysis scripts for QG equations

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Python data processing, analysis, visualization, and data operations

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

A data parser for the internal syncing data format used by Fog of World.

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Fancy data functions that will make your life as a data scientist easier.

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Utilize data analytics skills to solve real-world business problems using Humana’s big data

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.

ForecastGA is a Python tool to forecast Google Analytics data using several popular time series models.

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN